Notes on AI

Sean Horgan

Benchmarks

Given the outsized role NVIDIA plays in the chip industry, any benchmarks take into consider how they frame and publish their performance numbers.

Semi Analysis laid out 3 rules of Jensen Math in their NVIDA GTC 2025 review:

NVIDIA GTC 2025 – Built For Reasoning, Vera Rubin, Kyber, CPO, Dynamo Inference, Jensen Math, Feynman⁠

⁠

FLOPs are quoted with 2:4 sparsity (which no one uses) versus dense FLOPs, which is the real world performance metric – meaning the 989.4 TFLOPs of FP16 in for the H100 is quoted as 1979.81 TFLOPs

Bandwidth should be quoted in bidirectional terms. NVLink5 is quoted as 1.8TB/s because it is 900GB/s of transmit plus 900GB/s of receive. These are added together for the spec sheet, but in the networking world, the standard is to quote the unidirectional bandwidth.

GPU counts are counted in terms of GPU dies in a package rather than the number of packages. This nomenclature will be adopted from Rubin onwards. The first generation Vera Rubin racks will be called NVL144, even though the system architecture is similar to the GB200 NVL72 with the same Oberon rack and 72 GPU packages.

Optimizations

Sparsity

⁠

https://pytorch.org/blog/accelerating-neural-network-training/⁠

⁠

Basic idea is to skip calculations involving zero-valued tensor elements to speed up matrix multiplication by replacing dense kernels with sparse kernels that bypass calculations with pruned elements.

⁠

Benchmarks

Optimizations

Sparsity

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.