Nvidia published first MLPerf 4.1 results of Blackwell B200 processor. The results reveal that the Blackwell GPU offers up to four times the performance of its Hopper-based H100 predecessor, underscoring Nvidia’s leadership in AI hardware. There are, however, some caveats and disclaimers we need to point out.
Based on Nvidia’s results, the Blackwell-based B200 GPU delivers 10,755 tokens/second on a single GPU in the server inference test and 11,264 tokens/second on the offline reference test. A quick look at publicly available MLPerf Llama 2 70B The benchmark results reveal that the machine based on the 4-way Hopper H100 processor delivers similar results, supporting Nvidia’s claim that a single Blackwell processor is about 3.7 to 4 times faster than a single Hopper H100 GPU. However, we need to analyze the numbers to understand them better.
Row 0 – Cell 0 | # of graphics processors | Inaccessible | Server | on GPU Offline | for GPU server |
Nvidia B200 180GB HBM3E graphics card | 1 | 11264 | 10755 | 11264 | 10755 |
Nvidia H100 80GB HBM3 graphics card | 4 | 10700 | 9522 | 2675 | 2381 |
Nvidia H200 141GB HBM3E Graphics Card | 1 | 4488 | 4202 | 4488 | 4202 |
Nvidia H200 141GB HBM3E Graphics Card | 8 | 32124 | 29739 | 4016 | 3717 |
First, Nvidia’s Blackwell processor used FP4 precision because its fifth-generation Tensor cores support that format, while the Hopper-based H100 only supports and uses FP8. These different formats are allowed by the MLPerf guidelines, but Blackwell’s FP4 performance doubles the throughput of FP8, so that’s the first crucial thing to note.
Next, Nvidia is being a bit disingenuous by using a single B200 instead of four H100 GPUs. Scaling is never ideal, so a single GPU is pretty much the best-case scenario for GPU performance. There are no single GPU H100 results listed for MLPerf 4.1, and only one B200 result, so it becomes even more comparable. A single H200 achieved 4488 tokens/s, meaning the B200 is only 2.5x faster in this particular comparison.
Memory capacity and bandwidth are also critical factors, and there are vast differences between generations. The B200 GPU we tested has 180GB of HBM3E memory, the H100 SXM has 80GB of HBM memory (up to 96GB in some configurations), and the H200 has 96GB of HBM3 memory and up to 144GB of HBM3E memory. One result for a single H200 with 96GB of HBM3 memory only reaches 3114 tokens/s in offline mode.
So there are potential differences in the format of the numbers, the number of GPUs, the memory capacity, and the configuration that affect the “up to 4X” number. A lot of these differences are simply due to the fact that the Blackwell B200 is a novel chip with a newer architecture, and all of these factors affect its final performance.
Coming back to the Nvidia H200 with 141GB of HBM3E memory, it also performed exceptionally well not only in the generative AI test with the vast Llama 2 70B language model, but also in every single test in the data center category. For obvious reasons, it was significantly faster than the H100 in the tests that used the GPU’s memory capacity.
For now, Nvidia has only released the performance of its B200 in the MLPerf 4.1 generative AI test on the Llama 2 70B. We can’t say if that’s because it’s still working on tuning or for other reasons, but MLPerf 4.1 has nine core disciplines, and for now, we can only guess how the Blackwell B200 will fare in other tests.