Nvidia CEO Jensen Huang’s Keynote at COMPUTEX 2024 was fascinating. Below are selected excerpts about the world's most powerful AI chip: Nvidia's Blackwell.
Jensen Huang (Figure 1): “Ladies and gentlemen, this is Blackwell. Blackwell is in production. Incredible amounts of technology.”
Blackwell-architecture GPUs pack 208 billion transistors and are manufactured using a custom-built TSMC 4NP process. All Blackwell products feature two reticle-limited dies connected by a 10 terabytes per second (TB/s) chip-to-chip interconnect in a unified single GPU.
Jensen Huang (Figure 2): “This is our production board. This is the most complex, highest performance computer the world's ever made. This is the Grace CPU. And these are, you can see each one of these Blackwell dies, two of them connected together. You see that it is the largest die, the largest chip the world makes. And then we connect two of them together with a ten terabyte per second link.”
Figure 3: “And the performance is incredible. Take a look at this. So, you see, you see, the computational, the FLOPs, the AI FLOPs, for each generation has increased by a thousand times in eight years. Moore's law in eight years is something along the lines of, oh, I don't know, maybe 40, 60. And in the last eight years, Moore's law has gone a lot, lot less. And so just to compare, even Moore's Law as best of times compared to what Blackwell could do.”
Figure 4: “And whenever we bring the computation high, the thing that happens is the cost goes down. (…) the energy used to train a GPT-4, 2 trillion parameter, 8 trillion tokens (…) has gone down by 350 times. Well, Pascal would have taken 1,000 gigawatt hours. (…) we've now taken with Blackwell what used to be 1,000 gigawatt hours to three, an incredible advance, three gigawatt hours.”
From 17,000 joules (of energy) per token to just 0,4 joules per token! Chat GPT-4 uses about 3 tokens to generate one word.
Jensen Huang:
“Our token generation performance has made it possible for us to drive the energy down by 45,000 times, 17,000 joules per token. That was Pascal 17,000 joules. It's kind of like two light bulbs running for two days. It would take two light bulbs running for two days. Amounts of energy, 200W running for two days to generate one token of GPT-4. It takes about three tokens to generate one word. And so the amount of energy used necessary for Pascal to generate GPT-4 and have a ChatGPT experience with you was practically impossible. But now we only use 0.4 joules per token, and we can generate tokens at incredible rates and very little energy.”
However, the Blackwell is not big enough for AI compute!
Jensen Huang (Figure 5): “Okay, so Blackwell is just an enormous leap. Well, even so, it's not big enough. And so we have to build even larger machines. And so the way that we build it is called DGX.”
DGX Blackwell (GB200 NVL72) connects 36 Grace CPUs and 72 Blackwell GPUs in a rack-scale design. The GB200 NVL72 is a liquid-cooled, rack-scale solution that boasts a 72-GPU NVLink domain that acts as a single massive GPU.
This “massive single GPU” is quite big – see Figure 6. Yet, what’s amazing.. it’s still not enough for AI compute… but this is a story for another post…
Jensen Huang: “And even this is not big enough, even this is not big enough for an AI factory. So we have to connect it all together with very high speed networking.”
Comments