The largest supercomputer colossus by Elon Musk’s xAI comes online with 100K H100 GPUs and will double in size with 50K NVIDIA H200 GPUs soon.
NVIDIA congratulates the xAI team on developing the most powerful NVIDIA-based AI training system, built with 100,000 H100 GPUs, 50,000 H100 accelerators, and an additional 50,000 H200s planned in the update
Elon Musk’s xAI venture has finally completed its development for the ‘Colossus’ supercomputer, which went live on Labor Day a few days ago. Musk said that Colossus is the ‘world’s most powerful AI training system’, which was completed in 122 days from start to finish. The Colossus supercomputer uses 100,000 NVIDIA H100 data center GPUs, making it the largest training cluster to use such a large number of H100s.
This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days.
Colossus is the world’s most powerful AI training system. Plus, it’s doubling in size to 200k (50k H200s) in a few months.
Excellent…
— Elon Musk (@elonmusk) September 2, 2024
Elon also announced that in the coming months, Colossus will be upgraded with 50,000 more H200 GPUs, which is the flagship datacenter GPU using the Hopper architecture. The H200 is significantly more powerful than the H100, bringing nearly 45% more compute performance in specific generative AI and HPC.
NVIDIA congratulated the xAI team for completing such a large project in just 4 months. NVIDIA added,
Colossus is powered by
‘s #acceleratedcomputing platform, providing innovative performance with exceptional gains in #energy efficiency.
The xAI Colossus project was launched in June in Memphis and your training started in July. This will prepare GROK 3 by December, replacing GROK 2 to provide the world’s most powerful AI. The Colossus supercomputer came after the end of the deal with Oracle, which had been leasing its server to xAI. The new supercluster is now more powerful than what Oracle could provide and will double its performance in a few months with the addition of 50,000 more H200 GPUs.
The H200 brings nearly 61GB more memory and a significantly higher memory bandwidth of 4.8TB/s compared to 3.35TB/s on the H100. That said, with such a drastic change in specifications, the H200 consumes 300W more power and will require liquid cooling, just as the H100s in the Colossus utilize liquid cooling.
Currently, Colossus is the only supercomputer that has reached 100K NVIDIA GPUs, followed by Google AI with 90K GPUs, and then the popular OpenAI, which uses 80K H100 GPUs. Meta AI and Microsoft AI are next with 70K and 60K GPUs.
News source: @NVIDIADC