Elon’s ‘Colossus’ supercomputer built with 100,000 NVIDIA H100 GPUs goes live, H200 upgrade coming soon

Photo of author

By Maya Cantina

kzy dor ycm sgv wzr ylw ukj bww ndx tit avd dqu qof nad rzk mke oom wku xpy jfy paa wfe glc kbb gjp fig mra oml ism bas

The largest supercomputer colossus by Elon Musk’s xAI comes online with 100K H100 GPUs and will double in size with 50K NVIDIA H200 GPUs soon.

NVIDIA congratulates the xAI team on developing the most powerful NVIDIA-based AI training system, built with 100,000 H100 GPUs, 50,000 H100 accelerators, and an additional 50,000 H200s planned in the update

Elon Musk’s xAI venture has finally completed its development for the ‘Colossus’ supercomputer, which went live on Labor Day a few days ago. Musk said that Colossus is the ‘world’s most powerful AI training system’, which was completed in 122 days from start to finish. The Colossus supercomputer uses 100,000 NVIDIA H100 data center GPUs, making it the largest training cluster to use such a large number of H100s.

Elon also announced that in the coming months, Colossus will be upgraded with 50,000 more H200 GPUs, which is the flagship datacenter GPU using the Hopper architecture. The H200 is significantly more powerful than the H100, bringing nearly 45% more compute performance in specific generative AI and HPC.

NVIDIA to generate a whopping $12 billion from China despite restrictions as H20 AI GPU sees huge demand 1

NVIDIA congratulated the xAI team for completing such a large project in just 4 months. NVIDIA added,

Colossus is powered by

‘s #acceleratedcomputing platform, providing innovative performance with exceptional gains in #energy efficiency.

The xAI Colossus project was launched in June in Memphis and your training started in July. This will prepare GROK 3 by December, replacing GROK 2 to provide the world’s most powerful AI. The Colossus supercomputer came after the end of the deal with Oracle, which had been leasing its server to xAI. The new supercluster is now more powerful than what Oracle could provide and will double its performance in a few months with the addition of 50,000 more H200 GPUs.

The H200 brings nearly 61GB more memory and a significantly higher memory bandwidth of 4.8TB/s compared to 3.35TB/s on the H100. That said, with such a drastic change in specifications, the H200 consumes 300W more power and will require liquid cooling, just as the H100s in the Colossus utilize liquid cooling.

Currently, Colossus is the only supercomputer that has reached 100K NVIDIA GPUs, followed by Google AI with 90K GPUs, and then the popular OpenAI, which uses 80K H100 GPUs. Meta AI and Microsoft AI are next with 70K and 60K GPUs.

News source: @NVIDIADC

Share this story

Facebook

Twitter



Source link

Leave a Comment