News

xAI supercomputer lights up

Massive machine with 200,000 GPUs.

Jon Peddie

X.AI Corp. (xAI), funded by Elon Musk in March 2023, aims to explore the nature of the universe through AI. In August 2024, the company released the Grok-2 beta, featuring two language models: Grok-2 and Grok-2 mini. The models are available on the X platform. In September, xAI launched the Colossus supercomputer, a 100,000-GPU system designed to train the upcoming Grok-3 model. Built with Nvidia’s Spectrum-X networking, Colossus achieved 95% data throughput with minimal latency or packet loss.

Super computer
Colossus from xAI, as envisioned by #FluxSchnell in this AI-generated image.

It has taken almost 60 years for reality to catch up with science fiction. In 1966, British author Dennis Feltham Jones wrote a science-fiction novel, called Colossus, about supercomputers taking control of mankind. The book spawned two sequels and the 1970 film Colossus: The Forbin Project. In the film, Dr. Charles Forbin develops a massive computer system called Colossus to protect the US against nuclear attack. Growing increasingly suspicious of the private dialog Colossus continues with its Russian counterpart, Guardian, Forbin severs the connection between the two supercomputers. Displeased, Colossus gives Forbin an ultimatum: Reconnect the link or face the very thing Colossus was created to prevent.

It’s a really good movie and this author recommends it for those who have not seen it—even though you know the ending. But perhaps, this is a new beginning.

X.AI Corp. (xAI) is an American artificial intelligence start-up founded by Elon Musk in March 2023. Its stated goal is “to understand the true nature of the universe.”

On August 13, 2024, xAI released the Grok-2 beta. Grok-2 is a language model with what the company calls state-of-the-art reasoning capabilities. The beta release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the X platform (formerly Twitter).

In early September 2024, xAI’s Colossus supercomputer cluster began operation after an impressive build time of 122 days. This system powerhouse boasts 100,000 Nvidia H100 GPUs, making it the most powerful AI training system in the world. And, if that’s not enough, Musk has announced plans to double its size to 200,000 GPUs shortly.

Colossus is specifically designed to train the latest version of the Grok language model, known as Grok-3. This AI model is expected to be a significant upgrade from its predecessor, Grok-2, which already ranks second only to ChatGPT-4 in the LLM league tables. With Colossus, Musk aims to build the most powerful LLM out there.

It’s worth noting that this massive system requires an enormous amount of power, consuming around 150MW of electricity and using up to 1 million gallons of water per day for cooling.

Recently, Nvidia announced that xAI’s Colossus supercomputer cluster, comprising 100,000 Nvidia Hopper tensor core GPUs, achieved this massive scale by using the Nvidia Spectrum-X Ethernet networking platform.

Supercomputer
The real xAI Colossus. (Source: xAI)

The state-of-the-art supercomputer was built by xAI and Nvidia in just 122 days, instead of the typical time frame for systems of this size, which can take several months to years. It took 19 days from when the first rack rolled onto the floor until training began.

While training the Grok model, Colossus experienced zero application latency degradation or packet loss due to flow collisions. It maintained 95% of the data throughput using Spectrum-X congestion control. This cannot be achieved with standard Ethernet, which creates thousands of flow collisions, while delivering only 60% of data throughput.

Bernini grows legs