Exacluster with 144 NVIDIA H200 AI GPU, described in detail by its designer: Hydra Host enters the stage

Published:

At the beginning of this month, we informed about Exaaailabs Exacluster, a cluster of 18 machines with 144 NVIDIA H200 GPU, which are one of the first clusters based on these processors. Since then, Hydra Host, a company that has facilitated the construction of the cluster, gave us additional details about the system. The cluster uses Lenovo systems with many Hydra Host adaptations that have played a significant role. The machine can also be rented – when it is not used by the owner – via the Hydra Brokkr platform.

A lot of computing power

The cluster skeleton consists of 18 Lenovo nodes equipped with 144 NVIDIA H200 GPU and 20TB HBM3E memory – or eight per system – enabling computational performance 570 FP8 Petatops for AI. 16 knots are configured and adapted by hydrahost for training, which requires massive calculations and memory efficiency, while the other two serve as inference nodes. In addition, Hydra Host installed its BROKKR platform for GPU, management and remote rental (more on this subject later).

Hydra Host has collaborated with Computacek to design a high -performance network architecture adapted to the needs of the cluster. Configuration uses 3.2 TBPS Infiniband for Eastern Movement and 400 Gb / s Ethernet for north-south communication, including double connections 200 GB / s per server and 400 Gb / s Dell Ethernet. Computacek networking engineers have provided all components adapted to NVIDIA reference architecture to trouble -free compatibility.

- Advertisement -

“We delivered 18 Lenovo nodes with the H200 GPU (16 connected and two inference nodes), designed network architecture in cooperation with Computacenter and facilitated collocations via Patmos,” explained Andrea Holt, Hydra spokesman.

The cluster itself is quite powerful, even in terms of general purpose calculations. The servers contain 192 96-core processors (a total of 3,456 cores) in combination with 36 TB DDR5 and 270TB of the NVME stable magazine. There are spare sinuses, thanks to which the storage space can be easily expanded. The supercomputer uses a network specially built by Hydrahost.

The company also brought Patmos to collocation, ensuring sufficient power (about 100 kW) and cooling of hungry and scorching machines.

The best performance at the best price

Exacluster costs $ 5 million, on average 277,777 USD per machine, comparable to a single 8-of-log H200, and not a full server. It gets intriguing here. Who facilitated this price?

On the one hand, Hydra Host is a close NVIDIA partner and only offers NVIDIA GPU services. In addition, his brokkr software is optimized primarily for miracles. On the other hand, Exaai is a company supported by NVIDIA, so it can potentially get preferential prices.

“It is best to introduce our customers a proper graphic processor due to their needs and in the best price,” said Ryan Horjus, chief sales engineer at Hydra. “This cluster was supported by NVIDIA from the architecture design and their uprising program. Hydra served it for exa, as for other companies. “

Hydra also specializes in building custom solutions for startups, and even earns on its machines when they are not used.

“Hydra helped startups to get to its own clusters to get better valuations by mass purchase,” added Horjus. “They can reach perfect prices through our network. They are also able to earn on servers when they are not used via the broccol management platform. “

Speaking of brokkr, it is a GPU management and sharing software and a coinization platform for the GPU. It provides data centers and startups to solve the Turnkey software to introduce equipment to customers and earn them, explained Ariel Deschapell, technology director and Hydra co -founder.

“One of its key functions is automated supply with bare metal and life cycle management,” described Deschapell. “This means that the platform performs all the work configuration and management of the basic server and systemic software operating system, configuring controllers and other auxiliary software, as well as launching GPU tests and other components. This significantly accelerates and standardizes the delivery process, reducing inactivity time on servers and GPU.

Related articles