Running the largest artificial intelligence (AI) and machine learning (ML) workloads is a job for the higher-performing systems. Such loads are often tough for even more capable machines. Supermicro’s SuperBlade combines blades using AMD EPYC™ CPUs with competing GPUs into a single rack-mounted enclosure (such as the Supermicro SBE-820H-822). That leverages an extremely fast networking architecture for these demanding applications that need to communicate with other servers to complete a task.
The Supermicro SuperBlade fits everything into an 8U chassis that can host up to 20 individual servers. This means a single chassis can be divided into separate training and model processing jobs. The components are key: servers can take advantage of the 200G HDR InfiniBand network switch without losing any performance. Think of this as delivering a cloud-in-a-box, providing both easier management of the cluster along with higher performance and lower latencies.
The Supermicro SuperBlade is also designed as a disaggregated server, meaning that components can be upgraded with newer and more efficient CPUs or memory as technology progresses. This feature significantly reduces E-waste.
The SuperBlade line supports a wide selection of various configurations, including both CPU-only and mixed CPU/GPU models, such as the SBA-4119SG, which comes with up to two AMD EPYC™ 7000-series 64-core CPUs. These components are delivered on blades that can easily slide right in. Plus, they slide out as easily when you need to replace the blades or the enclosure. The SuperBlade servers support a wide network selection as well, ranging from 10G to 200G Ethernet connections.
The SuperBlade employs the Horovod distributed model-training, message-passing interface to let multiple ML sessions run in parallel, maximizing performance. In a sample test of two SuperBlade nodes, the solution was able to process 3,622 GoogleNet images/second, and eight nodes were able to scale up to 13,475 GoogleNet images/second.
As you can see, Supermicro’s SuperBlade improves performance-intensive computing and boosts AI and ML use cases, enabling larger models and data workloads. The combined solution enables higher operational efficiency to automatically streamline processes, monitor for potential breakdowns, apply fixes, more efficiently facilitate the flow of accurate and actionable data and scale up training across multiple nodes.