Broadcom boosts AI and ML with Tomahawk 5

Couldn’t attend Transform 2022? Discover all the summit sessions now in our on-demand library! Look here.


Artificial intelligence (AI) and machine learning (ML) are not limited to algorithms: the right hardware to boost your AI and ML calculations is essential.

To accelerate task completion, AI and ML training clusters need high bandwidth and reliable transport with predictable low-tail latency (tail latency is 1 or 2% of a task that follows the rest of the answers). A high-performance interconnect can optimize data center and high-performance computing (HPC) workloads in your portfolio of hyperconverged AI/ML training clusters, resulting in lower latency for better model training, increased utilization of data packets and reduced operational costs.

As AI/ML training tasks become more common, higher radix switches, which reduce latency and power, and higher port speeds are essential for creating larger training clusters with a flat network topology.

Ethernet switching for performance optimization

As network bandwidth requirements in data centers continue to dramatically increase, there is also strong pressure to combine general compute and storage infrastructure with optimized AI/ML training processors. As a result, AI/ML training clusters – where you specify multiple machines for training – drive demand for fabrics with high bandwidth connectivity, high base, and faster task execution while running at high utilization of the network.

Event

MetaBeat 2022

MetaBeat will bring together thought leaders to advise on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, California.

register here

To speed up task completion, it is essential to have efficient load balancing to achieve high network utilization, as well as congestion control mechanisms to achieve predictable tail latency. Virtualized and efficient data infrastructures, combined with high-performance hardware, can also improve CPU offloads and help network accelerators improve neural network training.

Ethernet-based infrastructures currently offer the best solution for a unified network. They combine low power with high bandwidth and base, as well as the fastest serializer/deserializer (SerDes) speeds, with a predictable doubling of bandwidth every 18-24 months. With these advantages, along with its broad ecosystem, Ethernet can provide the highest performance interconnect per watt and per dollar for AI/ML and cloud-scale infrastructure.

According to IDC, the global Ethernet switch market grew by 12.7% year-on-year to reach $7.6 billion in the first quarter of 2022 (1Q22). Broadcom offers the Tomahawk family of Ethernet switches to enable the next generation of unified networks.

Today, San Jose-based Broadcom announced the StrataXGS Tomahawk 5 series of switches, which deliver 51.2 Tbps (Tbps) of Ethernet switching capacity in a single monolithic device, more than twice the bandwidth passerby of his contemporaries, says the company.

“Tomahawk 5 has twice the capacity of Tomahawk 4. As a result, it’s one of the fastest switching chips in the world,” Ram Velaga, SVP/GM of Broadcom’s core switching group, told VentureBeat. . “Newly added specific features and capabilities to optimize the performance of AI and ML networks make Tomahawk 5 twice as fast as the previous version.”

Tomahawk 5 switch chips are designed to help data centers and HPC environments, to accelerate AI and ML capabilities. The switch chip uses a Broadcom approach known as cognitive routing, advanced shared packet buffering, programmable in-band telemetry, with on-chip hardware-based link failover.

Cognitive routing optimizes network link utilization by automatically and dynamically selecting the least loaded links in the system for each flow passing through the switch. This is especially important for AI and ML workloads, which frequently combine short and long duration high-bandwidth streams with low entropy.

“Cognitive routing is a step beyond adaptive routing,” Velaga said. “When you use adaptive routing, you’re only aware of data congestion between two points, but you ignore other endpoints,” Velaga said. Cognitive routing, he added, can make the system aware of conditions other than the next neighbor, redirecting to an optimal path that provides better load balancing while avoiding congestion.

Tomahawk 5 includes real-time dynamic load balancing, which monitors all link usage at the switch and downstream in the network to determine the best path for each flow. It also monitors the status of hardware links and automatically redirects traffic to failing connections. These features improve network utilization and reduce congestion, resulting in a shorter JCT.

The Future of Ethernet for AI and ML Infrastructures

Ethernet has the features required for successful AI and ML training clusters: high bandwidth, end-to-end congestion management, load balancing, and fabric management at a lower cost than its contemporaries, like InfiniBand.

It is clear that Ethernet is a robust ecosystem that is constantly growing at a rapid pace of innovation. Broadcom has shown that it will continue to improve its Ethernet switches to keep pace with innovation in the AI ​​and ML industry, and will continue to be part of the HPC infrastructure in the future .

VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Learn more about membership.

Leave a Reply

Your email address will not be published.