BrandPosts are written and edited by members of our sponsor community. BrandPosts create an opportunity for an individual sponsor to provide insight and commentary from their point-of-view directly to our audience. The editorial team does not participate in the writing or editing of BrandPosts.
The AI era is accelerating. AI is no longer just a data model in a lab. In addition, the industry is constantly exploring the way to implement AI applications.The compound annual growth rate (CAGR) of government, finance, Internet, new retail, new manufacturing, and healthcare industries with AI implementation will exceed 30% in the future three years. As AI is coming, is the underlying network infrastructure that provides key support for AI development already ready?
The algorithm, computing power, and data are the three driving forces for AI development. Today, we have made breakthroughs in the deep learning algorithm. However, algorithm-driven intelligence relies heavily on enormous sample data and high-performance computing capabilities. Revolutionary changes have taken place in storage and computing fields to improve data processing efficiency of AI.
Storage media are evolved from hard disk drives (HDDs) to solid state drives (SSDs) to meet real-time data access requirements, with the media latency reduced by more than 100 times. In terms of computing power, the industry has used GPU servers or even dedicated AI chips, improving data processing capability more than 100 folds.
The network communication latency becomes a bottleneck for further performance improvement. The communication latency is increased from 10% to 60% over the entire storage E2E latency. That is, the storage medium waits for idle communication for more than half of the entire storage access time. Computing has a similar bottleneck. For example, for a voice recognition training, the duration of each iteration task is 650 ms to 700 ms, and the communication latency is 400 ms. The expensive processor also waits for the communication synchronization of model parameters for more than half of the communication time.
So, the answer to the question that whether the underlying network infrastructure that provides key support for AI development is already ready is half sure.
New Requirements of RDMA Migration and AI Networks
Replacing TCP/IP with RDMA has been a trend when AI computing and SSD distributed storage pursue ultimate performance. The dedicated InfiniBand network and traditional Ethernet network are two network bearer solutions for RDMA.
InfiniBand is a network communication standard used for high-performance computing. Unlike the traditional TCP/IP protocol stack, InfiniBand has its own network- and transport-layer protocols. Most live networks use IP Ethernet networks. Therefore, InfiniBand cannot match needs for AI computing and distributed storage systems demanding many interconnections. In addition, as a dedicated network technology, InfiniBand cannot reuse O&M experiences and platform on the IP network.
The solution of RDMA over traditional IP Ethernet networks lacks a complete packet loss protection mechanism. The packet loss ratio is greater than 10-3, causing the throughput of RDMA to decrease sharply. However, the existing RDMA congestion and scheduling algorithm easily causes queue congestion on network devices, which may cause system risks.
Therefore, RDMA must be carried over an open Ethernet network with zero packet loss and high throughput.
AI Fabric Builds an Intelligent and Lossless Network for the AI Era
At HUAWEI CONNECT 2018, Huawei officially released the AI Fabric Intelligent and Lossless Data Center Network solution. “Zero packet loss”, “low latency”, and “high throughput” are the three core features of AI Fabric. These are based on Huawei’s innovative iLossless AI algorithm. The algorithm provides a series of congestion management and flow control capabilities, such as precise congestion flow identification, dynamic congestion waterline setting, and fast backpressure mechanism to deliver ultimate network performance.
According to the ENTAC test conclusion, AI Fabric can reduce the computing latency by a maximum of 44.3% in HPC scenarios and improve the input/output operations per second (IOPS) by 25% in distributed storage scenarios. From the perspective of business value, AI Fabric improves the IOPS by 25% for storage, which is equivalent to reducing the storage investment by 25% while achieving the same performance. A distributed storage system consisting of 512 nodes is used as an example. If AI Fabric is used, 384 storage nodes can achieve IOPS performance of 512 storage nodes on a traditional network. The reduced CAPEX can bring at least 45-fold ROI improvement compared with the investment in AI Fabric.
In addition, based on the standard Ethernet architecture, Huawei AI Fabric builds a unified converged network that can carry LAN, SAN, and IPC traffic, avoiding independent maintenance of the dedicated network.
AI Fabric has also passed the strict test and verification of EANTC. In all test instances for high-performance computing and distributed storage, AI Fabric achieves high throughput and zero packet loss, shortens the inter-HPC node communication duration by 40% based on network latency optimization, and greatly improves the efficiency of innovative services such as AI training.
Currently, AI Fabric has been applied in the Internet and finance industries. In a retail bank, intelligent congestion scheduling of AI Fabric accelerates network communication. Through a site test, the IOPS performance of the storage cluster is improved by 20% and the performance of a single volume reaches 350,000. AI Fabric accelerates performance of the bank branch cloud and provides users with the same experience as accessing local disks.
As predicted by Huawei Global Industry Vision (GIV), the AI adoption rate will reach 86% by 2025. More and more enterprises regard AI as the subsequent strategy for digital transformation. In this transformation process, data center networks urgently need to eliminate its own limitations, and use stable and converged link technologies and network protocols with higher throughput, lower latency, and higher reliability. The AI Fabric intelligent and lossless data center network solution is a good choice for the AI era.