RDMA (Remote Direct Memory Access) is the core technology for AI training clusters. This article details RDMA working principles, mainstream protocols, and applications in AI scenarios.
## RDMA Overview
RDMA (Remote Direct Memory Access) is a technology for directly accessing remote server memory without operating system intervention, significantly reducing network latency and CPU overhead.
## Core Value
- **Ultra-low latency**: Latency can be reduced to 1-2 microsecond level
- **Low CPU usage**: Data transfers directly between network card and memory with zero CPU participation
- **High bandwidth**: Can approach physical link bandwidth limits
## Mainstream RDMA Protocols
### RoCEv2 (RDMA over Converged Ethernet)
UDP-based lossless network protocol, the most mainstream solution currently:
- Compatible with existing Ethernet
- Switches need to support DCB/QoS
- Widely supported by domestic manufacturers like Huawei and H3C
### InfiniBand
Dedicated high-performance network protocol:
- Lowest latency
- Requires dedicated InfiniBand switches
- Deeply integrated with GPUs after NVIDIA acquisition
## AI Training Scenario Applications
RDMA is essential technology for AI training clusters:
- GPUDirect RDMA: Direct data transfer between GPUs
- Collective communication optimization: AllReduce and other operations accelerated
- Lossless network: Requires DCB/QoS configuration
← Back to Tech Center
RDMA (Remote Direct Memory Access) is a technology for directly accessing remote server memory without operating system intervention, significantly reducing network latency and CPU overhead.
## Core Value
- **Ultra-low latency**: Latency can be reduced to 1-2 microsecond level
- **Low CPU usage**: Data transfers directly between network card and memory with zero CPU participation
- **High bandwidth**: Can approach physical link bandwidth limits
## Mainstream RDMA Protocols
### RoCEv2 (RDMA over Converged Ethernet)
UDP-based lossless network protocol, the most mainstream solution currently:
- Compatible with existing Ethernet
- Switches need to support DCB/QoS
- Widely supported by domestic manufacturers like Huawei and H3C
### InfiniBand
Dedicated high-performance network protocol:
- Lowest latency
- Requires dedicated InfiniBand switches
- Deeply integrated with GPUs after NVIDIA acquisition
## AI Training Scenario Applications
RDMA is essential technology for AI training clusters:
- GPUDirect RDMA: Direct data transfer between GPUs
- Collective communication optimization: AllReduce and other operations accelerated
- Lossless network: Requires DCB/QoS configuration