RDMA Explained: Part 1

This is the first part of a multi-part post going over RDMA, current research and RDMA’s role in the future of networking.

“Bandwidth problems can be cured with money. Latency problems are harder because speed of light is fixed—you can’t bribe God” - Anonymous

Not your Father’s Networks

When networks were a mere 100MBPS and silicon was abiding Moore’s law, CPUs could perform useful work between packet arrivals that had round trip latencies in the order of millisecond. At the time, DRAM latencies were already in the nanoseconds and clock speeds were growing. Among other factors, this 6 order of magnitude difference in latency led to a design decision in the Linux kernel that the network should wake up the CPU on packet arrival (interrupts).

Fast forward to 2018, networks have continued to scale almost linearly, but DRAM latency and clock speeds have remained the same or decreased. One study from CMU [1] shows that since 1999, DRAM latency has increased by only 1.2x in the last 20 years. Some of these problems come from kernel reliance to manage network processing i.e. user-kernel boundary crossing and from per-packet interrupts.

In modern networking this has led to three major pushes:

  1. Kernel bypass and zero-copy/single-copy
  2. Polling based processing (interrupt avoidance)
  3. Hardware offloading

RDMA High Level Overview

RDMA or Remote Direct Memory Access offloads its transport layer to silicon to eliminate kernel packet processing and move the CPU out of the critical path. As the name implies it allows devices to write into each other’s virtual memory regions and enables sub 10-microsecond transfer latencies in controlled benchmarks [2]. However, there are catches.

Catch #1: You need a new lossless network fabric that supports the native RDMA protocol, Infiniband, that works by using link-level flow control.

Catch #2: You’ll need to overhaul your datacenter with Infiniband enabled switches and NICs.

Why Can’t I Just use Ethernet over Infiniband?

The good news is that you can use ethernet with protocols such as iWARP and RoCE. The bad news is these protocols are either very heavy weight (iWARP) or they require a lossless network (RoCE).

ROCE vs. iWARP:

iWARP does not require a lossless network because it implements the entire TCP/IP stack in the NIC. While this is great for generalizability it results in far more complex NICs, higher cost, and lower performance [3], RoCE on the other hand use UDP datagrams but works under the assumption that the protocol will be running on a lossless network.

This has led to a growing body of work led by Microsoft using PFC, or Priority Flow Control within the network but this just pushes the complexity out of the NIC and into the management layer of the network. Guo et. al. details the problems for large scale RoCE deployments in the SIGCOMM ‘16 paper RDMA over Commodity Ethernet at Scale [4] which include live-locks, deadlocks, and head-of-the-line blocking among others.

What’s the Relationship Between DPDK and RDMA?

While DPDK also provides kernel bypass and poll based mechanisms that reduce the reliance on the CPU the current arguments for RDMA over DPDK is that DPDK does not go far enough. Mellanox makes the claim [5] that because packet processing is still being done in userspace (rather than on the NIC) CPUs share a larger burden when compared to RDMA which completely offloads it.

In my opinion this argument is weak because it side steps the aforementioned complexity of RDMA deployment which is the real bottleneck for RDMA. Additionally RDMA still uses polling, either Busy Polling or Event-Triggered which has its set of latency vs. CPU usage trade-offs [6].

Performance! But at what COST?

There are plenty of performance reports showing RDMAs benefits over TCP. Guo shows that when RoCEv2 was deployed in Bing there was an order of magnitude lower latency between the 99%th latency when comparing RDMA and TCP. However, the real problems are programmability, security, deployment, and management. Due to the complexity of the management layer for RDMA scalability is still a problem. The big question here is how much do we sacrifice in the pursuit of performance and how can we close the gap to sacrifice less.

Next

In the next section we’ll take a more detailed look on RDMA’s hardware strategy, namely the Host controller adapter (HCA) and how RDMA transactions and verbs work for RDMA applications.

Bibliography

[1] - Memory scaling: A systems architecture perspective

[2] - Design Guidelines for High Performance RDMA Systems

[3] - Revisiting Network Support for RDMA

[4] - RDMA over Commodity Ethernet at Scale

[5] - Using Hardware to Improve NFV Performance

[6] - Performance Isolation Anomalies in RDMA