This is the first part of a multi-part post going over RDMA, current research and RDMA’s role in the future of networking.
“Bandwidth problems can be cured with money. Latency problems are harder because speed of light is fixed—you can’t bribe God” - Anonymous
When networks were a mere 100MBPS and silicon was abiding Moore’s law, CPUs could perform useful work between packet arrivals that had round trip latencies in the order of millisecond. At the time, DRAM latencies were already in the nanoseconds and clock speeds were growing. Among other factors, this 6 order of magnitude difference in latency led to a design decision in the Linux kernel that the network should wake up the CPU on packet arrival (interrupts).
Fast forward to 2018, networks have continued to scale almost linearly, but DRAM latency and clock speeds have remained the same or decreased. One study from CMU  shows that since 1999, DRAM latency has increased by only 1.2x in the last 20 years. Some of these problems come from kernel reliance to manage network processing i.e. user-kernel boundary crossing and from per-packet interrupts.
In modern networking this has led to three major pushes:
RDMA or Remote Direct Memory Access offloads its transport layer to silicon to eliminate kernel packet processing and move the CPU out of the critical path. As the name implies it allows devices to write into each other’s virtual memory regions and enables sub 10-microsecond transfer latencies in controlled benchmarks . However, there are catches.
Catch #1: You need a new lossless network fabric that supports the native RDMA protocol, Infiniband, that works by using link-level flow control.
Catch #2: You’ll need to overhaul your datacenter with Infiniband enabled switches and NICs.
The good news is that you can use ethernet with protocols such as iWARP and RoCE. The bad news is these protocols are either very heavy weight (iWARP) or they require a lossless network (RoCE).
iWARP does not require a lossless network because it implements the entire TCP/IP stack in the NIC. While this is great for generalizability it results in far more complex NICs, higher cost, and lower performance , RoCE on the other hand use UDP datagrams but works under the assumption that the protocol will be running on a lossless network.
This has led to a growing body of work led by Microsoft using PFC, or Priority Flow Control within the network but this just pushes the complexity out of the NIC and into the management layer of the network. Guo et. al. details the problems for large scale RoCE deployments in the SIGCOMM ‘16 paper RDMA over Commodity Ethernet at Scale  which include live-locks, deadlocks, and head-of-the-line blocking among others.
While DPDK also provides kernel bypass and poll based mechanisms that reduce the reliance on the CPU the current arguments for RDMA over DPDK is that DPDK does not go far enough. Mellanox makes the claim  that because packet processing is still being done in userspace (rather than on the NIC) CPUs share a larger burden when compared to RDMA which completely offloads it.
In my opinion this argument is weak because it side steps the aforementioned complexity of RDMA deployment which is the real bottleneck for RDMA. Additionally RDMA still uses polling, either Busy Polling or Event-Triggered which has its set of latency vs. CPU usage trade-offs .
There are plenty of performance reports showing RDMAs benefits over TCP. Guo shows that when RoCEv2 was deployed in Bing there was an order of magnitude lower latency between the 99%th latency when comparing RDMA and TCP. However, the real problems are programmability, security, deployment, and management. Due to the complexity of the management layer for RDMA scalability is still a problem. The big question here is how much do we sacrifice in the pursuit of performance and how can we close the gap to sacrifice less.
In the next section we’ll take a more detailed look on RDMA’s hardware strategy, namely the Host controller adapter (HCA) and how RDMA transactions and verbs work for RDMA applications.