Linux kernel bypass Linux kernel bypass

Unfortunately the speed of vanilla Linux kernel networking is not sufficient for more specialized workloads. For example, here at CloudFlare, we are constantly dealing with large packet floods. Vanilla Linux can do only about 1M pps. This is not enough in our environment, especially since the network cards are capable of handling a much higher throughput. Modern 10Gbps NIC’s can usually process at least 10M pps.

et’s prepare a small experiment to convince you that working around Linux is indeed necessary. Let’s see how many packets can be handled by the kernel under perfect conditions. Passing packets to userspace is costly, so instead let’s try to drop them as soon as they leave the network driver code. To my knowledge the fastest way to drop packets in Linux, without hacking the kernel sources, is by placing a DROP rule in the PREROUTING iptables chain:
$ sudo iptables -t raw -I PREROUTING -p udp –dport 4321 –dst -j DROP
$ sudo ethtool -X eth2 weight 1
$ watch ‘ethtool -S eth2|grep rx’
rx_packets: 12.2m/s
rx-0.rx_packets: 1.4m/s
rx-1.rx_packets: 0/s

Ethtool statistics above show that the network card receives a line rate of 12M packets per second. By manipulating an indirection table on a NIC with ethtool -X, we direct all the packets to RX queue #0. As we can see the kernel is able to process 1.4M pps on that queue with a single CPU.
Processing 1.4M pps on a single core is certainly a very good result, but unfortunately the stack doesn’t scale. When the packets hit many cores the numbers drop sharply. Let’s see the numbers when we direct packets to four RX queues:
$ sudo ethtool -X eth2 weight 1 1 1 1
$ watch ‘ethtool -S eth2|grep rx’
rx_packets: 12.1m/s
rx-0.rx_packets: 477.8k/s
rx-1.rx_packets: 447.5k/s
rx-2.rx_packets: 482.6k/s
rx-3.rx_packets: 455.9k/s
Now we process only 480k pps per core. This is bad news. Even optimistically assuming the performance won’t drop further when adding more cores, we would still need more than 20 CPU’s to handle packets at line rate. So the kernel is not going to work.

Solarflare network cards support OpenOnload, a magical network accelerator. It achieves a kernel bypass by implementing the network stack in userspace and using an LD_PRELOAD to overwrite network syscalls of the target program. For low level access to the network card OpenOnload relies on an “EF_VI” library. This library can be used directly and is well documented.
EF_VI, being a proprietary library, can be only used on Solarflare NIC’s, but you may wonder how it actually works behind the scenes. It turns out EF_VI reuses the usual NIC features in a very smart way.
Under the hood each EF_VI program is granted access to a dedicated RX queue, hidden from the kernel. By default the queue receives no packets, until you create an EF_VI “filter”. This filter is nothing more than a hidden flow steering rule. You won’t see it in ethtool -n, but the rule does in fact exist on the network card. Having allocated an RX queue and managed flow steering rules, the only remaining task for EF_VI is to provide a userspace API for accessing the queue.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.