Linux kernel bypass Linux kernel bypass

Unfortunately the speed of vanilla Linux kernel networking is not sufficient for more specialized workloads. For example, here at CloudFlare, we are constantly dealing with large packet floods. Vanilla Linux can do only about 1M pps. This is not enough in our environment, especially since the network cards are capable of handling a much higher throughput. Modern 10Gbps NIC’s can usually process at least 10M pps.

et’s prepare a small experiment to convince you that working around Linux is indeed necessary. Let’s see how many packets can be handled by the kernel under perfect conditions. Passing packets to userspace is costly, so instead let’s try to drop them as soon as they leave the network driver code. To my knowledge the fastest way to drop packets in Linux, without hacking the kernel sources, is by placing a DROP rule in the PREROUTING iptables chain:
$ sudo iptables -t raw -I PREROUTING -p udp –dport 4321 –dst -j DROP
$ sudo ethtool -X eth2 weight 1
$ watch ‘ethtool -S eth2|grep rx’
rx_packets: 12.2m/s
rx-0.rx_packets: 1.4m/s
rx-1.rx_packets: 0/s

Ethtool statistics above show that the network card receives a line rate of 12M packets per second. By manipulating an indirection table on a NIC with ethtool -X, we direct all the packets to RX queue #0. As we can see the kernel is able to process 1.4M pps on that queue with a single CPU.
Processing 1.4M pps on a single core is certainly a very good result, but unfortunately the stack doesn’t scale. When the packets hit many cores the numbers drop sharply. Let’s see the numbers when we direct packets to four RX queues:
$ sudo ethtool -X eth2 weight 1 1 1 1
$ watch ‘ethtool -S eth2|grep rx’
rx_packets: 12.1m/s
rx-0.rx_packets: 477.8k/s
rx-1.rx_packets: 447.5k/s
rx-2.rx_packets: 482.6k/s
rx-3.rx_packets: 455.9k/s
Now we process only 480k pps per core. This is bad news. Even optimistically assuming the performance won’t drop further when adding more cores, we would still need more than 20 CPU’s to handle packets at line rate. So the kernel is not going to work.

Solarflare network cards support OpenOnload, a magical network accelerator. It achieves a kernel bypass by implementing the network stack in userspace and using an LD_PRELOAD to overwrite network syscalls of the target program. For low level access to the network card OpenOnload relies on an “EF_VI” library. This library can be used directly and is well documented.
EF_VI, being a proprietary library, can be only used on Solarflare NIC’s, but you may wonder how it actually works behind the scenes. It turns out EF_VI reuses the usual NIC features in a very smart way.
Under the hood each EF_VI program is granted access to a dedicated RX queue, hidden from the kernel. By default the queue receives no packets, until you create an EF_VI “filter”. This filter is nothing more than a hidden flow steering rule. You won’t see it in ethtool -n, but the rule does in fact exist on the network card. Having allocated an RX queue and managed flow steering rules, the only remaining task for EF_VI is to provide a userspace API for accessing the queue.

High Performance C++ Profiling

High Performance C++ Profiling

My interest in code profiling started when I was making hudbot. What with code injection and patching, function hooking, data hijacking, and OpenGL, I knew I had relatively no experience in what I was attempting and that I could easily be producing some amazing slowdowns if I wasn’t careful.

Unfortunately, C++ profilers seem to come in three varieties, all of which have a fatal downside:

Sampling Profilers which are fast, multi-threaded, but inaccurate and have decent output (sometimes too detailed). Some examples are VTune, CodeAnalyst, google-perftools and Sleepy.
Instrumenting Profilers which are accurate, multi-threaded, but slow, and have decent output. Some examples are GlowCode and the now defunct DevPartner Profiler Community Edition.
Instrumenting Profilers which are fast, accurate, but single threaded and have limited output. These range from extremely simple profilers like Peter Kankowski’s Poor Man’s Profiler to the more complicated and full-featured Shiny C++ Profiler.
The obvious outcome is that if you want fast and accurate, like I did, you’ll have to use an existing profiler or write it yourself and instrument your code manually. With a little work, fancy stuff like call trees can be added. Once you get it tested and working, you can start going crazy profiSegmentation fault.

Oh yeah, about that. There are no multi-threaded instrumented profilers that are open source, and depending on how your single threaded profiler works, the results when trying to use it in a multi-threaded environment can range from bad data to outright crashing. It’s possible to patch the profiler to only allow the main thread in, but this adds unnecessary slowdowns and doesn’t address how to profile other threads. This is where my profiler comes in!

Pieces of a high performance multi-threaded C++ profiler


Latency in cycles and resolution of various timing methods (resolution is hand wavy, not to scale)
The main piece of a high performance profiler is what mechanism is used to get the timestamps. High precision is the obvious main requirement, but it must also have as low a latency as possible. If you’re making millions of calls a second to your profiler, the timestamp mechanism could become the limiting factor in your app’s performance and make it so unresponsive that testing it is infeasible.

On an x86, this means you must go with rdtsc. It is low latency, high precision, and is portable to gcc. This choice is unfortunately not without it’s trade offs. rdtsc does not serialize, so unless you insert a serializing instruction like cpuid before it (and bloat the latency in the process) or use the new rtdscp instruction, the cycle count you receive may not be 100% accurate. rdtsc is not guaranteed to be sync’d across all CPUs in a multi-core / multi-CPU system, so even single threaded timing has the possibility of being incorrect if the thread is scheduled across multiple CPUs. But, and this is a big but, for what I want there is nothing else to use. If someone else has different needs they can replace the timer function, but for the volume of calls I’m interested in, latency needs to be the bare minimum.

Vitesse Data | Welcome

Vitesse Data | Welcome
SSE Optimization CSV file parsing is done using SSE instructions that process the CSV data 16-byte at a time. Drop-in Deployment 100% binary compatibility with Postgres 9.3.5 means there is no need to modify your application or site operation to realize the speed benefits and cost savings in electricity or AWS. Mr. Sulu, Step On It! CSV imports run up to 2X faster. OLAP aggregates run up to 10X faster. All because Vitesse DB pushes your x86 CPU to its limits. Web Server Performance Comparison – DreamHost Web Server Performance Comparison – DreamHost
Remember, Apache supports a larger toolbox of things it can do immediately and is probably the most compatible across all web software out there today… and most websites really don’t get so many concurrent hits as to gain large performance/memory benefits from Lighttpd or nginx. But hey, it never hurts (too much) to swap your web servers around and see what works best for you! The Raspberry Pi Web Server Speed Test – Raspberry Pi Blog The Raspberry Pi Web Server Speed Test – Raspberry Pi Blog
Summary Winner: Nginx Overall I think the fastest and most reliable solution is Nginx. I only say this because it’s more mature than Monkey and has some stability going for it. Monkey however is catching up fast. There seems to be a lot of enthusiasm for the project and as you can see by these tests it does very well especially with text. In the image arena Apache still seems to dominate. I’m not sure why that is, but it clearly handles this function very well. With some tuning you can make Apache handle text well too, but I still think it’s a product that’s past its prime. If I had to recommend anything it would be Nginx, but soon I may be changing that depending on how Monkey progresses. USB 3.0 Tested: How Fast Is It in the Real World? | News & Opinion | USB 3.0 Tested: How Fast Is It in the Real World? | News & Opinion |
So what’s the bottom line? In all cases, I did see a notable performance improvement using USB 3.0, but it wasn’t anywhere near the 10X improvement in rated connection speed, or the two to three times improvement I was hoping to see. Still, writing at 24 MB/sec is a lot better than at 14 MB/sec, and the difference in price is fairly small, so I can recommend these drives as a real improvement. I just wanted more.

Dice News: Speed Test: Comparing Intel C++, GNU C++, and LLVM Clang Compilers

Dice News: Speed Test: Comparing Intel C++, GNU C++, and LLVM Clang Compilers
Conclusion: It’s interesting that the code built with the g++ compiler performed the best in most cases, although the clang compiler proved to be the fastest in terms of compilation time. But I wasn’t able to test much regarding the parallel processing with clang, since its Cilk Plus extension aren’t quite ready, and the Threading Building Blocks team hasn’t ported it yet.