I took every Redis release starting with 2.6 and compiled them 12 different ways.
It took my computer (actual computer, real cores, no VM, no SMT/HT) about 5 days to run through the entire test. I repeated the entire test four times and assembled the best results from each run.
The main job of Redis is storing and retrieving data. To store and retrieve data, Redis is constantly creating C strings and allocating/freeing memory. These benchmarks were a test to see if different optimization options can give us a higher performing Redis without any additional coding.
What is -flto?
LTO is short for Link Time Optimization. Both clang and gcc support LTO.
What is Link Time Optimization?
LTO is an alternative, and preferred, way to improve the optimize-at-compile-time capability of compilers. Compilers can only inline and optimize code when all relevant functions are evaluated at the same time and result in the same compilation unit (.o file). But, if your project is generating 50 independent .o files (or thousands of them), the compiler can’t reach into one compiled .o file, extract a function, and inline it in another .o file.
Historically, the way you get around optimization being limited to compilation units is to use an amalgamation where you basically concatenate all your source files together into one giant 110,000 line file. But, Redis conditionally includes some files based on your current system, so you can’t just cat *.c > redis-everything.c.
This is where LTO comes in. LTO retains modular function details in an intermediate format (for LLVM/clang the format is LLVM bitcode, for GCC the format is GIMPLE).
Now your compiler can still compile each source file to an independent object file, but it keeps the ability to cross-inline and optimize functions at Link Time, giving us full Link Time Optimization without needing to concatenate all our files together.
Relevance to Redis
Redis uses an internal string library for managing C strings along with an internal allocation wrapper for malloc/free calls. By using LTO, the compiler can cross-inline relevant calls to the abundantly used string creation/updating/destruction functions as well as relevant calls to zmalloc/zfree, resulting in reduced function call overhead for many operations.
You’ll notice compiles with LTO enabled dominate the benchmark results because they generate faster performing code.
GCC allows per-function optimization annotations. Based on empirical data, we can decorate our functions with GCC pragmas to optimize some functions for space so specific tight loop functions remain in hot cache easier, and for other functions we can optimize for lowering call overheads.
But, you’ll also notice clang outperforms GCC in most benchmarks.
Clang doesn’t allow per-function optimization levels, but we could also decide to use specific optimization levels for specific files instead of using one default -O2 across the entire project. Some data structures clearly benefit from using -Os or even -Ofast over -O2.
Note: none of my tests used -march, but it would be interesting to re-run them under that condition too.
Note: Redis currently has over 150 individual commands, but the redis-benchmark suite, by default, only tests about 10 common commands. There’s enough combinations of compiler options and commands to run tests filling a non-trivial cluster for a month of continuous performance testing across historical versions.