redhat.com: Practical micro-benchmarking with ‘ltrace’ and ‘sched’

redhat.com: Practical micro-benchmarking with ‘ltrace’ and ‘sched’

Recently I was asked to look into an issue related to QEMU’s I/O subsystem performance – specifically, I was looking for differences in performance between glibc’s malloc and other malloc implementations’. After a good deal of benchmarking I was unable to see a clear difference between our malloc (glibc) and others, not because the implementations were similar, but because there was too much noise in the system; the benchmarks were being polluted by other things, not just in QEMU, but elsewhere in the operating system. I really just wanted to see how much time malloc was using, but it was a small signal in a large noisy system.

To get to the results I needed, I had to isolate malloc’s activity so I could measure it more accurately. In this blog, I’ll document how I extracted the malloc workload, and how to isolate that workload from other system activity so it can be measured more effectively.

(Note, of course, that benchmarks are statistical things, and not all of the methods that worked for me may show statistically significant results for you. YMMV – “your metrics may vary” – but I am hopeful these techniques will prove useful in your own scenario.)

Step One – ltrace
The first step was to analyze the target workload to see what it was doing. If you’re benchmarking your own function, you can add ‘printf()’ statements or other trace functionality to record every call, but in this case – benchmarking a system library – we don’t have that option.

(Ok, it’s open source, we have that option, but it’s not practical).

The ltrace utility is designed to capture calls to system libraries, so it’s a good fit for our situation. The syntax we’ll use lets us connect to a running process, and store the output to a file. You’ll need your process ID (pid) and the name of a file (filename) in which to place the ltrace output:

ltrace -p -f -o -e malloc+free
If we run QEMU in one window, and this command in another window, we can start ltrace right before starting the test case within QEMU, then CTRL-C’ing ltrace once the test case is done in order to capture only the relevant data. Your output should look something like this:

15671 exe->malloc(144) = 0x7fa9f6e82eb0
15671 exe->malloc(128) = 0x7fa9f1513580
15662 exe->malloc(40) = 0x7fa9f2338f90
15662 exe->free(0x7fa9f79e0000) =
15662 exe->free(0x7fa9f6c3d100) =
The actual results are not relevant to this blog, but I observed that malloc was called many times with only a few different sizes requested, so if I benchmarked calls of those sizes I’d get a general idea of how our malloc compares to others. A simple sed/awk/grep/sort/uniq/pr script (ok, not so simple) gives us a histogram of malloc calls. Alternatively, you can just read the logs and interpret them yourself. 😉

Step 2 – The Microbenchmark
If you want to time something that takes minutes, you can use a stopwatch. If it only takes seconds, gettimeofday() is sufficient; however, a single call to malloc might only take microseconds or even nanoseconds.

Despite the wide range of system performance tools available, we’ll choose an old simple solution – ReaD Time Stamp Counter (RDTSC) – which tells us how many CPU clock cycles (or the equivalent) have passed. Since we’re comparing malloc implementations, a relative time is sufficient and we don’t need to know (or even worry about) converting clock cycles to microseconds. The RDTSC instruction is cpu-dependent, but it’s available on the one I’m using (x86-64) and is typically used like this:

static __inline__ int64_t rdtsc(void)
{
unsigned a, d;
asm volatile(“rdtsc” : “=a” (a), “=d” (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}

. . .

int64_t clocks_before, clocks_after, clocks_per_malloc;

clocks_before = rdtsc ();
p = malloc (i); /* Test goes here */
clocks_after = rdtsc ();
clocks_per_malloc = clocks_after – clocks_before;
Now we just run that code a few thousand times and we should get some nice clean results, and since there are only a small number of paths that the code can follow through malloc we should see only a few different times reported… or not?

One drawback of the RDTSC instruction is that the CPU is allowed to reorder it relative to other instructions, which causes noise in our results. Fortunately, Intel has provided an RDTSCP instruction that’s more deterministic. We’ll pair that with a CPUID instruction which acts as a memory barrier, resulting in this:

static __inline__ int64_t rdtsc_s(void)
{
unsigned a, d;
asm volatile("cpuid" ::: "%rax", "%rbx", "%rcx", "%rdx");
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}

static __inline__ int64_t rdtsc_e(void)
{
unsigned a, d;
asm volatile("rdtscp" : "=a" (a), "=d" (d));
asm volatile("cpuid" ::: "%rax", "%rbx", "%rcx", "%rdx");
return ((unsigned long)a) | (((unsigned long)d) < $i; done
You probably don’t want to run your machine like this for long, as it’s effectively a single-core machine; however, the results are now almost as clean as we can get them:

Note that a larger percentile of the calls show the shortest time (i.e. transitions have shifted right), and the transition to calls taking more cycles is sharper. This indicates that fewer of the fast calls are getting interrupted by scheduling or hardware issues.

This is starting to look pretty good, but there’s one more thing we can do to clean it up more. If you happen to have access to a machine with a lot of CPU cores and RAM, you’ll discover that there’s still room for improvement (could be tasks that can’t be moved off the relevant CPUs, or hardware contention). An example of those results can be seen below:

Conclusion
In summary, you can see a huge difference between the first and last graph, but that improvement is based on many steps, and there’s still some room for improvement (an Intel paper uses a kernel module to isolate their tests even further, but that’s beyond the scope of what I need to test). As we run more benchmarks and clean up the results, we learn more about what we’re benchmarking. We learn what “noise” needs to be removed (like we’ve done above) and what “noise” is actually part of the result we’re looking for.

For now, this is good enough for my purposes. Look forward to a blog post about changes in malloc!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s