High Performance C++ Profiling

High Performance C++ Profiling

My interest in code profiling started when I was making hudbot. What with code injection and patching, function hooking, data hijacking, and OpenGL, I knew I had relatively no experience in what I was attempting and that I could easily be producing some amazing slowdowns if I wasn’t careful.

Unfortunately, C++ profilers seem to come in three varieties, all of which have a fatal downside:

Sampling Profilers which are fast, multi-threaded, but inaccurate and have decent output (sometimes too detailed). Some examples are VTune, CodeAnalyst, google-perftools and Sleepy.
Instrumenting Profilers which are accurate, multi-threaded, but slow, and have decent output. Some examples are GlowCode and the now defunct DevPartner Profiler Community Edition.
Instrumenting Profilers which are fast, accurate, but single threaded and have limited output. These range from extremely simple profilers like Peter Kankowski’s Poor Man’s Profiler to the more complicated and full-featured Shiny C++ Profiler.
The obvious outcome is that if you want fast and accurate, like I did, you’ll have to use an existing profiler or write it yourself and instrument your code manually. With a little work, fancy stuff like call trees can be added. Once you get it tested and working, you can start going crazy profiSegmentation fault.

Oh yeah, about that. There are no multi-threaded instrumented profilers that are open source, and depending on how your single threaded profiler works, the results when trying to use it in a multi-threaded environment can range from bad data to outright crashing. It’s possible to patch the profiler to only allow the main thread in, but this adds unnecessary slowdowns and doesn’t address how to profile other threads. This is where my profiler comes in!

Pieces of a high performance multi-threaded C++ profiler


Latency in cycles and resolution of various timing methods (resolution is hand wavy, not to scale)
The main piece of a high performance profiler is what mechanism is used to get the timestamps. High precision is the obvious main requirement, but it must also have as low a latency as possible. If you’re making millions of calls a second to your profiler, the timestamp mechanism could become the limiting factor in your app’s performance and make it so unresponsive that testing it is infeasible.

On an x86, this means you must go with rdtsc. It is low latency, high precision, and is portable to gcc. This choice is unfortunately not without it’s trade offs. rdtsc does not serialize, so unless you insert a serializing instruction like cpuid before it (and bloat the latency in the process) or use the new rtdscp instruction, the cycle count you receive may not be 100% accurate. rdtsc is not guaranteed to be sync’d across all CPUs in a multi-core / multi-CPU system, so even single threaded timing has the possibility of being incorrect if the thread is scheduled across multiple CPUs. But, and this is a big but, for what I want there is nothing else to use. If someone else has different needs they can replace the timer function, but for the volume of calls I’m interested in, latency needs to be the bare minimum.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.