Wednesday, May 20

Using Valgrind to Analyze Code For Bottlenecks Makes Faster, Less Power-Hungry Programs

What is the right time to optimize code? This is a very good question, which usually comes down to two answers. The first answer is to have a good design for the code to begin with, because ‘optimization’ does not mean ‘fixing bad design decisions’. The second answer is that it should happen after the application has been sufficiently debugged and its developers are at risk of getting bored.

There should also be a goal for the optimization, based on what makes sense for the application. Does it need to process data faster? Should it send less data over the network or to disk? Shouldn’t one really have a look at that memory usage? And just what is going on inside those CPU caches that makes performance sometimes drop off a cliff on a single core?

All of this and more can be analyzed using tools from the Valgrind suite, including Cachegrind, Callgrind, DHAT and Massif.

Keeping Those Cores Cool

Modern day processors are designed with low power usage in mind, regardless of whether they are aimed at servers, desktop systems or embedded applications. This essentially means that they are in a low power state when not doing any work (idle loop), with some CPUs and microcontrollers turning off power to parts of the chip which are not being used. Consequently, the more the processor has to do, the more power it will use and the hotter it will get.

Code that needs fewer instructions to perform the same task due to a more efficient algorithm or fewer abstractions not only runs cooler, but also faster. This means that for the user, the experience is that not only does the task complete faster, but the device also gets less hot, with less fan noise. If it is battery powered, the battery will also last longer with a single charge. Basically everyone will be happier.

The weapons of choice here are Cachegrind and Callgrind. Although heap profiling (covered later in this article) can also be useful for power saving, the main focus should be on the processor. This means that we need to know what our code is doing, especially in terms of what parts of our code run most often, as those would be prime targets for optimization.

Track and Trace Those Calls

Running Cachegrind and Callgrind is quite uncomplicated. One simply passes the executable name and any flags it needs to Valgrind along with the tool you wish to use:

$ valgrind --tool=callgrind my_program 

This command would start the Callgrind tool for our program called my_program. Optionally, we can also have Callgrind simulate the CPU caches with --simulate-cache=yes. While running, Callgrind generates an output file called callgrind.out.<pid>, where <pid> is the process ID of the application while it was running. This file is then converted to a human-readable format:

$ callgrind_annotate callgrind.out.<pid> > callgrind00.txt

This produces a file that contains (among other things) a function call summary, ranked by how long execution spent in that particular function, making it obvious that a lot of speed could be gained if that function were to be optimized.

As explained in this article over at Stanford, the use of cache simulation adds details on cache hit/misses:

  • Ir: I cache reads (instructions executed)
  • I1mr: I1 cache read misses (instruction wasn’t in I1 cache but was in L2)
  • I2mr: L2 cache instruction read misses (instruction wasn’t in I1 or L2 cache, had to be fetched from memory)
  • Dr: D cache reads (memory reads)
  • D1mr: D1 cache read misses (data location not in D1 cache, but in L2)
  • D2mr: L2 cache data read misses (location not in D1 or L2)
  • Dw: D cache writes (memory writes)
  • D1mw: D1 cache write misses (location not in D1 cache, but in L2)
  • D2mw: L2 cache data write misses (location not in D1 or L2)

Seeing a large number of cache misses in an algorithm or loop would be a strong hint to optimize it to take up less data in the cache, employ prefetching to prevent cache misses, or take other measures applicable to the code in question.

Using Cachegrind is fairly similar to Callgrind, except that Cachegrind focuses on CPU caches first and foremost, with function calls a secondary concern. This should make it obvious which tool to pick of these two, depending on one’s most pressing questions.

Doing More with Less Memory

Even though computers and even microcontrollers often come with more memory in the form of caches and main system memory (RAM) than a developer in the 1990s could even begin to dream of, there are two negatives that relate to RAM:

  • RAM isn’t infinite; at some point heap space will run out. Best case, it’s just your application getting terminated by the OS, not the entire OS (or RTOS) bailing out and causing a cascade of faults throughout the system.
  • Active RAM costs power. Each part of a dynamic RAM (DRAM) module has to be constantly refreshed in order for the capacitive charges that store the value to be retained. Is becomes an especially important issue for battery-powered devices.

Reducing the amount of memory used does not only affect the system RAM, but also helps with the caches between the CPU’s processing units and RAM. Less data means fewer cache misses and fewer delays as the memory subsystem scrambles to move the requested data from RAM into the L3, L2 and (usually) the L1 cache.

Although a beefy Xeon or Epyc server processors tend to have a healthy 128 MB of L3 cache (or more), a commonly used ARM processor such as the one in the Raspberry Pi 3 (the BCM2837 SoC) has 16 kB L1 cache for data and instructions each, as well 512 kB L2 cache. There is no L3 cache here. Unless your application uses less than 512 kB memory in total (stack and heap), system RAM will be hit regularly, which will heavily affect the application’s performance.

One distinction to make here is that every application tends to have data stored in RAM — whether on the heap or the stack — that may get accessed regularly or only occasionally. Using Valgrind’s Massif and DHAT tools it’s fairly easy to figure out the usage patterns of this data on the heap, as well as which data doesn’t need to be stored at all any more.

Running the Numbers

Massif is the easiest to use of these two tools, all it takes is a single call on the command line:

$ valgrind --tool=massif my_program

This will run the application and output to the file massif.out.<pid>, where <pid> is the process ID of the application as it was being executed. Before we can use this data which Massif gathered, we first have to process it:

$ ms_print massif.out.<pid> > massif00.txt

This will direct the output from the ms_print utility to a file with the details in a human-readable form. It contains a graph of heap usage over time, like this example from the Massif documentation:

    MB
3.952^                                                                    # 
     |                                                                   @#:
     |                                                                 :@@#:
     |                                                            @@::::@@#: 
     |                                                            @ :: :@@#::
     |                                                          @@@ :: :@@#::
     |                                                       @@:@@@ :: :@@#::
     |                                                    :::@ :@@@ :: :@@#::
     |                                                    : :@ :@@@ :: :@@#::
     |                                                  :@: :@ :@@@ :: :@@#:: 
     |                                                @@:@: :@ :@@@ :: :@@#:::
     |                           :       ::         ::@@:@: :@ :@@@ :: :@@#:::
     |                        :@@:    ::::: ::::@@@:::@@:@: :@ :@@@ :: :@@#:::
     |                     ::::@@:  ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |                    @: ::@@:  ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |                    @: ::@@:  ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |                    @: ::@@:::::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |                ::@@@: ::@@:: ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |             :::::@ @: ::@@:: ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |           @@:::::@ @: ::@@:: ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   626.4

Number of snapshots: 63
 Detailed snapshots: [3, 4, 10, 11, 15, 16, 29, 33, 34, 36, 39, 41,
                      42, 43, 44, 49, 50, 51, 53, 55, 56, 57 (peak)]

The graph shows KDE’s Konquerer web browser as it’s started and run for a while. The vertical axis shows the heap usage (in megabytes), and the horizontal axis the number of instructions that have been executed since the application was launched. This way one can get an idea what heap usage looks like, with each of the slices in the graph detailed further in the file, e.g.:

--------------------------------------------------------------------------------
  n        time(B)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 15         21,112           19,096           19,000            96            0
 16         22,120           18,088           18,000            88            0
 17         23,128           17,080           17,000            80            0
 18         24,136           16,072           16,000            72            0
 19         25,144           15,064           15,000            64            0
 20         26,152           14,056           14,000            56            0
 21         27,160           13,048           13,000            48            0
 22         28,168           12,040           12,000            40            0
 23         29,176           11,032           11,000            32            0
 24         30,184           10,024           10,000            24            0
99.76% (10,000B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->79.81% (8,000B) 0x80483C2: g (example.c:5)
| ->39.90% (4,000B) 0x80483E2: f (example.c:11)
| | ->39.90% (4,000B) 0x8048431: main (example.c:23)
| |   
| ->39.90% (4,000B) 0x8048436: main (example.c:25)
|   
->19.95% (2,000B) 0x80483DA: f (example.c:10)
| ->19.95% (2,000B) 0x8048431: main (example.c:23)
|   
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)

The first column is the slice number, with detailed slices being expanded, showing the percentage of heap space being taken up by specific data in the heap, as well as where in the code it was allocated. Obviously it is required to compile the application with debug symbols included (-g option for GCC) to get the most out of this functionality.

The use of DHAT is similar to Massif, though it outputs to a JSON format, requiring the browser-based viewer (dh_view.html) to actually analyze the data. DHAT can provide more detailed information about the allocated data in the heap than Massif, including things like allocations which are never fully used. Whether this is necessary depends on what kind of optimizations one desires.

Keep That Toolbox Filled

After previously looking at the other commonly used tools in the Valgrind suite, you should have a pretty good idea of when to use Valgrind to assist in debugging and optimization efforts. Although they are all incredibly useful tools, they are not the end all and be all of debugging analysis tools. As every experienced developer knows, what counts is knowing when to use each approach.

Sometimes all one needs is a simple debugger or solid logging inside the application to shake out the most serious issues. Only when that doesn’t help is it time to start breaking out the heavier tools, with the cautionary warning that with powerful tools comes great responsibility in interpreting the data. The experience and knowledge to make the right decisions and draw the right conclusions is as essential a tool as any other.

No comments:

Post a Comment