We all know what bugs in code are. We don’t like them when they are in programs we use, and they’re even worse when they are in code which we have written. Clearly, the best code is bug-free, but how do we get there?
This isn’t a new question, of course, just one that has become ever more important as the total number of lines of code (LoC) that run modern day society keeps increasing and which is affecting even hobbyists more and more often now that everything has a microcontroller inside.
Although many of us know the smug satisfaction of watching a full row of green result markers light up across the board after running the unit tests for a project, the painful reality is that you don’t know whether the code really is functionally correct until it runs in an environment that is akin to the production environment. Yet how can one test an application in this situation?
This is where tools like those contained in the Valgrind suite come into play, allowing us to profile, analyze and otherwise nitpick every single opcode and memory read or write. Let’s take a look, shall we?
It’s Broken, Make It Work Again
When it comes to software development (and hardware development to some extent as well), there are three possible states of being broken:
- It is obviously broken.
- It works, but sometimes it breaks.
- It works fine, but is actually broken.
The first type is the one that shall not be a surprise to anyone. It’s the kind of failure that happily announces itself with such cheerful terms like ‘SIGSEGV‘ (segmentation fault) and ‘SIGBUS‘ (address bus fault) which indicate that the operating system’s kernel has detected that the application is about to do something that is illegal, or impossible. Dividing by zero is a good example of the latter.
The second type of brokenness — where does run but sometimes throws errors — is more intriguing, in that it allows the application to be run through its paces, transferring data, opening and writing files, and displaying data on screen without any issues. Until suddenly when doing the same thing a second time it fails. Or after an hour of working fine it fails. Or it starts doing something ‘weird’, after which the application’s behavior begins to feel almost random.
The third type of brokenness — where it runs but it shouldn’t — is also known as ‘how the heck did this ever work in the first place’, with its discovery usually accompanied by loud exclamations, the questioning of the very fabric of reality, and possibly a few quick prayers to one’s deity of choice depending on theological affinity. This kind of code has managed to reach just the perfect balance within a perfect storm of mistakes that allows it to do the right thing by sheer chance. Until one dares to alter a line of code, of course.
It’s Not Magic, It’s Just Complicated
In its most elementary form, software is merely a series of instructions for the underlying hardware. This hardware attempts to carry out these instructions to the best of its abilities, which involve not only the processing core(s) of the CPU, but also its caches, cache synchronization logic (for multi-core CPUs), memory controller(s) and system memory. On top of this there is usually an operating system (OS) which serves to make life easy for application developers, as they don’t have to worry about implementing a task scheduler, heap and stack management, as well as a lot of other fun details that no application developer wants to mess with.
Each of these elements of the OS and underlying hardware can affect the execution of the code, and each issue will affect different parts of this whole system. This is why we we need the have a range of tools. In the case of a suite like Valgrind, the main tools that we find ourselves using are called Memcheck, DRD and Helgrind.
Using Valgrind to Monitor Memory
The default tool that Valgrind uses when started is Memcheck. As the name suggests, it checks memory. More specifically, it inserts a layer between the OS and the application that is being tested. Much like a debugger, it then tracks each memory write and read, keeping track of references, valid memory ranges, whether blocks of memory are still reachable or not, and so on.
Common use cases of Memcheck are to detect memory leaks, e.g.:
void main() { int* foo = new int; int* bar = new int; *foo = 42; *bar = 24; bar = foo; }
Which would spit out something like in the Memcheck logging:
4 bytes in 1 blocks are definitely lost in loss record 1 of 14
Followed by a backtrace indicating when access to the data (previously pointed to by bar
) was lost. When passing --leak-check=full
to Memcheck, it will also let you know where the data that has been lost was allocated. Here Memcheck may report ‘definitely’, ‘indirectly’ or ‘possibly’ lost data. Unless you have an obvious problem, the ‘definitely’ lost blocks of data are the ones to focus on. Indirectly lost data is usually the result of losing the address of a block of pointers, so fixing the ‘definitely lost’ issue for that should also resolve any ‘indirectly lost’ issues.
Usually one runs Memcheck with this CLI command to get the most useful information:
$ valgrind --tool=memcheck --log-file=memcheck00.txt --leak-check=full --read-var-info=yes path/to/binary
This way the output will be written to a log file (memcheck00.txt), we will get the full leak report, and Memcheck will use any debug information in the binary, if present, to make the trace even more readable. It’s highly advisable to use binaries that have all debug symbols in place to make one’s life easier.
Finding Other Memory Problems
Memcheck is also very useful for detecting invalid reads and writes, as well as the freeing of memory that was not allocated by the application. This would suggest that the application is doing something naughty with memory, which could lead to crashes, corrupted data and other fun. This also includes the use of mismatched free()
and delete()
calls, which can be an issue when mixing C and C++ code in the same application.
Finally, Memcheck will also sanity check your arguments to malloc and similar memory allocation functions, as well as memcpy and similar C functions, catching a lot of issues that would otherwise show up during testing if one is lucky. The Memcheck manual has an assortment of examples, as do various Memcheck tutorials out there (like this one, which covers debugging a memory leak).
Keep Your Threads Where We Can See Them
The other two tools in Valgrind that are exceedingly useful are Helgrind and DRD, which focus primarily on multithreading and all the issues that this may cause. Depending on the settings used, they can track thread activity in a fairly coarse fashion, or log every single mutex movement and so on. Of course, the more one tracks, the more one’s application slows to a crawl.
Although it may seem redundant for Valgrind to have two tools which at first glance appear to do the same thing, Helgrind and DRD are not identical. Each uses a different approach for analyzing application behavior and thus each may give (slightly) different results. It’s often a good idea to run both for this reason.
Issues that we can track down using Helgrind and DRD are for example deadlocks, where two or more threads try to obtain the lock (mutex, rwlock, or similar) to a resource, while also holding a lock themselves. As each thread will only release their lock after they have obtained the other lock, nothing will happen and the application is effectively frozen.
With DRD we can also trace the behavior or locks, including the time that a specific lock was held for:
$ valgrind --tool=drd --exclusive-threshold=10 drd/tests/hold_lock -i 500 ... ==10668== Acquired at: ==10668== at 0x4C267C8: pthread_mutex_lock (drd_pthread_intercepts.c:395) ==10668== by 0x400D92: main (hold_lock.c:51) ==10668== Lock on mutex 0x7fefffd50 was held during 503 ms (threshold: 10 ms). ==10668== at 0x4C26ADA: pthread_mutex_unlock (drd_pthread_intercepts.c:441) ==10668== by 0x400DB5: main (hold_lock.c:55)
Here we set a threshold value of 10 ms, with the test application being instructed to hold the lock for 500 ms. As we can see, the lock (mutex) was held for 503 ms, according to DRD.
Sometimes Order Matters
A useful feature of Helgrind is the tracking of in which order locks are normally used, and when their order changes:
Thread #1: lock order "0x7FF0006D0 before 0x7FF0006A0" violated Observed (incorrect) order is: acquisition of lock at 0x7FF0006A0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) by 0x400825: main (tc13_laog1.c:23) followed by a later acquisition of lock at 0x7FF0006D0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) by 0x400853: main (tc13_laog1.c:24) Required order was established by acquisition of lock at 0x7FF0006D0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) by 0x40076D: main (tc13_laog1.c:17) followed by a later acquisition of lock at 0x7FF0006A0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) by 0x40079B: main (tc13_laog1.c:18)
The thing about the way that locks are used is that it might be totally valid to have them be used in different orders throughout the execution of the application, or it might be indicative of a logic error.
Thinking About Multithread Flow
Both tools will track data races, which occur when two or more threads try to access the same resource simultaneously, without a locking mechanism or the use of atomics to prevent data corruption and worse. This can be as subtle as a single unsigned 64-bit integer that is being read by one thread while another writes to it. If the read operation isn’t atomic (i.e., the whole 64-bit value is read in one CPU cycle), the value can be changed by the writing thread half-way through the reading operation.
Data races are generally bad news, and must be fixed. Though a data race is reported even for atomic operations (e.g. reading a boolean or 8-bit integer on most architectures), specifying the type as an atomic type (e.g. in the STL’s <atomic>
header for C++) is an easy way to make DRD and Helgrind happy, while also being the technically correct approach to writing multithreaded code.
But Wait, There’s More
In this article we only addressed the Valgrind tools that are most useful for debugging, as these tend to be memory and multithreading-related issues. This raises the prospect of another highly enjoyable and educational pursuit for any software developer: optimizing code.
After your application has stopped crashing, no longer corrupts data and is behaving itself, what better use of one’s time than to dive deep into its performance statistics to eek out more performance? This is where tools such as Cachegrind, Callgrind and Massif are useful to figure out where the bottlenecks in the application lie, and where one should focus any optimization efforts.
We will have to save that joyful topic for another day, however.
No comments:
Post a Comment