Monday, June 29

Linux-Fu: Parallel Universe

At some point, you simply run out of processing power. Admittedly, that point keeps getting further and further away, but you can still get there. If you run out of CPU time, the answer might be to add more CPUs. However, sometimes there are other bottlenecks like memory or disk space. However, it is also likely that you have access to multiple computers. Who doesn’t have a few Raspberry Pis sitting around their network? Or maybe a server in the basement? Or even some remote servers “in the cloud.” GNU Parallel is a tool that lets you spread work across multiple tasks either locally to remote machines. In some ways, it is simple, since it looks sort of like xargs but with parallel execution. On the other hand, it has myriad options and configurations that can make it a little daunting to use.

About xargs

In case you don’t use xargs, it is a very simple program that among other things lets you do something with a list of files. For example, suppose we want to search all C source files for the string “hackaday” using grep. You could write:

find . -name ‘*.[ch]’ | xargs grep -i hackaday

Here, xargs grabs an input line, calls grep and after grep completes, it repeats the process until it runs out of input lines. (Note: handling files with spaces is a bit tricky. Using -d ‘\n’ might help, although not all versions of xarg support it.)

In the simplest case, Parallel does the same thing, but it can execute grep — or whatever you are using — multiple times at once. On a local machine, this allows you to use multiple CPUs to improve timing. However, you can also spread the work among different machines that have passwordless ssh logins.

Demos

The author of GNU Parallel has a multipart video demonstration of the system. You can see the first part, below. The tutorial is also very good, and clears up a number of details that might not be obvious from the man page.

Just for my own amusement, I took a directory with some large mp4 files in it and used both xargs and parallel to gzip each file. I know, I know. The files are already compressed, so gzip isn’t going to do much. But I just wanted some large task to time. Here are the results:

[:~/Videos/movies] $ time find *.mp4 | xargs -d '\n' gzip

real    6m10.796s
user    2m52.828s
sys     0m9.718s
[:~/Videos/movies] $ time find *.mp4 | parallel --jobs 8  -d '\n' gzip

real    5m25.050s
user    2m56.676s
sys     0m7.732s

Admittedly, this wasn’t very scientific, and saving about 45 seconds isn’t a tremendous gain, but still. I picked eight jobs because I have an eight-core processor. You might vary that setting depending on what else you’re doing at the time.

Remote

If you want to use remote computers to process data, you need to have passwordless ssh remote access to the other computer (or computers). Of course, chances are the remote computer won’t have the same files and resources, so it makes sense that — by default — your commands only run on the remote server. You can provide a comma-separated list of servers, and if you use the server name of “:” (just a colon), you’ll include your local machine in handling jobs.

This might be very useful if you have a mildly underpowered computer that needs help doing something. For example, we could imagine a Raspberry Pi-based 3D printer asking a remote host to slice a bunch of models in parallel. Even if you think you don’t have any computational heavy lifting, Parallel can do things like process files from a tar archive as they are unpacked without waiting for the rest of the files. It can distribute grep‘s work across your CPUs or cores.

Honestly, it would take a lot to explain each feature in detail, but I hope this has encouraged you to read more about GNU Parallel. Between the videos and the tutorial, you should get a good idea of some of the things you could do with this powerful tool.

No comments:

Post a Comment