At some point, you simply run out of processing power. Admittedly, that point keeps getting further and further away, but you can still get there. If you run out of CPU time, the answer might be to add more CPUs. However, sometimes there are other bottlenecks like memory or disk space. However, it is also likely that you have access to multiple computers. Who doesn’t have a few Raspberry Pis sitting around their network? Or maybe a server in the basement? Or even some remote servers “in the cloud.” GNU Parallel is a tool that lets you spread work across multiple tasks either locally to remote machines. In some ways, it is simple, since it looks sort of like xargs
but with parallel execution. On the other hand, it has myriad options and configurations that can make it a little daunting to use.
About xargs
In case you don’t use xargs
, it is a very simple program that among other things lets you do something with a list of files. For example, suppose we want to search all C source files for the string “hackaday” using grep. You could write:
find . -name ‘*.[ch]’ | xargs grep -i hackaday
Here, xargs
grabs an input line, calls grep
and after grep
completes, it repeats the process until it runs out of input lines. (Note: handling files with spaces is a bit tricky. Using -d ‘\n’ might help, although not all versions of xarg support it.)
In the simplest case, Parallel does the same thing, but it can execute grep
— or whatever you are using — multiple times at once. On a local machine, this allows you to use multiple CPUs to improve timing. However, you can also spread the work among different machines that have passwordless ssh
logins.
Demos
The author of GNU Parallel has a multipart video demonstration of the system. You can see the first part, below. The tutorial is also very good, and clears up a number of details that might not be obvious from the man page.
Just for my own amusement, I took a directory with some large mp4
files in it and used both xargs
and parallel
to gzip
each file. I know, I know. The files are already compressed, so gzip
isn’t going to do much. But I just wanted some large task to time. Here are the results:
[:~/Videos/movies] $ time find *.mp4 | xargs -d '\n' gzip real 6m10.796s user 2m52.828s sys 0m9.718s [:~/Videos/movies] $ time find *.mp4 | parallel --jobs 8 -d '\n' gzip real 5m25.050s user 2m56.676s sys 0m7.732s
Admittedly, this wasn’t very scientific, and saving about 45 seconds isn’t a tremendous gain, but still. I picked eight jobs because I have an eight-core processor. You might vary that setting depending on what else you’re doing at the time.
Remote
If you want to use remote computers to process data, you need to have passwordless ssh
remote access to the other computer (or computers). Of course, chances are the remote computer won’t have the same files and resources, so it makes sense that — by default — your commands only run on the remote server. You can provide a comma-separated list of servers, and if you use the server name of “:” (just a colon), you’ll include your local machine in handling jobs.
This might be very useful if you have a mildly underpowered computer that needs help doing something. For example, we could imagine a Raspberry Pi-based 3D printer asking a remote host to slice a bunch of models in parallel. Even if you think you don’t have any computational heavy lifting, Parallel can do things like process files from a tar
archive as they are unpacked without waiting for the rest of the files. It can distribute grep
‘s work across your CPUs or cores.
Honestly, it would take a lot to explain each feature in detail, but I hope this has encouraged you to read more about GNU Parallel. Between the videos and the tutorial, you should get a good idea of some of the things you could do with this powerful tool.
No comments:
Post a Comment