awk
is a kind of Swiss Army knife for text files. However, some of its limitations are often a bit annoying. I’ve used a simple set of functions to make awk
a bit better, although I will warn you: it does require GNU extensions to awk
. That is, you must use gawk
and not other versions. Your system probably maps /usr/bin/awk
to something and that something might be gawk
. But it could also be mawk
or some other flavor. If you use a Debian-based distro, update-alternatives is your friend here. But for the purposes of this post, I’m going to assume you are using gawk
.
By the end of the post, you’ll see how to use my awk
add-on functions to split up a line into fields even when there is no single character to separate all fields. In addition, you’ll be able to refer to the fields using names you decide. You won’t have to remember that $2 is the time field. You’ll say Fields_fields["time"]
instead.
The Problem
awk
does a lot of common work for you when you use it to process text files. It reads files a record at a time. Normally, a record is a single line. Then it splits the line on fields using whitespace, or some other choice of field separators. You can write code that manipulates the line or individual fields. This default behavior is great, especially since you can change the end of record character and the field separator. A surprising number of files fit this sort of format.
Until, of course, they don’t. If you have data coming from a data logging instrument or some database, it could be formatted in a variety of ways. Some fields might have structured data with a variety of separators. This isn’t a deal-breaker. Since you can get at the whole line, you can do almost anything you want, but the logic is harder and the whole point to using awk
is to make things easier.
For example, suppose you had a file from a data recorder that had an eight-digit serial number, followed by a six-character tag, and then two floating point numbers separated by colons. The pattern might look like
^([0-9]{8})([a-zA-Z0-9]{6})([-+.0-9]+),([-+.0-9]+)$
This would be hard to handle with the conventional field splitting and you’d normally just write code to split everything apart.
If you have regular fields, but don’t know how many, you probably want to set FS
or FPAT
, instead. We talked about FPAT
a little before when we were abusing awk
to read hex files. This library is a little different. You can use it to pick apart a line totally. For example, you might have part of the line with a fixed field length and then multiple types of separators. That can be hard to handle with the other methods.
Regular Expressions
To make things easier, I’ll wrap up the gawk
match
function. That function exists in regular awk
, of course, but gawk
adds an extension that makes things much easier. Normally, the function performs a regular expression match on a string and tells you where the match starts, if there was a match, and how many characters matched.
With the GNU extensions in gawk
, you can provide an extra array argument. That array will get some information about the match. In particular, the zero item of the array will contain the entire match. If the regular expression contains sub-expressions in parenthesis, the array will contain those, numbering by the order of the parenthesis. It will also contain start and length information.
For example, if your regular expression were "^([0-9]+)([a-z]+)$"
and your input string is 123abc
, the array would look like this:
array[0] - 123abc array[1] - 123 array[2] - abc array[0start] - 1 array[0length] - 6 array[1start] - 1 array[1length] - 3 array[2start] - 4 array[2length] - 3
You can even have nested expressions, so "^(([xyz])[0-9]+)([a-z]+)$"
with an input of z1x gives array[1]=z1
, array[2]=z
, and array[3]=x
.
Theory vs Practice
In theory, that’s all you need. You can write a regular expression to pick apart a line, parse it, and then access the pieces using the array. In practice, it is much nicer to have everything done so you can use plain names to access the data.
As an example data format, consider a line like this:
11/10/2020 07:00 The Best of Bradbury, 14.95 *****
There is a date in US format, a time in 24-hour format, an item name, a price, and a rating from 1 to 5 stars that may not be present. Writing a regular expression to grab each field is a bit complex, but not very hard. Here is one way to do it:
"^(([01][0-9])/([0-3][0-9])/(2[01][0-9][0-9]))[[:space:]]*(([0-2][0-9]):([0-5][0-9]))[[:space:]]+([^,]+),[[:space:]]*([0-9.]+)[[:space:]]*([*]{1,5})?[[:space:]]*$"
That’s a mouthful, but it works. Note that each item is in parenthesis and some of those are nested. So the date is one field, but the month, day, and year are also fields.
The Library
Once you grab the files on GitHub, you could put the fields_* functions into your code. You need to do some setup in the BEGIN tag. Then you process each line using fields_process. Here’s a small example (with the functions omitted):
BEGIN { fields_setup("^(([01][0-9])/([0-3][0-9])/(2[01][0-9][0-9]))[[:space:]]*(([0-2][0-9]):([0-5][0-9]))[[:space:]]+([^,]+), [[:space:]]*([0-9.]+)[[:space:]]*([*]{1,5})?[[:space:]]*$") fields_setupN(1,"date") fields_setupN(2,"month") fields_setupN(3,"day") fields_setupN(4,"year") fields_setupN(5,"time") fields_setupN(6,"hours") fields_setupN(7,"minutes") fields_setupN(8,"item") fields_setupN(9,"price") fields_setupN(10,"star") } { v=fields_process() ... your code here... }
In your code you can write something like:
cost=Fields_fields["price"] * 3
Simple, right? The fields_process
function returns false if there was no match. You can still access the normal awk
fields like $0 or $2 if you want.
Inside
The extra functions rely on two things: the extensions to the gawk
match
function and awk
‘s associative array mechanism. In the past, I’ve added the named keys to the existing match array so you could get data out either way. However, I’ve modified it so that the match array is local because I almost never really want that capability and then you have to filter out the extra fields if you want to dump the entire array.
It is frequently useful to start the regular expression with ^
and end it with $
to anchor the entire string. Just don’t forget that the regular expression needs to handle white space consumption, as the example does. This is often a benefit when you have fields that can contain spaces, but if you wanted spaces to break fields anyway, you are probably better off with the original parsing scheme.
Another trick is to get “the rest of the line” after you parsed off the first fields. You can do that by adding "(.*)$"
to the end of the regular expression. Just don’t forget to set up a tag for it using fields_setupN
so that you can fetch the value later.
An easy extension to this library would be to make the pattern an array. The processing function could try each pattern in turn until one matches. Then it would return the index of the matching pattern or false if there were no matches. This would let you define multiple types of lines if you had a complex file format. You’d probably want to have different sets of field tags for each one, too.
I have a long history of abusing tools like awk
to do things, like build cross assemblers. Even so, I’m probably not the worst offender.
No comments:
Post a Comment