BotGadget: Squoze Your Data

I have a confession to make. I enjoy the challenge of squeezing software into a tiny space or trying to cut a few more cycles out of a loop. It is like an intricate puzzle. Today, of course, there isn’t nearly as much call for that as there used to be. Today even a “small” microcontroller has a ton of memory and resources.

Even so, there’s still a few cases where you need to squeeze those last few bytes out of memory. Maybe you are trying to maximize memory available for some purpose. Maybe you are anticipating mass production and you are using the smallest microcontroller you can find. Or maybe you’re doing the 1 kB Challenge and just want some advice.

One way to find techniques to maximize resources is to look at what people did “in the old days.” Digital Equipment computers once had a special character set called Squoze (or sometimes DEC Radix-50). This technique can be useful when you need to get a lot of strings into memory. The good news is that you can reliably get 3 characters into 2 bytes (or, as DEC did, 6 characters into 4 bytes). The bad news is that you have to pick a limited character set that you can use. However, that’s not always a big problem.

The Big Blue Connection

The key to Squoze is that you have to pick 40 characters that you want to encode. That’s enough for 26 letters, 10 digits, and 4 punctuation marks (a space plus whatever else you want). Notice that 26 letters means you have to pick upper case or lower case, but not both.

You might wonder why it was called DEC Radix-50 if you can only have 40 characters. The 50 is octal (base 8) which is 40 decimal. The Squoze name was actually used both by DEC and IBM. In the IBM scheme, though, a 36-bit word held two flags and 6 characters from a set of 50 (decimal 50, this time). Historically, the use of Squoze for symbol tables is why a lot of old compilers used 6 character symbols with only certain allowed characters (if you ever wondered about that).

How it Works

Once you have 40 characters in mind, you can treat them each as a digit in a base 40 number. Hold that thought and think about a regular decimal number. If you have a string of digits from 0 to 9 coming in, you can build a decimal value like this:

value=0

while not end of digits

   value=value * 10

   value = value + digit

end while

If you have the digits “643” then the value will progress from 0 to 6 to 64 to 643.

If you generalize to base 40, the “digits” are numbers from 0 to 39 (representing a character) and the *10 becomes *40 in the above pseudo code. If you prefer, you can think of it mathematically as the first digit is multiplied by 40 squared (1600), the second digit by 40 (40 to the first power), and the final digit by 1 (40 to the zero power).

Suppose, then, that we select 0 to represent a space, 1 to represent an A, 2 for a B, and so on. The string “A B” (that is a space in the center” would be 1, 0, 2. Running through the algorithm, the result would be 1602.

Decoding is just the reverse. Starting with 1602, for example, you do an integer division by 1600 to get a 1 (A). Take that away leaving 2. Dividing by 40 gives you a space (0). Then there’s 2 left–a B–so the string is “A B” as you’d hope.

Why Forty?

Why go through all this trouble? Why use 40 and not, say, 41 or 50? Think about the largest number you can represent with three base 40 characters. 39*1600+39*40+39 = 63999. That fits in a 16-bit word. If you went with base 41, the largest possible number is 68920. That’s too big for a 16-bit number.

The point is, using this scheme you can pack 3 characters (from your set of 40) into two bytes. Compared to normal ASCII characters, that’s a savings of 33%. Not much, but compared to other compression schemes it is simple to implement and it always saves 33%. Well, at least, roughly. If your text length isn’t evenly divisible by 3 you may have a padded character or two which reduces a little efficiency depending on the total text length.

Imagine a bunch of LCD menu entries stored in EPROM. Saving 33% could leave you some room for more strings or more unrelated features. Of course, you also need to factor in the size of the code and the character table. Frequently, you don’t need both encoding and decoding at run time, but there’s still some overhead to consider. You have to evaluate your needs and decide appropriately.

Enhancements

If you chafe at the 40 character limitation, you can pull a few stunts to get more at the expense of possible efficiency loss. For example, for some applications, you can assume the first letter of a string is uppercase and the rest are lower case. Or select your 39 most common characters, use one more as an escape character and then pick 40 more for a total of 79 characters. You could use the escape character as a “shift” character, too, if that works better for you.

Implementation

You can find a simple C-language implementation on GitHub. The make file generates two executables, one for encoding and one for decoding. You can define a symbol to avoid building the executables if you want to experiment with the code as a library.

The encoder and decoder will use a default character set if you like. You can also provide one yourself. The code is hardly optimized so if your application shares code and data space, you probably want to spend some time paring the code size down. However, for a demo or if your code space is separate and plentiful, it is probably good enough as it is.

The code is set up so you can run the encode and feed its output to the decoder:

$ ./encode | ./decode
Hello Hackaday. A man, a plan, a canal, Panama.
Characters in=48
(13012) HEL
(19800) LO 
(12843) HAC
(17644) KAD
(2637) AY.
(40) A 
(20854) MAN
(60801) , A
(652) PL
(2198) AN,
(40) A 
(4854) CAN
(2118) AL,
(641) PA
(22453) NAM
(3080) A. 
Bytes out=32
Full String (length=48):
HELLO HACKADAY. A MAN, A PLAN, A CANAL, PANAMA. 
$

Note that the characters are shifted to upper case by the encoder.

Worth It?

This is just one more trick to put in your toolbox. It isn’t for every project. You need to be able to live with the limitations. You also have to be able to accept the additional code and the runtime overhead. Depending on the data you have, there may be better options. For example, a data logger might store floating point numbers or some other scheme.

Huffman encoding, run length encoding, and other methods might do better–it depends on your application. However, I do think it is interesting to mine some of the old techniques and apply them to new solutions.

If you had fun with this, you should throw your hat in the ring with the 1 kB Challenge I mentioned before. Hackaday wants to see what you can do with a 1 kB compiled binary limit for a microcontroller-based project. Give it a shot and learn the space-saving techniques of old along the way.

Filed under: Engineering, Hackaday Columns, Software Development

BotGadget

Tuesday, November 22

Squoze Your Data