I had some fun decoding hexidecimal stings and learned a few things in the process.

I was looking through job postings today and I came across a LoveWithFood posting that really got my attention. Not the posting itself so much (although the job looks really cool), but because they included a small programming question inside the post. It was a hexadecimal-encoded string that contained a message that should be decoded and included in the job submission. The encoded string is this:

e299a5205765204d616b652054756d6d792048617070792120e299a5

I don’t know about you, but I love secret messages. I just couldn’t help myself.

In python, this is really easy because it has encode() and decode() that both take a format as a parameter. So the quick solution is this:

>>> mystring = "e299a5205765204d616b652054756d6d792048617070792120e299a5"
>>> print mystring.decode('hex')

But that’s really boring. And decode() gets to do all of the work, so I set out to implement the decode operation on my own (in python), and I made some interesting discoveries along the way.

Character Encoding

Let me first explain what I’ve learned about character encoding. Well, there is too much to explain, so let me sum up. You’re certainly familiar with ASCII, which is a 7-bit encoding scheme developed by the U.S. to represent character symbols. This means only 128 unique characters can be represented and this is an obvious problem for languages that don’t share our relatively small alphabet.

So unicode was created to solve this problem. Unicode is a standard, but not an implementation. Utf-8 is a unicode implementation that uses 32 bits to represent character symbols (some of the higher bits aren’t used but that’s beyond this post). Utf-8 will use fewer bytes if it can, so not all characters encode to a 32 bit value. One neat fact is that utf-8 uses the same byte value for ASCII characters which has the same value as ASCII encoding, so the character values are the same.

Let’s take a look at a few examples:

>>> 'a'.encode('hex')
'61'

>>> 'é'.encode('hex')
'c3a9'

>>> '…'.encode('hex')
'e280a6'

I just glossed over a huge amount of information, but bear with me. The basic gist is this: given a long hexadecimal-encoded string, decoding it means breaking it apart into individual bytes and decoding each byte by itself. In some cases, decoding a particular byte may require the value of the previous byte. Since a byte is represented as two hexadecimal digits, we need to separate the string into pairings and decode each pairing.

Decoding without decode()

In the encoded string we are given, each pair of two characters contains the encoding for a single printable character, so the goal is to separate this large array into pairings of two that we can individually decoded and then combine to see the final result.

Python’s :: array syntax

Python provides a really powerful array index notation that allows you to quickly slice and dice an array with very little effort. One of these is the :: operator which allows you to take each nth entry starting at some starting point. So, for example, if you wanted to pull out all of the even-indexed elements in an array called A you would say A[0::2]. That is, starting at index 0, take every second element. You can get the odd-indexed elements by changing the start index to 1 like so: A[1::2].

This allows us to separate the large array A into two smaller arrays B and C with the following propery: at each index i from 0 to n, B[i] and C[i] contain the two characters we want to pair together. If only there was a way to zip the two arrays together…

The zip() function

Of course python has a zip() function, which does exactly what we want. Here’s the documented definition:

Return a list of tuples, where each tuple contains the i-th element from each of the argument sequences. The returned list is truncated in length to the length of the shortest argument sequence.

So acquiring a set of tuples that we can then decode individually is as simple as:

>>> [''.join(c) for c in zip(INPUT[0::2], INPUT[1::2])]
['e2', '99', 'a5', '20', '57', '65', '20', '4d', '61', '6b', '65', '20', '54', '75', '6d', '6d', '79', '20', '48', '61', '70', '70', '79', '21', '20', 'e2', '99', 'a5']

The above is a list comprehension and is one of the many features that make Python a joy to work with. If you’re not familiar with list comprehensions, I highly encourage you to learn more about them.

Now that we can take the string apart into pieces, it’s easy to convert each byte to decimal:

>>> [int(''.join(c), 16) for c in zip(INPUT[0::2], INPUT[1::2])]
[226, 153, 165, 32, 87, 101, 32, 77, 97, 107, 101, 32, 84, 117, 109, 109, 121, 32, 72, 97, 112, 112, 121, 33, 32, 226, 153, 165]

And now we can interpret each decimal value as a character:

>>> [chr(int(''.join(c), 16)) for c in zip(foo[0::2], foo[1::2])]
['\xe2', '\x99', '\xa5', ' ', 'W', 'e', ' ', 'M', 'a', 'k', 'e', ' ', 'T', 'u', 'm', 'm', 'y', ' ', 'H', 'a', 'p', 'p', 'y', '!', ' ', '\xe2', '\x99', '\xa5']

And now the message is starting to show. It appears there are unicode characters at the beginning and end of the string. The print function knows how to correctly interpret those sequences, so decoding the string is now:

>>> print ''.join([chr(int(''.join(c), 16)) for c in zip(INPUT[0::2], INPUT[1::2])])
♥ We Make Tummy Happy! ♥

So there you have it. Decoding secret messages is fun, but I think the really cool part is the use of the :: array index operator, the zip function, and python’s list comprehensions.

What about Ruby?

The job requested the answer in Ruby, and it turns out the answer is not so different from what we’ve just done in Python. Here is a Ruby solution:

>> INPUT = "e299a5205765204d616b652054756d6d792048617070792120e299a5"
=> "e299a5205765204d616b652054756d6d792048617070792120e299a5"

>> INPUT.scan(/../)
=> ["e2", "99", "a5", "20", "57", "65", "20", "4d", "61", "6b", "65", "20", "54", "75", "6d", "6d", "79", "20", "48", "61", "70", "70", "79", "21", "20", "e2", "99", "a5"]

>> INPUT.scan(/../).map(&:hex)
=> [226, 153, 165, 32, 87, 101, 32, 77, 97, 107, 101, 32, 84, 117, 109, 109, 121, 32, 72, 97, 112, 112, 121, 33, 32, 226, 153, 165]

>> INPUT.scan(/../).map(&:hex).map(&:chr)
=> ["\xE2", "\x99", "\xA5", " ", "W", "e", " ", "M", "a", "k", "e", " ", "T", "u", "m", "m", "y", " ", "H", "a", "p", "p", "y", "!", " ", "\xE2", "\x99", "\xA5"]

>> INPUT.scan(/../).map(&:hex).map(&:chr).join
=> "\xE2\x99\xA5 We Make Tummy Happy! \xE2\x99\xA5"

>> print INPUT.scan(/../).map(&:hex).map(&:chr).join
♥ We Make Tummy Happy! ♥

In the absense of decode(), I personally find the Ruby solution to be more concise.