Cryptograms
Part II: Solving the Code
In Part I of this essay, we described the 'aristocrat' style
of cryptogram and provided a form that could encode a message according to its rules.
Such messages are usually decoded with a grab-bag of simple analyses (e.g., letter frequency tests) and rules-of-thumb
(guesses of probable common 1-, 2-, and 3-letter words, predictable suffixes like 'er', 'ed', and 'ing', etc.). Trial and error,
and a reasonable amount of perseverence, will usually find success with this approach.
But for a simpler, computerized solution, here we examine an approach that uses a big dictionary
and depends on our enemy spies being able to spell.
|
|
(Be patient ... this process can take some time to complete!) |
| Decoded Message: |
| 1 solution was found for encoded characters caewundobyrtxfzspg Solution 1: tobecmanplishdryuv to become an accomplished pitcher, you really have to use 'mind over batter.' |
A couple warnings may be helpful: First, since this is a web page, I intentionally bail out of puzzles that demand a lot of processing time rather than time out the web server. For example, "HRPB ORWQH HBBP KR KGDFQ KGBI'SB NRSKG M WRK RO PRFBI ALHK ZBYMLHB KGBI GMXB DK" translates to "some folks seem to think they're worth a lot of money just because they have it" ... even after adding "theyre" to the dictionary (contractions are a pain), this takes two minutes to solve using the Perl utility I use for tougher problems, and that's too long for a Web page.
Second, be VERY CAREFUL to transcribe the puzzle correctly. A user submitted: "qnhq pml kj jy nkxjmqa na ravkrar qy iafhiil vnhpfa nkj phga qy nhxxl ghpp". The second word has a typo, and should be "fml" (the answer then is "that guy is so hirsute he decided to legally change his name to harry mann"). With the typo, "guy" is the non-word "nuy", and that trips up this algorithm.
The Puzzle :
Advice for solving cryptograms without a key usually begins with a discussion about letter frequency, although this can be an unreliable approach for a couple of reasons. First, different types of writing, from different times and for different audiences, can favor different vocabularies, and this can impact the popularity of different letters.
For example, I counted the letters from a New Yorker article about the Unicorn Tapestries in the Cloisters, and then ran the same analysis on the Charles Dickens novel "Little Dorrit":
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
The table offers some comfort ... the same 10 letters occur most frequently even contrasting contemporary magazine copy with 19th century serialized fiction, and the top 4 letters (e, t, a, and o) occur in exactly the same order. This suggests that a good solution strategy would be to compute the frequency of use for each character in an encoded message, and assign the top four to these letters.
But the other interesing question to ask is, "How many letters into a message must I be to compute a reliable frequency distribution of letters?" 25? 50? 100? Although there may be a clever approach to determine the statistics of this, I wrote a short script to check the "Unicorn" text directly, and the results are summarized in the following chart:

The script measures the relative stability of the rankings as characters are encountered in the text, counting the number of characters that hold their position in the hit parade as each character is added. (For example, after 100 characters have been read, the letters of the alphabet, written in order of frequency of use, might be "etanrohimsuvpflwdgcbxjykqz" and then we read the 101st character and rerank the letters to "etanrohismuvpflwdgcbxjykqz" ... the two strings are the same through the first eight characters, so we score that an '8').
What you find is that after the primacy of the letter 'e', there is enough similarity in the frequency of use of the next letters, that deciding even which is second- or third-ranked can not be done with much certainty. The rankings of the top five letters do become reasonably set after 2,000 characters of text, but most cryptogram puzzles are 50 to 100 characters in length, which means you can assign "e" with some certainty, and maybe "t" ... otherwise other rules or algorithms are needed. (And even that may be going out on a limb ... the second quote I tried when testing the page was 104 characters without spaces and was particularly fond of the letter 'o', not 'e'!)
Simplifications and Complications
For a message that uses all the letters in the alphabet in its key, there would be about 403 octillion permutations to consider (4 x 1026 ... 4 followed by a LOT of zeros). The odds are slightly better since most messages use only 70%-85% of the available letters, but even a message using just half the available 26 letters is simpler only by a paltry 6 billion permutations or so ... hardly worth troubling oneself over.
In short, brute force (generating and testing every possible key) is probably not a good strategy here.
Power of a Pattern (or, The Benefit of Being Normal)
To simplify in a more practical way, I introduce a way to 'normalize' a word through a transformation that has the same result for both the encrypted or decrypted form of a word. The normalized form of the word retains information about A) the number of unique characters used to build the word and B) the sequence in which those letters occur.
For example, consider the word 'bottle'. It consists of six letters, but only five 'unique' letters (the 't' is used twice). Map the symbols in the encoded word to letters from the alphabet in the order they are encountered. So, for 'bottle', let b = a, o = b, t = c, l = d, and e = e. Using this 'normalized' approach, the word 'bottle' is normalized as 'abccde'.
Any single-letter encryption of the word 'bottle', when normalized, will have the same form as the word 'bottle' will when normalized. If we add the normal form of each word as a searchable field in our database of commonly used words, we can often limit the possible solutions of an encrypted word to a much smaller universe of candidates to test.
I used a word list with 109,583 entries. If we look at all six-letter words in this list as candidates for the identity of 'xhnncq', we have 11,492 to consider (over 10% of the list!). But if we constrain our search to words of the same length and same normal form as the encrypted word, there are only 850 such words to wonder about. The search space is reduced by more than an order of magnitude!
Using What You Know (or Assume to Know)
The other problem facing us is the collection of unique words that make up the encrypted text. Even if we can narrow down the candidate list for a given word to a few hundred, a short 10-word message would still threaten to defeat the supercomputer you keep in the basement for such problems. However, since the symbols used to code the message are consistent throughout the message, we have an out.
In our solution algorithm, we gather the list of possible words that fit the normalized form of our first word. We then loop through this list and in each case we assume the word is the 'correct' choice, and update our puzzle key to reflect the concordance of encrypted to decrypted characters that this implies. When we then query our dictionary for word 2 in the puzzle, we select words that fit the normal form AND require that the words agree with (or, at least, don't conflict with) the key we've started to construct. When such a query returns a result set of zero words, then we assume we've found a dead-end, and backtrack to the next candidate word
To add a little practical speed, when the encrypted puzzle is loaded, an array is created of the unique words in the puzzle sorted longest to shortest. The sorting seems to help speed up the solution of many (not all) puzzles. Messages composed of predominantly 5, 6, and 7 letter words tend to take the longest to solve.
The garden-variety newspaper cryptogram often has one very long word which can make syntactic analysis of the puzzle difficult, but is a spectacular fatal weakness that our dictionary-based approach can exploit. For example:
AY UBNYLB ZV ZNNYLQDORCBE QOANCBM, GYH MBZDDG CZPB AY HRB "LOVE YPBM UZAABM."
When normalized, that fourth word ("ZNNYLQDORCBE") has just 9 possible dictionary alternatives, and the script can leverage this fact to quickly shrink the search space and nail down a whole lot of letters at once. As a result, the solution is generated in just 14 seconds ("To become an accomplished pitcher, you really have to use "mind over batter".)
The Weakest Link(s)
This algorithm will (usually) generate rapid solutions but has a major weakness: If the encoded message uses a word that is not in our dictionary, either because our dictionary is not up to the spy's vocabulary level or the spy can't spell very well, a 'correct' decoding of the word will look like it is incorrect. A corollary of this problem is that many cryptograms are quotes that include the name of the author of the quote, and proper names often do not appear in our commonly-used-words dictionary.
Another interesting problem with this sort of puzzle is that solutions are often not unique. Most puzzles have only one 'real' solution, but this decoding algorithm, which has no sense of syntax or idiom, may find several. Puzzles comprised of just a small set of unique words are problematic for all sorts of decryption approaches, but this one in particular. An encrypted form of the question "Are my eyes really brown?" not only takes an agonizing 5.5 minutes to solve, but also results in over 350 valid, if not entirely sensible, solutions. ("Are by eyes really print?")
Even longer puzzles display this problem, although a longer puzzle usually uses enough letters that the bulk of any candidate solution is the same, and only a few alternatives emerge. For example, I pulled an encoded quote by Erasmus from Louise B. Moll's puzzle book, "Clever Cryptograms" ("Those who would not make room for the poor fellow to come in through the door will have him thrust upon them through the ceiling.") This only requires a couple of minutes to solve, but results in multiple solutions since "have", "haze", and "hake" all fit, with "make" and "maze". The correct answer is usually obvious, as long as there aren't several hundred to consider.
In the implementation presented here, I employ recursion to solve the puzzle and must forcibly end processing after 1000 calls to the recursive function, simply because it is easy for the script to quickly overreach the processing resources allocated by the web server if allowed to run amok. (An ActiveState Perl implementation of the same algorithm run from a command prompt will run to completion without a similar safety net.)
Links
Online sources of cryptograms abound. One that I particularly like is the Cryptogram Corner by Parsly, McGee, and Vanzo, which provides a daily puzzle, an interactive form for solving it (which I'd like better if it worked with Firefox, having all but banned IE from my PC), and links to several other cryptogram pages.
Copyright 2006 Mark Van Dine, All Rights Reserved

