The online project notebook of
Mark Van Dine
omo sanze lettere




- Current
- Notebook
- College

A collection of quotes and excerpts.


  Light Verse for a Heavy Universe
  Puzzle Machine
  Google
  Wikipedia
  Purportal.com
  Yahoo!
  Professional Writers & Editors
  BBC News
  Drama Queen
  Commoner's View


Cracking the Code, Part III

In Part II of this essay, we provided a dictionary-based solver for the 'aristocrat' style of cryptogram.

As long as the scrambled message is made up of commonly used words, this solver will usually do the trick, but I've been managing software projects long enough to have learned the value of a solid testing plan, so I decided to test the solver against some real 'cons' from the American Cryptogram Association (ACA).

I discovered that the solver was pretty good, but also that puzzle designers aren't afraid to bend a rule to create a challenge, and now and then they are apt to display a sick sense of humor.

SPOILERS Warning! In the following I provide some answers to May-June 2005 ACA aristocrat cons. Purists are hereby warned.

The Good, the Weird, and the Just Plain Mean

Several times a year the ACA publishes 'The Cryptogram', a Journal of the American Cryptogram Association, which includes club news, interesting columns, book reviews, and sets of different types of cryptograms, or "cons". The 'Aristocrat' con (an ACA term, as far as I know) is the puzzle we are discussing here (Xenocrypts are cons done in other languages, and Patristocrats eliminate the word divisions and punctuation, and frankly, give me a headache.)

In the May-June 2005 issue, 25 aristocrat cons were provided. I am going to discuss these specific cons in the following AND PROVIDE SOME ANSWERS. If you are an association member and do not appreciate SPOILERS, now is the time to stop reading!

I entered each in the dictionary-based Solver to see how it would do. In general, not bad, although I learned a thing or two toward's tightening up the code. For example, the abbreviation 'TV' tripped me up, as did many contractions unless I took care to remove apostrophes and some punctuation from the puzzle text.

Sometimes the difficulty was entirely predictable, as when proper names enter into the code text (I had 'Aztec' and 'Churchill' in my word list, but the 'G' in 'G. Ace' (short for Goodman Ace) stumped me. In another case, the printed con misplaced the hyphen in 'heavy-weight', and since neither 'heav' or 'yweight' are words, the Solver gave up.

And sometimes the vocabulary just stumped the program. For example, cons included the words 'maquillage' (makeup), 'gerenuk' (an east African antelope), and 'massif' ("a block of the earth's crust bounded by faults or flexures and displaced as a unit without internal change" ... the second definition in Webster's). Gerenuk. Come on.

Usually I just had to edit out apostrophes or drop suspiscious-looking proper names (which the Journal flags with asterisks) and the Solver would figure out the remaining words. The Solver was also immune to several obvious attempts to skew letter distributions in unlikely directions, including one very clever puzzle that used only words starting with 'M', and another that worked so hard to accomplish this that the answer was unintelligible ("obscure symbolic lawsuit drawn; scenario fought. weigh complaint. buoyant neighbor impugns ignoble felony. ostracized."). What?

The web-based Solver has some built-in fail-safes to keep it from processing too long, so I occaisionally had to employ the command-line Perl cousin of the Solver, which is built for more diagnostic flexibility and tougher problems.

But then I ran into the 25th (and last) con. 24 words, and each and every one of them was four letters long. It was to inspire me to mutter a few four-letter words of my own.

Attack of the Four-Letter Words :

The given puzzle is:

kjka izvn nvhr gudy kjuq. dfhu fhsm rgnu 'fjuh vbhz ljrf', fvwr kjlq owvu sgsa shji, koar ivoz onsa ksoh konr izvt fonh kgzm.

I first assumed that this was another editing slip-up ... that a patristocrat had been added to the list of aristos by mistake. But ACA patristocrats are given in 5-letter clusters (and don't include punctuation or other hints about word boundaries), and although 'The Cryptogram' may have the occaisional misplaced hyphen, they never publish mistakes of that magnitude any way.

There are over 3,000 four-letter words in our word list of 109,000+, but 80% of these have the same normal pattern indicating no repeating letters (true of 22 of the 24 words in the puzzle as well). This fact makes it tough for our depth-first algorithm to rule out dead-ends very effectively, and as a result we have to check a lot of potential word combinations. It isn't impossible, but this is an approach that will usually solve most puzzles in a few seconds. The PHP version of the algorithm gives up quickly. The beefier Perl version will chug away, but even on a fast PC, it takes almost 5 hours of processing time before a solution is found, and an exhaustive search had a long way to go (and I wasn't going to wait around to see what that would take).

The problem gave me the opportunity to create an algorithm I'd been wondering about ... one that tried to create a decoding key a letter at a time, rather than a word at a time. The pseudocode looks something like this:

1 Determine the set of unique symbols used in the puzzle called C-UNIQUE

2 Create a blank KEY with the same number of slots as C-UNIQUE.

3 Determine the array of unique words used in the puzzle called WORDS

4 Starting with the first blank position in KEY, take turns substituting each alphabet letter not already used in the KEY ...

5 With this temporary key, check the dictionary for each encoded word in the WORDS array*, and count the number of candidates for each:

- If any of the words have ZERO candidates, give up. Go back to step 4 and go on with the next alphabet letter.

- If each word in WORDS has exactly ONE dictionary candidate, the puzzle is solved! Print the answer. (Since many puzzles are non-unique, you can decide whether or not to bail out now. If not, return to step 4 and look for more solutions.)

- If all of the words in WORDS have at least one, but at least one has more than one solution, add the letter to the key, and then start the analysis again (from step 4) with the next available slot in the KEY.

*We start with a "blank" key, but this is actually a key with all SQL wildcard characters. For each encrypted word, we query our dictionary database for the set of words with the same 'normal' form (see Cryptograms II for a description of how this is defined) AND that match any known key letters. Early in the process, few key letters are known, and so the wildcards are used instead. For example, if our key-under-construction is mostly blank, but does suggest the letter 'b' to decrypt the code letter 'k', when we query the dictionary for candidate solutions for the encrypted "KJKA", the SQL is something like: SELECT COUNT(word) FROM Dictionary WHERE normal="abac" AND word LIKE "b_b_". (There are 7 such: babe, babu, baby, bibs, bobs, bubo and bubs).

This, too, can be a lot of processing, depending on the puzzle, since it is fairly easy to get 5 and 6 characters deep into the key before 'impossible' alternatives start to give themselves away (hundreds of millions of tests), but a full cycle can run in about half a day, and a few simple heuristics can improve on that fairly dramatically. For example, my perl script allows me to analyze the letter frequency in the puzzle, and from that I can make an educated guess about which symbol is letter 'e'. (Of course, in this puzzle, 'e' turned out to be the third most frequent letter ... nothing's ever easy ... so a little trial and error was involved.)

Using this approach, in about an hour I generated the solution:

baby frog goes into bank. then held sign 'hand over cash', hops back upon lily leaf, buys four ugly blue bugs from huge bird.

Not exactly a message that rewards the effort, although I do appreciate the creativity that went into its construction!

 

Copyright 2005 Mark Van Dine, All Rights Reserved

 
     

Recent additions:

- Tough Sudoku

- Scramble Redux

- Sudoku

- Cryptograms III

- Cryptograms II

- Cryptograms I

- Peg Solitaire

- Mining Words

- What's My Line?

- Solve the Jumble!

- Scramble Squares

- SRAT Fever



Check out 'The Puzzle Machine'!

 

home - contact