The Primordial Quiz

Evolving an Answer to a Logical Puzzle



Sometimes puzzles can be solved by an algorithm that takes a dumb guess and evolves it to a solution using a tactic suggested by the action of Natural Selection: gradual accumulation of small improvements to the original guess, as measured by some sort of fitness test.

- Introduction
- Propp's Self-Referential Aptitude test
- Simulation Results
- Conclusion
- Programs for Solving Propp's Puzzle


Introduction

In his book "The Blind Watchmaker", Richard Dawkins illustrated the power of algorithms that accumulate small change to find optimal solutions to problems. In one example, he shows how a simple computer program can start with a random series of characters, and through an iterative process "evolve" that pattern to match a specific phrase.

Successive random guesses, conjuring up the old example of monkeys at typewriters, is only a strategy for the very patient. A 28 character string of text suggests 27(28-1) possible combinations (26 characters and a space raised to the 27th power) … about 4.4 * 1038 different combinations, which would take several million times the current age of the universe to get through, even if you were processing a billion guesses a second.

It is one of a series of clever examples Dawkins provides to show how application of the basic mechanisms of evolution (non-random selection of variation, even random variation) can be put to work solving difficult problems.

The basic framework of such a solution is to create a population of candidate solutions, and to assess each member of the population with a "fitness function".  The member(s) with the highest score from the fitness function are used to create a new generation (a new population) of candidate solutions.

Dawkins' candidates are just strings of random characters to begin with.  Each such string is compared to the sentence “Methinks it is like a weasel” (a line from Hamlet).  The string that is the closest match (has the highest relative fitness) is used to create the next population of candidate strings.  Fitness is measured by comparing the characters from a candidate string with the 'target' string, awarding one point for every character that matches.  (If I'm looking for a match for the word "weasel" and the guess is "xcathj", I score one point because the two strings have a match in the third character ... the letter 'a').

The guess with the highest score is simply copied several times to create the next generation of candidates, allowing for mutation (typos) when the copies are made (e.g., every 100th character or so we insert a random character into the copy). The fitness is measured for the members of the new population, and the most fit is used to make another generation, and so on. The occaisional typos intoduce variation that sometimes acts to improve fitness. We keep making new generations from our best match until we find a guess that completely matches our target.

In Dawkin's example, a match is 'evolved' in just a few seconds on a standard desktop computer.

Propp's Self-Referential Aptitude Test

Several years ago I was sent a copy of the "Self-Referential Aptitude Test" (SRAT), a puzzle designed by Jim Propp, an Associate Professor of Mathematics at the University of Wisconsin (Madison).

It is an interesting puzzle because it is a twenty question multiple-choice quiz about ... the quiz itself! A sample question from the quiz:


4. The number of questions with the answer A is

(A) 4
(B) 5
(C) 6
(D) 7
(E) 8


Propp states that the quiz can be solved using logic, and that it has a single unique solution.

Now, I've met people who could sustain the level of logical focus needed to score 20 out of 20 on a test like this, but I am not one of them.  Fourth grade math problems with 5, rather than 20, moving parts, have me on the ropes for hours.

So, I decided to build a computer program to solve the SRAT using the Dawkins approach to the "weasel" problem.

Any proposed solution to the quiz can be represented as a string 20 characters long (each an A, B, C, D, or E, corresponding to multiple choice answer to the question.)

An advantage of the quiz being Self-Referential is that we can create a fitness test for our program that uses any given set of answers to determine how “correct” those answers are.  That is, the answers for a completed quiz can be (must be) compared to themselves to see how well they score.  For example, if we answer Question 4 as “C … there are six questions answered 'A' in this test”, we only have to count the ‘A’s in our answer string to see if we are correct.  If that answer is correct, we score one point.  Unlike the Dawkins example, we don’t go into this puzzle knowing the answer key. But even so, we have a non-ambiguous definition of fitness, and that's really what matters. We want to evolve an answer string that results in a perfect score (20 out of 20).

Simulation Results

It turns out that such a program is successful, but there can be a lot of variation in the results.  A C program that uses 100 members per generation, a mutation rate of 10% (meaning that when we copy a 'parent' string from one generation to the next, the chances are one in 10 every time we copy one of the answers in the string that a 'typo' will occur and a different answer will be placed in the copy). We choose to stop the test after 10,000 generations.

Although this is a sort of simple genetic algorithm (GA), the 10% mutation rate is uncharacteristically high. This is because other GAs, written to mimic actual chromosome exchange, employ a mechanism called 'crossover' to keep a reasonable amount of variation flowing into the experiment.

With these parameters, however, the program finds the unique solution every fourth or fifth try, usually in about 1,000 generations.  (My best attempt found the solution after 223 generations.  The worst successful attempt took 3,569 generations).  The program usually can get to a score of 15 or 16 (out of 20) within 100 generations, but most of the time will get locked up in some sort of local optimum and won’t advance beyond a score of 16 or 17.

In an attempt to find a "sweet spot" of available parameters (a combination of mutation rate, initial population size, and number of generations that would usually succeed), I created a lot of simulations attempting to find optimal population sizes and mutation rates. Population size turned out to be not much of a factor.

The mutation rate was another story. In a regular Genetic Algorithm, most of the variation in a population will come from "crossover", when pairs of solution strings swap portions of each other as part of the "breeding" phase.  Prior to adding crossover as a source of variation, I ran a simulations on population of 100 using mutation rates from 10% to 30%, at 1% increments (10 trials at each setting).

The best results occurred with mutation rates on the range of 20% to 26%, and I was surprised to see success in all 10 trials run when the mutation rate was 23%.  It is also interesting to note that at these levels of mutation, it is common to see interim results fall off from temporary peaks,  For example, by the 100th generation the program may find a solution that scores a 19, but then subsequent mutation substantially modifies the population and the peak on the next round might be a 17.  This is to some degree a strength of the algorithm, in that it is less likely to stagnate at false peaks.

Conclusion

When modified to include a simple sort of crossover, the program was routinely successful, even with a low mutation rate. 

Richard Dawkins uses simple models like this to illustrate that the non-random accumulation of small changes over many generations, the central mechanism of Darwin's theory of evolution, is immensely powerful and can certainly account for the variation among living things we see in our world today.

But there is a power to this approach that is also very appealing because of its very human-ness.  Although many of us spend years mastering logical disciplines for applications in engineering, mathematics, and science, our own personal experience so often relies on the simpler exercise of trying something, seeing if it works, and then modifying in order to move closer to an optimal state.  And that, it seems, is a very powerful strategy indeed.

Programs for Solving Propp's Puzzle

propp.c

Written in Borland C++ version 5.02.  This is a simple console program and I write very vanilla C code, so it should be fairly portable.

This version incorporates crossover, but the 'mutation-only' code is simply commented out and should not be too tough to re-enable.

When run at the console, propp.exe will take two arguments:  population size and mutation rate.  Population size will default to 100 if not specified and won't go above 1,000 (again ... easy to change).  Mutation rate is specified as number of mutations per thousand (don't ask ... a chimp could have written something more intuitive but I was pressed for time). If you want to enter this on the command line, you'll have to include the population size argument as well.

propfast.c

What a terrible name for a slow program.  This is a modification of the propp.c program that let me generate the results of the main simulation mentioned in the article.  It loops the quiz through populations from 100 to 1,000, with mutation rates from 1% to 10% (and 10 repeats per combination). It runs for a long time, and is crying for some code-tuning.

This program will compile and run using gcc under Linux. The only changes required were for generating random numbers ... Borland C++ seems to object to a generic function using RAND_MAX, and I did not research why this was so.


 
Copyright 2006 Mark Van Dine, All Rights Reserved