This post, and the programming it contains, were some of the first programming I ever did. As a result, they’re really quite bad. I include it for posterity, and because it’s funny.

The problem

In high school I used to use a Wordlock on my locker. It had four dials of ten letters each, and each ring rotated independently. Predictably, one summer I returned to school to find that I had forgotten my combination - one of a thousand possibilities.

But not probabilities. There were a few caveats. Most of the words that could have been formed (AEJX, WQRX) were unlikely. I had the advantage of knowing that I would have chosen a real word (MATH, OXEN) rather than a jumble of letters. I figured the 1000 possibilities could be narrowed down to just a few (~100) probabilities, and that a little scripting would do the trick.

I had been learning just a little bit of Python over the summer and this challenge seemed like the very definition of a real-world application. It was combinatorics, arrays, recursion, and language analysis. I found the Enchant module for Python (although, in retrospect, just a simple check against wordlist.txt would have done the trick).

#!/usr/bin/env python
import numpy
import enchant
d = enchant.Dict("en_US")

After a standard preamble, I wanted to create a structure to approximate the physical structure of the rings on the real lock. My ring function took the alphabet and turned it from a straight string of characters into an array of string literals.

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def ring(x):
   y = []
   for i in range(len(x)):
   return y

The four rings ran [k,t], [a-j], [u-z], and [k-t] again, respectively. I assigned four ring arrays using the ring function.

r1 = ring(alphabet)[10:20]
r2 = ring(alphabet)[0:10]
r3 = ring(alphabet)[20:30]
r4 = ring(alphabet)[10:20]

Then came the combinatorics. Using every possible combination of the four rings, I wanted to compile the string literals back into a word, check the word against a dictionary, and, if it was a real word, print it. Despite there being a thousand combinations, there were rarely more than about 50 words.

for i in range(len(r1)):
   for j in range(len(r2)):
      for k in range(len(r3)):
       for l in range(len(r4)):
          x = (r1[i])+(r2[j])+(r3[k])+(r4[l])
        if d.check(x) == True:
           print x

Using this method, I found my combination (NEXT) almost immediately from the list of twenty or so words, but the thought occurred to me: what if there had been fifty words to sift through? A hundred?

The next thing I did was to write a frequency generator. I compiled a few large, well-known texts off Project Gutenberg (the Declaration of Independence, Sherlock Holmes, Leaves of Grass, Huckleberry Finn, and Ulysses - that seemed to cover all my bases) and wrote a script to strip the text of punctuation, divide it by spaces into words, and create a table with each word and the number of times it appears in a good sampling of the english language.

# !/usr/bin/env python
import re, string, operator, pickle

regex = re.compile('[%s]' % re.escape(string.punctuation)) #regex rule

a = {} #frequencies dictionary
b = {} #output dictionary
temp = []
def strip(x): #strips a string of punctuation
  for i in x:
    return temp

# stages: plaintext to string to words
f = strip(open('/home/james/code/declaration.txt', 'r') .read().split(None))

for i in f:
  a[i] = a.get(i,0) + 1
klist = sorted(a, key=a.get)

# write python dict to a file
output = open('frequencylist.pkl', 'wb')
pickle.dump(a, output)

The script took nearly no time at all to run, and it gave me a pickl’d version of the dictionary class I could import later into another program without having to run all over again. The next step was to take my list of candidates and reorder them by frequency.

Once the dictionary (not the Enchant english-language dictionary, the python-array-classification dictionary) was imported, every word in the candidates was checked against the array and sorted into a new list by frequency; words like this went to the top, and words like tank went to the bottom. I assumed (hopefully with some degree of accuracy) that the words most likely to be chosen for a word lock would fall somewhere near the middle-top; not too likely, but not seriously obscure either.

#!/usr/bin/env python
import pickle
pkl_file = open('frequencylist.pkl', 'rb')
list = pickle.load(pkl_file)
candidates = {}
rejects = []
words = open('/home/james/code/words.txt', 'r').read().split(None)
for word in words:
if list.has_key(word) == True:
  candidates[word] = list.get(word,0)
else: rejects.append(word)
  candidates = sorted(candidates.items(), key=lambda(k,v):(v,k))
print "candidates:", candidates
print "rejects:" , rejects


The general workflow was to take the 1000 possibilities, reduce them down to far, far fewer probabilities, and sort them byfrequency in the hopes of finding the correct combination sooner than later. There were a few implicit assumptions, however:

  • That the combination would have meaning. (WQXR rather than XRQW)
  • That the combination would be a real word. (WOOT rather than WQXR)
  • That the word would be in the dictionary. (FORK rather than WOOT)
  • And then a final hope that the word would be a likely one.

As it turned out, the script worked nearly perfectly, running in under five seconds total and figuring out my combination nearly immediately. However, there were a few issues:

The scripts were written very early on during my Python learning curve, and are thus clunky. They do array manipulations that could have more easily been done with mapping or list comprehensions, they juggle strings and string literals, and they’re actually pretty slow. Compared to the far preferable grep-based solution, five seconds is abysmal.

  • They make assumptions. In a real cryptographic situation, this would be a last-ditch effort, hoping the password was truly password rather than p4s5w0r_d

  • The scripts were distributed. They did in three scripts what should have been just one. This is because I wasn’t sure of the exact terms and conditions of the problem I was working on, and I wasn’t confident enough to risk damaging the first step in the process to add the second step in the same script. There’s lots of juggling of text files and pickled dictionaries and the like.

To my further shame, I learned a few weeks later that regular expressions (of which I heard much but understood little) were designed to handle exactly these sorts of problems. With 20/20 hindsight, opening up a terminal and entering

egrep "^[k-t][a-j][u-z][k-t]$" /usr/share/dict/words

would have done the trick about a million times faster, but that wasn’t the point, was it?