English spelling is extremely difficult. It has few rules, and even they are frequently broken. Despite all this, millions of people write in English every day. How does everyone do it? Is spelling learned by brute memorization, or does the brain do something clever?
Despite what pre-school teachers might tell you, people don’t learn to spell by brute memorization. When we hear a new word—sometimes even a made up word—we can usually guess how it will be spelled. If we simply memorized the spelling of every word, guessing new spellings would be impossible. It’s much more likely that the brain of an English speaker develops a probabilistic system for spelling words. This system might have some hard-wired exceptions, but it’s less rigid than a simple lookup table.
So, if the brain can learn a system for English spelling, can an artificial neural network do the same? Spoiler alert: yes it can.
You might wonder what the input to such a neural network might look like. For example, could we feed the network recordings of speech and expect it to produce spellings? At this point in time, I think it would be unreasonable to expect good results from such an approach. Even modern speech recognition systems first convert speech into phonetic transcriptions before attempting to produce predictions over entire words. So, instead of feeding my network raw sound, I decided to feed it phonetic transcriptions in IPA. IPA is a standard way to write down speech sounds without needing large audio files. As an added bonus, it’s easy to find dictionaries with IPA transcriptions.
After deciding on the input format, I had to design the network itself. Since words can vary in length, I had little choice but to use a Recurrent Neural Network. In fact, the architecture I chose closely resembles the one used in Graves et al., 2006 for speech recognition. To get technical, each phoneme of the input is duplicated four times (“mmmmʌʌʌʌŋŋŋŋkkkkiiii”), two separate RNNs are run on the sequence (one backwards, one forwards), and Connectionist Temporal Classification determines the final network output (i.e. the spelling). This setup has two side effects:
- Spellings cannot be more than 4 times the length of the phonetic transcriptions.
- The network is encouraged to “align” phonemes with their corresponding letters.
With our network laid out, we can talk a bit about data. A while ago I had used the CMU pronunciation dictionary for a speech synthesis project, and I repurposed that dictionary as network training data. After a bit of preprocessing, I split the dictionary up into 105,139 training samples and 11,474 testing samples (where a sample is a single word). Uncompressed, this data was a bit over 2MB. The neural network, on the other hand, was only 1.8MB. This meant that, in the very least, the network couldn’t simply store an uncompressed copy of the data.
When I went to train the network, I found that it learned quite quickly. In fact, I started by training the network on a single core of a 10-year-old iMac. A single epoch took a few hours, but the result was fairly satisfying. I ran the network on a few random words to see what it could come up with:
Actual word | Input | Output |
---|---|---|
invader | InveIdəɹ |
invader |
twelve | twɛlv |
twelve |
evaluate | IvæljueIt |
evaluate |
guilty | gIlti |
gilty |
None of these words have a one-to-one mapping between phonemes and letters. In “invader”, the letter “a” only has the /eI/
sound because it is followed by “de”. In “twelve”, we see the general rule that words cannot end in a “v”. In “evaluate”, “e” is added to the end of the word to give the last “a” an /eI/
sound. The last example is particularly notable because the network got it wrong. Not many words are spelled like “guilty”, so this mistake is unsurprising (the fact that “guilty” is a common word has no effect; see my note at the bottom about word frequency).
The trained network had a validation success rate of about 40%, meaning that it could correctly spell 40% of the words it had never seen before. This didn’t seem impressive, but I didn’t want to judge without trying myself. I selected 19 random words from the validation set and tried to spell them based on their IPA transcriptions. It was harder than I expected. The dictionary contains a huge number of uncommon words (e.g., city names, names of plants, etc.). I was only able to spell 9 out of the 19 words (47%). This should give you an idea of what I was up against:
Phonetics | My guess | Network’s guess | True word |
---|---|---|---|
mʌkweI |
muckway | mcway | mcquay |
fɹægmʌnt |
fragment | fragment | fragment |
ʌgwangʌ |
agwanga | aguanga | aguanga |
pɛtɹoʊmInəɹʌlz |
petrominerals | petrominerals | petrominerals |
At this point, I started wondering about overfitting. After training for one epoch, cross-validation indicated that virtually no overfitting had taken place. However, I knew that further training might cause the network to lose its ability to generalize to new words. When you think about it, though, humans probably overfit spelling ourselves. Most of us can think of times when we’ve overthought the spelling of a word and ended up misspelling it as a result. In some sense, we tend to focus on weird edge cases, leading us to overthink words with normal spellings. So, as an experiment, I set out to intentionally overfit a network on spelling.
I started by training the existing 1.8MB network for more epochs. Its validation success rate stabilized at around 41%, while the training success rate inched up to 48%. Encouragingly, the validation score never got worse. Eager to force a network to memorize more English spelling, I increased the number of neurons in the network. The new network took up an alarming 7MB, but it still only achieved a 44% validation success rate and a 56% training success rate. Annoyingly, that’s still an F, albeit on the world’s hardest spelling test. My conclusion was that humans aren’t the only “neural” entities that struggle with spelling tests.
One final thing to note is that my experiments weighted every word equally. In reality, English words are distributed on a heavy-tailed probability distribution. It is undoubtable that people are better at spelling common words than extremely infrequent ones. In fact, one can sometimes get by with only 1000 words. The takeaway is that, while a 50% success rate might not sound impressive, it could be more than enough.
The code for this project is available on Github. You might note from the commit history that this is not a new project. I had started implementing this idea months ago, but I had limited training hardware at the time. I only decided to pick the project back up once I was less restricted by hardware.