Can Neural Networks Learn to Spell?

English spelling is extremely difficult. It has few rules, and even they are frequently broken. Despite all this, millions of people write in English every day. How does everyone do it? Is spelling learned by brute memorization, or does the brain do something clever?

Despite what pre-school teachers might tell you, people don’t learn to spell by brute memorization. When we hear a new word—sometimes even a made up word—we can usually guess how it will be spelled. If we simply memorized the spelling of every word, guessing new spellings would be impossible. It’s much more likely that the brain of an English speaker develops a probabilistic system for spelling words. This system might have some hard-wired exceptions, but it’s less rigid than a simple lookup table.

So, if the brain can learn a system for English spelling, can an artificial neural network do the same? Spoiler alert: yes it can.

You might wonder what the input to such a neural network might look like. For example, could we feed the network recordings of speech and expect it to produce spellings? At this point in time, I think it would be unreasonable to expect good results from such an approach. Even modern speech recognition systems first convert speech into phonetic transcriptions before attempting to produce predictions over entire words. So, instead of feeding my network raw sound, I decided to feed it phonetic transcriptions in IPA. IPA is a standard way to write down speech sounds without needing large audio files. As an added bonus, it’s easy to find dictionaries with IPA transcriptions.

After deciding on the input format, I had to design the network itself. Since words can vary in length, I had little choice but to use a Recurrent Neural Network. In fact, the architecture I chose closely resembles the one used in Graves et al., 2006 for speech recognition. To get technical, each phoneme of the input is duplicated four times (“mmmmʌʌʌʌŋŋŋŋkkkkiiii”), two separate RNNs are run on the sequence (one backwards, one forwards), and Connectionist Temporal Classification determines the final network output (i.e. the spelling). This setup has two side effects:

  1. Spellings cannot be more than 4 times the length of the phonetic transcriptions.
  2. The network is encouraged to “align” phonemes with their corresponding letters.

With our network laid out, we can talk a bit about data. A while ago I had used the CMU pronunciation dictionary for a speech synthesis project, and I repurposed that dictionary as network training data. After a bit of preprocessing, I split the dictionary up into 105,139 training samples and 11,474 testing samples (where a sample is a single word). Uncompressed, this data was a bit over 2MB. The neural network, on the other hand, was only 1.8MB. This meant that, in the very least, the network couldn’t simply store an uncompressed copy of the data.

When I went to train the network, I found that it learned quite quickly. In fact, I started by training the network on a single core of a 10-year-old iMac. A single epoch took a few hours, but the result was fairly satisfying. I ran the network on a few random words to see what it could come up with:

Actual word Input Output
invader InveIdəɹ invader
twelve twɛlv twelve
evaluate IvæljueIt evaluate
guilty gIlti gilty

None of these words have a one-to-one mapping between phonemes and letters. In “invader”, the letter “a” only has the /eI/ sound because it is followed by “de”. In “twelve”, we see the general rule that words cannot end in a “v”. In “evaluate”, “e” is added to the end of the word to give the last “a” an /eI/ sound. The last example is particularly notable because the network got it wrong. Not many words are spelled like “guilty”, so this mistake is unsurprising (the fact that “guilty” is a common word has no effect; see my note at the bottom about word frequency).

The trained network had a validation success rate of about 40%, meaning that it could correctly spell 40% of the words it had never seen before. This didn’t seem impressive, but I didn’t want to judge without trying myself. I selected 19 random words from the validation set and tried to spell them based on their IPA transcriptions. It was harder than I expected. The dictionary contains a huge number of uncommon words (e.g., city names, names of plants, etc.). I was only able to spell 9 out of the 19 words (47%). This should give you an idea of what I was up against:

Phonetics My guess Network’s guess True word
mʌkweI muckway mcway mcquay
fɹægmʌnt fragment fragment fragment
ʌgwangʌ agwanga aguanga aguanga
pɛtɹoʊmInəɹʌlz petrominerals petrominerals petrominerals

At this point, I started wondering about overfitting. After training for one epoch, cross-validation indicated that virtually no overfitting had taken place. However, I knew that further training might cause the network to lose its ability to generalize to new words. When you think about it, though, humans probably overfit spelling ourselves. Most of us can think of times when we’ve overthought the spelling of a word and ended up misspelling it as a result. In some sense, we tend to focus on weird edge cases, leading us to overthink words with normal spellings. So, as an experiment, I set out to intentionally overfit a network on spelling.

I started by training the existing 1.8MB network for more epochs. Its validation success rate stabilized at around 41%, while the training success rate inched up to 48%. Encouragingly, the validation score never got worse. Eager to force a network to memorize more English spelling, I increased the number of neurons in the network. The new network took up an alarming 7MB, but it still only achieved a 44% validation success rate and a 56% training success rate. Annoyingly, that’s still an F, albeit on the world’s hardest spelling test. My conclusion was that humans aren’t the only “neural” entities that struggle with spelling tests.

One final thing to note is that my experiments weighted every word equally. In reality, English words are distributed on a heavy-tailed probability distribution. It is undoubtable that people are better at spelling common words than extremely infrequent ones. In fact, one can sometimes get by with only 1000 words. The takeaway is that, while a 50% success rate might not sound impressive, it could be more than enough.

The code for this project is available on Github. You might note from the commit history that this is not a new project. I had started implementing this idea months ago, but I had limited training hardware at the time. I only decided to pick the project back up once I was less restricted by hardware.

Ancient Philosophy as a Classification Problem

I took the first few weeks of an ancient philosophy class and found it extremely frustrating. I generally love philosophy, but ancient philosophy was somehow different. Throughout every reading and every lecture, I couldn’t help but think that ancient philosophy was just a narrow-minded subset of data science. This might sound reductive—perhaps altogether absurd—until you think about what ancient philosophers like Socrates were really after.

Socrates and Plato spent a lot of time trying to get at the essence of concepts like virtue, courage, and happiness. Typically, Socrates would bait someone into defining a broad term like “courage”, then would go on to refute the definition with a string of counter-examples. After having done this, he would conclude that they—Socrates and the collocutor—had no clue what “courage” was.

The TA for the class did a similar thing in our discussion section. He posed a simple question: what is a watch? It turned out that nobody knew. Is a pocket watch a watch? What about a smartwatch? What if you strap your phone to your arm—is that a watch? It went on, case after case, and it seemed more and more difficult to nail down a universal definition. In fact, it seemed that “watchness” had more to do with brand marketing than with the function or structure of the product. The newest Fitbit is essentially a smartwatch, yet people don’t call it that. What is it that makes the Moto 360 a watch while the Fitbit is not? None of us could figure it out.

What frustrated me at the time, and probably led me to drop the class, was that philosophers were approaching the problem from the wrong direction. Defining terms is—and always has been—a simple matter of classification. Data scientists have been dealing with this problem for years. In the case of a watch, we want a function that maps an object to a yes-or-no label (”watch” or “no watch”). The function could be anything: a neural network, a decision tree, an SVM, you name it. However, for some reason, philosophers only seemed to want to use one kind of classifier: a sentence.

Essentially, philosophers like Socrates were trying to make classifiers out of language. They wanted a definition that could classify every instance correctly. So, as a data scientist, I am not surprised that Socrates struggled so much. Of course a human knows what a watch is; we are essentially huge neural networks. However, there is absolutely no reason to assume that our neural circuitry could be compressed down to less than a kilobyte (the size of a lengthy paragraph).

With this being said, definitions can be extremely useful. Just like simple regression models, definitions are low variance. Basically, a definition gives you a “good sense” of a categorization without needing tons of data. However, one should never expect a definition to fulfill the duties of a high variance classifier like a deep neural network. Definitions cannot and should not cover every edge case. Capturing the intricacies of meaning is a job for a complex model like the human brain. That, I believe, is what Socrates was missing.