Random Fun with Linear SVMs

I was first introduced to Support Vector Machines by MIT’s AI course. The lecturer did a lot of math, arrived at a quadratic optimization problem, and then said “solving this is a job for a numerical analyst” (paraphrased). Discouraged, I went a while without implementing SVMs at all. Even after I had learned some numerical analysis, I still tended to prefer conceptually simpler models like neural networks or decision trees.

A few days ago, I came to a realization that totally changed my feelings about SVMs. In particular, I realized that I could use the hinge loss to train a regular neural network. Like any loss function, the hinge loss measures how well a model classifies inputs. However, it has a particular property that makes it special: when you minimize the hinge loss, you are essentially training an SVM.

Suddenly, I had a way to work SVMs into my existing ML framework. An SVM can be trained just like a neural network, provided you use the right loss function (combined with L2 regularization). All the complicating factors–stuff like kernels and quadratic programming–no longer seemed relevant. I finally had a way to benefit from SVMs without all the complications that seemed to come along with them.

Implementing the hinge loss in my own ML framework was fairly simple, although I had trouble deciding which version of the multi-class hinge loss to use. Annoyingly, Wikipedia only mentioned one variant of said loss. Thanks to me, that is now fixed.

Once I had an implementation of the hinge loss, I trained a linear SVM on the MNIST handwriting dataset. The results were amazing! After less than a second of training, the model had a 92% validation accuracy. But I didn’t want to stop there.

Until this week, I have never used the CIFAR dataset. If you aren’t familiar, CIFAR is a database of labeled color images. For example, there are thousands of 32×32 images of cars, each labeled “automobile”. I trained a linear SVM on CIFAR-10 (a CIFAR dataset with 10 categories). The resulting SVM was able to classify 38% of images correctly, which is pretty impressive for such a simple model.

Unlike most ML models, it is easy to visualize what a linear SVM is doing. For each class, the SVM finds a vector describing samples in that class. We can treat these vectors as images to see exactly what the SVM is looking for:

Car Boat Frog Horse

You can probably see why the car is the way it is. The boat image is also somewhat obvious: most boats are brown-ish objects floating in a blue-ish ocean. The frog is also quite clear: frogs are approximately green blobs. If you squint closely enough at the horse, you can probably explain it as well.

Perhaps the nicest thing about linear SVMs is that they learn so quickly. The CIFAR SVM took about 10 seconds to train. The MNIST SVM took less than a second. That’s something you don’t get from the typical neural network.

Now that I have had some fun with linear SVMs, I hope that they will become part of my everyday ML process. In the very least, they can often give a good baseline for classification accuracy.

Can Neural Networks Learn to Spell?

English spelling is extremely difficult. It has few rules, and even they are frequently broken. Despite all this, millions of people write in English every day. How does everyone do it? Is spelling learned by brute memorization, or does the brain do something clever?

Despite what pre-school teachers might tell you, people don’t learn to spell by brute memorization. When we hear a new word—sometimes even a made up word—we can usually guess how it will be spelled. If we simply memorized the spelling of every word, guessing new spellings would be impossible. It’s much more likely that the brain of an English speaker develops a probabilistic system for spelling words. This system might have some hard-wired exceptions, but it’s less rigid than a simple lookup table.

So, if the brain can learn a system for English spelling, can an artificial neural network do the same? Spoiler alert: yes it can.

You might wonder what the input to such a neural network might look like. For example, could we feed the network recordings of speech and expect it to produce spellings? At this point in time, I think it would be unreasonable to expect good results from such an approach. Even modern speech recognition systems first convert speech into phonetic transcriptions before attempting to produce predictions over entire words. So, instead of feeding my network raw sound, I decided to feed it phonetic transcriptions in IPA. IPA is a standard way to write down speech sounds without needing large audio files. As an added bonus, it’s easy to find dictionaries with IPA transcriptions.

After deciding on the input format, I had to design the network itself. Since words can vary in length, I had little choice but to use a Recurrent Neural Network. In fact, the architecture I chose closely resembles the one used in Graves et al., 2006 for speech recognition. To get technical, each phoneme of the input is duplicated four times (“mmmmʌʌʌʌŋŋŋŋkkkkiiii”), two separate RNNs are run on the sequence (one backwards, one forwards), and Connectionist Temporal Classification determines the final network output (i.e. the spelling). This setup has two side effects:

  1. Spellings cannot be more than 4 times the length of the phonetic transcriptions.
  2. The network is encouraged to “align” phonemes with their corresponding letters.

With our network laid out, we can talk a bit about data. A while ago I had used the CMU pronunciation dictionary for a speech synthesis project, and I repurposed that dictionary as network training data. After a bit of preprocessing, I split the dictionary up into 105,139 training samples and 11,474 testing samples (where a sample is a single word). Uncompressed, this data was a bit over 2MB. The neural network, on the other hand, was only 1.8MB. This meant that, in the very least, the network couldn’t simply store an uncompressed copy of the data.

When I went to train the network, I found that it learned quite quickly. In fact, I started by training the network on a single core of a 10-year-old iMac. A single epoch took a few hours, but the result was fairly satisfying. I ran the network on a few random words to see what it could come up with:

Actual word Input Output
invader InveIdəɹ invader
twelve twɛlv twelve
evaluate IvæljueIt evaluate
guilty gIlti gilty

None of these words have a one-to-one mapping between phonemes and letters. In “invader”, the letter “a” only has the /eI/ sound because it is followed by “de”. In “twelve”, we see the general rule that words cannot end in a “v”. In “evaluate”, “e” is added to the end of the word to give the last “a” an /eI/ sound. The last example is particularly notable because the network got it wrong. Not many words are spelled like “guilty”, so this mistake is unsurprising (the fact that “guilty” is a common word has no effect; see my note at the bottom about word frequency).

The trained network had a validation success rate of about 40%, meaning that it could correctly spell 40% of the words it had never seen before. This didn’t seem impressive, but I didn’t want to judge without trying myself. I selected 19 random words from the validation set and tried to spell them based on their IPA transcriptions. It was harder than I expected. The dictionary contains a huge number of uncommon words (e.g., city names, names of plants, etc.). I was only able to spell 9 out of the 19 words (47%). This should give you an idea of what I was up against:

Phonetics My guess Network’s guess True word
mʌkweI muckway mcway mcquay
fɹægmʌnt fragment fragment fragment
ʌgwangʌ agwanga aguanga aguanga
pɛtɹoʊmInəɹʌlz petrominerals petrominerals petrominerals

At this point, I started wondering about overfitting. After training for one epoch, cross-validation indicated that virtually no overfitting had taken place. However, I knew that further training might cause the network to lose its ability to generalize to new words. When you think about it, though, humans probably overfit spelling ourselves. Most of us can think of times when we’ve overthought the spelling of a word and ended up misspelling it as a result. In some sense, we tend to focus on weird edge cases, leading us to overthink words with normal spellings. So, as an experiment, I set out to intentionally overfit a network on spelling.

I started by training the existing 1.8MB network for more epochs. Its validation success rate stabilized at around 41%, while the training success rate inched up to 48%. Encouragingly, the validation score never got worse. Eager to force a network to memorize more English spelling, I increased the number of neurons in the network. The new network took up an alarming 7MB, but it still only achieved a 44% validation success rate and a 56% training success rate. Annoyingly, that’s still an F, albeit on the world’s hardest spelling test. My conclusion was that humans aren’t the only “neural” entities that struggle with spelling tests.

One final thing to note is that my experiments weighted every word equally. In reality, English words are distributed on a heavy-tailed probability distribution. It is undoubtable that people are better at spelling common words than extremely infrequent ones. In fact, one can sometimes get by with only 1000 words. The takeaway is that, while a 50% success rate might not sound impressive, it could be more than enough.

The code for this project is available on Github. You might note from the commit history that this is not a new project. I had started implementing this idea months ago, but I had limited training hardware at the time. I only decided to pick the project back up once I was less restricted by hardware.

Ancient Philosophy as a Classification Problem

I took the first few weeks of an ancient philosophy class and found it extremely frustrating. I generally love philosophy, but ancient philosophy was somehow different. Throughout every reading and every lecture, I couldn’t help but think that ancient philosophy was just a narrow-minded subset of data science. This might sound reductive—perhaps altogether absurd—until you think about what ancient philosophers like Socrates were really after.

Socrates and Plato spent a lot of time trying to get at the essence of concepts like virtue, courage, and happiness. Typically, Socrates would bait someone into defining a broad term like “courage”, then would go on to refute the definition with a string of counter-examples. After having done this, he would conclude that they—Socrates and the collocutor—had no clue what “courage” was.

The TA for the class did a similar thing in our discussion section. He posed a simple question: what is a watch? It turned out that nobody knew. Is a pocket watch a watch? What about a smartwatch? What if you strap your phone to your arm—is that a watch? It went on, case after case, and it seemed more and more difficult to nail down a universal definition. In fact, it seemed that “watchness” had more to do with brand marketing than with the function or structure of the product. The newest Fitbit is essentially a smartwatch, yet people don’t call it that. What is it that makes the Moto 360 a watch while the Fitbit is not? None of us could figure it out.

What frustrated me at the time, and probably led me to drop the class, was that philosophers were approaching the problem from the wrong direction. Defining terms is—and always has been—a simple matter of classification. Data scientists have been dealing with this problem for years. In the case of a watch, we want a function that maps an object to a yes-or-no label (”watch” or “no watch”). The function could be anything: a neural network, a decision tree, an SVM, you name it. However, for some reason, philosophers only seemed to want to use one kind of classifier: a sentence.

Essentially, philosophers like Socrates were trying to make classifiers out of language. They wanted a definition that could classify every instance correctly. So, as a data scientist, I am not surprised that Socrates struggled so much. Of course a human knows what a watch is; we are essentially huge neural networks. However, there is absolutely no reason to assume that our neural circuitry could be compressed down to less than a kilobyte (the size of a lengthy paragraph).

With this being said, definitions can be extremely useful. Just like simple regression models, definitions are low variance. Basically, a definition gives you a “good sense” of a categorization without needing tons of data. However, one should never expect a definition to fulfill the duties of a high variance classifier like a deep neural network. Definitions cannot and should not cover every edge case. Capturing the intricacies of meaning is a job for a complex model like the human brain. That, I believe, is what Socrates was missing.