Adversarial Train/Test Splits

The MNIST dataset is split into 60K training images and 10K test images. Everyone pretty much assumes that the training images are enough to learn how to classify the test images, but this is not necessarily the case. So today, in the theme of being a contrarian, I’ll show you how to split the 70K MNIST images in an adversarial manner, such that models trained on the adversarial training set perform poorly on the adversarial test set.

Class imbalance is the most obvious way to make an adversarial split. If you put all the 2’s in the test set, then a model trained on the training set will miss all those 2’s. However, since this answer is fairly boring and doesn’t say very much, let’s assume for the rest of this post that we want class balance. In other words, we want the test set to contain 1K examples of each digit. This also means that, just by looking at the labels, you won’t be able to tell that the split is adversarial.

Here’s another idea: we can train a model on all 70K digits, then get the test set by picking the digits that have the worst loss. What we end up with is a test set that contains the “worst” 1K examples of each class. It turns out that this approach gives good results, even with fairly simple models.

When I tried the above idea with a small, single-layer MLP, the results were surprisingly good. The model, while extremely simple, gets 97% test accuracy on the standard MNIST split. On the adversarial split, however, it only gets 81% test accuracy. Additionally, if you actually look at the samples in the training set versus those in the test set, the difference is quite noticeable. The test set consists of a bunch of distorted, smudgy, or partially-drawn digits.

Example images from the adversarial train/test split.
Training samples (left), testing samples (right).

In a sense, the result I just described was obtained by cheating. I used a certain architecture to generate an adversarial split, and then I used the same architecture to test the effectiveness of that split. Perhaps I just made a split that’s hard for an MLP, but not hard in general. To truly demonstrate that the split is adversarial, I needed to test it on a different architecture. To this end, I tried a simple CNN that gets 99% accuracy on the standard test set. On the MLP-generated adversarial split, this model gets a whopping 87% test accuracy. So, it’s pretty clear that the adversarial split works across architectures.

There’s plenty of directions to take this from here. The following approaches could probably be used to generate even worse splits:

  • Use an ensemble of architectures to select the adversarial test set.
  • Use some kind of iterative process, continually making the split worse and worse.
  • Try some kind of clustering algorithm to make a test set that is a fundamentally different distribution than the training set.

Maybe this result doesn’t mean anything, but I still think it’s cool. It shows that the way you choose a train/test split can matter. It also shows that, perhaps, MNIST is not as uniform as people think.

The code for all of this can be found here.

Decision Trees as RL Policies

In supervised learning, there are very good “shallow” models like XGBoost and SVMs. These models can learn powerful classifiers without an artificial neuron in sight. So why, then, is modern reinforcement learning totally dominated by neural networks? My answer: no good reason. And now I want to show everyone that shallow architectures can do RL too.

Right now, using absolutely no feature engineering, I can train an ensemble of decision trees to play various video games from the raw pixels. The performance isn’t comparable to deep RL algorithms yet, and it may never be for vision-based tasks (for good reason!), but it’s fairly impressive nonetheless.

A tree ensemble playing Atari Pong

So how exactly do I train shallow models on RL tasks? You might have a few ideas, and so did I. Today, I’ll just be telling you about the one that actually worked.

The algorithm itself is so simple that I’m almost kind of embarrassed by my previous (failed) attempts at tree-based RL. Essentially, I use gradient boosting with gradients from a policy gradient estimator. I call the resulting algorithm policy gradient boosting. In practice, I use a slightly more complex algorithm (a tree-based variant of PPO), but there is probably plenty of room for simplification.

With policy gradient boosting, we build up an ensemble of trees in an additive fashion. For every batch of experience, we add a few more trees to our model, making minor adjustments in the direction of the policy gradient. After hundreds or even thousands of trees, we can end up with a pretty good policy.

Now that I’ve found an algorithm that works pretty well, I want to figure out better hyper-parameters for it. I doubt that tree-based PPO is the best (or even a good) technique, and I doubt that my regularization heuristic is very good either. Yet, even with these somewhat random choices, my models perform well on very difficult tasks! This has convinced me that shallow architectures could really disrupt the modern RL landscape, given the proper care and attention.

All the code for this project is in my treeagent repository, and there are some video demonstrations up on this YouTube playlist. If you’re interested, feel free to contribute to treeagent on Github or send me a PM on Twitter.

Keeping Tabs On All My Neural Networks

When I’m out in public, I look at my watch a lot. It’s not because I’m nervous, or because I’m obsessed with the time. It’s because I’m checking on my neural networks.

At any given moment, I probably have four different machines crunching through ML tasks (e.g. training neural networks or downloading data). To keep tabs on all these machines, I use my own logging system called StatusHub. With StatusHub, I can use my phone, my watch, my tablet, or my laptop to see logs from every job across all of my machines. On my watch, I see a scrollable list that looks like this:

On my phone or laptop, I can see the same log through a web UI:

I can even view the log through the command-line, but I won’t bore you with a picture of that one.

Pushing log messages

You push log messages to a StatusHub server with the sh-log command. Without sh-log, I might train a neural network like so:

$ go run *.go
2017/07/03 18:32:57 done 4002029568 updates: cost=0.108497
2017/07/03 18:32:58 done 4002168832 updates: cost=0.114127
2017/07/03 18:32:59 done 4002308096 updates: cost=0.109726
...

As you can see, the program already produces log messages, but annoyingly they only go to standard error. To push the messages to StatusHub, we can simply use sh-log:

$ sh-log TrainEmbedding go run *.go
2017/07/03 18:32:57 done 4002029568 updates: cost=0.108497
2017/07/03 18:32:58 done 4002168832 updates: cost=0.114127
2017/07/03 18:32:59 done 4002308096 updates: cost=0.109726
...

In the above example, sh-log executes go run *.go and echoes the standard output/error to a StatusHub server (which is configured via environment variables). The first argument to sh-log is the service name, which helps to distinguish between different jobs in the log. If you look back at the screenshots from the beginning of this post, the service names should stand out right away.

The sh-log command also plays nicely with UNIX pipes. If you don’t provide a command for sh-log to run, it reads directly from standard input. For example, this is how I log information about my GPU:

$ nvidia-smi | head -n 9 | tail -n 1 | cut -b 3-77 | sed -e 's/\s\s*/ /g' | sh-log GPU
23% 35C P8 10W / 250W | 63MiB / 12186MiB | 0% Default

Viewing logs

The simplest way to view a log is via the StatusHub web UI or through the Android Wear application. However, StatusHub also ships with some commands for reading and manipulating logs.

To dump the log for a given service, there is the sh-dump command:

$ sh-dump tweetdump
...
-rw-r--r--  1 alex  staff  12141949085 Jul  3 19:09 tweets.csv
-rw-r--r--  1 alex  staff  12142001648 Jul  3 19:09 tweets.csv
-rw-r--r--  1 alex  staff  12142061169 Jul  3 19:10 tweets.csv
-rw-r--r--  1 alex  staff  12142116283 Jul  3 19:10 tweets.csv

You can also use the sh-stream command to view the output of a service in real time, for example:

$ sh-stream tweetdump
-rw-r--r--  1 alex  staff  12141949085 Jul  3 19:09 tweets.csv
-rw-r--r--  1 alex  staff  12142001648 Jul  3 19:09 tweets.csv
-rw-r--r--  1 alex  staff  12142061169 Jul  3 19:10 tweets.csv
-rw-r--r--  1 alex  staff  12142116283 Jul  3 19:10 tweets.csv
...

My favorite tool, though, is sh-avg. Using sh-avg, you can compute the averages of numerical fields over the last several log messages. For example, to average the results from the “TrainEmbedding” service:

$ sh-avg TrainEmbedding
size 10: cost=0.108642
size 20: cost=0.108811
size 50: cost=0.108578

You can also specify a particular average size (i.e. the number of log records to average):

$ sh-avg TrainEmbedding 32
size 32: cost=0.108722

If you want to be able to quickly see averages from your phone or smartwatch, you can setup a job to log the averages of another job:

$ while (true) do sh-log EmbedAvg sh-avg TrainEmbedding 30; sleep 30; done

As you can see, StatusHub allows you to be a command-line ninja with magical logging powers.

Going crazy

Once you have basic building blocks like sh-log and sh-stream, the possibilities are boundless.

With a pipe-based IRC client like ii, you can push chat logs to StatusHub in one terminal command. This makes it easy to keep tabs on IRC activity, even from devices without an IRC client (e.g. a smartwatch).

You could also pipe sh-stream into ii in order to send log messages to someone on IRC. This might not seem useful, but it actually could be. For example, say you want to be notified when a process finishes running. You could run this in one terminal:

$ ./some_long_task; sh-log Notify echo 'Task done!'

And then in some other terminal, perhaps on some other machine, run something like this:

$ sh-stream Notify | send-irc-messages

Using StatusHub yourself

The StatusHub repository has official installation instructions, but I thought I’d give a gist here as well. There are really three parts to a successful StatusHub installation:

  1. An sh-server process running on some internet-accessible machine. All of your devices should be able to connect to this machine over HTTP/HTTPS, either through a reverse proxy or via port forwarding.
  2. A set of jobs that do logging via the sh-log command. To have sh-log go to the correct server, you will need to set some environment variables.
  3. One or more devices from which you will consume logs. These devices simply need a browser, but you can also install the StatusHub commands and setup your environment variables accordingly.

Why I’m Remaking OpenAI Universe

OpenAI Universe was very exciting when it first came out. With Universe, it was suddenly possible to train AI to play Flash games, book flights online, and perform various other real-world tasks. Unfortunately, Universe never really took off in the AI world. I don’t know why others aren’t using Universe, but I can tell you why I’m not using it and why I am working hard to build an alternative.

Since Universe has a lot of moving pieces, I’ll give a basic overview of how it works. Every Universe environment runs an instance of Chrome inside a Docker container. An AI interacts with the environment using VNC, which can send keyboard and mouse inputs to Chrome and receive back screen images in real time. To train RL agents on Flash games, Universe uses OCR to read the score off the screen and send it back to the agent as a reward signal. A program in the container controls game flow by looking at the screen and waiting for menus and “game over” pages.

In my view, the biggest problem with Universe is that VNC and Flash need to run in real time. This means that any hiccups on your training machine (e.g. a CPU spike due to a software update) might suddenly change the frame rate at which your AI experiences its virtual environment. It also means that you can’t run Universe at all on a slow machine. This rules out many cloud-hosting instances, for example many EC2 t2 instances.

Since Flash games are hard to reverse-engineer, Universe has no choice but to look at the screen to figure out the state of a game. This approach isn’t very stable. Imagine a game where the “game over” screen fades in gradually, but the “Submit Highscore” button on said screen can be pressed before the fade is complete. If you are running a random agent (one which clicks randomly), it might hit “Submit Highscore” before Universe is able to detect the “game over” screen. The result is that a strange new screen will pop up instead of the “game over” screen, and the Universe infrastructure will crash. I talk more about this flaw in this unresolved Github issue. This is a real issue because most RL algorithms start by taking random actions to explore the environment.

On top of the problems I just mentioned, it seems that OpenAI has internally abandoned Universe. When OpenAI unveiled Universe six months ago, they said the following in their blog post:

In upcoming weeks, we’ll release our environment integration tools, so anyone can contribute new environment integrations.

I guess weeks turned into months turned into never. Those tools are still not available. In the same blog post, they also promised demonstration data:

We’re compiling a large dataset of human demonstrations on Universe environments, which will be released publicly.

That never happened either, which saddens me since I would have loved to get my hands on that data.

Enter μniverse, my alternative to OpenAI Universe. Unlike Universe, μniverse focuses on HTML5 games. Since HTML5 games are written in JavaScript, it is possible to inject new code into them. This makes it much easier (and more reliable) to extract the score and “game over” status from a game. Menu automation is also simpler for HTML5 games, since you can trigger event handlers directly without worrying about the UI. The biggest benefit of HTML5 games, though, is that you can spoof time.

All JavaScript applications use the same APIs for time: Date, setTimeout, requestAnimationFrame, etc. If you inject your own implementation of these APIs, you can completely control how fast time appears to pass. This is exactly what I do in μniverse. Using a time spoofing API, games in μniverse run as fast or as slow as you want them to. You have to explicitly tell the environment to advance time; it won’t happen under your nose. This means that the speed of your machine has no bearing on the AI’s perception of its environment.

In μniverse, it is easy to guarantee that all actions are “safe”. In other words, μniverse ensures that there is nothing the agent can do to escape or break its environment. For example, it is easy to disable things like the “Back to Main Menu” button present in many games. Universe attempted to do this by forbidding clicks at certain locations, but it is undocumented and not fully supported.

I’ve been adding games to μniverse for a few weeks now. Right now I’ve integrated 27 games, and I expect that number to keep rising (update Jun 26, 2017: up to 54 games now). As I go, I’ve been training AI agents on some of the new games. It’s very satisfying to see AI perform better than me at some games. Throughout this process, I have to say that μniverse has felt more stable than Universe (I’ve trained games using both systems). μniverse is also much faster in terms of environment boot times, but that’s just a bonus.

I’d like to close with a nerdy technical note. While μniverse still runs Chrome inside of Docker, it runs Chrome in the new headless mode. This means that VNC is not needed. Instead, μniverse uses the Chrome DevTools protocol to capture screenshots, trigger events, navigate to webpages, and inject JavaScript. Hooking directly into Chrome like this is simpler and much more stable. It feels like a very elegant solution, and it has proven to be so in practice.

The Meta-Learning Quest: Part 1

Over the course of billions of years, a crude meta-learning algorithm was able to produce the brain. This meta-learning algorithm, known as biological evolution, is slow and inefficient. Nevertheless, it produced organisms that can learn to solve complex problems in a matter of seconds. In essence, a slow and inefficient learning algorithm produced a fast and efficient one. That’s the beauty of meta-learning, and it’s what I believe will lead to strong AI.

Today, in 2017, meta-learning is still in its infancy. There are few papers published on the subject, and they all approach fairly easy problems. For more details on this, I have a YouTube video that describes the current state of meta-learning.

This is where my quest comes in. Meta-learning is far from producing strong AI, but I want to take it several steps closer.

The first step in my quest was to develop a meta-learning algorithm that could plausibly scale up to human-level intelligence. To this end, I developed a memory-augmented neural network (MANN) which I call sgdstore. Unlike other MANNs, sgdstore works by training a neural network dynamically and using the ever-changing network as a large memory bank. I have had great successes with sgdstore, and I believe it’s ready for harder challenges.

As a meta-learning algorithm, evolution had to produce adaptable creatures that could succeed in many diverse situations. If we want to use meta-learning to produce strong AI, we will need our own set of diverse challenges for meta-learning to solve. For this set of challenges, I am looking to OpenAI Universe. Universe has over 1,000 virtual environments, many of which are video games. If I could use meta-learning to teach an sgdstore model to play new video games quickly, that would be a huge step forward. After all, life isn’t very far from a video game.

By the way, the idea of using meta-learning on OpenAI Universe is not a brand-new idea. Ilya Sutskever gave a talk in 2016 about this very subject. However, while the idea is simple, there are many technical obstacles in the way. I suspect that Evolution Strategies was part of OpenAI’s approach to overcoming these obstacles. As I will describe, I am taking a slightly different path.

Among the many technical difficulties involved in meta-learning, one is particularly troublesome: memory consumption. My sgdstore model is trained via back-propagation, the traditional algorithm for training neural networks. As it runs, back-propagation needs to store intermediate values. This means that, as episodes of training get longer and longer, memory consumption grows worse and worse. To give you an idea, the meta-learning tasks with which I’ve tested sgdstore are on the order of 100 time-steps. If a video game is played at 10 frames-per-second, then 5 minutes of play would amount to 3,000 time-steps. Without any modification to my algorithm, this would be infeasible.

Luckily, there are algorithms to make back-propagation use asymptotically less memory. In the extreme, it’s possible to make memory consumption grow logarithmically with episode length (a huge win). I actually thought of that algorithm myself, but of course–knowing my luck–it was already known. I went ahead and implemented the algorithm in a Github repo called lazyrnn. I am confident that it will come in handy.

As an aside, OpenAI’s recent variant of Evolution Strategies addresses the memory issue in a different way. By avoiding back-propagation, ES does not need more memory for longer episodes. However, there are other difficulties with ES that make me shy away from it. For one, it seems to depend greatly on network parameterization and weight decay, both things which I doubt will lend themselves nicely to Recurrent Neural Networks like sgdstore or LSTM.

Now that I have a way of training sgdstore on long episodes (lazyrnn), and I have a source of simulated experience (OpenAI Universe), I am left with one final task. Since the meta-learner will be performing reinforcement learning (learning from a reward signal), I need to implement a powerful RL algorithm. For this, I plan to use Trust Region Policy Optimization (TRPO). TRPO is pretty much the state of the art when it comes to RL. Today and yesterday, I have been implementing forward automatic-differentiation so that I can compute Fisher-vector products for TRPO.

I hope to have TRPO implemented within the next few days, at which point I will begin my meta-learning experiments. With any luck, sgdstore will be able to learn something. From my experience, though, I doubt anything will work on the first try. I’m sure many challenges await me.

 

Slice Aliasing Is Nicer Than You Realize

Most languages handle array slicing in one of two ways: with copying or with aliasing. JavaScript copies:

var arr = [1, 2, 3];
var arr1 = arr.slice(1, 2);
arr[1] = 0;
console.log(arr1); // prints [ 2 ]

Go, on the other hand, aliases:

arr := []int{1, 2, 3}
arr1 := arr[1:2]
arr[1] = 0
fmt.Println(arr1) // prints [0]

Until yesterday, my ML framework’s vector library did not support aliasing. When you sliced a vector, you created a copy. I thought this would make things easier for a few reasons. For one, aliasing means that you have to worry about overlapping memory regions. Have you ever tried to implement your own version of memmove()? It’s not fun. However, after months of working with my framework, it became abundantly clear that aliasing would be nice to have.

In my view, the biggest merit of aliasing is performance. Suppose I have a highly-optimized function dot(x, y) which computes the dot product of two vectors (i.e. arrays of numbers). If I want to use this to compute a matrix-vector product, I can call dot once for each row of the matrix. But wait: how do I extract each row from the matrix? If this is Go, I can create a reference to the row in constant time without copying any memory. If this is JavaScript, on the other hand, I have to use slice() to make a copy of the row. That’s obviously worse for performance, especially with GC overhead.

Suppose I wanted to optimize the dot function in JavaScript by adding offset arguments, such as dot(x, xOffset, y, yOffset). Now I could use the offset arguments instead of slicing. But wait, I just added two more arguments to my function! And what’s more, I have to add offset arguments to every similar math function I want to optimize. As someone who maintains a vector manipulation library, that seems like a lot of extra clutter.

As a note, Java suffers from exactly what I described above. Since there’s no array aliasing, the Arrays class is cluttered with method variants like fill(short[] a, int fromIndex, int toIndex, short val). If Java arrays worked like Go slices, you wouldn’t need those pesky index arguments.

You could avoid the extra arguments by wrapping every array in a type that looks something like this:

SlicedVector {
  int    start;
  int    length;
  type[] backingArray;
}

If all your APIs took these structures instead of arrays, then you could slice for free. But now you’ve implemented aliasing! You would have all the same problems you’d have with aliasing: overlapping array arguments, weird bugs relating to mutability, etc. So why not just use aliasing from the start?

Overall, it’s surprising how little I thought about aliasing before creating my own vector library. Usually it’s not an issue. It was only yesterday, when I profiled a program and found that 30% of its overhead was coming from vector slicing, that I finally decided in favor of aliasing.

The Bug That Wasted a Month of GPU Time

After getting my hands on a good GPU, I spent a month trying to train a single neural network. The network wasn’t even that special—it was just an image classifier. In case you don’t know, an image classifier tries to name the objects in images. You might show it a picture of a phone, for example, and it’d output something like “rotary phone”. This is an extremely popular application of neural networks, so I figured I was in for an easy ride.

To my disappointment, the network was doing really poorly. I expected an error rate of about 7%, but I was getting more like 35%. I tried lots of different fixes: changing the batch size, using different optimization algorithms, and even tinkering with L2 regularization. Nothing seemed to help.

I needed to be able to run experiments faster. After tweaking the network, I would have to wait a few days before seeing the (negative) results. I knew I hadn’t put much time into optimizing my code, so I figured I could squeeze more performance out of it. When I ran a profile, I found that most of the overhead wasn’t even coming from the network itself. Rather, data augmentation was taking up most of the training time.

Tipped off by the profiler results, I tried to optimize the cropping code by combining it with some other image logic. This is when I really looked at the cropping function:

func crop(img image.Image, x, y int) image.Image {
	res := image.NewRGBA(image.Rect(0, 0, InputImageSize,
		InputImageSize))
	for y := 0; y < img.Bounds().Dy(); y++ {
		for x := 0; x < img.Bounds().Dx(); x++ {
			c := img.At(x+img.Bounds().Min.X,
				y+img.Bounds().Min.Y)
			res.Set(x, y, c)
		}
	}
	return res
}

So, that code is pretty broken. For one thing, the x and y arguments are shadowed by the for loop variables, so I was always cropping the top-left corner out of every image. Second, the code was really slow because I was looping over the pixels of the larger image rather than only touching the pixels in the cropped region.

So, why was my neural network so bad? Because I was always feeding it the top-left corner of its training images. I don’t know if you’ve noticed, but the subject of a picture is rarely in the top-left corner. All things considered, it’s amazing that the network could get 60% accuracy with such terrible inputs.

Now for the good news. After I replaced the cropping code with something sane, the network learned to classify images very well. It was a ResNet-34, and my results were comparable to the ones in the paper for the same network. And now that I finished what I set out to do, I have my GPU back. I can finally move on to more interesting things.

Random Fun with Linear SVMs

I was first introduced to Support Vector Machines by MIT’s AI course. The lecturer did a lot of math, arrived at a quadratic optimization problem, and then said “solving this is a job for a numerical analyst” (paraphrased). Discouraged, I went a while without implementing SVMs at all. Even after I had learned some numerical analysis, I still tended to prefer conceptually simpler models like neural networks or decision trees.

A few days ago, I came to a realization that totally changed my feelings about SVMs. In particular, I realized that I could use the hinge loss to train a regular neural network. Like any loss function, the hinge loss measures how well a model classifies inputs. However, it has a particular property that makes it special: when you minimize the hinge loss, you are essentially training an SVM.

Suddenly, I had a way to work SVMs into my existing ML framework. An SVM can be trained just like a neural network, provided you use the right loss function (combined with L2 regularization). All the complicating factors–stuff like kernels and quadratic programming–no longer seemed relevant. I finally had a way to benefit from SVMs without all the complications that seemed to come along with them.

Implementing the hinge loss in my own ML framework was fairly simple, although I had trouble deciding which version of the multi-class hinge loss to use. Annoyingly, Wikipedia only mentioned one variant of said loss. Thanks to me, that is now fixed.

Once I had an implementation of the hinge loss, I trained a linear SVM on the MNIST handwriting dataset. The results were amazing! After less than a second of training, the model had a 92% validation accuracy. But I didn’t want to stop there.

Until this week, I have never used the CIFAR dataset. If you aren’t familiar, CIFAR is a database of labeled color images. For example, there are thousands of 32×32 images of cars, each labeled “automobile”. I trained a linear SVM on CIFAR-10 (a CIFAR dataset with 10 categories). The resulting SVM was able to classify 38% of images correctly, which is pretty impressive for such a simple model.

Unlike most ML models, it is easy to visualize what a linear SVM is doing. For each class, the SVM finds a vector describing samples in that class. We can treat these vectors as images to see exactly what the SVM is looking for:

Car Boat Frog Horse

You can probably see why the car is the way it is. The boat image is also somewhat obvious: most boats are brown-ish objects floating in a blue-ish ocean. The frog is also quite clear: frogs are approximately green blobs. If you squint closely enough at the horse, you can probably explain it as well.

Perhaps the nicest thing about linear SVMs is that they learn so quickly. The CIFAR SVM took about 10 seconds to train. The MNIST SVM took less than a second. That’s something you don’t get from the typical neural network.

Now that I have had some fun with linear SVMs, I hope that they will become part of my everyday ML process. In the very least, they can often give a good baseline for classification accuracy.

Can Neural Networks Learn to Spell?

English spelling is extremely difficult. It has few rules, and even they are frequently broken. Despite all this, millions of people write in English every day. How does everyone do it? Is spelling learned by brute memorization, or does the brain do something clever?

Despite what pre-school teachers might tell you, people don’t learn to spell by brute memorization. When we hear a new word—sometimes even a made up word—we can usually guess how it will be spelled. If we simply memorized the spelling of every word, guessing new spellings would be impossible. It’s much more likely that the brain of an English speaker develops a probabilistic system for spelling words. This system might have some hard-wired exceptions, but it’s less rigid than a simple lookup table.

So, if the brain can learn a system for English spelling, can an artificial neural network do the same? Spoiler alert: yes it can.

You might wonder what the input to such a neural network might look like. For example, could we feed the network recordings of speech and expect it to produce spellings? At this point in time, I think it would be unreasonable to expect good results from such an approach. Even modern speech recognition systems first convert speech into phonetic transcriptions before attempting to produce predictions over entire words. So, instead of feeding my network raw sound, I decided to feed it phonetic transcriptions in IPA. IPA is a standard way to write down speech sounds without needing large audio files. As an added bonus, it’s easy to find dictionaries with IPA transcriptions.

After deciding on the input format, I had to design the network itself. Since words can vary in length, I had little choice but to use a Recurrent Neural Network. In fact, the architecture I chose closely resembles the one used in Graves et al., 2006 for speech recognition. To get technical, each phoneme of the input is duplicated four times (“mmmmʌʌʌʌŋŋŋŋkkkkiiii”), two separate RNNs are run on the sequence (one backwards, one forwards), and Connectionist Temporal Classification determines the final network output (i.e. the spelling). This setup has two side effects:

  1. Spellings cannot be more than 4 times the length of the phonetic transcriptions.
  2. The network is encouraged to “align” phonemes with their corresponding letters.

With our network laid out, we can talk a bit about data. A while ago I had used the CMU pronunciation dictionary for a speech synthesis project, and I repurposed that dictionary as network training data. After a bit of preprocessing, I split the dictionary up into 105,139 training samples and 11,474 testing samples (where a sample is a single word). Uncompressed, this data was a bit over 2MB. The neural network, on the other hand, was only 1.8MB. This meant that, in the very least, the network couldn’t simply store an uncompressed copy of the data.

When I went to train the network, I found that it learned quite quickly. In fact, I started by training the network on a single core of a 10-year-old iMac. A single epoch took a few hours, but the result was fairly satisfying. I ran the network on a few random words to see what it could come up with:

Actual word Input Output
invader InveIdəɹ invader
twelve twɛlv twelve
evaluate IvæljueIt evaluate
guilty gIlti gilty

None of these words have a one-to-one mapping between phonemes and letters. In “invader”, the letter “a” only has the /eI/ sound because it is followed by “de”. In “twelve”, we see the general rule that words cannot end in a “v”. In “evaluate”, “e” is added to the end of the word to give the last “a” an /eI/ sound. The last example is particularly notable because the network got it wrong. Not many words are spelled like “guilty”, so this mistake is unsurprising (the fact that “guilty” is a common word has no effect; see my note at the bottom about word frequency).

The trained network had a validation success rate of about 40%, meaning that it could correctly spell 40% of the words it had never seen before. This didn’t seem impressive, but I didn’t want to judge without trying myself. I selected 19 random words from the validation set and tried to spell them based on their IPA transcriptions. It was harder than I expected. The dictionary contains a huge number of uncommon words (e.g., city names, names of plants, etc.). I was only able to spell 9 out of the 19 words (47%). This should give you an idea of what I was up against:

Phonetics My guess Network’s guess True word
mʌkweI muckway mcway mcquay
fɹægmʌnt fragment fragment fragment
ʌgwangʌ agwanga aguanga aguanga
pɛtɹoʊmInəɹʌlz petrominerals petrominerals petrominerals

At this point, I started wondering about overfitting. After training for one epoch, cross-validation indicated that virtually no overfitting had taken place. However, I knew that further training might cause the network to lose its ability to generalize to new words. When you think about it, though, humans probably overfit spelling ourselves. Most of us can think of times when we’ve overthought the spelling of a word and ended up misspelling it as a result. In some sense, we tend to focus on weird edge cases, leading us to overthink words with normal spellings. So, as an experiment, I set out to intentionally overfit a network on spelling.

I started by training the existing 1.8MB network for more epochs. Its validation success rate stabilized at around 41%, while the training success rate inched up to 48%. Encouragingly, the validation score never got worse. Eager to force a network to memorize more English spelling, I increased the number of neurons in the network. The new network took up an alarming 7MB, but it still only achieved a 44% validation success rate and a 56% training success rate. Annoyingly, that’s still an F, albeit on the world’s hardest spelling test. My conclusion was that humans aren’t the only “neural” entities that struggle with spelling tests.

One final thing to note is that my experiments weighted every word equally. In reality, English words are distributed on a heavy-tailed probability distribution. It is undoubtable that people are better at spelling common words than extremely infrequent ones. In fact, one can sometimes get by with only 1000 words. The takeaway is that, while a 50% success rate might not sound impressive, it could be more than enough.

The code for this project is available on Github. You might note from the commit history that this is not a new project. I had started implementing this idea months ago, but I had limited training hardware at the time. I only decided to pick the project back up once I was less restricted by hardware.

Ancient Philosophy as a Classification Problem

I took the first few weeks of an ancient philosophy class and found it extremely frustrating. I generally love philosophy, but ancient philosophy was somehow different. Throughout every reading and every lecture, I couldn’t help but think that ancient philosophy was just a narrow-minded subset of data science. This might sound reductive—perhaps altogether absurd—until you think about what ancient philosophers like Socrates were really after.

Socrates and Plato spent a lot of time trying to get at the essence of concepts like virtue, courage, and happiness. Typically, Socrates would bait someone into defining a broad term like “courage”, then would go on to refute the definition with a string of counter-examples. After having done this, he would conclude that they—Socrates and the collocutor—had no clue what “courage” was.

The TA for the class did a similar thing in our discussion section. He posed a simple question: what is a watch? It turned out that nobody knew. Is a pocket watch a watch? What about a smartwatch? What if you strap your phone to your arm—is that a watch? It went on, case after case, and it seemed more and more difficult to nail down a universal definition. In fact, it seemed that “watchness” had more to do with brand marketing than with the function or structure of the product. The newest Fitbit is essentially a smartwatch, yet people don’t call it that. What is it that makes the Moto 360 a watch while the Fitbit is not? None of us could figure it out.

What frustrated me at the time, and probably led me to drop the class, was that philosophers were approaching the problem from the wrong direction. Defining terms is—and always has been—a simple matter of classification. Data scientists have been dealing with this problem for years. In the case of a watch, we want a function that maps an object to a yes-or-no label (”watch” or “no watch”). The function could be anything: a neural network, a decision tree, an SVM, you name it. However, for some reason, philosophers only seemed to want to use one kind of classifier: a sentence.

Essentially, philosophers like Socrates were trying to make classifiers out of language. They wanted a definition that could classify every instance correctly. So, as a data scientist, I am not surprised that Socrates struggled so much. Of course a human knows what a watch is; we are essentially huge neural networks. However, there is absolutely no reason to assume that our neural circuitry could be compressed down to less than a kilobyte (the size of a lengthy paragraph).

With this being said, definitions can be extremely useful. Just like simple regression models, definitions are low variance. Basically, a definition gives you a “good sense” of a categorization without needing tons of data. However, one should never expect a definition to fulfill the duties of a high variance classifier like a deep neural network. Definitions cannot and should not cover every edge case. Capturing the intricacies of meaning is a job for a complex model like the human brain. That, I believe, is what Socrates was missing.