Anybody who does research knows that ideas often don’t pan out. However, it is fairly rare to see papers or blogs about negative results. This is unfortunate, since negative results can tell us just as much as positive ones, if not more.
Today, I want to share some of my recent negative results in machine learning research. I’ll also include links to the code for each project, and share some theories as to why each idea didn’t work.
Project 1: reptile-gen
Source code: https://github.com/unixpickle/reptile-gen
Premise: This idea came from thinking about the connection between meta-learning and sequence modeling. Research has shown that sequence modeling techniques, such as self-attention and temporal convolutions (or the combination of the two), can be used as effective meta-learners. I wondered if the reverse was also true: are effective meta-learners also good sequence models?
It turns out that many sequence modeling tasks, such as text and image generation, can be posed as meta-learning problems. This means that MAML and Reptile can theoretically solve these problems on top of nothing but a feedforward network. Instead of using explicit state transitions like an RNN, reptile-gen uses a feedforward network’s parameters as a hidden state, and uses SGD to update this hidden state. More details can be found in the README of the GitHub repository.
Experiments: I applied two meta-learning algorithms, MAML and Reptile, to sequence modeling tasks. I applied both of these algorithms on top of several different feedforward models (the best one was a feedforward network that resembles one step of an LSTM). I tried two tasks: generating sequences of characters, and generating MNIST digits pixel-by-pixel.
Results: I never ended up getting high-quality samples from any of these experiments. For MNIST, I got things that looked digit-esque, but the samples were always very distorted, and my LSTM baseline converged to much better solutions with much less tuning. I did notice a big gap in performance between MAML and Reptile, with MAML consistently winning out. I also noticed that architecture mattered a lot, with the LSTM-like model performing better than a vanilla MLP. Gated activations also seemed to boost the performance of the MLPs, although not by much.
Takeaways: My main takeaway from this project was that MAML is truly better than Reptile. Reptile doesn’t back-propagate through the inner-loop, and as a result it seems to have much more trouble modeling long sequences. This is in contrast to our findings in the original Reptile paper, where Reptile performed about as well as MAML. How could this be the case? Well, in that paper, we were testing Reptile and MAML with small inner-loops consisting of less than 100 samples; in this experiment, the MNIST inner-loop had 784 samples, and the inputs were (x,y) indices (which inherently share very little information, unlike similar images).
While working on this project, I went through a few different implementations of MAML. My first implementation was very easy to use without modifying the PyTorch model at all; I didn’t expect such a plug-and-play implementation to be possible. This made my mind much more open to MAML as an algorithm in general, and I’d be very willing to use it in future projects.
Another takeaway is that sequence modeling is hard. We should feel grateful for Transformers, LSTMs, and the like. There are plenty of architectures which ought to be able to model sequences, but fail to capture long-term dependencies in practice.
Project 2: seqtree
Source code: https://github.com/unixpickle/seqtree
Premise: As I’ve demonstrated before, I am fascinated by decision tree learning algorithms. Ensembles of decision trees are powerful function approximators, and it’s theoretically simple to apply them to a diverse range of tasks. But can they be used as effective sequence models? Naturally, I wondered if decision trees could be used to model the sequences that reptile-gen failed to. I also wanted to experiment with something I called “feature cascading”, where leaves of some trees in an ensemble could generate features for future trees in the ensemble.
Experiments: Like for reptile-gen, I tried two tasks: MNIST digit generation, and text generation. I tried two different approaches for MNIST digit generation: a position-invariant model, and a position-aware model. In the position-invariant model, a single ensemble of trees looks at a window of pixels above and to the left of the current pixel, and tries to predict the current pixel; in the position-aware model, there is a separate ensemble for each location in the image, each of which can look at all of the previous pixels. For text generation, I only used a position-invariant model.
Results: The position invariant models underfit drastically. For MNIST, they generated chunky, skewed digits. The position-aware model was on the other side of the spectrum, overfitting drastically to the training set after only a few trees in each ensemble. My feature cascading idea was unhelpful, and greatly hindered runtime performance since the feature space grew rapidly with training.
Takeaways: Decision tree ensembles simply can’t do certain things well. There are two possible reasons for this: 1) there is no way to build hierarchical representations with linearly-combined ensembles of trees; 2) the greedy nature of tree building prevents complex relationships from being modeled properly, and makes it difficult to perform complex computations.
Another more subtle realization was that decision tree training is not very easy to scale. With neural networks, it’s always possible to add more neurons and put more machines in your cluster. With trees, on the other hand, there’s no obvious knob to turn to throw more compute at the problem and consistently get better results. Tree building algorithms themselves are also somewhat harder to parallelize, since they rely on fewer batched operations. I had trouble getting full CPU utilization, even on a single 64-core cloud instance.
Project 3: pca-compress
Source code: https://github.com/unixpickle/pca-compress
Premise: Neural networks contain a lot of redundancy. In many cases, it is possible to match a network’s accuracy with a much smaller, carefully pruned network. However, there are some caveats that make this fact hard to exploit. First of all, it is difficult to train sparse networks from scratch, so sparsity does not help much to accelerate training. Furthermore, the best sparsity results seem to involve unstructured sparsity, i.e. arbitrary sparsity masks that are hard to implement efficiently on modern hardware.
I wanted to find a pruning method that could be applied quickly, ideally before training, that would also be efficient on modern hardware. To do this, I tried a form of rank-reduction where the linear layers (i.e. convolutional and fully-connected layers) were compressed without affecting the final number of activations coming out of each layer. I wanted to do this rank-reduction in a data-aware way, allowing it to exploit redundancy and structure in the data (and in the activations of the network while processing the data). The README of the GitHub repository includes a much more detailed description of the exact algorithms I tried.
Experiments: I tried small-scale MNIST experiments and medium-scale ImageNet experiments. I call the latter “medium-scale” because I reserve “large-scale” for things like GPT-2 and BERT, both of which are out of reach for my compute. I tried pruning before and after training. For ImageNet, I also tried iteratively pruning and re-training. A lot of these experiments were motivated by the lottery ticket hypothesis, which stipulates that it may be possible to train sparse networks from scratch with the right set of initial parameters.
I tried several methods of rank-reduction. The simplest, which was based on PCA, only looked at the statistics of activations and knew nothing about the optimization objective. The more complex approaches, which I call “output-aware”, considered both inputs and outputs, trying to prevent the network’s output from changing too much after pruning.
Results: For my MNIST baseline, I was able to prune networks considerably (upwards of 80%) without any significant loss in accuracy. I also found that an output-aware pruning method was better than the simple PCA baseline. However, results and comparisons on this small baseline did not accurately predict ImageNet results.
On ImageNet, PCA pruning was uniformly the best approach. A pre-trained ResNet-18 pruned with PCA to 50% rank across all convolutional layers experienced a 10% decrease in top-1 performance before any tuning or re-training. With iterative re-training, this gap was reduced to closer to 1.8%, which is still worse than the state-of-the-art. My output-aware pruning methods resulted in a performance gap closer to 30% before re-training (much worse than PCA).
I never got around to experimenting with larger architectures (e.g. ResNet-50) and more severe levels of sparsity (e.g. 90%). Some day I may revisit this project and try such things, but at the moment I simply don’t have the compute available to run these experiments.
At the end of the day, why would I look at these results and consider this a research project that “didn’t pan out”? Mostly because I never managed to get the reduction in compute that I was hoping to achieve. My hope was that I could prune networks without a ton of compute-heavy re-training. My grand ambition was to figure out how to prune networks at or near initialization, but this did not pan out either. My only decent results were with iterative pruning, which is computationally expensive and defeats most of the purpose of the exercise. Perhaps my pruning approach could be used to find good, production ready models, but it cannot be used (as far as I know) to speed up training time.
Takeaways: One takeaway is that results on small-scale experiments don’t always carry over to larger experiments. I developed a bunch of fancy pruning algorithms which worked well on MNIST, but none of them beat the PCA baseline on ImageNet. I never quite figured out why PCA pruning worked the best on ImageNet, but my working theory is that the L2 penalty used during training resulted in activations that had little-to-no variance in discriminatively “unimportant” directions.