Training a Speech Synthesizer

Computers are much cooler when they talk to you, which is probably why people have been working on speech synthesis for decades (arguably centuries). I decided to dip my toe into this domain by using modern deep learning tools to train my own speech synthesizer.

Creating a speech synthesizer with today's tools is actually pretty straightforward. We want to generate an audio waveform, which means we probably want to train a generative model. There are a plethora of generative models we could reach for, such as diffusion models or autoregressive Transformers.

For this project, I opted to use VQ-VAE, a two-stage approach where we first learn to compress the data and then model the compressed representation. Following this approach, I first created a low-bitrate representation of audio, and then I trained an autoregressive Transformer on top of these latent codes.

I could have sourced training data from many places, but I chose to take a particularly easy route: I created synthetic data using Apple's speech synthesizer. In particular, I trained my model on a dataset of about 1M examples of Siri saying short strings of text.

The samples from the model are sometimes quite impressive, and other times surprisingly broken. If you came here for samples, you can skip to the last section of this post. All training ran on a single Mac Studio over the course of a few weeks, and the code is written in Swift and uses Honeycrisp as a deep learning framework.

Collecting the data

To train a deep learning model, we typically need to collect a large amount of training data. For once, this was the easy part. All I did was feed 100 ebooks from Project Gutenberg into the Mac's built-in say command-line utility.

I ended up producing about 1.5M short audio files. After de-duplicating the actual texts, I was left with about 1M of these. This was likely more than I needed, but it's better to be safe than sorry when you have the option to produce unlimited data.

Training discrete codes

Raw audio data is big. At a bitrate of 24kHz, a five second audio file contains 24000*5=120000 samples. If you've worked with present-day language models, this probably seems a lot longer than you'd want, especially when training on a single machine.

Luckily, audio data is very compressible. The VQ-VAE paradigm takes advantage of this by first learning to compress the data into sequences of discrete codes before modeling these sequences with a more powerful model. In my case, I compressed the audio down to 96 Hz discrete codes, meaning that each five second audio clip becomes 480 tokens—much more manageable for a powerful sequence model.

Sadly, 480 codes probably aren't enough to encode every detail of an audio waveform. When we decode these discrete codes into raw audio samples, we would like our model to be able to "fill in the blanks", so to speak, by doing some generation of its own.

To this end, I decided to try normalizing flows. In general, flows aren't very popular anymore, and they do have various pitfalls. However, unlike diffusion models, they can produce a sample with a single forward pass. This seemed like a nice property to have when you want to decode audio quickly.

Training the VQ codes was, surprisingly, the toughest part of this project. Initially, I had some trouble making training stable. At one point, I even discovered a bug in a previous version of Apple's Metal Performance Shaders which caused incorrect gradients. Even after tuning hyperparameters, dealing with divergences, and scaling up the model considerably, one might argue that the codes are still pretty bad.

To see how well the codes work, we can compress a waveform into discrete codes, and then decode it back. We can see from this example that we are losing a lot of detail and quality:

Input waveform

Reconstructed waveform

I probably could have trained better codes, especially if I had abandoned normalizing flows. However, I had already invested about two weeks of GPU time into training the VQ codes for this project, and I wanted to move on to the fun part.

Training a transformer

Once we can encode audio into low-bitrate sequences of discrete codes, we are in a great position to train a Transformer. I opted to reuse my existing Transformer implementation with Rotary Embeddings, and didn't need to make any real modifications to make it work for this project.

I only trained the Transformer for a few days, during which time I dropped the learning rate twice (as you can see from the learning curve). I found that training was ridiculously stable.

A learning curve for training a transformer model on audio. The y-axis is loss, and the x-axis is training steps. — The learning curve of the Transformer text-to-speech model. The y-axis is negative log-likelihood of the discrete sequences, and the x-axis is training steps.

Early in training, the samples from the model sounded pretty hilarious. For example, here is the model attempting to say "the quick brown fox jumps over the lazy dog":

An early model sample

Even after training the model for a few days, I was surprised that it liked putting weird sounds at the end of the sample. For example,

"This is a test" is spoken with a strange artifact at the end.

To make a long story short, it turned out that every training example's caption ended with "\n\r". This was an artifact of my data preprocessing. When I was sampling from the model, I did not put these control sequences at the end of the prompt, and this caused the model to do weird things (like creating extra sounds after the prompt). To my relief, when I added "\n\r" to every prompt I fed to the model, the artifacts were gone.

Successes and failure cases

The best way to get to know a model is to try it on a bunch of prompts. In this section, I'll walk through a few observations I've made about the model. If you want to try it yourself, you can follow the instructions on Github.

Surprisingly, the model can sometimes pronounce unusual (out of distribution) words, such as "DALL-E". Of course, when this fails, you could also use a phonetic spelling that might be easier for the model.

"DALL-E generates images"

"Dolly generates images"

Interestingly, the model has no problem with English tongue twisters. However, I did notice that the second example occurs in the training corpus (although it is split in the middle across two different training examples).

"She sells seashells by the seashore."

"Peter Piper picked a peck of pickled peppers."

There are a lot of numbers in the training data, but the model still hasn't quite mastered how to say large numbers. For example, it might start at the wrong order of magnitude, and then have to make up digits:

"783"

"1234"

"1049"

While the model is okay with traditional tongue twisters, it does have many of its own. For example, these are seemingly easy prompts where the model simply cannot say what you'd expect.

"A B C D E F G"

"aaaaaaaaaa"