Make a dataset from one long song
The raw audio is just a sequence of real numbers. We think of ‘true’ audio as being a continuous wave, and we take samples from that wave a certain number of times per second – these are the numbers in the raw audio file.
A frame is a sequence of samples. If this were next-character prediction for generating text, a frame is like a ‘character’. It’s what we predict at each step, and so implicitly we are telling the model that a song is made of a sequence of frames sampled from a distribution over all frames it’s seen in the training set. I tried different frame sizes of 40, 50, 100, 800, 1000, 4000, and 40 000.
Ideally we might do next-step prediction over the whole song, but in practice we need to break the song up into sequences of frames for ease of training. The length of this sequence tells the model how long to consider time dependencies for. I used sequence lengths of 50, 100, 1000, 4000, and 10 000.
Each example shown to the network is a sequence of frames – this example gets broken up into input and target, and the output of the network is a vector representing what it thought the target would be at each step.
The examples also get put in mini-batches – the network sees a mini-batch of examples before computing a gradient update. I used Fuel to batch and stream my data examples, and I used a mini-batch size of 128, and ran for 30 epochs.
Build and train the model
RNNs (Recurrent Neural Networks) are the defacto standard for modeling time-dependent data with a neural network. I had a couple ideas about working with the full song, or with FFT data and convnets, but I wanted to start with something that I felt had more research behind it for my first project.
From reading a couple articles about various RNN architectures (including this one), it seemed that GRUs (Gated Recurrent Units) perform comparably with LSTMs (Long Short Term Memory [units]), but often train somewhat faster due to having fewer parameters. I did some very initial trials with an LSTM vs GRU, and did not find a difference, so all subsequent trials used GRUs. However, I found a lot of bugs in my code, so it would be worth re-running to see if the results really are similar.
I used a 2-layer network with around a couple hundred GRUs per layer (199, 256).
Given an input, as described above, the network makes a prediction for the next step at each frame in the input vector. This is compared to the target, and cost calculated by mean squared error: mean_for_sequence[(prediction – target)^2].
Generate audio from a trained model
After training, we have a model that should be pretty good at outputting a prediction vector, given an input vector. In order to generate novel sound, we give a seed sequence as input to the network, get a prediction, and then give that prediction back to the network as input and repeat for as long as we want to create audio.
An important thing that we discussed in class is to cut the seed off of the generated data (obviously, the seed is not ‘generated’).
I did something very simplistic to construct an audio sequence from these overlapping predictions – just took the last frame in each and concatenated them. I generated a fixed length of 30 seconds.