Reflections and future directions


Frames and data representation

It seems so much of the time that as long as we throw “enough” data at a neural network, the structure of that data doesn’t matter too much. At the beginning of the project, I spent a lot of time listening and looking at waveforms of different frame and sequence lengths, trying to imagine what would make sense to learn from. But various people in the class seem to have used quite different lengths for both of these, with fairly consistent results. So I’m not sure how much my mind has changed about data representation being important, but next time I would definitely try to get to generating more audio, sooner.

I would also have liked to experiment with Fourier transforms of the data – one idea I had was to do the Fourier, and plot each frame as an ‘image’ using the phase as one axis and the amplitude as the other axis, pass these images through a convnet, and use that as the inputs to the RNN at each timestep. I wasn’t sure how I would reconstruct audio from the outputs though – perhaps the convnet filters could be constrained to be invertible, or maybe I could have done something like this.


Network implementation

I used the Blocks framework to build the network – I started out writing my own theano code, but realized I might not have time to implement any interesting models if I spent too much time implementing things like plotting and resuming models. In retrospect, I’m not sure how I feel about this decision. I think it was good to become familiar with the way Blocks does things – I spent a lot of time reading the source code and docs. Next time though, I would implement myself from scratch, maybe using pieces from Blocks that I found useful.

I’d also like to test with more layers, and different layer sizes, as well as experimenting with skip-connections. I think a big part of music is that patterns build and change throughout the song, and although that’s maybe not so evident in this pretty monotonous music, I think propagating information about input patterns to later timesteps could be important.



I used a fairly large batch size, and I don’t think that this was a good idea. I would like to rerun some things with a batch size of 1 or 10 or something.

I also trained for a fixed number of epochs (30), and I don’t think that this was enough. The models were taking so long to train though – I’d have to figure out what that is, first.

Lastly, I don’t think that mean squared error quite the right cost to use. It implies that every sample in the frame of audio is equally important to the frame sounding like a good frame, and I don’t think that this is necessarily true – sort of the way we see that generative models produce noisy or blurred images when they’re trained to reconstruct directly to pixel space, I think that using MSE is part of the reason that the audio sounds noisy.

I would have liked to explore the idea I mentioned in class, using a GAN where the discriminator has to discriminate between noise and a frame, but that will be for the future.



I would like to do something more intelligent than just concatenating the last-predicted frames in order to produce audio – a simple averaging over a window in the overlapped predictions might be a good first step, but ultimately this feels like a hack. Thinking about this the way a person might compose music, I would like to do something that could ‘revise’ its predictions – maybe using a bidirectional RNN would have been a good approach to this, or perhaps the “denoising GAN” that I mentioned above would produce better samples over time.


Sound generation summary

This is an overview of the process of sound generation that I used. You can see this implemented in the code in my github repo, and read some things I would like to do differently here.

Make a dataset from one long song

The raw audio is just a sequence of real numbers. We think of ‘true’ audio as being a continuous wave, and we take samples from that wave a certain number of times per second – these are the numbers in the raw audio file.

A frame is a sequence of samples. If this were next-character prediction for generating text, a frame is like a ‘character’. It’s what we predict at each step, and so implicitly we are telling the model that a song is made of a sequence of frames sampled from a distribution over all frames it’s seen in the training set. I tried different frame sizes of 40, 50, 100, 800, 1000, 4000, and 40 000.

Ideally we might do next-step prediction over the whole song, but in practice we need to break the song up into sequences of frames for ease of training. The length of this sequence tells the model how long to consider time dependencies for. I used sequence lengths of 50, 100, 1000, 4000, and 10 000.

2016-04-21 22.19.14

2016-04-21 22.19.08

Each example shown to the network is a sequence of frames – this example gets broken up into input and target, and the output of the network is a vector representing what it thought the target would be at each step.

The examples also get put in mini-batches – the network sees a mini-batch of examples before computing a gradient update. I used Fuel to batch and stream my data examples, and I used a mini-batch size of 128, and ran for 30 epochs.


Build and train the model

RNNs (Recurrent Neural Networks) are the defacto standard for modeling time-dependent data with a neural network. I had a couple ideas about working with the full song, or with FFT data and convnets, but I wanted to start with something that I felt had more research behind it for my first project.

From reading a couple articles about various RNN architectures (including this one), it seemed that GRUs (Gated Recurrent Units) perform comparably with LSTMs (Long Short Term Memory [units]), but often train somewhat faster due to having fewer parameters.  I did some very initial trials with an LSTM vs GRU, and did not find a difference, so all subsequent trials used GRUs. However, I found a lot of bugs in my code, so it would be worth re-running to see if the results really are similar.lstm_gru

I used a 2-layer network with around a couple hundred GRUs per layer (199, 256).

Given an input, as described above, the network makes a prediction for the next step at each frame in the input vector. This is compared to the target, and cost calculated by mean squared error: mean_for_sequence[(prediction – target)^2].


Generate audio from a trained model

After training, we have a model that should be pretty good at outputting a prediction vector, given an input vector. In order to generate novel sound, we give a seed sequence as input to the network, get a prediction, and then give that prediction back to the network as input and repeat for as long as we want to create audio.

An important thing that we discussed in class is to cut the seed off of the generated data (obviously, the seed is not ‘generated’).

2016-04-21 22.26.19

I did something very simplistic to construct an audio sequence from these overlapping predictions – just took the last frame in each and concatenated them. I generated a fixed length of 30 seconds.

Final tries – still mostly noise

After working out a lot of bugs in my code, I regenerated the audio from the models I had trained previously, and also ran some new experiments based on other students’ work. Specifically, in data pre-processing I switched from normalizing [-1,1] to dividing by the standard deviation as suggested by Melvin, and tried some very different sequence and frame sizes (40, 40000) and (40000, 40).

The sound I generate is mostly like this:

It pulsates but is very noisy. The forget gate initialization doesn’t appear to make much difference – the audio is almost identical (I used the same seed, for comparison).

I’m not sure which I think sounds better between the (4000,50) and the (1000, 100) – the first sounds like it would be a bit more melodic (if it weren’t mostly noise), while the second (shorter frame_length, longer sequence_length) sounds like it’s got the time changes better – presumably this is due to the longer sequence lengths.

The experiments I ran with much shorter frame lengths and longer sequences (800, 8000), (40, 10000), and (100, 7000) are still running, several days later …. I’m not sure what I’m doing differently from other people in the class that’s making them take so much longer.

Problems and solutions

Hokay. I went gone down a number of code rabbit holes last week, trying to figure out why the audio I was generating sounded so fast relative to the original.

I was also trying to find other bugs, clean up the code, and consolidate memories about what I did and how all the pieces work … I learned a _lot_ of new things for this project. I had been hoping last week to finish up the 2-layer LSTM/GRU, and move on to a more interesting model and at least get some preliminary results this weekend. But I discovered a whole bunch of problems with my code (and had some other things happen in real-life-land) and fixing them has taken longer than I anticipated.

I decided it was probably better to do this one model well rather than leave it half-finished and half-start something else. Ultimately, I think it’s been a lot more instructive to cause myself all these problems than it would have been to get things right the first time. I’ll do a separate blog post describing the model more in detail, and another one describing the things I would have tried if I had time. For now, here’s a summary of the issues I had, and how I solved (some of) them:

1. Audio too fast

Troubleshooting: This took me the longest to troubleshoot by far. I used tag.test values and stepped through the debugger making sure all the tensors were the right shapes at the right times, triple-checked my scaling, reshaping and data pre-processing methods (and found another couple unrelated bugs, described in (2)!), also checked my hdf5 file creation, and went through Fuel’s code to see exactly what it was doing with that file to turn it into a datastream. Then I checked the code for generating the wav file, made sure it was int16 and I was setting the sample rate correctly, plotted the input, target, and prediction frames to make sure they were at the same scales … I also tried taking some of these frames and saving them at different sample rates to see if they sounded similar to my audio, and they didn’t, so I was almost at the point of thinking my models were just learning faster time signatures, as implausible as that seemed… but of course in the end it came down to something simple.

Solution: To generate, I select a random seed frame from the training data. When I checked this code initially, I just checked that the predictions at each step were the right size/shape/sample rate etc., and they were … but when I compiled the predictions I did something bad. I wanted to take the last (truly predicted) frame, but I made an indexing mistake, so instead of getting a frame, I got a whole example (sequence + frame). This got flattened in later code and cast as int16, so was sped up by about sequence_length number of times (50-100) – this is a lot more than I expected, and is why the samples I saved didn’t sound similar (I only sped it up by at most 10 times in my little tests).

2. Training on too little data

Troubleshooting: I had a couple lines of code which cut the raw audio such that it could be reshaped into examples (frame_length x sequence_length). I did something ‘cute’ with negative indices and mod that I thought was working, but it was not – it cut off way too much at the end. I didn’t look in to it too much, just replaced it with something simpler, and double-checked the lengths this time.

Also, I realized that the loop for my window_shift was not working properly, so I was only going across the audio once (i.e. examples did not overlap at all, and there were relatively few of them)



num_examples = len(data) // example_length
leftover = len(data) % example_length
if leftover < frame_length:
data_to_use = data[:((num_examples-1)*example_length)+frame_length]
data_to_use = data[:(num_examples*example_length)+frame_length]


shift = 0
while len(data[shift:]) >= example_length + frame_length:
num_ex_this_pass = len(data[shift:]) // example_length
data_to_use.extend( data[shift:shift+(num_ex_this_pass*example_length)] )
num_examples += num_ex_this_pass
shift += example_shift
data_to_use.extend( data[shift:shift+frame_length] )

3. Including seed information in generated audio

This issue was discussed a lot in class. I had tried to be careful about removing the seed sequence from the audio I generated, but because of the problem described in (1),  when I removed the first sequence-length of data, there were still many (sped up) pieces of audio after it which had information from the seed. This means my previously generated audio samples are not reliable.

Solution: Fixed by other fixes (but was still worth looking in to to make sure I was doing this correctly).

4. Audio is not

After fixing the bugs in my generation code, and re-generating audio from my trained models, I found that after a few seconds they only predict silence. Other students also had this problem.

Possible solutions:

  1. Forget-gate bias initialization: I read this paper, and heard about this issue from other students – if the forget gate in an LSTM/GRU is initialized to 0, it’s hard for it to ever learn to remember. This seems intuitively obvious, but is easy to overlook. I ran exactly the same models but with this being the only difference; we’ll see how that sounds.
  2. Shorter frame size, longer sequences: I’m training these models now, but it’s a lot slower (lots more data). We’ll see how it goes!


More noise

I ran some more experiments looking at frame size and sequence length and seeing how these affect the generated audio.

Specifically, I ran the following combinations of (frame_size, sequence length):

  • 4000, 50
  • 1000, 50
  • 1000, 100
  • 4000, 100

4000, 50

1000, 50

1000, 100

4000, 100

In general, the problems I’m having seem to be the following:

  1. The predictions are very similar to the seed
  2. The audio fades out to noise as time goes on
  3. It’s too fast

I’m not sure how to describe this, but it sounds more like someone playing a fuzzy piano or steel drums than anything like the original “music”. It sounds kind of neat, but it’s definitely too quick. There must be a bug/downsample in the way I’m making the audio but I haven’t found where…

Making noise: success!

That title is very accurate – I’ve now succeeding in making pretty pure noise, although it does seem to have a liiitle bit of pulsation in it.

[UPDATE: Alex pointed out that doing mysequence.astype(‘int16’) did not actually modify the array in place, so my audio wasn’t really encoded properly. Proper version below]

I started using bokeh to do live-plotting of learning curves and thought I was saving them as pngs, but it turns out I was not, so I only have the first couple batches of the latest experiment I ran to put here for now:

[UPDATE: Here’s the training loss for the model described below]


But the good news is I’m now truly set up with Blocks, Fuel, and bokeh to run and log multiple experiments.

I’m currently using a 2-layer LSTM, with the following hyperparameters:

  • frame length: 4000 samples
  • sequence length: 50 frames
  • batch size: 128
  • hidden layer size: 200
  • learning rate: 0.002
  • epochs: 30

I’m also using gradient clipping. Most of these decisions are fairly arbitrary at the moment – I’ve just been working on getting plots and audio made so that now I can look at the details of my experiments.

The model trains and everything, but the audio generated has a lot of noise in it and does not sound very much like the vocal data it was trained on – I’m going to try a better overlapping to smooth out the transitions.