Reflections and future directions


Frames and data representation

It seems so much of the time that as long as we throw “enough” data at a neural network, the structure of that data doesn’t matter too much. At the beginning of the project, I spent a lot of time listening and looking at waveforms of different frame and sequence lengths, trying to imagine what would make sense to learn from. But various people in the class seem to have used quite different lengths for both of these, with fairly consistent results. So I’m not sure how much my mind has changed about data representation being important, but next time I would definitely try to get to generating more audio, sooner.

I would also have liked to experiment with Fourier transforms of the data – one idea I had was to do the Fourier, and plot each frame as an ‘image’ using the phase as one axis and the amplitude as the other axis, pass these images through a convnet, and use that as the inputs to the RNN at each timestep. I wasn’t sure how I would reconstruct audio from the outputs though – perhaps the convnet filters could be constrained to be invertible, or maybe I could have done something like this.


Network implementation

I used the Blocks framework to build the network – I started out writing my own theano code, but realized I might not have time to implement any interesting models if I spent too much time implementing things like plotting and resuming models. In retrospect, I’m not sure how I feel about this decision. I think it was good to become familiar with the way Blocks does things – I spent a lot of time reading the source code and docs. Next time though, I would implement myself from scratch, maybe using pieces from Blocks that I found useful.

I’d also like to test with more layers, and different layer sizes, as well as experimenting with skip-connections. I think a big part of music is that patterns build and change throughout the song, and although that’s maybe not so evident in this pretty monotonous music, I think propagating information about input patterns to later timesteps could be important.



I used a fairly large batch size, and I don’t think that this was a good idea. I would like to rerun some things with a batch size of 1 or 10 or something.

I also trained for a fixed number of epochs (30), and I don’t think that this was enough. The models were taking so long to train though – I’d have to figure out what that is, first.

Lastly, I don’t think that mean squared error quite the right cost to use. It implies that every sample in the frame of audio is equally important to the frame sounding like a good frame, and I don’t think that this is necessarily true – sort of the way we see that generative models produce noisy or blurred images when they’re trained to reconstruct directly to pixel space, I think that using MSE is part of the reason that the audio sounds noisy.

I would have liked to explore the idea I mentioned in class, using a GAN where the discriminator has to discriminate between noise and a frame, but that will be for the future.



I would like to do something more intelligent than just concatenating the last-predicted frames in order to produce audio – a simple averaging over a window in the overlapped predictions might be a good first step, but ultimately this feels like a hack. Thinking about this the way a person might compose music, I would like to do something that could ‘revise’ its predictions – maybe using a bidirectional RNN would have been a good approach to this, or perhaps the “denoising GAN” that I mentioned above would produce better samples over time.



Sound generation summary

This is an overview of the process of sound generation that I used. You can see this implemented in the code in my github repo, and read some things I would like to do differently here.

Make a dataset from one long song

The raw audio is just a sequence of real numbers. We think of ‘true’ audio as being a continuous wave, and we take samples from that wave a certain number of times per second – these are the numbers in the raw audio file.

A frame is a sequence of samples. If this were next-character prediction for generating text, a frame is like a ‘character’. It’s what we predict at each step, and so implicitly we are telling the model that a song is made of a sequence of frames sampled from a distribution over all frames it’s seen in the training set. I tried different frame sizes of 40, 50, 100, 800, 1000, 4000, and 40 000.

Ideally we might do next-step prediction over the whole song, but in practice we need to break the song up into sequences of frames for ease of training. The length of this sequence tells the model how long to consider time dependencies for. I used sequence lengths of 50, 100, 1000, 4000, and 10 000.

2016-04-21 22.19.14

2016-04-21 22.19.08

Each example shown to the network is a sequence of frames – this example gets broken up into input and target, and the output of the network is a vector representing what it thought the target would be at each step.

The examples also get put in mini-batches – the network sees a mini-batch of examples before computing a gradient update. I used Fuel to batch and stream my data examples, and I used a mini-batch size of 128, and ran for 30 epochs.


Build and train the model

RNNs (Recurrent Neural Networks) are the defacto standard for modeling time-dependent data with a neural network. I had a couple ideas about working with the full song, or with FFT data and convnets, but I wanted to start with something that I felt had more research behind it for my first project.

From reading a couple articles about various RNN architectures (including this one), it seemed that GRUs (Gated Recurrent Units) perform comparably with LSTMs (Long Short Term Memory [units]), but often train somewhat faster due to having fewer parameters.  I did some very initial trials with an LSTM vs GRU, and did not find a difference, so all subsequent trials used GRUs. However, I found a lot of bugs in my code, so it would be worth re-running to see if the results really are similar.lstm_gru

I used a 2-layer network with around a couple hundred GRUs per layer (199, 256).

Given an input, as described above, the network makes a prediction for the next step at each frame in the input vector. This is compared to the target, and cost calculated by mean squared error: mean_for_sequence[(prediction – target)^2].


Generate audio from a trained model

After training, we have a model that should be pretty good at outputting a prediction vector, given an input vector. In order to generate novel sound, we give a seed sequence as input to the network, get a prediction, and then give that prediction back to the network as input and repeat for as long as we want to create audio.

An important thing that we discussed in class is to cut the seed off of the generated data (obviously, the seed is not ‘generated’).

2016-04-21 22.26.19

I did something very simplistic to construct an audio sequence from these overlapping predictions – just took the last frame in each and concatenated them. I generated a fixed length of 30 seconds.

Final tries – still mostly noise

After working out a lot of bugs in my code, I regenerated the audio from the models I had trained previously, and also ran some new experiments based on other students’ work. Specifically, in data pre-processing I switched from normalizing [-1,1] to dividing by the standard deviation as suggested by Melvin, and tried some very different sequence and frame sizes (40, 40000) and (40000, 40).

The sound I generate is mostly like this:

It pulsates but is very noisy. The forget gate initialization doesn’t appear to make much difference – the audio is almost identical (I used the same seed, for comparison).

I’m not sure which I think sounds better between the (4000,50) and the (1000, 100) – the first sounds like it would be a bit more melodic (if it weren’t mostly noise), while the second (shorter frame_length, longer sequence_length) sounds like it’s got the time changes better – presumably this is due to the longer sequence lengths.

The experiments I ran with much shorter frame lengths and longer sequences (800, 8000), (40, 10000), and (100, 7000) are still running, several days later …. I’m not sure what I’m doing differently from other people in the class that’s making them take so much longer.

Problems and solutions

Hokay. I went gone down a number of code rabbit holes last week, trying to figure out why the audio I was generating sounded so fast relative to the original.

I was also trying to find other bugs, clean up the code, and consolidate memories about what I did and how all the pieces work … I learned a _lot_ of new things for this project. I had been hoping last week to finish up the 2-layer LSTM/GRU, and move on to a more interesting model and at least get some preliminary results this weekend. But I discovered a whole bunch of problems with my code (and had some other things happen in real-life-land) and fixing them has taken longer than I anticipated.

I decided it was probably better to do this one model well rather than leave it half-finished and half-start something else. Ultimately, I think it’s been a lot more instructive to cause myself all these problems than it would have been to get things right the first time. I’ll do a separate blog post describing the model more in detail, and another one describing the things I would have tried if I had time. For now, here’s a summary of the issues I had, and how I solved (some of) them:

1. Audio too fast

Troubleshooting: This took me the longest to troubleshoot by far. I used tag.test values and stepped through the debugger making sure all the tensors were the right shapes at the right times, triple-checked my scaling, reshaping and data pre-processing methods (and found another couple unrelated bugs, described in (2)!), also checked my hdf5 file creation, and went through Fuel’s code to see exactly what it was doing with that file to turn it into a datastream. Then I checked the code for generating the wav file, made sure it was int16 and I was setting the sample rate correctly, plotted the input, target, and prediction frames to make sure they were at the same scales … I also tried taking some of these frames and saving them at different sample rates to see if they sounded similar to my audio, and they didn’t, so I was almost at the point of thinking my models were just learning faster time signatures, as implausible as that seemed… but of course in the end it came down to something simple.

Solution: To generate, I select a random seed frame from the training data. When I checked this code initially, I just checked that the predictions at each step were the right size/shape/sample rate etc., and they were … but when I compiled the predictions I did something bad. I wanted to take the last (truly predicted) frame, but I made an indexing mistake, so instead of getting a frame, I got a whole example (sequence + frame). This got flattened in later code and cast as int16, so was sped up by about sequence_length number of times (50-100) – this is a lot more than I expected, and is why the samples I saved didn’t sound similar (I only sped it up by at most 10 times in my little tests).

2. Training on too little data

Troubleshooting: I had a couple lines of code which cut the raw audio such that it could be reshaped into examples (frame_length x sequence_length). I did something ‘cute’ with negative indices and mod that I thought was working, but it was not – it cut off way too much at the end. I didn’t look in to it too much, just replaced it with something simpler, and double-checked the lengths this time.

Also, I realized that the loop for my window_shift was not working properly, so I was only going across the audio once (i.e. examples did not overlap at all, and there were relatively few of them)



num_examples = len(data) // example_length
leftover = len(data) % example_length
if leftover < frame_length:
data_to_use = data[:((num_examples-1)*example_length)+frame_length]
data_to_use = data[:(num_examples*example_length)+frame_length]


shift = 0
while len(data[shift:]) >= example_length + frame_length:
num_ex_this_pass = len(data[shift:]) // example_length
data_to_use.extend( data[shift:shift+(num_ex_this_pass*example_length)] )
num_examples += num_ex_this_pass
shift += example_shift
data_to_use.extend( data[shift:shift+frame_length] )

3. Including seed information in generated audio

This issue was discussed a lot in class. I had tried to be careful about removing the seed sequence from the audio I generated, but because of the problem described in (1),  when I removed the first sequence-length of data, there were still many (sped up) pieces of audio after it which had information from the seed. This means my previously generated audio samples are not reliable.

Solution: Fixed by other fixes (but was still worth looking in to to make sure I was doing this correctly).

4. Audio is not

After fixing the bugs in my generation code, and re-generating audio from my trained models, I found that after a few seconds they only predict silence. Other students also had this problem.

Possible solutions:

  1. Forget-gate bias initialization: I read this paper, and heard about this issue from other students – if the forget gate in an LSTM/GRU is initialized to 0, it’s hard for it to ever learn to remember. This seems intuitively obvious, but is easy to overlook. I ran exactly the same models but with this being the only difference; we’ll see how that sounds.
  2. Shorter frame size, longer sequences: I’m training these models now, but it’s a lot slower (lots more data). We’ll see how it goes!


More noise

I ran some more experiments looking at frame size and sequence length and seeing how these affect the generated audio.

Specifically, I ran the following combinations of (frame_size, sequence length):

  • 4000, 50
  • 1000, 50
  • 1000, 100
  • 4000, 100

4000, 50

1000, 50

1000, 100

4000, 100

In general, the problems I’m having seem to be the following:

  1. The predictions are very similar to the seed
  2. The audio fades out to noise as time goes on
  3. It’s too fast

I’m not sure how to describe this, but it sounds more like someone playing a fuzzy piano or steel drums than anything like the original “music”. It sounds kind of neat, but it’s definitely too quick. There must be a bug/downsample in the way I’m making the audio but I haven’t found where…

Making noise: success!

That title is very accurate – I’ve now succeeding in making pretty pure noise, although it does seem to have a liiitle bit of pulsation in it.

[UPDATE: Alex pointed out that doing mysequence.astype(‘int16’) did not actually modify the array in place, so my audio wasn’t really encoded properly. Proper version below]

I started using bokeh to do live-plotting of learning curves and thought I was saving them as pngs, but it turns out I was not, so I only have the first couple batches of the latest experiment I ran to put here for now:

[UPDATE: Here’s the training loss for the model described below]


But the good news is I’m now truly set up with Blocks, Fuel, and bokeh to run and log multiple experiments.

I’m currently using a 2-layer LSTM, with the following hyperparameters:

  • frame length: 4000 samples
  • sequence length: 50 frames
  • batch size: 128
  • hidden layer size: 200
  • learning rate: 0.002
  • epochs: 30

I’m also using gradient clipping. Most of these decisions are fairly arbitrary at the moment – I’ve just been working on getting plots and audio made so that now I can look at the details of my experiments.

The model trains and everything, but the audio generated has a lot of noise in it and does not sound very much like the vocal data it was trained on – I’m going to try a better overlapping to smooth out the transitions.


Making noise: first attempt

I have another couple posts half-written that are more reflective about the structure of this data, and what it means to take differently sized batches and input lengths and sequences.

I was thinking a lot about this kind of data representation stuff, and what it would mean for the kinds of models I should use and how I should go about training… This is the kind of thing that is interesting to me, and I really have trouble seeing how the model is learning any structure from the kinds of input/target data we’re giving it … batch splits that generate just a couple hundred or thousand training examples, on sections of audio that don’t share much patterning as far as I can see…

But I tend to get lost in details easily, and it seems like great audio is being generated by a lot of people in the class, so I decided I should be doing less reading and thinking, and more random-decision-making and bad-audio-generation!

In fact my decisions have been heavily conditioned on some of the great work of other students, particularly Chris, Melvin, & Ryan. I’ve also found Andrej Karpathy’s posts and code very helpful. My LSTM and knowledge of Blocks/Fuel is based mostly on Mohammed’s very clear code.

In point form, this is what I’ve done so far:

  1. Data exploration: Used to read the wave file and matplotlib do some visualizations
  2. Data preprocessing: Used numpy and raw python/Theano to split the data into train, test, and validation sets at an 80:10:10 split (about 140mil:17mil:17mil frames), and subtracted the mean and normalized to [-1,1] *
  3. Example creation: Cut up the data into examples as follows (I was going to use to Bart’s transformer, but it doesn’t let you create overlapping examples.)**
    1. Examples with
      • window_shift of 1000 frames (i.e. about 140 000 training, 17 000 each test and validation examples)
      • x_length of 8000 frames (i.e. a half a second of data per timestep)
      • seq_length 25 (i.e. a sequence of 25 steps, 200 000 frames, 12 seconds)
    2. Mini-batches of 100 examples
    3. Truncated BPTT of 100 timesteps
  4. Set up an LSTM (using Tanh) using Blocks
  5. Attempted to train using squared-error loss

*I asked a question about this on the course website and made a blog post about it – how are people doing their mean subtraction and normalization? Before, or after the test/train/validation split? Is there a better way? Does it matter?

**I started making my own version of a Fuel transformer, but realized I don’t know anything about how it iterates and that it may not be trivial to start the next example at an index other than the end of your current example. I made my own data slicer instead, similar to Chris, and fed the examples to Fuel after.



Sidenote – test/val preprocessing

I asked a question about this on the course website – how are people doing their mean subtraction and normalization? Before, or after the test/train/validation split?

Currently, I split first, then calculate the mean for the training data, and subtract the meaning of the training data from the train, test, and validation. I think this is the easiest and most statistically-okay way to do this. If you use the mean of the whole dataset, you’re introducing information from the test/validation into training.

To normalize, I calculate the min and max for each of the training, test, and validation sets. The training data is rescaled based on the training min and max. For the test and val sets, I use the smallest min and largest max between the each set and training (separately for test and val of course).

Otherwise, if the test/val data happened to have values higher than the training data I would clip them out, or I would be not taking into account information from the training. I can imagine cases where you wouldn’t want to do this (e.g. where you think your test/val have some kind of bias), but I think it’s safe to assume in this case that the training and test and val all come from the same distribution since they’re from the same file.

I wonder if there isn’t a better way to do this preprocessing though, particularly for the test and validation sets. Maybe something as simple as subtracting the mean(mean_of_train, mean_of_test), if you thought your test data were biased?  But it seems in general like there should be something more intelligent to do.

Spiritual ascension “music”: data exploration

Audio generation time! This is the audio our generative models will learn from:

I’ve put music in quotes in the title, because I remember my elementary school music teacher telling us that there are two basic components of music; rhythm and melody, and that really melody is optional – all you need to make a song is a beat.  The spiritual ascension track that we’re using for this project is somewhat lacking in that department.

My intuition is that the lack of simple structure actually makes this generative task harder in some ways than say, generating speech (although it seems like it would be harder to make really good speech than really good spiritual ascension music). At the least, I think it makes it harder to choose a good sample length such that from timestep to timestep you have some kind of real patterning to learn.



It’s hard to make a model that will learn to understand its input data better than you do.

Exploring, Part 1

So I started taking a look at the waveform of the track for various time intervals:

One second: (16 000 frames)


5 seconds:


10 seconds:


30 seconds:


1 minute:


2 minutes:


2nd minute:


3rd minute:


5th minute:


10th minute:


35th minute:


100th minute:


-5th minute:


-20th minute:


Last minute:


Observations, Part 1:

  • The audio doesn’t start right away (about 2000 frames in), and fades out at the end (about 20 000 frames from the end)
  • Not much structure/patterning until 10 seconds – after that (visible in the 30 seconds and the 1 minute plots) there’s a pretty clear “pulsation” every 110 000 frames (7 seconds) or so – about 9 per minute
  • After a bit less than a minute, things change up – the next few minutes each have about 14 “pulsations” per minute
  • By looking at some random minute-long segments, it looks like the “pulsations” vary in length/shape throughout the “song”, but there is consistently some pattern in the increase/decrease in amplitude



Why binary stochastic units train slowly

In Geoffrey Hinton’s coursera lecture 9, he talks about Alex Graves adding noise to weights in an RNN for handwriting recognition (possibly this paper is a good reference for this?).

Hinton goes on to say that just adding noise in activations is a good regularizer, and talks about doing this in an MLP by making the units binary and stochastic on the forward pass, and then doing backprop as though we’d done the forward pass deterministically (as a usual logistic unit).

So you compute the logistic p, and treat that p as the probability of outputting a 1. In the forward pass, you make a random decision to output 1 or 0 depending on that probability p, but in the backward pass you use the real value of p.

\displaystyle p_i = sigmoid(W_i +b)

Forward pass:

\displaystyle h_i ~~ Bernoulli(p_i)

Backward pass:

\displaystyle \frac{\partial C}{\partial p_i}    (not \displaystyle \frac{\partial C}{\partial h_i} )

This works pretty well as a regularizer (increases error on the training set, but decreases error on the test set). However, it does train a lot slower. Yiulau in these lecture comments asked why that happens.


There are three main reasons:

  1. We’re adding stochasticity to the gradient of the update,
  2. By quantizing p to be 0 or 1, we’re losing information, and therefore have reduced capacity.
  3. The true gradient w.r.t activations is 0, so really, we should not be able to do anything (there should be no weight updates). But we just use the gradient as though the unit had been a sigmoid, and it just seems to work pretty well.

Why is the true gradient 0? Consider \frac{\partial C}{\partial p_i} ; to get from \frac{\partial C}{\partial h_i} (from our forward pass) to \frac{\partial C}{\partial p_i} , we need to multiply by \frac{\partial h_i}{\partial p_i} . How did we get h_i from p_i again?

\displaystyle h_i = 1_{U<p_i}

Where U is sampled from the uniform distribution [0,1]. The derivative of this function w.r.t. p_i is 0, so when we do chain rule, we should be multiplying by this 0  in the second term:

\displaystyle \frac{\partial C}{\partial p_i} = \frac{\partial C}{\partial h_i}\frac{\partial h_i}{\partial p_i}

And therefore we should never have any gradient. But we just ignore this \frac{\partial h_i}{\partial p_i} term, and we get a regularizer!

I suppose that this amounts to adding noise to the gradient, and I wonder if it has a specific form.