Frames and data representation
It seems so much of the time that as long as we throw “enough” data at a neural network, the structure of that data doesn’t matter too much. At the beginning of the project, I spent a lot of time listening and looking at waveforms of different frame and sequence lengths, trying to imagine what would make sense to learn from. But various people in the class seem to have used quite different lengths for both of these, with fairly consistent results. So I’m not sure how much my mind has changed about data representation being important, but next time I would definitely try to get to generating more audio, sooner.
I would also have liked to experiment with Fourier transforms of the data – one idea I had was to do the Fourier, and plot each frame as an ‘image’ using the phase as one axis and the amplitude as the other axis, pass these images through a convnet, and use that as the inputs to the RNN at each timestep. I wasn’t sure how I would reconstruct audio from the outputs though – perhaps the convnet filters could be constrained to be invertible, or maybe I could have done something like this.
I used the Blocks framework to build the network – I started out writing my own theano code, but realized I might not have time to implement any interesting models if I spent too much time implementing things like plotting and resuming models. In retrospect, I’m not sure how I feel about this decision. I think it was good to become familiar with the way Blocks does things – I spent a lot of time reading the source code and docs. Next time though, I would implement myself from scratch, maybe using pieces from Blocks that I found useful.
I’d also like to test with more layers, and different layer sizes, as well as experimenting with skip-connections. I think a big part of music is that patterns build and change throughout the song, and although that’s maybe not so evident in this pretty monotonous music, I think propagating information about input patterns to later timesteps could be important.
I used a fairly large batch size, and I don’t think that this was a good idea. I would like to rerun some things with a batch size of 1 or 10 or something.
I also trained for a fixed number of epochs (30), and I don’t think that this was enough. The models were taking so long to train though – I’d have to figure out what that is, first.
Lastly, I don’t think that mean squared error quite the right cost to use. It implies that every sample in the frame of audio is equally important to the frame sounding like a good frame, and I don’t think that this is necessarily true – sort of the way we see that generative models produce noisy or blurred images when they’re trained to reconstruct directly to pixel space, I think that using MSE is part of the reason that the audio sounds noisy.
I would have liked to explore the idea I mentioned in class, using a GAN where the discriminator has to discriminate between noise and a frame, but that will be for the future.
I would like to do something more intelligent than just concatenating the last-predicted frames in order to produce audio – a simple averaging over a window in the overlapped predictions might be a good first step, but ultimately this feels like a hack. Thinking about this the way a person might compose music, I would like to do something that could ‘revise’ its predictions – maybe using a bidirectional RNN would have been a good approach to this, or perhaps the “denoising GAN” that I mentioned above would produce better samples over time.