Hokay. I went gone down a number of code rabbit holes last week, trying to figure out why the audio I was generating sounded so fast relative to the original.
I was also trying to find other bugs, clean up the code, and consolidate memories about what I did and how all the pieces work … I learned a _lot_ of new things for this project. I had been hoping last week to finish up the 2-layer LSTM/GRU, and move on to a more interesting model and at least get some preliminary results this weekend. But I discovered a whole bunch of problems with my code (and had some other things happen in real-life-land) and fixing them has taken longer than I anticipated.
I decided it was probably better to do this one model well rather than leave it half-finished and half-start something else. Ultimately, I think it’s been a lot more instructive to cause myself all these problems than it would have been to get things right the first time. I’ll do a separate blog post describing the model more in detail, and another one describing the things I would have tried if I had time. For now, here’s a summary of the issues I had, and how I solved (some of) them:
1. Audio too fast
Troubleshooting: This took me the longest to troubleshoot by far. I used tag.test values and stepped through the debugger making sure all the tensors were the right shapes at the right times, triple-checked my scaling, reshaping and data pre-processing methods (and found another couple unrelated bugs, described in (2)!), also checked my hdf5 file creation, and went through Fuel’s code to see exactly what it was doing with that file to turn it into a datastream. Then I checked the code for generating the wav file, made sure it was int16 and I was setting the sample rate correctly, plotted the input, target, and prediction frames to make sure they were at the same scales … I also tried taking some of these frames and saving them at different sample rates to see if they sounded similar to my audio, and they didn’t, so I was almost at the point of thinking my models were just learning faster time signatures, as implausible as that seemed… but of course in the end it came down to something simple.
Solution: To generate, I select a random seed frame from the training data. When I checked this code initially, I just checked that the predictions at each step were the right size/shape/sample rate etc., and they were … but when I compiled the predictions I did something bad. I wanted to take the last (truly predicted) frame, but I made an indexing mistake, so instead of getting a frame, I got a whole example (sequence + frame). This got flattened in later code and cast as int16, so was sped up by about sequence_length number of times (50-100) – this is a lot more than I expected, and is why the samples I saved didn’t sound similar (I only sped it up by at most 10 times in my little tests).
2. Training on too little data
Troubleshooting: I had a couple lines of code which cut the raw audio such that it could be reshaped into examples (frame_length x sequence_length). I did something ‘cute’ with negative indices and mod that I thought was working, but it was not – it cut off way too much at the end. I didn’t look in to it too much, just replaced it with something simpler, and double-checked the lengths this time.
Also, I realized that the loop for my window_shift was not working properly, so I was only going across the audio once (i.e. examples did not overlap at all, and there were relatively few of them)
num_examples = len(data) // example_length
leftover = len(data) % example_length
if leftover < frame_length:
data_to_use = data[:((num_examples-1)*example_length)+frame_length]
data_to_use = data[:(num_examples*example_length)+frame_length]
shift = 0
while len(data[shift:]) >= example_length + frame_length:
num_ex_this_pass = len(data[shift:]) // example_length
data_to_use.extend( data[shift:shift+(num_ex_this_pass*example_length)] )
num_examples += num_ex_this_pass
shift += example_shift
data_to_use.extend( data[shift:shift+frame_length] )
3. Including seed information in generated audio
This issue was discussed a lot in class. I had tried to be careful about removing the seed sequence from the audio I generated, but because of the problem described in (1), when I removed the first sequence-length of data, there were still many (sped up) pieces of audio after it which had information from the seed. This means my previously generated audio samples are not reliable.
Solution: Fixed by other fixes (but was still worth looking in to to make sure I was doing this correctly).
4. Audio is not
After fixing the bugs in my generation code, and re-generating audio from my trained models, I found that after a few seconds they only predict silence. Other students also had this problem.
- Forget-gate bias initialization: I read this paper, and heard about this issue from other students – if the forget gate in an LSTM/GRU is initialized to 0, it’s hard for it to ever learn to remember. This seems intuitively obvious, but is easy to overlook. I ran exactly the same models but with this being the only difference; we’ll see how that sounds.
- Shorter frame size, longer sequences: I’m training these models now, but it’s a lot slower (lots more data). We’ll see how it goes!