After working out a lot of bugs in my code, I regenerated the audio from the models I had trained previously, and also ran some new experiments based on other students’ work. Specifically, in data pre-processing I switched from normalizing [-1,1] to dividing by the standard deviation as suggested by Melvin, and tried some very different sequence and frame sizes (40, 40000) and (40000, 40).
The sound I generate is mostly like this:
It pulsates but is very noisy. The forget gate initialization doesn’t appear to make much difference – the audio is almost identical (I used the same seed, for comparison).
I’m not sure which I think sounds better between the (4000,50) and the (1000, 100) – the first sounds like it would be a bit more melodic (if it weren’t mostly noise), while the second (shorter frame_length, longer sequence_length) sounds like it’s got the time changes better – presumably this is due to the longer sequence lengths.
The experiments I ran with much shorter frame lengths and longer sequences (800, 8000), (40, 10000), and (100, 7000) are still running, several days later …. I’m not sure what I’m doing differently from other people in the class that’s making them take so much longer.