Making noise: first attempt

I have another couple posts half-written that are more reflective about the structure of this data, and what it means to take differently sized batches and input lengths and sequences.

I was thinking a lot about this kind of data representation stuff, and what it would mean for the kinds of models I should use and how I should go about training… This is the kind of thing that is interesting to me, and I really have trouble seeing how the model is learning any structure from the kinds of input/target data we’re giving it … batch splits that generate just a couple hundred or thousand training examples, on sections of audio that don’t share much patterning as far as I can see…

But I tend to get lost in details easily, and it seems like great audio is being generated by a lot of people in the class, so I decided I should be doing less reading and thinking, and more random-decision-making and bad-audio-generation!

In fact my decisions have been heavily conditioned on some of the great work of other students, particularly Chris, Melvin, & Ryan. I’ve also found Andrej Karpathy’s posts and code very helpful. My LSTM and knowledge of Blocks/Fuel is based mostly on Mohammed’s very clear code.

In point form, this is what I’ve done so far:

  1. Data exploration: Used scipy.io.wavfile to read the wave file and matplotlib do some visualizations
  2. Data preprocessing: Used numpy and raw python/Theano to split the data into train, test, and validation sets at an 80:10:10 split (about 140mil:17mil:17mil frames), and subtracted the mean and normalized to [-1,1] *
  3. Example creation: Cut up the data into examples as follows (I was going to use to Bart’s transformer, but it doesn’t let you create overlapping examples.)**
    1. Examples with
      • window_shift of 1000 frames (i.e. about 140 000 training, 17 000 each test and validation examples)
      • x_length of 8000 frames (i.e. a half a second of data per timestep)
      • seq_length 25 (i.e. a sequence of 25 steps, 200 000 frames, 12 seconds)
    2. Mini-batches of 100 examples
    3. Truncated BPTT of 100 timesteps
  4. Set up an LSTM (using Tanh) using Blocks
  5. Attempted to train using squared-error loss

*I asked a question about this on the course website and made a blog post about it – how are people doing their mean subtraction and normalization? Before, or after the test/train/validation split? Is there a better way? Does it matter?

**I started making my own version of a Fuel transformer, but realized I don’t know anything about how it iterates and that it may not be trivial to start the next example at an index other than the end of your current example. I made my own data slicer instead, similar to Chris, and fed the examples to Fuel after.

 

 

Sidenote – test/val preprocessing

I asked a question about this on the course website – how are people doing their mean subtraction and normalization? Before, or after the test/train/validation split?

Currently, I split first, then calculate the mean for the training data, and subtract the meaning of the training data from the train, test, and validation. I think this is the easiest and most statistically-okay way to do this. If you use the mean of the whole dataset, you’re introducing information from the test/validation into training.

To normalize, I calculate the min and max for each of the training, test, and validation sets. The training data is rescaled based on the training min and max. For the test and val sets, I use the smallest min and largest max between the each set and training (separately for test and val of course).

Otherwise, if the test/val data happened to have values higher than the training data I would clip them out, or I would be not taking into account information from the training. I can imagine cases where you wouldn’t want to do this (e.g. where you think your test/val have some kind of bias), but I think it’s safe to assume in this case that the training and test and val all come from the same distribution since they’re from the same file.

I wonder if there isn’t a better way to do this preprocessing though, particularly for the test and validation sets. Maybe something as simple as subtracting the mean(mean_of_train, mean_of_test), if you thought your test data were biased?  But it seems in general like there should be something more intelligent to do.

Spiritual ascension “music”: data exploration

Audio generation time! This is the audio our generative models will learn from:

I’ve put music in quotes in the title, because I remember my elementary school music teacher telling us that there are two basic components of music; rhythm and melody, and that really melody is optional – all you need to make a song is a beat.  The spiritual ascension track that we’re using for this project is somewhat lacking in that department.

My intuition is that the lack of simple structure actually makes this generative task harder in some ways than say, generating speech (although it seems like it would be harder to make really good speech than really good spiritual ascension music). At the least, I think it makes it harder to choose a good sample length such that from timestep to timestep you have some kind of real patterning to learn.

 

Conjecture:

It’s hard to make a model that will learn to understand its input data better than you do.

Exploring, Part 1

So I started taking a look at the waveform of the track for various time intervals:

One second: (16 000 frames)

1sec

5 seconds:

5sec

10 seconds:

10sec

30 seconds:

30sec

1 minute:

60sec

2 minutes:

120sec

2nd minute:

second_minute

3rd minute:

third_minute

5th minute:

fifth_minute

10th minute:

tenth_minute

35th minute:

thirtyfifth_minute

100th minute:

onehundredth_minute

-5th minute:

fifth_minute_before_end

-20th minute:

twentieth_minute_before_end

Last minute:

last_minute

Observations, Part 1:

  • The audio doesn’t start right away (about 2000 frames in), and fades out at the end (about 20 000 frames from the end)
  • Not much structure/patterning until 10 seconds – after that (visible in the 30 seconds and the 1 minute plots) there’s a pretty clear “pulsation” every 110 000 frames (7 seconds) or so – about 9 per minute
  • After a bit less than a minute, things change up – the next few minutes each have about 14 “pulsations” per minute
  • By looking at some random minute-long segments, it looks like the “pulsations” vary in length/shape throughout the “song”, but there is consistently some pattern in the increase/decrease in amplitude