Audio generation time! This is the audio our generative models will learn from:
I’ve put music in quotes in the title, because I remember my elementary school music teacher telling us that there are two basic components of music; rhythm and melody, and that really melody is optional – all you need to make a song is a beat. The spiritual ascension track that we’re using for this project is somewhat lacking in that department.
My intuition is that the lack of simple structure actually makes this generative task harder in some ways than say, generating speech (although it seems like it would be harder to make really good speech than really good spiritual ascension music). At the least, I think it makes it harder to choose a good sample length such that from timestep to timestep you have some kind of real patterning to learn.
It’s hard to make a model that will learn to understand its input data better than you do.
Exploring, Part 1
So I started taking a look at the waveform of the track for various time intervals:
One second: (16 000 frames)
Observations, Part 1:
- The audio doesn’t start right away (about 2000 frames in), and fades out at the end (about 20 000 frames from the end)
- Not much structure/patterning until 10 seconds – after that (visible in the 30 seconds and the 1 minute plots) there’s a pretty clear “pulsation” every 110 000 frames (7 seconds) or so – about 9 per minute
- After a bit less than a minute, things change up – the next few minutes each have about 14 “pulsations” per minute
- By looking at some random minute-long segments, it looks like the “pulsations” vary in length/shape throughout the “song”, but there is consistently some pattern in the increase/decrease in amplitude