A Parametric Model for Generative Audio Synthesis

Krishna Subramani & Alexandre D'Hooge & Preeti Rao

This notebook demonstrates the latent space we obtained and sounds we synthesized from the network.

Here is an example of the 2-dimensional latent space we obtained when training on brass and organ sounds from NSynth.

latent space

Note: All the sounds are sustained sounds both when we trained and sampled from the network.

CAUTION: Some of the sounds can be loud, adjust volume accordingly!

First experiment:

We play examples of sounds sampled from each cluster of the latent space and we also play similar sounds from the training set to compare them.

In [7]:
 
Brass sound from the training set
Brass sound sampled from the latent space
Organ sound from the training set
Organ sound sampled from the latent space
Transition from brass to organ sampled from the latent space

Second experiment:

The network has only been trained on odd MIDI pitches and we generate samples conditioned on even MIDI pitches. (The latent space is different from the previous one but has a very similar structure)

In [8]:
 
Brass sound with MIDI pitch 69 = 440Hz (seen during training)
Brass sound with MIDI pitch 68 = 415Hz (never seen before)
Organ sound with MIDI pitch 69 = 440Hz (seen during training)
Organ sound with MIDI pitch 68 = 415Hz (never seen before)
Frequency sweep from MIDI 57 to MIDI 69 in the brass cluster
Frequency sweep from MIDI 57 to MIDI 69 in the organ cluster
Frequency sweep from MIDI 57 to MIDI 69 and from brass to organ

Final note: The sampled sounds can for now be considered as of subpar quality. Those are early results which are merely a proof of concept of our network, reducing the audio frames representation to only 2 dimensions. What we want to emphasize is how smooth the interpolation between the two sounds is, and that the network produces a somewhat consistent timbre even through a continuous frequency sweeping.