Bird Sound Generator

The Generative Adversarial Network (GAN) is a recent framework proposed to improve the performance of generative models by introducing adversarial training techniques. For this, two different Artificial Neural Networks are trained simultaneously: a Generator, G, and a Discriminator, D. The Generator learns the distribution of the training data and tries to generate new samples from a random noise input, as similar as possible to the original samples. At the same time, the Discriminator wants to distinguish between real and generated samples. In conclusion, a GAN is a two-player game between the two ANNs.

The training of a GAN is made by alternatively passing the real samples and then the generated samples through the Discriminator, computing the classification loss. The Discriminator is then updated by backpropagation to minimize the classification loss of all samples and the Generator is updated to maximize the Discriminator’s loss for the generated samples, that is, is updated to generate better and better samples, increasingly similar to the real ones.

The most common applications for GANs are with image data, and these models are currently capable of generating unbelievably real and high quality pictures. Moreover, the quality of style transfer results became spectacular with the use of Cycle-GANs, a more recent variant of these generative models. Finally, the integration of GANs with NLP (Natural Language Processing) methods even made possible for computers to create images just by reading a text label. Sounds like science fiction right? Look at the examples below!

In the first image we can see hyper realistic pictures of imaginary celebrities, entirely generated by a GAN. The second one is an example of style transfer done by a Cycle-GAN, allowing toggling between Monet like paintings and photos, and the last example shows the potential of fusing NLP with GANs.


GANs and Sounds

Having some experience with GANs and images (check my Master Thesis on this page), I thought it would be cool to try the capabilities of these models with sound data, that is, simple 1D arrays. I’ve seen some cool experiments like this, so why not do it myself?

After some research, I decided to use the British Bird Song Dataset from Kaggle to get the sound data, and Tensorflow as the main library for deep learning.

First of all, these were all the imports needed to run this project,:

import tensorflow as tf
for g in tf.config.experimental.list_physical_devices('GPU'):
    tf.config.experimental.set_memory_growth(g, True)
import numpy as np
import os
import soundfile as sf 
import glob
import random
import sys
import pandas as pd

Dataset

The dataset is composed of 264 sound files (.flac extension) with different durations and bird songs. To open the files as arrays in Python I used the soundfile library.

Example of a sound file from the dataset

Loading and preparing the data

Now we have to load and prepare the data! The files have different sizes and are too big to be fed to a simple Neural Network, so let’s also create a function to split the files in 5 chunks:

split_size = 5

def chunks(x, n):
    n = max(1, n)
    return [x[i:i+n] for i in range(0, len(x), n) if len(x[i:i+n]) == n]

def load_dataset():
    sounds = []
    rates = []
    min_chunk_size = float('inf')
    for filename in glob.glob('dataset/songs/*.flac'):                                              
        data, samplerate = sf.read(filename, dtype='float32')   
        seconds = len(data)/samplerate
        num_chunks = int(seconds/split_size)
        if num_chunks > 0:
            chunk_size = int(len(data) / num_chunks)
            if chunk_size < min_chunk_size:
                min_chunk_size = chunk_size
            splitted = chunks(data, chunk_size)    
            splitted = [np.array(c) for c in splitted]
            sounds.extend(np.array(splitted))
        rates.extend([samplerate]*len(splitted))
    sounds = [s[:min_chunk_size] for s in sounds]
    return sounds, rates, min_chunk_size

After loading all files and splitting them it into smaller pieces, we end up with a list of 3414 arrays, and each one of these is then cropped to match the size of the smaller one. That way all elements of the list have the same length and correspond to some seconds of bird sounds.

Building the models

Both the Generator and the Discriminator are a simple Multi Layer Perceptron (MLP), that is, an Artificial Neural Network with only dense layers of neurons. The loss function used for both will be the Binary Cross Entropy Loss and the optimizer will be the Adam Optimizer.

The Generator receives an input array of 500 elements (input noise) and has a total of 4 layers. The first comprises 256 neurons, the second 512 neurons and and the third 1024 neurons, all followed by a Leaky ReLU activation function. This activation function is very similar to the ReLU but with a small negative slope, allowing neurons with negative output to activate and recover from it instead of being stuck at a 0 activation. The final layer has a number of neurons equal to the size of the dataset sound arrays (so the input of the Discriminator has always the same dimension) and Tanh as activation function.

def Generator(min_chunk_size):
    
    G = tf.keras.models.Sequential()
    
    G.add(tf.keras.layers.Dense(units=256,input_dim=500))
    G.add(tf.keras.layers.LeakyReLU(0.2))
    
    G.add(tf.keras.layers.Dense(units=512))
    G.add(tf.keras.layers.LeakyReLU(0.2))
    
    G.add(tf.keras.layers.Dense(units=1024))
    G.add(tf.keras.layers.LeakyReLU(0.2))
    
    G.add(tf.keras.layers.Dense(units=min_chunk_size, activation='tanh'))
    
    G.compile(loss='binary_crossentropy', optimizer='adam')
    
    return G

The Discriminator receives the real and generated samples as input and also has a total of 4 layers. The first comprises 1024 neurons, the second 512 neurons and and the third 1024 neurons, all also followed by a Leaky ReLU activation function. The last layer has one single neuron that outputs the probability of the input being a real sample (1 corresponds to 100% probability of being a real sample), with a Sigmoid activation function.

def Discriminator(min_chunk_size):
    
    D = tf.keras.models.Sequential()
    
    D.add(tf.keras.layers.Dense(units=1024,input_dim=min_chunk_size))
    D.add(tf.keras.layers.LeakyReLU(0.2))

    D.add(tf.keras.layers.Dense(units=512))
    D.add(tf.keras.layers.LeakyReLU(0.2))
  
    D.add(tf.keras.layers.Dense(units=256))
    D.add(tf.keras.layers.LeakyReLU(0.2))
    
    D.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))
    
    D.compile(loss='binary_crossentropy', optimizer='adam')
    
    return D

Training method

As explained in the beginning of this post, each training step consists on creating a random input noise, feeding it to the Generator to get the fake (generated) batch and then getting the output of the Discriminator for both fake and real batches. The two neural networks are then updated to follows their correspondent goals: the Discriminator has to distinguish between real and fake samples and the Generator has to create samples that are classified as real by the Discriminator (that’s why we use an array of ones as ground truth when computing the loss of the Generator).

def train_step(X_real):

    noise = tf.random.normal([batch_size, 500])

    with tf.GradientTape(True) as tape:

        X_fake = G(noise, training=True)
        P_fake = D(X_fake, training=True)
         
        X_real = tf.stack(X_real, axis=0)
        P_real = D(X_real, training=True)

        G_loss = bce_loss(tf.ones_like(P_fake), P_fake)
        D_loss = bce_loss(tf.zeros_like(P_fake), P_fake)
        D_loss += bce_loss(tf.ones_like(P_real), P_real)

    grads = tape.gradient(G_loss, G.trainable_variables)
    G_opt.apply_gradients(zip(grads, G.trainable_variables))

    grads = tape.gradient(D_loss, D.trainable_variables)
    D_opt.apply_gradients(zip(grads, D.trainable_variables))

    P_avg_real = tf.reduce_mean(tf.math.sigmoid(P_real))
    P_avg_fake = tf.reduce_mean(tf.math.sigmoid(P_fake))

    return P_avg_real, P_avg_fake, G_loss, D_loss

Training loop

After defining the number of epochs and the batch size, loading the dataset and building the models, the training step is called for all batches repeatedly. At the end of each epoch, three output samples are saved, so we can

epochs = 250
batch_size = 32

#load dataset
sounds, rates, min_chunk_size = load_dataset()

#build models
G = Generator(min_chunk_size)
# G.summary() #uncomment this to see the architecture of the generator

D = Discriminator(min_chunk_size)
# D.summary() #uncomment this to see the architecture of the discriminator

#define optimizers
G_opt = tf.keras.optimizers.Adam(2e-4, 0.5)
D_opt = tf.keras.optimizers.Adam(2e-4*5, 0.5)

#define loss function
bce_loss = tf.keras.losses.BinaryCrossentropy(True)

#create a random noise batch of 3 to feed to the generator at the end of each epoch
debug_noise = tf.random.normal([3, 500])

losses_g = []
losses_d = []
prob_avg_real_list = []
prob_avg_fake_list = []

#training loop
for epoch in range(epochs):

    #shuffle dataset list
    random.shuffle(sounds)

    print(f'* Epoch {epoch+1}/{epochs}')

    #initialize some temporary training variables
    avg_probs = np.zeros(2, np.float32)
    avg_losses = np.zeros(2, np.float32)
    steps = int(np.ceil(len(sounds) / batch_size))
    losses_g__ = []
    losses_d__ = []
    prob_avg_real_list__ = []
    prob_avg_fake_list__ = []
    idx = 0

    #loop through batches
    for _ in range(steps):

        sys.stdout.write("\r   - Step {}/{}".format(_, steps))
        sys.stdout.flush()

        #training step
        avg_probs += np.array(train_step(sounds[idx:idx+32])) / steps

        #save training data
        avg_probs += np.array((P_avg_real, P_avg_fake)) / steps
        avg_losses += np.array((G_loss, D_loss)) / steps
        prob_avg_real_list__.append(avg_probs[0])
        prob_avg_fake_list__.append(avg_probs[1])
        losses_g__.append(avg_losses[0])
        losses_d__.append(avg_losses[1])

        idx = idx + 32

    #save epoch summary data 
    prob_avg_real_list.append(mean(prob_avg_real_list__))
    prob_avg_fake_list.append(mean(prob_avg_fake_list__))
    losses_g.append(mean(losses_g__))
    losses_d.append(mean(losses_d__))

    print(' - %ds - D real avg: %f, D fake avg: %f' % (dt, *tuple(avg_probs)))

    #get the output of the generator for the debug noise and save it as three sound files
    output = G(debug_noise)
    for i in range(output.size[0]):
        sf.write('output_epoch{}_{}.wav'.format(epoch, i), output[i], rates[0])

After training the models, it’s a good habit to look at the evolution of the losses and the probabilities throughout the epochs, which are all saved in the lists created previously.

#save losses evolution as a plot
plt.figure()
plt.plot(losses_g, label="Generator Loss")
plt.plot(losses_d, label="Discriminator Loss")
plt.xlabel("Step")
plt.ylabel("Loss")
plt.title("Generator and Discriminator Loss")
plt.legend()
plt.savefig(path + "losses.png")
plt.close('all')

#save losses evolution as a CSV
pd.DataFrame({'D LOSS': losses_d, 'G LOSS': losses_g}).to_csv("losses.csv", index=False)

#save probabilities evolution as a plot
plt.figure()
plt.plot(prob_avg_real_list, label="Real probability")
plt.plot(prob_avg_fake_list, label="Fake probability")
plt.xlabel("Step")
plt.ylabel("Probability")
plt.title("Real and Fake Probabilities")
plt.legend()
plt.savefig(path + "probabilities.png")
plt.close('all')

#save probabilities evolution as a CSV
pd.DataFrame({'REAL PROB': prob_avg_real_list, 'FAKE PROB': prob_avg_fake_list}).to_csv("probs.csv", index=False)

Results

Firstly, looking at the training plots:

  • It’s difficult to interpret the losses plot because of the 3 peaks.
  • The fake and real probabilities remained always between 0.26 and 0.36, which can mean that the discriminator not only struggled to distinguish between real and fake samples but also to learn the distribution of the real sounds, and was never sure about the prediction.

As it is difficult to get information from the losses plot, let’s look at the CSV with the data and remove those peaks to check the actual train evolution. Now it’s possible to see that the discriminator’s loss was low during the entire training in relation to the generator’s loss, which increased in general.

Regardless of the technical characteristics of the training, the important is for the generator to be capable of creating bird sounds from just random noise. I know you are curious to hear the generator outputs, so here you go!

Generator’s input noise

Generator’s output after 1 epoch

Generator’s output after 50 epochs

Generator’s output after 150 epoch

Generator’s output after 250 epochs – last output

So, despite the not-perfect technical training metrics, the generator was actually capable of generating sound since the first epoch! The first output is very soft and has a lot of noise, but it’s possible to hear some sounds that resemble birds singing, which is amazing! Naturally, as the training evolves, the sounds become sharper and more intense, and after the last epoch it’s possible to hear lots of different birds singing at the same time and even some of the sounds appear and disappear in the middle of the 5 seconds. A good sound cleaning process would make it even better!

Conclusion

If I close my eyes and listen to the last generated output I can imagine myself exploring some isolated rain forest, which is enough to consider this project a complete success!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: