3.4. Classification of Audio Spectrograms Using Convolutional Neural Networks (CNN)

NS Nausheen Saeed
RN Roger G. Nyberg
MA Moudud Alam
MD Mark Dougherty
DJ Diala Jooma
PR Pascal Rebreyend
request Request a Protocol
ask Ask a question
Favorite

In addition to investigating traditional supervised learning methods, we will also investigate loose gravel sound classification using Convolutional Neural Network (CNN). This investigation will compare both methods.

Gravel acoustics were converted to spectrogram images by Fast Fourier Transform (FFT). A pre-trained network GoogLeNet was trained using these images with some fine-tuning to the network. GoogLeNet is a 22-layer deep convolutional neural network. It is a variant of the Inception Network, a deep convolutional neural network developed by researchers at Google [31,33]. We will discuss GoogLeNet later in this section. The details of the technologies are discussed in this section.

Sound waves are made up of high and low-pressure regions moving through a medium. Such pressure patterns make the sound distinguishable. These waves have characteristics such as wavelength, frequency, speed, and time. Machines can classify sounds based on such characteristics, just as humans do [70,71].

A spectrogram is a way to visualize a sound wave frequency spectrum when it varies over time. We can say it is a photograph of the frequency spectrum that shows intensities by varying colors or brightness. One way to create a spectrogram is through the use of FFT, a digital process. We have used this method to generate spectrograms in this study. Digitally sampled data in the time domain is broken into segments, usually using overlap and Fourier transformed data to calculate the magnitude of the frequency spectrum for each chunk. Each chunk corresponds to a vertical line in the spectrogram. These spectrums are laid side by side to form the image or three-dimensional surface with information of the time, frequency, and amplitude [72]. Amplitude is shown by using intensities of colors; brighter colors show higher frequencies of sound waves. Spectrograms of gravel and non-gravel sound are shown in Figure 6.

Spectrogram images of non-gravel and gravel sound.

Neural networks (NN) are inspired by the human brain. A neural network comprises many artificial neurons containing weights and biases. These networks learn feature presentation, thus eliminating the process of manual feature selection process [63]. The training process involves backpropagation to minimize a loss of function,  L=g(x,y,θ)  through the tuning of parameters, θ. A loss function is calculated as the difference between observed and actual values. The cross-entropy loss function is often a choice in classification problems. The loss function is optimized iteratively through the calculation of the gradient descent by learning rate. The learning rate is an important parameter; it is the rate at which the gradients of each neuron are updated. A higher learning rate can reach the goal quickly but risks reaching a local minima [73,74,75,76,77]. The goal of the loss function is to reach a global minimum acceptable value for the loss function. The most common optimizers are stochastic gradient descent and its variants. These networks are composed of connected layers, each layer having many neurons. Deep neural networks (DNNs) are referred to as NNs with many layers. Multiple layers enable them to solve complex problems that their relatively shallow networks usually cannot solve. The network depth seems to contribute to the improved classification [78,79].

In several studies, CNNs classify spectrograms for musical onset detection, classification of acoustic scenes and events, emotion recognition, or identification of dangerous situations in underground car parking to activate an automatic alert from sound [80,81,82,83,84]. Convolutional neural networks (CNNs) have become popular in machine learning research. CNN’s are widely applied to visual recognition and audio analysis. CNN’s consist of specialized layers for feature extraction images called convolutional layers. Convolutional layers have filters to learn features such as edges, circles, or textures. Each convolutional layer convolves the input and passes the result to the next layer, resulting in a complex feature map of the image [85].

One of the first CNNs was LeNet. It was used to recognize digits and characters. LeNet architecture includes two convolutional layers and two fully connected layers [86]. One reason for the success of CNNs is their ability to capture spatially local and hierarchical features from images. Later, a deeper CNN was proposed called AlexNet, which achieved record-breaking accuracy on the Imagenet large-scale visual recognition challenge (ILSVRC-2010) classification task [87]. In addition to having increased depth, AlexNet also has a rectified linear unit (ReLU) as its activation function and overlapping max pooling to downsample the features of the layers.

Training CNNs requires a considerable amount of data and time, which in most cases are not available. Using a pre-trained network with transfer learning is typically much faster and easier than training a network from scratch. Pertained networks are CNNs with descriptors that are extracted by training on large sets. These descriptors from pre-trained networks can help in many visual recognition problems with high accuracy [88].

Many pre-trained networks are developed over time, such as a residual neural network (ResNet), AlexNet, GoogLeNet, FractalNet, VGG, etc. These networks are trained on different data sets and have variants depending on the number of layers in the architecture. Pre-trained networks are trained on millions of images from data sets that are publicly available. The training requires a considerable amount of computational power and may take weeks of training depending on the network architecture’s complexity. By taking advantage of transfer learning from pre-trained networks, other classification problems can often be solved by fine-tuning pre-trained networks. Fine-tuning is the task of training and tweaking a pre-trained network with a small data set and fewer classes than the pre-trained network [89].

For this study, the dataset is considerably small (i.e., 237 spectrograms images) for training a network from scratch. We can still take advantage of pre-trained convolutional networks. Data augmentation is a technique used to artificially create new training data from existing training data [90]. We also used data augmentation techniques, such as image resize, horizontal flip, and random rotation. We increased the image data set four times with data augmentation and fed it to the CNN as four different sets of images. Each image’s dimensions were 224 × 224 pixels, as it is the default input images size required by GoogLeNet.

We used GoogLeNet for the classification of spectrograms of gravel acoustics. GoogLeNet or Inceptionv1 was proposed by Google research in collaboration with various universities. GoogLeNet architecture outperformed its counterpart in classification and detection challenges in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). It provided a lower error rate than AlexNet, the previous winner of the challenge in 2012. GoogLeNet architecture consists of 22 layers. It introduced various features such as 1 × 1 convolution and global average pooling that reduce the number of parameters and create a deeper architecture. GoogLeNet is a pre-trained network trained on the ImageNet dataset, comprising over 100,000 images across 1000 classes. The large data set of ImageNet contains abundant examples of a variety of images. Feature knowledge gained by GoogLeNet could be practical in the classification of the images of other data sets. In this study, we leverage this knowledge of GoogLeNet gained from training on larger data sets of images to help classify spectrograms of gravel audios with a relatively small data set of 237 audio spectrograms. This method can help in achieving better results. More details about GoogLeNet architecture can be found in the paper in the following reference [91].

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A