Building a State-of-the-Art Audio Classifier through Machine Learning

Summary

Highly accurate audio classifiers, if exist, have many practical applications in all walks of our life, from medicine to industry.  Developing such accurate classifiers, however, is arduous. Unlike in computer vision, advancements in computer listening are in early stages.  Audio classifiers, unlike image classifiers, typically have lower accuracies. However, with the ready availability of curated, public audio datasets and ML classification algorithms, it is easier than ever to build accurate classifiers.

Several research papers recently published classifiers on the UrbanSound8k dataset. However, these classifiers only have 50-79% accuracy range. In this project, by employing various ML strategies, I aim to significantly improve this accuracy from its current high of 79%. To obtain generalizable and reliable results, for ML model training, I will use industry gold-standard, k-fold-cross-validation on the training set, which is 80% of the source data. The trained model will be tested on the test set, which is the remaining 20% unseen source data. Experiments will be repeated and even run on different operating systems to measure for variability.

Furthermore, I will investigate if the successful strategies from above have broader applicability, by trying them on unrelated audio datasets. If successful, then I propose a pipeline of steps that can be adopted by anyone to develop an accurate audio classifier. In the future, I plan to employ unsupervised ML strategies, like clustering, on audio datasets. Accurate clustering strategies eliminate the need for labeled data and help sprout new applications.

Question / Proposal

Background: An Audio classifier takes audio input and predicts what class the input belongs to. The UrbanSound8k is a public dataset of 8732 labeled sound clips of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music.

Several research papers reported the development of audio classifiers on this dataset. The best among these had 79% accuracy. This accuracy is considerably lower than what is typically observed with image classifiers, which routinely reach 90% or more. This prompted me to investigate if I can improve the accuracy of audio classifiers further.  Higher accuracy models are advantageous since they can be readily used to solve real-world problems.

 

Question: Is it possible to improve upon the best currently published classifier accuracy on the UrbanSound8k dataset? Along the way, is it possible to put forth a standardized pipeline of steps that can be followed by anyone to develop a performant classifier on any audio dataset?

 

Hypothesis: It is possible to build an audio classifier on the UrbanSound8k dataset with an accuracy of 90% or more and propose a pipeline of steps to build accurate classifiers on other audio datasets.

Research

When I got interested in computer listening and looked into published audio classifiers and their data sources, I came across three main papers and a few blogs (J., McFee, B., Humphrey, E., & Bello, J. Salamon, C. Jacoby and J. P. Bello, Piczak, Karol J, Saeed, Aaqib, J. Collis).  All these works referenced the UrbanSound8k dataset (Salamon, Justin, Christopher Jacoby, Bello, J). This publicly available labeled dataset has 8732 different audio files in the .wav format from ten different categories: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music.  The best reported classifier accuracy on this dataset was 74% (Piczak) without data augmentation and 79% (Salamon, Justin, and Juan Pablo Bello) with data augmentation, both using a CNN ML model, an advanced deep learning neural network.

My aim is simple: improve the classifier accuracy significantly, to 90% or more.  In ML, there are two types of classifiers, based on the complexity of the trained model.  These are classic/simple learners and advanced/deep learning neural networks. I decided to try to employ both.  While complex, deep learning models generally give better accuracies, it is, however, occasionally possible to obtain better results with simpler models.  Moreover, simpler models are easy to build, understand, and maintain.

Having a better performing, dependable model is paramount since these could be deployed in mission critical situations.  Imagine an audio-classifier that can screen normal/abnormal heart sounds in a clinical setting as a pre-screener or a model that predicts impending failure of machinery just by listening to the sounds they produce (Mannes, J).

Method / Testing and Redesign

Materials

For this experiment, I am planning to use Python (Version: 3.4), a language popularly used for ML. I am also planning to use Anaconda (Version: 1.8.1), a python package manager, along with many packages used in data science (like numpy, pandas among others), and IPython Notebook which is an IDE that allows programmers to write a segment of code in a ‘cell’ and then run a singular ‘cell’s worth of code. I ran all the experiments at home both on Windows and Mac OS machines.

Procedure

The basic ML procedure is: (Brownlee, Jason)

  1. Define the Problem
  2. Prepare Data
    1. Data Selection
    2. Data Preprocessing
    3. Data Transformation
  3. Spot Check Algorithms
  4. Improve Results
  5. Present Results

To start preparing the data, I downloaded the UrbanSound8k dataset. Next, I extracted features from this dataset and passed them as is: with neither preprocessing nor parameter tuning. I started with classic classifiers. In an attempt to find the best classifier, I tried many different classic classifiers. I tried Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB), and Support Vector Machines (SVM). These are some of the popular classic classifiers available in ML.  To present the audio data to these classifiers, I extracted and passed in the features from the above dataset. For audio data, 5 different data features are popularly used: MFCCs, Chroma, Mel, Contrast, and Tonnetz (Collis, J.). I chose two different ways to combine classifiers and features. First, trying all the five features with each classifier. Second, separately trying each classifier and individual feature. For the accuracy measurement, I used K-Fold (K = 10) cross-validation. This led me to identify KNN as the best classifier. I got similar accuracy rate with either the full feature set or just with MFCC.  Accuracy% is equal to (# of correct predictions/# of total samples tried) * 100. Further, features are input/independent variables and predicted class-label is the single output/dependent variable.

To further boost the accuracy I have attempted parameter tuning on the KNN classifier. Earlier, I ran each classifier with its default parameters, but this time I used sklearn’s GridSearchCV API to find an optimal parameter combination. 

Next, I attempted data preprocessing. For that, I have used MinMaxScaler from sklearn with -1 to +1 range, to force data normalization. 

The last step I tried was data augmentation. With more input data, classifiers generally predict with more accuracy. Following the strategy outlined in a research paper (Salamon, Justin, and Juan Pablo Bello), I decided to use the Muda Library to achieve data augmentation. After using the augmented dataset, the classifier’s accuracy reached its highest level. A Confusion matrix was generated on the model to gain better insight into its performance.  Also, standard deviation was measured to assess the extent of variability. In all the experiments, input data was randomly split into training and test sets in 80:20 ratio. Final accuracy results were obtained on test set, while the model was trained on training set using 10-fold cross-validation strategy.

Results

Figure 1 shows the accuracy rates of each classifier in the study when they received the full feature set. In this study, KNN was the best classifier with 82% accuracy (standard deviation = 0.013)

Figure 2 shows results from my second approach with only using MFCCs as input on the same classifiers. Accuracy results on the KNN classifier with MFCCs alone were nearly identical with using the entire feature set. This made MFCCs the single best feature in the feature set.  Because of this, in subsequent experiments, I used only MFCCs as input.

Results from various experiments aimed at boosting the accuracy of KNN from figure 2 are summarized in table 1. Applying all three approaches together namely, parameter tuning, preprocessing, and data augmentation on KNN classifier with MFCC as input yielded the highest accuracy of 99.4%. Figure 3 shows more details on this result and Figure 4 contains a confusion matrix of the classifier. Confusion matrix allows easy visualizations of the algorithm performance and widely used to display results in ML.

To prove that my audio classifier development technique can be used in other audio problem domains, I decided to use a heart sound dataset (Bentley, Peter) that had both normal and abnormal heart sounds. With the classifier on this dataset, I achieved an accuracy of 98% while doing K-Fold Cross Validation, and 100% accuracy on the separately provided test dataset.

Conclusion

KNN classifier is the best performer out of the six different audio classifiers I tried. Furthermore, KNN performed equally well with either full feature set or just with MFCCs (compare figure 1 and figure 2).  That means I can just use MFCCs as input in the next experiments. This has many advantages. First, I can save time during feature extraction. Second, scaling step in data preprocessing becomes more efficient. Third, I can minimize overfitting while training the classifier, a common problem in ML. Fourth, my program’s memory requirements are lower than loading the entire feature set.

All the steps outlined in Table 1, namely preprocessing through scaling, tuning the classifier parameters, and augmented-dataset, are essential for getting high levels of accuracy.  These results are consistently observed every time I have repeated these experiments, both on Mac and Windows machines.

Finally and more importantly, I have obtained similar accuracy results both on validation and on unseen test data, which was set aside during the data preparation stage.  Additionally, I obtained similar results even with an advanced ML model, such as the Feed Forward Neural Network. Lastly, the steps used to develop audio classifier on the Urbansound8K are applicable to develop accurate classifiers on other audio datasets, as I have shown through the classifier on the heart sound dataset.

My results are reliable, as I diligently followed best practices and avoided known traps, such as over-fitting.  Since I used k-fold cross-validation for training and accuracy rates are consistent both on the validation set as well as the test set, the possibility for over-fitting is remote. Moreover, on advanced deep learning models, which use their own learning strategy I got similar results. I am planning for journal publication of my results that include a pointer to my online codebase.

A benchmark in machine learning is the model's relative performance compared to humans. In this context MFCCs as input feature is relevant. MFCCs are optimized signal vectors engineered to closely match human hearing.

Given the state-of-the-art results I got, it is necessary to ask why the accuracies were lower in published works. I believe, it is due to the steps pipeline I followed. No paper followed all the steps I followed. Moreover, in some of the papers, the focus is not entirely on getting the best possible accuracy.

In conclusion, I have outlined a series of steps needed to improve upon current audio classifiers accuracy on the UrbanSound8K dataset. By applying these steps, I was able to build a classifier that produced state-of-the-art results for classifying environmental sounds.  With such high accuracy, it is possible to think of their practical use in the real world.

About me

Hi, my name is Sruthi Kurada and I am a 9th grader. I started coding while in 2nd grade by learning simple block code to learn the basics of logic and then became more advanced each year by doing different projects. I learned how to code through online websites such as Codeacademy and Udacity. I was originally interested in robotics and the physical aspect of computing, but now I am interested in computing because of the endless capabilities code could accomplish. Anyone can make an impact through looking at a problem and working out solutions.

An innovator that inspires me is Steve Jobs.  Although he wasn't a programmer, he strived for the best in his co-workers and himself and was brutally honest about progress. He was able to simplify the hardware for his devices and his company is one of the most valuable today.

I am planning to go to college and would like to major in Computer Science. After college, I would like to work in a company such as Google for 5-10 years and then use the expertise that I learn to create my own computer science company.

Winning the Google Science Fair would mean a lot to me, as it would show how my project is a practical solution to a problem, well received by experts in the field, and validates me as a budding engineer. The prizes would be nice to help lessen college expenses.

Health & Safety

Did not follow any health or safety procedures, as this was a computer science project.

Bibliography, references, and acknowledgements

Acknowledgments:

I would like to thank my father who initially pointed to machine learning, signal processing tutorials to go through to build the needed expertise.


References:

  • Salamon, Justin, et al. (2014, Nov 03). “Urban Sound Dataset - Two Datasets and a taxonomy for Urban Sound Research.” Urban Sound Datasets - Home serv.cusp.nyu.edu/projects/urbansounddataset/.
  • Mannes, J. (2017, January 30). The sound of impending failure. Retrieved March 14, 2018, from https://techcrunch.com/2017/01/29/the-sound-of-impending-failure/
  • McFee, B., Humphrey, E., & Bello, J. (2015, August 01). “A software framework for musical data augmentation.” Retrieved March 14, 2018, from https://github.com/bmcfee/muda
  • J. Salamon, C. Jacoby and J. P. Bello. (2014, Nov 03). “A Dataset and Taxonomy for Urban Sound Research”, 22nd ACM International Conference on Multimedia, Orlando USA, Nov. 2014.
  • McFee, Brian, McVicar, Matt, Nieto, Oriol, Balke, Stefan, Thome, Carl, Liang, Dawen, … Lee, Hojin. (2017, February 17). librosa 0.5.0. Zenodo. http://doi.org/10.5281/zenodo.293021
  • Salamon, Justin, and Juan Pablo Bello. (2016, August 15). “ Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification.”. doi:10.1109/LSP.2017.2657381.
  • Collis, J. (2016, December) “Using deep learning to build an audio classifier to recognise birdsong”
  • Brownlee, Jason. “Applied Machine Learning Process.” Machine Learning Mastery, 26 Sept. 2016, machinelearningmastery.com/process-for-working-through-machine-learning-problems/.
  • Saeed, Aaqib. “Urban Sound Classification, Part 1.” Aaqib Saeed, 3 Sept. 2016, aqibsaeed.github.io/2016-09-03-urban-sound-classification-part-1/.
  • Piczak, Karol J. “Esc.” Proceedings of the 23rd ACM International Conference on Multimedia - MM 15, 2015, doi:10.1145/2733373.2806390.
  • “Classification of Normal/Abnormal Heart Sound Recordings: the PhysioNet/Computing in Cardiology Challenge 2016.”
  • Bentley, Peter, et al. “Classifying Heart Sounds Challenge.” Classifying Heart Sounds Challenge, PASCAL, www.peterjbentley.com/heartchallenge/index.html.