Sound recognition using Deep Learning

Feb 1, 2019

TimeScaleNet, a multi-resolution time domain architecture for speech recognition in environmental sounds

Summary: This project is part of the activities related to Deep Learning for audio, which I have been developing since early 2018.

In recent years, the use of Deep Learning techniques in audio signal processing has significantly improved the performance of sound recognition systems. This paradigm shift has prompted the scientific community to develop machine learning strategies to create efficient representations directly from temporal raw waveforms for Machine Hearing tasks.

In this project, I develop a multi-resolution approach, which allows the deep neural network to efficiently encode relevant information contained in unprocessed acoustic signals in the time domain.

The neural network developed, TimeScaleNet, aims at learning a representation of a sound, based on analysis of temporal dependencies, at the scale of the audio sample, and at the scale of audio frames of $20~ms$. The proposed approach improves the interpretability of the learning scheme, by unifying advanced Deep Learning and signal processing techniques.

Figure : Architecture of the TimeScaleNet neural network

In particular, the architecture of TimeScaleNet introduces a new form of recurrent neural cell, directly inspired by IIR digital signal processing, and acting as a two-quadratic IIR digital filter bank with adjustable bandwidth, in order to represent the sound signature in a two-dimensional map. This new approach allows to improve the recognition performances, and to automatically build a representation similar to time-frequency spectrograms, whose parameters are chosen by the neural network. This approach allows to obtain a specific semantic representation of the training dataset, with a low computational cost approach.

The time-frequency representation obtained at the frame level is then processed using a depthwise-separable residual convolution network. This second scale of analysis aims to efficiently encode the relationships between temporal fluctuations at the frame time scale, in different learned clustered frequency bands, in the range of $[20 ~text{ms}~;~200~text{ms}]$.

Figure : Depthwise-separable subnetwork of 1D atrous convolutions.

TimeScaleNet was tested using both a speech command dataset ( Speech Commands Dataset v2 )and an environmental sound dataset ( ESC-10 ). For speech recognition, we obtain a very high accuracy of $\mathbf{94.87 \pm 0.24 %}$, which exceeds the performance of most existing algorithms. For environmental sounds, the performance is more moderate, which suggests the need for an improvement of of atrous subnetwork architecture to be more efficient for small datasets, with examples of signals that possess rather stationary signal characteristics.

Figure : Confusion matrix obtained on the voice dataset.

Within the framework of the project, we were also interested in the representation built by the neural network. In an extremely interesting way, this one constructs a representation of the sounds by building filters similar to those developed in the literature concerning the cognitive models of hearing. On the other hand, this representation uses a mel-type approach for frequencies below 2500 Hz, to encode the content of vowels and nasals, and switches to an ERB-type representation close to Glasberg and Moore’s model for higher frequencies, which are used to encode consonants, fricatives and plosives: