2.3.3. Feature extraction

AL Ana Laguna
SP Sandra Pusil
IA Irene Acero-Pousa
JZ Jonathan Adrián Zegarra-Valdivia
AP Anna Lucia Paltrinieri
ÀB Àngel Bazán
PP Paolo Piras
CP Clàudia Palomares i Perera
OG Oscar Garcia-Algar
SO Silvia Orlandi
request Request a Protocol
ask Ask a question
Favorite

To prove the objectivity of qualitative annotation and the potential to automatically differentiate cry distress levels, several Machine Learning (ML) and DL algorithms were applied. The first approach used the first 13 Mel Frequency Cepstral Coefficients (MFCCs) of every CU as input features computed using the Python 3 package for audio analysis Librosa. The second approach uses spectrograms of each CU and a Convolutional Neural Network (CNN) (O’Shea and Nash, 2015) with 2D convolutional and dense layers. To prevent overfitting, pooling, and batch normalization layers were incorporated for training optimization. Both approaches utilized 80% of the samples for training the model and 20% to validate the algorithm during the learning process.

Within CEs (cry episodes), the actual cries are not continuous vocalizations, but punctuated by inspirations and spontaneous pause or silence periods. The total duration in seconds of cry parts within the CE is defined as cryCE (amount of cry in cry episodes) while the total sum of seconds of unvoiced periods (inspirations, pauses, etc) within the CE is named as unvoicedCE (unvoiced parts in cry episodes). Percentages of cry and unvoiced parts within every CE were also computed and described as cryCE (%) and unvoicedCE (%) respectively.

Audio processing of each CU was conducted through Praat software (Boersma, 2002) using a band-pass filter between 200 and 1,200 Hz to compute the F0 and a low-pass filter of 10,000 Hz to compute the spectrum (Rautava et al., 2007). Audio recordings were collected with a sampling rate of 48,000 Hz. The main frequency features include F0 and its descriptive statistics (maximum, minimum, mean, standard deviation), the resonance frequencies of the vocal tract (F1, F2, F3) along with the percentage of high pitch (F0 > = 800 Hz) (Kheddache and Tadj, 2013) and hyper-phonation (F0 > = 1000HZ) (Zeskind et al., 2011) level of the CU were computed. Other voice quality parameters related to the phonation of the vocalization are also included: local jitter (Jitter: micro-variations of the F0 measured with pitch period length deviations), local shimmer (Shimmer: amplitude deviations between pitch periods), harmonic to noise ratio (HNR, quantifies the amount of additive noise in the voice signal) (Teixeira et al., 2013).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A