The aim of our experiments is to study the vocal learning and acoustic matching during self-supervised learning from the listening of one speaker or from several. The experimental setup for Experiment 1 in section consists of a small audio dataset of 2 minutes length of a native French woman speaker repeating five sentences three times. The audio .wav file is translated into MFCC vectors (dimension 12) sampled at 25ms each and tested either with a stride of 10ms or with no stride. A stride is the temporal shift between two samples. Typically, if we have one sound sample between [0, 25ms] then the next sample will be between [10ms, 35ms]. A stride of 25ms guaranties that there is no overlapping across samples. The whole sequence represents 14.000 MFCC vectors for the case with strides and 10.000 MFCC vectors for the case with no strides.

The numbers of Striatal and GP units are chosen so that they correspond to the number of MFCC vectors, which means 14000 units (or 10.000 units without strides) for each layer. We do so in order to test the reliability of our architecture to retrieve input data with an orthogonal representation. The compression rate is, however, low (1:1). We organize the MFCC vectors only depending on the temporal order of appearance in the Wav file.

In contrast, Experiment 2 in section will use a bigger audio dataset of 27 minutes length from six native French speakers, the same speaker as in Experiment 1 plus two other women and three men, repeating the same sentences as in the previous experiment. The audio .wav file is translated into MFCC vectors (dimension 12) sampled at 25ms each, which corresponds to 140.000 MFCC vectors for the case with 10ms stride. The numbers of Striatal and GP units are kept the same as for the first experiment (14.000 units), which means that the size for the BG layers is now ten times lower than the total number of MFCC to be retrieved in the sequence. The compression rate this time is high (1:10). This second experiment will serve to test the generalization capabilities of our architecture and its robustness to high variabilities with respect to the inputs, replicating the correspondence problem.

The sentences used in the audio database were selected because they cover all the syllables in French. Each period takes 10 minutes on a conventional laptop for the supervised method. The stabilization is done depending on the global error and we can decide below a certain threshold or we can choose a maximum number of iteration to stop the learn stage. For the unsupervised one, it can take much longer, 30 minutes to one hour to stabilize the dynamics below a certain error level. In our computation, we let the system stabilizes itself for a maximum of ten periods independently to a particular threshold level. We provide a link to .wav files samples and results as well as a link to source code at https://git.cyu.fr/apitti/inferno.

Note: The content above has been extracted from a research article, so it may not display correctly.



Q&A
Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.



We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.