We created two sets of stimuli, one set for the speech condition and one set for the non-speech condition. The speech stimuli were naturally produced, edited consonant-vowel (CV) syllables [fε] and [fa]. The formants were stable throughout the vowels and corresponded to the Czech low-mid front /ε/ and low /a/, respectively. The first three formants of [ε] in [fε] were 755 Hz, 1646 Hz, and 2710 Hz, and the first three formants of [a] in [fa] were 864, 1287, and 2831 Hz; these values are in line with the formants of Czech vowels produced by women reported by Skarnitzl and Volín (2012). The duration of the vowels [ε] and [a] (extracted from the CV frames) was modified using PSOLA in Praat (Boersma and Weenink, 1992–2020). The vowel [a] had a duration of 220 ms, and [ε] was resynthesized with three durations, namely, 220, 180, and 360 ms, which met the following conditions: 220 ms was judged (by three expert phoneticians) as a typical duration of the mid and low short vowels in an isolated CV syllable, 360 ms represented a long vowel in a CV syllable that was not perceived as unnaturally exaggerated, and short /ε/ with the duration of 180 ms was considered to be sufficiently distinct from the long /ε:/.1 In order to create the stimuli, we cut out the initial fricative consonant [f] from one recorded syllable and combined it with the target [a] and [ε] vowels, such that the fricative [f] was identical across all four speech stimuli and had a duration of 150 ms. None of the created [f] + V syllables carries lexical or morphological content in Czech. The speech stimuli had been used in a behavioral study on vowel perception with Czech-exposed infants (Paillereau et al., 2021), and recently, along with the non-speech stimuli described below, in an ERP study with Czech newborns (Chládková et al., under review).
To test the discrimination of a spectral contrast, the non-focal [fε] and the focal [fa] lasting for 220 ms each were used. The vowel [a] is considered focal because the distance between its first and second formant is da = 2.07 Bark, while the vowel [ε] in [fε] is non-focal because its first two formants are spread apart by dε = 4.08 Bark. The difference between [a] and [ε] thus lies in their perceptual prominence, where [a] is the more prominent one. The discrimination of a durational contrast was tested by the short 180-ms [fε] and long 360-ms [fε]. Similarly as for the spectral dimension, the short and the long vowel differ in their perceptual prominence, where the short one contains energy over a shorter time interval (i.e., less energy in total) as can thus be seen as perceptually less prominent stimulus than a long vowel represented by energy in a longer time interval. The intensity of the stimuli was scaled by peak to be matched across all the 4 different syllables.
The non-speech stimuli were inharmonic tone complexes with spectral and durational properties mimicking those of the vowels described above. Inharmonic tone complexes are comparably complex as vowels in that their source signal contains a series of fundamental frequency harmonics and is filtered with vocal-tract like formants. At the same time, the inharmonic tone complexes are not confusable with vowels because their source signal frequencies are spaced inharmonically (Goudbeek et al., 2009; Scharinger et al., 2014). The tone complexes in the present experiment had 15 inharmonically spaced frequency components, the first one at 500 Hz and every following being 1.15 times higher. The inharmonic source signal was filtered with three formants, namely, for the focal spectral condition with the formants of [a], for the non-focal spectral condition and the short and long durational condition with the formants of [ε]. Durations of the non-speech stimuli were identical to the durations of the vowels from the speech condition. The amplitude was ramped linearly over 5 ms at stimulus onset and offset. Sound intensity was scaled to be identical across all the four stimuli. As in the speech condition, the [a]-like focal tone (prominent) and the [ε]-like non-focal 220-ms tone (non-prominent) were used to test discrimination of spectral differences, and the 180-ms [ε]-like tone (non-prominent) and the 360-ms [ε]- like tone (prominent) were used to test discrimination of duration differences.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.