B. Stimuli

SG Steven P. Gianakas
MW Matthew B. Winn
ask Ask a question
Favorite

There were three speech continua that varied by spectral cues of varying speeds, including fast (stop formant transitions), medium (fricative spectra), and slow (vowel formants), described below. Speech stimuli consisted of modified natural speech tokens that were originally spoken by a native speaker of American English. Stimulus context were chosen to minimize any bias other than lexical bias (e.g., lexical frequency or familiarity). Real-word targets within each stimulus set were chosen with the intention of having roughly equivalent and sufficient lexical frequency and familiarity. Our initial verification of those features was based on the HML database, which is an online database of 20 000 words containing lexical frequency and familiarity ratings (Nusbaum et al., 1984). The familiarity and frequency scores from the HML database were also comparable to the scores found on a more recent lexical neighborhood and phonotatic probability database (Vitevich and Luce, 2004). The HML database rates familiarity from 1 (least familiar) to 7 (most familiar) with a mean of 5.5. Any familiarity rating between 5.5 indicates that the word is familiar and the meaning is well known (Nusbaum et al., 1984); all real-word stimuli had familiarity ratings of 7 except for “dash” (6.916). The frequency of words is quantified in usage per 1 × 106 words of text, with a mean of 40.8. All real-word stimuli had similar frequency ratings; Dash was 11, “Dot,” was 13, “Safe” was 58, “Share” was 98, “Gap” was 17, and “back,” which was more common at 967.

For all stimuli, phonetic environments surrounding the segment of interest were kept consistent across stimulus set using a cross-fading/splicing method, described in detail for each stimulus below. All stimuli modifications were complete using the Praat software. Stimuli are available to download via the supplementary materials.1

A continuum from /æ/ to /ɑ/ was created in the /d_ʃ/ context, creating sounds that ranged from dash (a real word) and “dosh” (a non-word), and also in the /d_t/ context (ranging from “dat” to dot). Recordings of the word dash and dot spoken by a native speaker of American English were manipulated to produce a continuum from /æ/ to /ɑ/, using the method described by Winn and Litovsky (2015). In short, the vocalic portion of this syllable from the offset of the /d/ burst to the vowel/consonant boundary was decomposed into voice source and vocal tract filter using the LPC inverse-filtering algorithm in Praat. The filter was represented using FormantGrids that were systematically modified between formant contours for the original vowel (/æ/) and the formant contours for a recording of the vowel /ɑ/ in dosh. Formants 1 through 4 were sampled at 30 time points throughout the vowel; continuum steps were interpolated using the Bark frequency scale to account for auditory non-linearities in frequency perception. High-frequency energy normally lost during LPC decomposition was restored using the method described in full detail by Winn and Litovsky (2015). A uniform /d/ burst (with prevoicing) was preappended to each vowel.

Using the continuum of /dæ/ to /dɑ/, we created two lexical environments by preappending the continuum steps to either a 300-ms /ʃ/ segment excised from the original recording of dash or to the /t/ (including 94 ms closure gap) excised from the original recording of dot. The final result was a set of two continua: one that changed from dash to dosh (where lexical bias would favor the /æ/ endpoint) and another that changed from dat to dot (where bias should favor the /ɑ/ endpoint). See Fig. Fig.11 for an illustration of the stimulus construction for the dash-dosh and dat-dot continua.

(Color online) Illustration of stimulus construction for the dash-dosh and dat-dot continua. Concat refers to concatenation of two waveform segments. Blend refers to cross-fading of two waveform segments.

A continuum from /s/ to /ʃ/ was created and preappended to (/_eɪf/ (“shafe,” a non-word, and “safe,” a real word) and /_eɪr/, (share, a real word and “sare,” a non-word). /s/ and /ʃ/ fricative segments of equal duration were extracted from recordings of safe and share. The fricative segments were subject to a blending/mixture procedure where the amplitude of one segment was multiplied by a factor intermediate between 0 and 1, and the other segment was multiplied by 1 minus that factor. For example, /s/ would be modified by 0.2 and /ʃ/ would be modified by 0.8. These fricatives were preappended to the vowel and offset of safe. The differences in vowel-onset formant transitions were discarded based on the judgment by the authors that the /s/-onset vowel permitted natural-like perception of either fricative (furthermore, any residual effects of formant transitions would be balanced across opposing lexical contexts).

To create the /_eɪr/ environment, the first half of the /_eɪf/ syllable was spliced onto the second half of the syllable share so that regardless of phonetic environment (/_eɪf/ or /_eɪr/), the syllable onset (including fricative and vowel onset) remained exactly the same, removing any acoustic bias other than the segments that determined the word offset. Splicing was done using a cross-fading procedure where an 80-ms offset ramp of the first half was combined with an 80-ms onset ramp for the second half to ensure a smooth transition. The exact placement of the cross-splicing boundary was chosen strategically to avoid any discontinuities in envelope periodicity. The final result was a set of two continua: one that changed from safe to shafe (where bias should favor /s/) and another that changed from share to sare (where bias should favor /ʃ/). For the share- sare continua, duration of the initial fricative was roughly 220 ms, the /eɪ/ was approximately 162 ms (although there is not a straightforward transition into the /r/); the whole syllable coda was approximately 334 ms. The safe-shafe continua had the same durations except for the coda, /f/ which was approximately 321 ms. See Fig. Fig.22 for an illustration of the share-sare and shafe-safe continua.

(Color online) Illustration of the creation of the continuum for share-sare and shafe-safe. Concat refers to concatenation of two waveform segments. Onset and offset ramps were applied to blend the/e/vowel onset into the offset portion of either the/_eɪr/or/_eɪf/vocalic portions, ensuring equivalent vowel onsets but natural transitions into the different syllable codas.

A continuum from /bæ/ to /ɡæ/ was created and preappended to /_p/ (with 85 ms closure duration) to make gap, a real word and “bap,” a non-word, or /_k/ (with 75 ms closure duration), to make “gack,” a non-word and back, a real word. This continuum was made using the same basic procedure as for the /æ/ vs /ɑ/ continuum, where formant contours were imposed on a voice source that was derived from an inverse-filtered utterance. The onset prevoicing and release burst that was preappended to the vowel was a blend of 50% /b/ and 50% /ɡ/ onsets so that the only cue available for the distinction was the set of formant transitions. After generating a continuum of bap-gap syllables, the /k/ from back was appended to the vocalic portion, which was truncated to remove glottalization that could cue /p/. This step was taken with a similar motivation as for the /ʃ/-/s/ continuum, so that regardless of word-final phonetic environment, the syllable onset remained exactly the same within each continuum. The final result was a set of two continua: one that changed from back to gack [where bias should favor (/b/) and another that changed from gap to bap (where bias should favor /ɡ/)] (see Fig. Fig.33).

(Color online) Illustration of the creation of the continuum for back-gack and gap-bap. Concat refers to concatenation of two waveform segments.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A