Use of ultrasound in studies with long collection times requires a method of fixing the ultrasound transducer position relative to the head; due to the lack of hard oral cavity structures in the produced images, it is otherwise impossible to directly compare images across time and/or subjects. For this study, there was the additional need to allow users to play on a trombone while having their tongue imaged with ultrasound. We used a modified version of the University of Canterbury non-metallic jaw brace (Derrick et al., 2015) that was narrow enough not to contact the trombone tubing running along the left side of player’s face. The device ties probe motion to jaw motion and thus reduces motion variance. An assessment of the motion variance of the system, evaluating tongue and head movement data collected using both ultrasound and electromagnetic articulography (Derrick et al., 2015), showed that 95 percent confidence intervals of probe motion and rotation were well within acceptable parameters described in a widely cited paper that traced head and transducer motion using an optical system (Whalen et al., 2005). We are not aware of any alternative systems available at the time of the data collection that would have been compatible with trombone performance. Similarly, electromagnetic articulography (EMA) would have been unsuitable for use in this study due to long setup times (fixing the sensors in place requires anywhere from 20 to 45 min), and the danger of sensors coming loose during a long experiment (participants were recorded for around 45 min, on average) that featured possibly more forceful tongue movements as well as higher amounts of airflow than previous speech-only experiments. Furthermore, EMA only provides data for isolated flesh points that will be inconsistently placed across individuals and it is very difficult and often impossible to position articulography sensors at the back of the tongue due to the gag reflex, meaning that we probably would have been unable to document the differences in tongue position we found at the back of the tongue using ultrasound imaging.
All study data were collected using a GE Healthcare Logiq e (version 11) ultrasound machine with an 8C-RS wide-band microconvex array 4.0–10.0 MHz transducer. Midsagittal videos of tongue movements were captured on either a late 2013 15′′ 2.6 GHz MacBook Pro or a late 2012 HP Elitebook 8570p laptop with a 2.8 GHz i5 processor, both running Windows 7 (64bit); the following USB inputs were encoded using the command line utility FFmpeg (FFmpeg, 2015): the video signal was transmitted using an Epiphan VGA2USB pro frame grabber, and a Sennheiser MKH 416 shotgun microphone connected to a Sound Devices LLC USBPre 2 microphone amplifier was used for the audio. The encoding formats for video were either the x264 (for video recorded on the MacBook Pro) or mjpeg codecs (for video recorded on the HP Elitebook), while audio was encoded as uncompressed 44.1 kHz mono.
Although the ultrasound machine acquired images within a 110 degree field of view at 155–181 Hz depending on scan depth (155 Hz for 10 cm, 167 Hz for 9 cm, and 181 Hz for 8 cm), the bandwidth limitations of the frame grabber meant that the frame rates recorded to the laptops reached only 58–60 Hz and were encoded in a progressive scan uyvy422 pixel format (combined YUV and alpha, 32 bits per pixel; 2:1 horizontal downsampling, no vertical downsampling) at 1024 × 768. This means that the potential temporal misalignment of image content grabbed from the top versus bottom of the ultrasound machine screen (via the frame grabber; the misalignment is called ‘tearing’) would never exceed 6.45 milliseconds.
All NZE-speaking and one Tongan participant were recorded in a small sound-attenuated room at the University of Canterbury in Christchurch, New Zealand. No equivalent room was available for the recordings of the other Tongan participants. As a result, recordings were completed in a small empty room on the campus of the Royal Tongan Police Band in Nuku’alofa, capital city of the Kingdom of Tonga.
All NZE-speaking participants were asked to read a list of 803 real mono- and polysyllabic words off a computer screen, except for the first participant. Words were presented in blocks of three to five items using Microsoft PowerPoint, with the next slide appearing after a pre-specified, regular interval; the first participant read a list of words of similar length printed on paper and presented in lines of three to seven items, depending on orthographic length. Words were chosen to elicit all eleven monophthongs of NZE (see Figure 1) in stressed position plus unstressed schwa (see Heyne, 2016, pp. 252–255 for the full word list). Note that we distinguish schwa occurring in non-final and final positions in our analyses, as we were previously able to show that these sounds are acoustically and articulatory different and display phonetic variability with speech style comparable to other vowel phonemes (Heyne and Derrick, 2016a). All words were chosen to elicit all combinations with preceding coronal (/t, d, n/) and velar (/k, g/) consonants, as well as rhotics and laterals. Although it is well-known that read speech and wordlists result in somewhat unnatural speech production (Barry and Andreeva, 2001; Zimmerer, 2009; Wagner et al., 2015), this form of elicitation was chosen to ensure that the desired phoneme combinations were reliably produced, and to facilitate automatic acoustic segmentation. While the blocks usually contained words with the same stressed consonant-vowel combination, the sequence of the blocks was randomized so participants would not be able to predict the initial sound of the first word on the following slide; all NZE participants read the list in the same order. This procedure resulted in nine blocks of speech recordings lasting roughly 2 min and 20 s each, except for the first participant who was shown the next block after completing the reading of each previous block.
The same setup was used for the Tongan speakers who read through a list of 1,154 real mono- and polysyllabic words to elicit all five vowels of Tongan, both as short and long vowels, and occurring in combination with the language’s coronal and velar consonants (see Table 1; see Heyne, 2016, p. 249–251 for the full word list); all Tongan participants read the list in the same order. In Tongan, ‘stress’ is commonly realized as a pitch accent on the penultimate mora of a word (Anderson and Otsuka, 2003, 2006; Garellek and White, 2015), although there are some intricate rules for ‘stress’ shift that do not apply when lexical items are elicited via a list. We only analyzed stressed vowels with stress assigned to the penultimate mora and Tongan words are often quite short, consisting minimally of a single vowel phoneme, so it did not take as long to elicit the Tongan wordlist as the numerically shorter NZE wordlist.
Additionally, speakers from both language groups were asked to read out the syllables /tatatatata/ or /dadadadada/ at the beginning and end of each recording block to elicit coronal productions used to temporally align tongue movement with the resulting rise in the audio waveform intensity (Miller and Finch, 2011).
The musical passages performed by all study participants were designed to elicit a large number of sustained productions of different notes within the most commonly used registers of the trombone. Notes were elicited at different intensities (piano, mezzopiano, mezzoforte, and forte; we also collected some notes produced at fortissimo intensity but removed them due to insufficient token numbers across the two language groups) and with various articulations including double-tonguing, which features a back-and-forth motion of the tongue to produce coronal and velar articulations. To control as much as possible for the intonation of the produced notes, five out of a total of seven passages did not require any slide movement and participants were asked to ‘lock’ the slide for this part of the recordings (the slide lock on a trombone prevents extension of the slide). The difficulty of the selected musical passages was quite low to ensure that even amateurs could execute them without prior practice. Participants were asked to produce the same /tatatatata/ or /dadadadada/ syllables described above at the beginning and end of each recording block in order to allow for proper audio/video alignment.
Trombone players these days can choose to perform on instruments produced by a large number of manufacturers, built of various materials and with varying physical dimensions, both of which influence the sound produced by the instrument (Pyle, 1981; Ayers et al., 1985; Carral and Campbell, 2002; Campbell et al., 2013 among others). For this reason, we asked all participants to perform on the same plastic trombone (‘pBone’ - Warwick Music, Ltd., United Kingdom) and mouthpiece (6 1/2 AL by Arnold’s and Son’s, Wiesbaden, Germany); the first English participant performed on his own ‘pBone’ using his own larger mouthpiece.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.