Robert Mac Auslan, Joel Mac Auslan, and Linda J. Ferrier-Reid of Phonologics chart the evolution of intelligibility assessments for non-native speakers of English
With the increased use of global English in business, medicine, government, technology, and many other industries, the need to speak intelligibly is increasing across the world. Therefore, the intelligibility of non-native speakers is an important issue in both the academic world and the workplace.
In order to train for better intelligibility, we need to be able to quickly and accurately judge that intelligibility. Human judges of intelligibility need extensive training and their judgments are often biased and inconsistent. Technology is stepping in here to provide quicker and more objective ratings of speaker intelligibility. This article introduces a variety of such technologies available today and the areas in which they are particularly critical.
Reduced Intelligibility can Lead to Fatal Miscommunications
Miscommunication can occur in any human interaction. In some situations, miscommunication results in serious consequences, as medical institutions know to their cost. Anecdotes of such miscommunications are very common, with some of the most chilling examples coming out of the airline industry, where results can be fatal.
Communication in the air is generally carried out in English. Indeed, nothing underscores the subtle complexities of speech communication more strikingly than the miscommunications that occur among pilots, crewmembers, and air traffic controllers. When different words or phrases sound exactly or nearly alike, it can be problematic. Confusion is possible, for example, because “left” can sound very much like “west.”
According to a Federal Air Surgeon’s Medical Bulletin entitled, “Thee… Uhhmm… Ah… , ATC-Pilot Communications,” Mike Wayda writes, “When you produce these hesitations while speaking, you are using ‘place holders,’ or ‘filled pauses,’ a type of speech dysfluency especially common in pilot-controller exchanges.” Until recently, such speech dysfluencies and other mistakes were not considered to be important; however, new research suggests that there is a correlation between miscommunications and mistakes.
What is Intelligibility?
How do we define intelligibility and how is it measured? Intelligibility refers to the ability of a listener to recognize and understand a word, phrase, or sentence of a non-impaired speaker. Intelligibility is influenced by the social and linguistic context of the speech. If the listener is familiar with the topic under discussion, intelligibility will be higher. In addition, intelligibility is higher if the speaker is in a noise-free background. Finally, intelligibility varies according to how familiar the listener is with the speech pattern of the speaker. (A well-known phenomenon is the miraculous improvement in intelligibility of a non-native speaker over time in the view of his/her teacher, when objective testing shows no real improvement.)
Intelligibility is often measured by the number of phonemes that can be accurately transcribed from listening to recorded speech. It is also often rated on Likert scales, where the listener selects from options ranging from, for example, “totally unintelligible” to “completely intelligible.”
What is a “Foreign Accent”?
We are only interested in an accent to the extent that it reduces intelligibility, concentrating only on pronunciation and ignoring vocabulary and grammar. Non-native speakers are sometimes unintelligible because the speech patterns of their first language interfere with their pronunciation of American English. German speakers, for example, often substitute /v/ for /w/. Some languages, such as Mandarin Chinese, do not allow obstruents (sounds created by restricting air-flow through the oral cavity) at the end of a word or syllable, so the final consonant is omitted — in the word rice, the final /s/ sound is left off. In some languages, the /t/ sound is produced more like a /d/. This can lead to meaning confusions, such as English listeners hearing die instead of tie.
Prosodic effects are also important. Prosody covers a number of systems that affect intelligibility, including intonation and sentence stress or accent, determined in English mostly by the speaker’s focus and whether this is the first mention of an item to the conversation. Unfortunately, there are few simple rules to guide the learner of English; word-stress patterns must generally be learned on a word-by-word basis. In addition, speakers of tone languages, such as Chinese and Korean, have difficulty carrying an uninterrupted pitch contour over an utterance and assigning correct sentence stress to the most important word/s in a sentence. To the ears of native speakers, their productions sound “jerky.”
How did Speech Assessment Evolve?
Human-Scored Testing
Initially, all speech testing relied on the judgment of a human listener, who is, of course, prone to fatigue, bias, and unreliability. This is probably still the most common way to evaluate speaking effectiveness and intelligibility. Speakers are evaluated in reading, responding to prompts, or in free conversation.
The SPEAK Test
The Speaking Proficiency English Assessment Kit (SPEAK) is an oral test developed by the Educational Testing Service (ETS) and perhaps epitomizes the traditional way of evaluating speech. Its aim is to evaluate the examinee’s proficiency in spoken English. ETS developed the four-skills (listening, reading, speaking, and writing) TOEFL iBT test. The speaking portion of the test is scored by human listeners and has undergone extensive statistical and reliability analysis. The speaking section of the TOEFL is not available separately from the other sections, but institutions wishing to test speaking skills only may choose to use the TOEIC (Test of English for International Communication) Speaking Test, also developed by ETS, and available as a stand-alone assessment.
Acoustic Analysis of Speech
Since acoustic analysis methods became readily available in the 1960s, there has been a steady stream of research documenting particular features of standard American English speech in single words and sentences and, more recently, of non-native speech, allowing comparison of the two. These studies have allowed the computer analysis of speech in programs such as the Versant Testing System, Carnegie Speech Assessment, and the Automated Pronunciation Screening Test (APST). These use large-scale statistical studies on native and non-native speech as the basis for assessments. Because of the difficulty of training listeners to achieve reasonable reliability with each other and the time it takes to score spoken tests, computer-based testing offers the hope of more rapid and reliable intelligibility assessment. The three tests noted above that use computer analysis are further described below.
The Versant Testing System
Versant Technology originally developed a telephone-based test in which the speaker repeated items or responded to prompts. This first test primarily evaluated speaker fluency. More recently, Versant has developed a system presented on a computer, described on their website: “The Versant testing system, based on the patented Ordinate technology, uses a speech-processing system that is specifically designed to analyze speech from native and non-native speakers of the language tested. In addition to recognizing words, the system also locates and evaluates relevant segments, syllables, and phrases in speech. The Versant testing system then uses statistical modeling techniques to assess the spoken performance.
“Base measures are then derived from the linguistic units (segments, syllables, words), based on statistical models built from the performance of native and non-native speakers. The base measures are combined into four diagnostic sub-scores using advanced statistical modeling techniques. Two of the diagnostic sub-scores are based on the content of what is spoken, and two are based on the manner in which the responses are spoken. An overall score is calculated as a weighted combination of the diagnostic sub-scores.”
Carnegie Speech Assessment
This system uses speech recognition and pinpointing technology under license from Carnegie Mellon University to assess an individual’s speech. By pinpointing exactly what was correct and incorrect in the speaker’s pronunciation, grammar, and fluency, accurate and objective English assessments can be made. Specific features include:
• Rapid assessment of spoken English by analyzing each student’s speech against a statistical composite voice model of native speakers.
• Self-directed tutorials that reduce administrative requirements and costs.
• Tunable grading scale that customizes results to each organization’s operational or educational requirements.
• Immediately available and objective reports that can be compared across multiple applicants as well as across business and educational enterprises.
• Detailed reports on individual users that allow information on each applicant’s language proficiency to flow from hiring to training departments, eliminating redundant assessments.
Automated Pronunciation
Screening Test (APST)
The APST uses knowledge-based speech analysis and is based on the careful study and acoustic analysis of the target speech. It is designed to test large groups of non-native speakers quickly, accurately, and objectively. Speakers first practice recording items and then read words and sentences, which are recorded into the computer. These recordings are sent to Phonologics via the web, where they are automatically scored and a report is made available to the test administrator within minutes. The test provides sub-scores on particular aspects of speech and a summary score that indicates the intelligibility of the speaker to American English listeners.
The initial human-scored version of the APST was developed to screen the large numbers of non-native speakers at Northeastern University in Boston, MA. The program provided a summary and sub-scores and was used with standard TOEFL scores to determine whether international teaching assistants should be allowed into the lab or classroom or first receive intelligibility training. This first version showed the need for a more objective and quickly scored version of the test. A second automated prototype was developed with funding from NIH. Further development of the APST has been under the auspices of the Speech Technology and Applied Research Corp.
Automated vs. Human Testing
It is important to test how well automated tests correspond with the judgments of human listeners. To check this, the authors first got intelligibility rankings of three non-native speakers and one native speaker using the APST. Then they took recordings used for the APST analysis and asked five native-English listeners to judge their speech. The judges were asked to do two things: rank speakers on a nine-point intelligibility scale and place them for intelligibility in the top, middle, or bottom positions. On both measures, the human evaluators all rated the speakers consistently with their APST scores. (A full version of this study is available on the Phonologics website.) So this particular study showed the APST to agree with human judges. These new technologies offer the prospect of accurate results that agree with the judgments of human listeners, but without the labor and time commitments, and with the promise of more objective results. This allows us to place speakers in classes or positions more quickly and accurately, and without the bias that unfortunately can often creep into the human-scored process.
Robert Mac Auslan, PhD, is VP of operations, Joel Mac Auslan, PhD, is chief technology officer, and Linda J. Ferrier-Reid, PhD, is chief linguistics officer at Phonologics, Inc. To find out more, visit their website at Phonologics.com.