Mystifying the Senses: Bimodal Speech Perception

This paper reflects the research and thoughts of a student at the time the paper was written for a course at Bryn Mawr College. Like other materials on Serendip, it is not intended to be "authoritative" but rather to help others further develop their own explorations. Web links were active as of the time the paper was posted but are not updated.

Contribute Thoughts | Search Serendip for Other Papers | Serendip Home Page

Biology 202
2001 Second Web Report
On Serendip

Mystifying the Senses: Bimodal Speech Perception

Alice Goff

My grandmother, like many elderly people, suffers from hearing loss. Recently however, she has begun to lose her sight as well. Curiously enough, though her level of auditory impairment remains the same since macular degeneration has claimed her ability to see, her hearing seems to have deteriorated further. Could this be simply the result of alienation because of the loss of a further sense? This situation led me to wonder about my own hearing ability. I have often experienced hearing difficulty in settings where I cannot see the person who is talking to me-in a movie theater, or over the telephone. The questions raised here call into question the conventional notion of sensory processing. Distinctive inputs are received by their respective processing organ and the end result is relayed to the brain. How then can we explain a seeming reliance of two different sensory percepts on each other? Is there more to hearing than our ears?

Historically, scientific evidence for the existence of sensory integration has long existed, but the first formal theory developed to this effect was stumbled upon by Harry McGurk and John MacDonald of the University of Surrey (1). The scientists were involved in a study of how infants perceive speech by playing a video of a mother talking in one place and playing the sound of her voice in another place. They randomly began to play with the consequences of dubbing an particular audio sound onto the video of the mother saying a different sound (2). They found that when the auditory syllable, "ba-ba" was imposed on the visual syllable "ga-ga", "da-da" was heard. The same occurred when the audio and visual syllables were reversed. Also, "pa-pa" dubbed on "ka-ka" was heard as "ta-ta". When one of the sensory inputs was eliminated by closing the eyes, or plugging the ears, the correct syllable was identified (2). McGurk and McDonald found "Contemporary, auditory-based theories of speech perception...inadequate to accommodate these new observations" and concluded that there must be some allowance made for the influence of the visual on hearing (2). The conventional theory of the senses is challenged.

So, speech perception is bimodal. Of course, as science repeatedly shows, nothing is simple as that. The question remains, how does this integration occur? When does it occur? What neurological systems are involved? It has become generally accepted that audio and visual inputs are received by independent organs (the ears and eyes) and that integration occurs sometime after these two systems have "processed" the input. In fact, a model for speech perception is often drawn from the work of linguist, Noam Chomsky. Chomsky asserts the existence of a mental module which is independently responsible for speech creation (5). This module is somewhat akin to the modular systems of vision, or hearing (5). Such an ideal for speech perception is a generally accepted by current researchers-there is some system within our mental capabilities which recieves input from auditory and visual systems and "decides" on what was heard (1). It is in the integration stage of matters that opinion and hypothesises seem to conflict. One such theory of bimodal speech integration is that of categorical perception. This extremely complicated model has for its basis the idea that there is no distinction made between inputs within a certain category-all inputs are treated without regard to characteristic (3). However, a new theory of integration has become more prominent, and refutes the pillars of the categorical perception model.

The fuzzy logical model of perception was developed by scientist Dominic Massaro and Ron Cole. The principle of their research is based on a sort of "democractic" theory of bimodal speech processing. They believe that audio input is a continuous input, that a decision is made based on many sensory qualities of a given input and that 'experience is critical' (5). They have broken the process into three stages: evaluation, integration and decision (5) each of which involves recognition (prototypes stored in the long term memory) (1). In the evaluation process (independent for each type of input-audio and visual) the stimulus is evaluated in comparison to all other stimulus and a consensus is reached, producing an output from that system (5). This is called the "relative-goodness-of-match rule" (1). The integration process combines these output values and generates "an overall degree of suport for each speech alternative" (5). The decision take the integration results and creates a definitive response (5). These steps are occur successively but also overlap (1). There is significant statistical theory that goes along with these assertions, but actual anatomical and structural knowledge of the integration process is still somewhat of a mystery (4).

Perhaps the most fascinating ramification of this line of research is it's implications for educational advancement. As with many scientific discoveries, the program resulting from the fuzzy logical model of perception was reached in a rather upside-down manner. Since it would seem that speech perception is a combination of two sensory inputs, then for those who cannot speak due to hearing loss, could the visual component be used as a sort of speech therapy? Indeed that is the project that has been formulated by Massaro Cole and Mike Macon which utilizes a computer-generated human head which is designed to speak using extremely precise mouth movements (4). This hairless head, aptly named Baldi (with an 'i' rather than 'y' since he is from California (3)), is part of a computer program which is implemented at the Tucker-Maxon Oral School in Portland, Oregon which is dedicated to teaching its deaf students how to speak. By seeing and mimicing Baldi's oral gestures the students are able to more accurately form words (6). Baldi also has the superhuman qualities of being able to appear as only a jaw and tongue, allowing the children to gain a more exact understanding of the physicality of speech (7).

The development of bimodal speech therapy has revolutionizing implications on neuroscience and thought involving all types of perception. How can we persist in thinking of our distinct "5 senses" when clearly these perception systems are interlinked and dependent? Also introduced by bimodal speech perception is the idea of an independent organ within the brain which serves as 'integrator' and 'decider' for the various inputs that enter our nervous system. The concept that our brain takes inputs from the outside world and converts them into something potentially completely different also has the effect of destabilizing our confidence that what we perceive is what exists in reality. Now however, it is not our ears or eyes that deceive us, but rather the conflict that stems from inputs from these two systems. Although unsettling, clear optimism can come from the educational ramifications of this scientific knowledge. The question remains what exact structures are responsible for the bimodal phenomenon. Perhaps the larger question is, what other systems utilize this integration method of perception? Is it useful to think of senses? Thus conventions continue to be broken........

WWW Sources

1)"Speech recognition and Sensory Integration" by Dominic Massaro and David Stork , An outline of the theories involved in speechreading and some statistical background.

2) Harry McGurk and John MacDonald, "Hearing Lips and Seeing Voices," Nature v.264 (Winter 1976): 746-748.

3), Massaro, Dominic, Perceiving Talking Faces. Cambridge, Massachusetts:The MIT Press, 1998.

4) "Speech: A Sight to Behold" by Barbara Rodriguez., Science notes from UCSC Summer 1996

5) "From Speech is Special to Computer Aided Language Learning" by Dominic Massaro and Ron Cole, See article title under heading "Perption of Visible and Bimodal Speech"

6) Summary of Baldi as a Language Tutor.

7)The Animated Face, BALDI, A description of the uses of Baldi at the Tucker-Maxon Oral School.

Send us your comments at Serendip