Speech Research
at the
I. P. Pavlov Institute in
Leningrad/St. Petersberg
L. A. Chistovich,
Early Intervention Institute, St. Petersburg, Russia
J. M. Pickett,
Professor Emeritus, Gallaudet University
Windy Hill Lab, Surry, Maine, USA
R. J. Porter, Jr.,
Professor Emeritus, University of New Orleans,
Lambda Consulting, Tampa, Florida, USA
(Click on the author's name to send email.)

This document contains an HTML version of an oral presentation, by R. J. Porter, at The 16th International Congress on Acoustics and The 135th Meeting of the Acoustical Society of America, Seattle, Washington, 25th of June, 1998

  



 
 Introductory Remarks

Good afternoon. I'm Bob Porter. As some of you are aware, Ludmilla Chistovitch and Mac Pickett are unable to attend today's symposium due to ill health. They asked if I would prepare a representative sampling of some of the work of the St. Petersberg group. It is a very great honor and pleasure to do so, although I am sure you, as I, would have preferred to have Professors Chistovich and Pickett be here with us, in person.

In my presentation, I have attempted to incorporate the suggestions my co-authors have made. Interpretations and, especially, misinterpretations, I claim solely for myself.


Oral Presentation
The work of Ludmilla Chistovich's group needs little introduction. It has served as impetus and inspiration to many of us, and has provided a challenging and contrasting perspective on speech perception and production that has energized discussion and research world-wide. 

The work of the Pavlov Institute researchers is primarily known to most of us through the publication of two monographs and one book. The covers of these are shown in the first overhead (click on this and other images to view larger version); two are English translations, supported by the United States Defense Department, of the 1962 and 1965 monographs. The 1976 book, to my knowledge, has not been translated. I obtained a copy in the early 1980's and was fortunate to have both a Russian-speaking student, David Evans-St. Romaine, and a neighbor, Irene Gulowska, who helped me translate some of it.

The research of the Pavlov group covered a vast range of topics, from the organization of speech muscle activity to the processing of signals at the auditory nerve. Today, however, I would like to point to two areas that are nearly unique to Chistovich's laboratory and which, I think, continue to have special meaning for contemporary speech researchers.

The first, and earliest, of these is the shadowing and mimicry work done by Ludmilla Chistovich and her husband Valerji Kozhevnikov. It is sometimes forgotten that the earliest work of this type was published by Chistovich in the 1950s. It was elaborated somewhat in the 1965 paper by Kozhevnikov and Chistovich (shown at the top of overhead along with its Defense Department cover) and then only summarized in the two later publications shown at the bottom of the overhead.

The earliest shadowing work used real speakers to provide stimuli. Both the stimulus speakers and the experimental subjects, who were both often the authors themselves, where fitted with a variety of devices designed to capture features of articulation. One such device, an artificial palate, (one of the first of its type) is shown in the next overhead, along with a record of two utterances in which the artificial palate and other measures were taken. These other measures include and laryngeal microphone, a thermistor sensor of nasal air-flow, lip-contact switches, a microphone, and so forth.

In a typical shadowing experiment, the speaker would read from a random list os or syllable clusters. The listener's task was to begin to repeat the speaker's utterance, as soon as it began, repeating everything as rapidly and accurately as possible. The principal dependent measure was the delay between distinctive gesture onsets in the speaker and those in the listener. A typical finding is shown in the next overhead

On the top, is a figure like we have just seen, except that only three for the speaker and the listener are shown. The bottom three traces are from the stimulus-speaker, the critical marker being the speaker's closing and then opening of the lips to produce an inter-vocalic /p/. These gestures, recorded using lip contact switches, are marked by the vertical lines. The corresponding gesture of the shadowing listener is shown in the top-most trace. Obviously, the listener, or examinee, generates lip movement after that of the speaker's has begun; the delay between the onset of the speaker's gesture and that of the listener can be recorded as a shadowing, choice-reaction-time.

A summary of the reaction times for inter-vocalic consonants of different types are shown at the bottom of the overhead, for two different examinees. Note that the average choice-reaction times are very similar across consonant type, for both speakers. Note also that the reaction times have a range of somewhere between 100 and 150 milliseconds.

These reaction times are extraordinarily fast. I can remember seeing these results when I was a know-it-all graduate student, and thinking that there must have been some error made in measurement, or some important experimental control left out. I had learned in my perception classes that even practiced subjects cannot produce choice reaction times faster than 300 msec or so. It was a particular thrill for me, therefore, to actually replicate and extend these data, with my graduate student Xavier Castellanos, in 1980, as well as in subsequent papers with Jim Lubker, Dan Harbison, and Emily Tobey.

To my knowledge shadowing reaction times are the fastest known for this type of perception-reaction task. In fact, these data provide, in my opinion, the strongest evidence of all in support of a hypothesis that perception and production of speech are linked in a direct and special way.

An interesting contrast in reaction-time results is obtained if listeners' shadowing responses are compared to responses obtained when listeners are asked to either (1) wait until the utterance is complete and then mimic it as quickly as possible, or (2) listen to and write down the identity of the utterance as quickly as possible. In one 1966 study (published in the KTH Proceedings) of this type, Chistovich used a continuum of synthetic vowels produced at Fant's laboratory. The results are shown on the next overhead.

The ordinate displays reaction times in fractions of seconds, the abscissa represents the continuum of stimulus vowels ranging from [i] to [e], to [a]. Results from one subject are shown. Note that the shadowing reaction times in the bottom-most line are the fastest, averaging somewhat below 200 msec. Mimicking, on the other hand, is delayed by nearly a second. Both of these results, however, are different from the written responses. The reaction times for writing was measured from the beginning of the vowel to the touching of a pen to a writing tablet.

Two features are of interest. First, the written responses occur somewhat more slowly than the shadowing, but not nearly as slowly as mimicking. Second, we observe the variation in reaction time across the continuum of stimuli. As you might suspect, the longest reaction times occur for those stimuli which are at the boundaries of phonetic categories and which, we may presume, require more processing before the listener can decide on the proper orthographic response. Most interesting, however, is the contrast between the written and shadowed reaction times. That is, the shadowing reactions show no phonetic category sensitivity at all, suggesting the very important conclusion that the auditory system provides for a direct and rapid conversion of acoustic input into gesture-control without intermediating reference to phonemic distinctions.

Another distinguishing research approach of the Pavlov Institute group, in addition to shadowing, was the frequent use of subjective ratings of stimulus qualities to compare stimuli. The researchers viewed these more qualitative measures as valuable complements to the prothetic difference-limens typically employed in auditory psychophysics.

This method was most often used to examine the Pavlov researchers' proposal that special filtering properties of the auditory system help listeners hear consonants and other dynamic features of speech. This model has much in common with more general models of audition developed by Eberhardt Zwicker's group in Munich, notably those of Terhardt and Fastl, but the model is adapted particularly to speech in the Pavlov lab. A schematic of the Russian model is illustrated in the next overheadThis schematic is from a 1977 article by Chistovich entitled "The Possible Nature of Acoustic Events."

If we follow the structure of the model left to right, we see that the input signal is first presumed to be analyzed by a bank of critical-band filters. The output of each filter is then half-wave rectified, and then logarithmically compressed. The resulting outputs are then passed through modulation-band filters, Unit #3 in the figure, and the outputs of the #3 filters are summed and passed through a detector. This Unit #3, band-filter, is the crux of the Pavlov model. It is important to keep in mind that the modulation-band filter shown in the figure represents only one of a large set of such modulation filters, all of which receive the same rectified and compressed output from each of the critical bands. Each of these unseen filters is assumed to be sensitive, however, to a different modulation frequency in the two (2) to fifty (50) Hertz region. In other words, the filters are sensitive to modulations, or energy variations, which occur over periods of 20 msec to 500 msec. Although it is not directly apparent from the model as shown, this proposed mechanism has the very interesting property of converting formant transitions with different frequency extents into energy envelopes which have magnitudes proportional to the degree of frequency change, and modulation rates proportional to the rate of spectral change. In terms of speech, this mechanism has the potential to change formant dynamics into modulation sensations with rate, magnitude, and phase characteristics that track articulator movements. The full-blown model is discussed in greater detail in the 1976 book. The important aspect of this model is, however, that the perceptual system is presumed to use modulation sensations, and not just spectral slope, in speech perception.

The next overhead provides results from one of a number of studies examining the characteristics of the hypothesized modulation filters. In the study, listeners were asked to judge the similarity of members of pairs of one-second stimuli that differed in terms of a 300 msec, or shorter, amplitude modulation occurring for one formant of a three formant signal. Modulated pink noise signals were also used. The modulation rate, number of modulation cycles, and degree of modulation where all varied systematically. The results in the upper right panel reveal the band-pass characteristic of one listener's modulation filter centered at about 25 Hz. (This subject reported that the modulated signal sounded something like a flapped [r].) What the figure shows is that signals with modulations of rates progressively different from 25 Hz were judged as having progressively smaller modulation sensation magnitude compared to the 25 Hz signal. The left-hand panel shows the modulation-sensation threshold for the 25 Hz signal and the panel below shows results for different numbers of cycles of modulations of noise versus formants. Note, at the bottom, that listeners' are as sensitive to one cycle of formant modulation as they are to several; a result that does not hold for the pink noise. This is important because it shows that modulation sensations can be both extracted from single cycles of modulation of formants and that these sensations are available to serve perceptual functions in speech.

A 1976 study by Rodionov, Koshenikov, and Rene Carre', who was visiting the lab from France, further investigated the modulation sensations. They used 31 Hz haversine amplitude modulations occurring in one, two, three or five harmonic components of a five-component signal. They found that listeners can combine the sensations of the dynamics of formants, across different frequency regions, in ways that could signal articulator movements. Unfortunately, time does not permit further detailed discussion of this interesting result. It is supported, however, by additional Institute research in the late 1970's and early 1980's. A more extended discussion the hypothesis and evidence in support of it, can be found in my 1985 and 1987 papers in Speech Communication and in the Journal of Phonetics.

One might summarize the Russian work and its theoretical basis this way: Speech perception and production are inextricably related by movement. Production involves dynamic movement, spectral dynamics reflect those movements, and modulation-processing mechanisms directly recover and recognize those movements.

Regardless of the theory or interpretation, however, the St. Petersberg data continue to intrigue and inspire new researchers. That inspiration is the true measure of the value of the research conducted by the resourceful and inventive scientists at the Pavlov Institute.


I would like to conclude by reading from letters sent to me by Mac Pickett, and by Ludmilla Chistovich, in response to my request for some personal commentary I might include in my presentation.


Finally, Ludmilla sent me some photos to share with you which were taken during the 1966 and 1973 conferences referred to by Mac Pickett in his letter. The first is a picture from the 1966 conference. As near as I can tell, this is a picture of an early speech synthesizer. (It is Ken Stevens, of course.)

Other photos include: A picture of Lenin overlooks presentations at the 1966 Leningrad Conference; Gunnar Fant and Ludmilla Chistovich, also at the 1966 conference. Pictures from the 1973 conference include those of Al Liberman, Ken Stevens and Gunnar Fant, and Hiroya Fujisaki.

And, finally, here is a 1973 picture of Mac Pickett in St. Petersberg and a color photo of Valerji Kozhevnikov pointing out features of the modulation filter model in 1973.
 

You may also click on the thumbnail photos below to see a larger version.

If you would like a photocopy of this paper please email R. J. Porter,  your name and address.


Attachments/Addenda

Prof. Chistovich's letter provides some references to the Russian work.  Additional references can be found in the R. J. Porter review paper noted below.

Speech perception/production issues are further discussed in the revision of  Prof. Pickett's book, Sounds of Speech Communication, (Allyn and Bacon), which should be available the end of 1998 or the beginning of 1999.
 

 Selected R. J. Porter references related to Pavlov group work.
Porter, R.J., Jr. and Castellanos, F.X. Speech-production measures of speech perception: Rapid shadowing of VCV syllables. J. Acoust. Soc. Am. 67(4), 1349-1356, 1980.   This paper replicates and extends the original shadowing work.

Porter, R.J., Jr. and Lubker, J. Rapid reproduction of vowel-vowel sequences: Evidence for a fast and direct acoustic-motoric linkage in speech. J. Sp. Hear. Res. 23, 593-603, 1980.   EMG signals are clarify the time-course of shadowing,  Very fast muscle reactions are observed.

Porter, R.J., Jr. Pavlov Institute Research in Speech Perception: Finding phonetic messages in modulations. Speech Comm. 4, 31-39, 1985.  Review of the classic Russian work as well as of a collection of additional articles which are not widely know.

Porter, R.J. Speech messages, modulations and motions. J. Phonet. 13, 193-197, 1985. A brief account of the possible relation of modulation sensations and the control of articulator movements.

Porter, R.J. What is the relation between speech production and speech perception? In: A. Allport, D. Mackay, W. Prinz and E. Scheerer (Eds.) Language, Perception, and Production. Academic Press, London, pp. 85-106, 1987. The extension of the basic modulation idea with particular reference to the relation of modulation and articulator movements.

Harbison, D., R. J. Porter, and E. Tobey. Shadowing and simple reaction-time in stutterers and non-stutterers. J. Acoust. Soc. Am. 86(4), 1277-1284, 1989.   Stutterers shadow a bit differently than non-stutters, apparantly because of implementation difficulties.

Cespedes-Mohrke, D. A., Porter, R. J., Reimer, T. E., and Trahan, R. E., Jr. Exponential transient component analysis for speech modulation decomposition. Proc. IEEE Southeastcon '92, vol. 1, 299-302, 1992.   A Short report of an experimental implementation of a modulation sensation model to the analysis of speech signals.

<- RETURN TO Lambda Consulting HOME PAGE