Andrew Gardner and Irfan Essa (1997) “Prosody Analysis for Speaker Affect Determination” In Proceedings of Perceptual User Interfaces Workshop (PUI 1997), Banff, Alberta, CANADA, Oct 1997 [PDF][Project Site]
Speech is a complex waveform containing verbal (e.g. phoneme, syllable, and word) and nonverbal (e.g. speaker identity, emotional state, and tone) information. Both the verbal and nonverbal aspects of speech are extremely important in interpersonal communication and human-machine interaction. However, work in machine perception of speech has focused primarily on the verbal, or content-oriented, goals of speech recognition, speech compression, and speech labeling. Usage of nonverbal information has been limited to speaker identification applications. While the success of research in these areas is well documented, this success is fundamentally limited by the effect of nonverbal information on the speech waveform. The extra-linguistic aspect of speech is considered a source of variability that theoretically can be minimized with an appropriate preprocessing technique; determination of such robust techniques is however, far from trivial.
It is widely believed in the speech processing community that the nonverbal component of speech contains higher-level information that provides cues for auditory scene analysis, speech understanding, and the determination of a speaker’s psychological state or conversational tone. We believe that the identification of such nonverbal cues can improve the performance of classic speech processing tasks and will be necessary for the realization of natural, robust human-computer speech interfaces. In this paper we seek to address the problem of how to systematically analyze the nonverbal aspect of the speech waveform to determine speaker affect, specifically by analyzing the pitch contour.