Skip to main content

You Sound Sad, Human

Computers are getting pretty good at detecting our emotions.
(Photo: Roman Sotola/Shutterstock)

(Photo: Roman Sotola/Shutterstock)

Imagine a future where your phone can analyze your emotional state by the sound of your voice. That's not as improbable as it sounds; computers are already pretty good at discerning your feelings, and they're getting better all the time. In fact, computers that listen for the right things can identify six basic emotions correctly more than nine times out of 10, according to a new study.

Science fiction is filled with examples of computers that can conversationally hold their own with humans: Buck Rogers' Dr. Theopolis; 2001: A Space Odyssey nemesis HAL; and the holographic and chronically irritated Arnold Rimmer from Red Dwarf, just to name a few. In reality, computers would struggle to communicate with us in such a natural way, in part because there's a lot more to speaking than transmitting words and phrases.

For one thing, there's emotion. It'd be nice if computers could recognize your mood from your tone of voice. That way, the customer service robot you're trying to talk to could figure out when you're about to boil over and hand you over to a manager (not that that would necessarily help). Emotion-recognizing computers might also aid frustrated students or help calm a stressed-out driver.

Their approach was able to correctly identify the emotions expressed in 91 percent of test clips, an improvement of nine percent over previous efforts.

To bring that idea closer to reality, Northeastern University researchers Reza Asadi and Harriet Fell looked at three groups of features in human speech. The first, mel-frequency cepstrum coefficients (MFCCs), separates out the effects of the throat, tongue, and lips, which act as filters on the underlying sound of a person's vocal cords. The second, Teager energy operators (TEOs), capture the flow of air through the vocal tract, revealing tension or stress in the throat. Finally, Asadi and Fell looked at so-called acoustic landmarks, transition spots in speech that we hear as the start of a word, say, or the end of a sentence.

To find out whether those features were enough to detect emotion, Asadi and Fell first extracted MFCCs, TEOs, and landmarks from the Linguistic Data Consortium's Emotional Prosody and Speech Transcripts, a set of short audio clips featuring actors speaking in a variety of emotional states. Using a portion of that data, the computer scientists trained a simple, off-the-shelf computer algorithm to tell the difference between six emotions—anger, fear, disgust, sadness, joy, and neutrality.

Remarkably, their approach was able to correctly identify the emotions expressed in 91 percent of test clips, an improvement of nine percent over previous efforts. Acoustic features were best at detecting sadness and joy, Asadi and Fell found, while TEOs, the features related to airflow, were particularly useful for identifying anger and fear—something those customer service call centers might want to take into account.

Asadi will present the research today at the Acoustical Society of America's Spring meeting in Pittsburgh.

Quick Studies is an award-winning series that sheds light on new research and discoveries that change the way we look at the world.