Researchers Show Emotion AI Can Match Human Emotional Perception

Emotions are difficult to understand, even for humans. Because of this, building AI systems to accurately identify emotions is also a hard problem to solve. Many things can factor into how someone reads another person’s emotions: tonality, facial expression, context, words said, nonverbal cues, etc. Vocal tone is known to be a great indicator for emotion and can be accessed in a variety of digital environments. Recently, researchers were able to uncouple the semantic meaning of words from vocal tonality to determine how well an AI can classify emotions from just tonality in comparison to a person.

In this paper, a group of German computer science and psychology researchers used both Canadian and German databases of nonsensical emotional audio clips, meaning the sentences did not make sense but they were spoken with a specific emotion. These clips were used to train machine learning (ML) models to classify the vocal emotion as fear, anger, joy, sadness, disgust, or neutral. They were able to get fairly accurate emotional readings from audio clips as short as 1.5 seconds long. 

These same clips were also shown to human participants who listened to the emotional audio clips and classified the emotions. The results showed that the AI emotion classifications were comparable in accuracy to the human emotion classifications. The researchers collected demographic information about the human participants including sex, age, first language, residence, and prior English experience; however, there is no information given about the demographic distribution of these participants. This is notable regarding Milton’s double empathy theory which describes how one’s accent, age, race, gender, and neurotype can significantly impact their emotional perception ability depending on the demographics of the person they’re perceiving. 

It’s important that these researchers acknowledged their data limitations. While their models were high accuracy, the datasets used were small and created by voice actors. With only ~1500 files recorded by 34 people available for this study, this is a small sample size in which to include vocal diversity. Using voice actors further complicates the applicability of models created from this data since voice actors tend to over exaggerate emotions. In addition to the content of the audio files, the study was limited to only using 1-5 second long files due to their data access which leads to questions about applicability on longer audio datasets. Still, studies like this are a very useful starting point in making comparisons between AI and human emotion recognition abilities.  


Here at Valence, we strive to create emotional classification systems that are built from representative, real-world data rather than from a small sample of voice actors. By using more realistic audio samples to train our ML models, we create highly accurate systems that work in real-life situations for all people. Our North American English models are trained on a representative sample of the US and Canada with respect to age, race, gender, geography, and neurotype to make sure we can create models that work for a diverse set of voices. We also crowdsourced this data from real people, rather than voice actors, to make sure our data is as close as possible to real life scenarios. 

We also aim to create systems that take into account broader emotional context. Our systems utilize longer, sentence-level audio files to better track shifts in tone and emotion over time. Longer audio files provide more context, helping the AI better understand the emotional trajectory of a sentence or conversation. Emotions can change rapidly over time, and longer audio files capture these changes more comprehensively. While it’s useful that these researchers were able to glean emotions from very short audio files, using such short audio files is only applicable in a few contexts. At Valence, it is important to us that our models are able to track changes in emotion over the course of longer periods of time for a larger range of applicability in different situations. 


Specifics about our model offerings can be found here: Pulse API. You can read more about our dedication to creating representative datasets and AI models out of diverse data here: Empowering Diversity in Data-Centric AI.

Previous
Previous

Comparing Emotion Processing Modalities

Next
Next

Empowering Diversity in Data-Centric AI