Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics

arXiv — cs.LGWednesday, November 12, 2025 at 5:00:00 AM
The field of Speech Emotion Recognition (SER) has seen significant progress, particularly through the integration of deep learning techniques and textual information. However, the role of physiological data during speech production has been largely overlooked. To address this gap, researchers conducted experiments focusing on phonation excitation information and articulatory kinematics, leading to the creation of the STEM-E2VA dataset, which includes both audio and physiological data such as EGG and EMA. These tools provide insights into speaker traits and emotional states. The study explored the feasibility of using estimated physiological data derived from speech, rather than directly collected EGG and EMA data. The experimental results confirmed the effectiveness of incorporating physiological information into SER, highlighting its potential for practical applications in real-world scenarios.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition
PositiveArtificial Intelligence
CO-VADA is a new approach aimed at reducing bias in speech emotion recognition (SER) systems. Bias often arises from misleading correlations between speaker characteristics and emotional labels, resulting in unfair predictions across different demographic groups. Unlike many existing methods that require changes to model architecture or demographic annotations, CO-VADA operates without such modifications. It identifies biased training samples and uses voice conversion to generate augmented samples that help the model focus on emotion-relevant features, enhancing fairness in SER systems.
Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model
PositiveArtificial Intelligence
Recent advancements in speech emotion recognition (SER) have been hindered by the lack of large quality-labelled training data. A new framework has been proposed that utilizes cross-modal information transfer and mutual information regularization to enhance data augmentation. This approach was tested on benchmark datasets including IEMOCAP, MSP-IMPROV, and MSP-Podcast, resulting in improved performance in emotion prediction compared to existing methods.