
A Lightweight Tri-Stream Feature Fusion Network for Speech Emotion Recognition
Author(s) -
Ronghe Cao,
Yunxing Wang,
Xiaolong Wu,
Shuang Jin,
Huiling Niu
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3587607
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Understanding and modeling emotions from speech is a fundamental challenge in speech processing and a key enabler of emotionally intelligent human-computer interaction. However, defining and extracting robust emotional features remains difficult due to the nuanced and context-dependent nature of human affect. Existing approaches, focusing on prosodic features or deep representations from pre-trained models, often struggle to capture the full spectrum of emotional cues present in real-world speech. To address these limitations, we introduce Tri-Stream, a novel speech emotion recognition (SER) framework that concurrently leverages spectrogram and waveform modalities. Tri-Stream integrates three complementary feature streams: spectral patterns extracted via a Swin Transformer, deep acoustic representations from HuBERT, and engineered prosodic features capturing rhythmic information. These streams are fused and processed by a GRU-based classifier for final emotion prediction. Extensive evaluations on four benchmark datasets (IEMOCAP, SAVEE, RAVDESS, EMO-DB) demonstrate that Tri-Stream consistently outperforms state-of-the-art baselines, achieving 79.86% unweighted accuracy on IEMOCAP and best performance on the remaining datasets, highlighting its effectiveness and robustness across diverse emotional speech corpora.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom