Combining Multi-scale and Self-Supervised Features for Speech Emotion Recognition - Details

Author：

Lei, Fei (Lei, Fei.) | Wang, Zhuorui (Wang, Zhuorui.) | Ding, Yibo (Ding, Yibo.)

Indexed by：

EI Scopus

Abstract：

Speech　emotion　recognition　(SER)　plays　a　crucial　role　in　classifying　emotional　information　conveyed　through　audio　signals,　providing　more　accurate　and　convenient　solutions　for　human-computer　interaction,　emotional　analysis　and　other　fields.　The　recent　research　has　focused　on　training　Transformer-based　models　on　human-annotated　emotional　datasets　to　capture　long-range　dependencies　by　modeling　fixed-scale　feature　representations,　and　processing　time-varying　spectral　features　as　images.　However,　extracting　efficient　and　robust　common　speech　features　from　small-scale　datasets　is　challenging,　and　dealing　with　scale　variance　is　difficult　due　to　the　lack　of　inherent　inductive　bias　(IB).　To　address　these　challenges,　this　paper　proposes　a　noval　architecture　that　extracts　Multi-Scale　features　from　raw　signals　and　embeds　them　into　a　Self-supervised　Features,　i.e.,　MSSF.　Technically,　this　paper　first　designs　a　spatial　pyramid　reduction　cell　that　combines　rich　multi-scale　speech　features　by　utilizing　multiple　convolutions　of　different　kernel　sizes.　Next,　these　features　are　embedded　into　a　pre-trained　self-supervised　model　to　obtain　multi-scale,　discriminative,　and　common　features　for　SER　tasks.　The　predicted　labels　are　then　output　through　the　final　classification　head.　Additionally,　this　paper　designs　a　convolution　block　in　parallel,　and　its　features　are　fused　and　fed　into　the　multi-scale　features.　Finally,　MSSF　is　fine-tuned　on　the　benchmark　corpus　IEMOCAP　for　four　emotions.　Compared　to　previous　methods,　our　proposed　model　demonstrates　improvements　on　four　common　metrics,　indicating　its　superiority.　©　2023　Technical　Committee　on　Control　Theory,　Chinese　Association　of　Automation.

Keyword：

Classification (of information) Speech recognition Convolution Emotion Recognition Human computer interaction

Author Community：

[ 1 ] [Lei, Fei]College of Artificial Intelligence and Automation, Beijing University of Technology, Beijing; 100124, China
[ 2 ] [Wang, Zhuorui]College of Artificial Intelligence and Automation, Beijing University of Technology, Beijing; 100124, China
[ 3 ] [Ding, Yibo]College of Artificial Intelligence and Automation, Beijing University of Technology, Beijing; 100124, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

An EEG emotion recognition method based on AdaBoost classifier
2017，2017 Chinese Automation Congress, CAC 2017
Gender classification in face images based on stacked-autoencoders method
2014，2014 7th International Congress on Image and Signal Processing, CISP 2014
Photoplethysmogram Based Cognitive Load Recognition Using Lstm
2023，15th International Conference on Bioinformatics and Biomedical Technology, ICBBT 2023
Adversarial Action Data Augmentation for Similar Gesture Action Recognition
2019，2019 International Joint Conference on Neural Networks, IJCNN 2019

Source ：

ISSN： 1934-1768

Year： 2023

Volume： 2023-July

Page： 8701-8706

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 0

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to