A Bimodal Approach for Speech Emotion Recognition using Audio and Text

Oxana Verkholyak^1,2,3+, Anastasia Dvoynikova^1,2, and Alexey Karpov^1,2

¹St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
overkholyak@gmail.com, {karpov, dvoynikova.a}@iias.spb.su

²ITMO University, Kronverkskiy Prospekt, 49, St Petersburg, Russia

³Ulm University, Helmholtzstraße 16, 89081 Ulm, Germany

Abstract

This paper presents a novel bimodal speech emotion recognition system based on analysis of acoustic and linguistic information. We propose a novel decision-level fusion strategy that leverages both emotions and sentiments extracted from audio and text transcriptions of extemporaneous speech utterances. We perform experimental study to prove the effectiveness of the proposed methods using emotional speech database RAMAS, revealing classification results of 7 emotional states (happy, surprised, angry, sad, scared, disgusted, neutral) and 3 sentiment categories (positive, negative, neutral). We compare relative performance of unimodal vs. bimodal systems, analyze their effectiveness on different levels of annotation agreement, and discuss the effect of reduction of training data size on the overall performance of the systems. We also provide important insights about contribution of each modality for the best optimal performance for emotions classification, which reaches UAR=72.01% on the highest 5-th level of annotation agreement.

Keywords: Computational paralinguistics, Speech emotion recognition, Sentiment analysis, Bimodal fusion, Annotation agreement

+: Corresponding author: St. Petersburg Federal Research Center of the Russian Academy of Sciences
Tel: +7-(812)-328- 70-81

Journal of Internet Services and Information Security (JISIS), 11(1): 80-96, February 2021

Received: November 19, 2020; Accepted: February 9, 2021; Published: February 28, 2021

DOI: 10.22667/JISIS.2021.02.28.080 [pdf]