Speech emotion recognition (SER) is a challenging framework in demanding human machine interaction systems. Standard approaches based on the categorical model of emotions reach low performance, probably due to the modelization of emotions as distinct and independent affective states. Starting from the recently investigated assumption on the dimensional circumplex model of emotions, SER systems are structured as the prediction of valence and arousal on a continuous scale in a two-dimensional domain. In this study, we propose the use of a PLS regression model, optimized according to specific features selection procedures and trained on the Italian speech corpus EMOVO, suggesting a way to automatically label the corpus in terms of arousal and valence. New speech features related to the speech amplitude modulation, caused by the slowly-varying articulatory motion, and standard features extracted from the pitch contour, have been included in the regression model. An average value for the coefficient of determination R-2 of 0.72 (maximum value of 0.95 for fear and minimum of 0.60 for sadness) is obtained for the female model and a value for R-2 of 0.81 (maximum value of 0.89 for anger and minimum value of 0.71 for joy) is obtained for the male model, over the seven primary emotions (including the neutral state). (C) 2014 Elsevier B.V. All rights reserved.

Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure

Bozzali M;
2014-01-01

Abstract

Speech emotion recognition (SER) is a challenging framework in demanding human machine interaction systems. Standard approaches based on the categorical model of emotions reach low performance, probably due to the modelization of emotions as distinct and independent affective states. Starting from the recently investigated assumption on the dimensional circumplex model of emotions, SER systems are structured as the prediction of valence and arousal on a continuous scale in a two-dimensional domain. In this study, we propose the use of a PLS regression model, optimized according to specific features selection procedures and trained on the Italian speech corpus EMOVO, suggesting a way to automatically label the corpus in terms of arousal and valence. New speech features related to the speech amplitude modulation, caused by the slowly-varying articulatory motion, and standard features extracted from the pitch contour, have been included in the regression model. An average value for the coefficient of determination R-2 of 0.72 (maximum value of 0.95 for fear and minimum of 0.60 for sadness) is obtained for the female model and a value for R-2 of 0.81 (maximum value of 0.89 for anger and minimum value of 0.71 for joy) is obtained for the male model, over the seven primary emotions (including the neutral state). (C) 2014 Elsevier B.V. All rights reserved.
2014
63
68
81
Mencattini A; Martinelli E; Costantini G; Todisco M; Basile B; Bozzali M; Di Natale C
File in questo prodotto:
File Dimensione Formato  
91-Mencattini_Knowledge-Based Systems 2014.pdf

Accesso riservato

Tipo di file: PDF EDITORIALE
Dimensione 2.05 MB
Formato Adobe PDF
2.05 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1784361
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 81
  • ???jsp.display-item.citation.isi??? 67
social impact