Systematic Evaluation of Deep Learning Paradigms for Speech Emotion Recognization Using Diverse Audio Sources

Authors

  • Yogeshkumar Prajapati Research Scholar, Gujarat Technological University, Ahmedabad, Gujarat, India Author
  • Dr. Priyesh Gandhi Provost, Sigma University, Vadodara, Gujarat, India Author
  • Dr. Sheshang Degadwala Professor and Head, Department of Computer Engineering, Sigma University, Vadodara, Gujarat, India Author

DOI:

https://doi.org/10.32628/IJSRST25123148

Keywords:

Speech Emotion Recognition, Audio Features Extraction, Machine Learning, Data Augmentation, Ensemble Learning

Abstract

Speech emotion identification is one of the most difficult areas of human-computer interaction, with significant ramifications for assistive technologies, customer support, and mental health monitoring. Despite significant advances in machine learning, accurately identifying emotional states from speech remains difficult due to the complex, nuanced nature of vocal emotional expressions across diverse speakers and contexts. This study presents a comprehensive evaluation of Speech Emotion Recognition (SER) systems across multiple machine learning paradigms using four benchmark datasets (CREMA-D, RAVDESS, SAVEE, and TESS). We implement a multi-feature extraction approach incorporating prosodic, spectral, and voice quality features, while employing data augmentation techniques to enhance model robustness. Our investigation spans traditional machine learning algorithms, ensemble methods, and deep learning architectures including CNN and RNN implementations. Performance evaluation reveals the superiority of the Stacking Classifier (accuracy: 72.54%, F1-score: 72.47%), with strong performances from Random Forest (68.31% accuracy) and ResNet (66% accuracy). This comparative analysis advances affective computing by providing detailed insights into the effectiveness of various approaches for emotion recognition in speech, with significant implications for developing more sophisticated emotional intelligence systems.

Downloads

Download data is not yet available.

References

Akçay, M. B., & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56-76.

Latif, S., Qadir, J., Farooq, S., & Imran, M. A. (2018). Cross lingual speech emotion recognition: Urdu vs. western languages. In 2018 International Conference on Frontiers of Information Technology (FIT) (pp. 88-93). IEEE.

Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98-125.

Harar, P., Galaz, Z., Alonso-Hernandez, J. B., Mekyska, J., Burget, R., & Smekal, Z. (2017). Towards robust voice pathology detection. Neural Computing and Applications, 32(15), 11337-11350.

Yoon, S., Byun, S., & Jung, K. (2018). Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 112-118). IEEE.

Aouani, H., & Ben Ayed, Y. (2020). Speech emotion recognition with deep learning. Procedia Computer Science, 176, 251-260.

Andayani, F., Theng, L. B., Tsun, M. T., & Chua, C. (2022). Hybrid LSTM-transformer model for emotion recognition from speech audio files. IEEE Access, 10, 36018-36027.

Dutt, A., & Gader, P. (2021). Wavelet-based deep emotion recognition (WaDER). In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6329-6333). IEEE.

Amjad, A., Kordel, P., & Fernandez-Rojas, R. (2022). Emotion recognition from speech using convolutional neural networks with attention mechanisms. Engineering Applications of Artificial Intelligence, 110, 104684.

Lian, Z., Liu, B., & Tao, J. (2021). CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 985-1000.

Mustaqeem, Sajjad, M., & Kwon, S. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access, 8, 79861-79875.

Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014). CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4), 377-390.

Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS One, 13(5), e0196391.

Jackson, P., & Haq, S. (2014). Surrey Audio-Visual Expressed Emotion (SAVEE) database. University of Surrey: Guildford, UK.

Haq, S., & Jackson, P. J. (2009). Speaker-dependent audio-visual emotion recognition. In ICMI-MLMI '09: Proceedings of the 2009 international conference on Multimodal interfaces (pp. 53-60).

Dupuis, K., & Pichora-Fuller, M. K. (2010). Toronto emotional speech set (TESS). Scholars Portal Dataverse, V1.

Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.

Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312-323.

Sarma, M., Ghahremani, P., Povey, D., Goel, N. K., Sarma, K. K., & Dehak, N. (2018). Emotion identification from raw speech signals using DNNs. In Interspeech 2018 (pp. 3097-3101).

Kwon, S. (2020). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183.

Downloads

Published

20-06-2025

Issue

Section

Research Articles