NATIONAL CHENG KUNG UNIVERSITY, TAINAN, TAIWAN
BANYAN
Volume 31 Issue 4 - March 3, 2017
Add Bookmark RSS Subscribe BANYAN
Commentary
Article Digest
Chung-Hsien Wu
Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels
Woei-Jer Chuang
Effects of the RGD loop and C-terminus of rhodostomin on regulating integrin  αIIbβ3 recognition.
Chuh-Yung Chen
The value of teaching and contribution of industry-university cooperative research for 40 years
Chen-Sheng Yeh
What can a chemist do in nanomedicine?
Yei-Chin Chao
Challenge of Long-Term Indigenous Development of Critical Systems – Development of Hydrogen Peroxide Satellite Reactive Control System (RCS) as an Example
Wei-Min Zhang
Breakdown of Bose-Einstein Distribution in Photonic Crystals
Yang-Yih Chen
Evolution of breaking waves on sloping beaches
Sun-Yuan Hsieh
A multi-index hybrid trie for IP lookup and updates
Article Digest
Po-Wu Gean
Social isolation-induced increased NMDA receptors in ventral hippocampus primes mice for aggressive behaviors
News Release
NCKU Press Center
NCKU President shares her love of science with high school girls in Taiwan
NCKU Press Center
Taiwan-Germany Workshop and Symposium opens in NCKU
NCKU Press Center
New reading center “KnowLEDGE” set up at NCKU to accommodate more students
NCKU Press Center
NCKU calligrapher displays Chibifu on wall
Banyan Forum
Opportunities
Activities
Editorial Group
Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels
Chung-Hsien Wu*, Wei-Bin Liang
Department of Computer Science and Information Engineering, National Cheng Kung University
【105 Outstanding Research Award】Special Issue

Speech is one of the most fundamental and natural communication means of human beings. With the exponential growth in available computing power and significant progress in speech technologies, spoken dialogue systems (SDS) have been successfully applied to several domains. However, the applications of SDSs are still limited to simple informational dialog systems, such as navigation systems, air travel information system, etc. [1][2]. To enable more complex applications (e.g. home nursing [3] educational/tutoring, and chatting [4]), new capabilities, such as affective interaction, are needed. However, to achieve the goal of affective interaction via speech, several problems in speech technologies, including low accuracy in recognition of highly affective speech and lack of affect-related common sense and basic knowledge, still exist. This work presents an approach to emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information (AP) and semantic labels (SLs). For AP-based recognition, acoustic and prosodic features including spectrum-, formant-, and pitch-related features are extracted from the detected emotional salient segments of the input speech. Three types of models Gaussian Mixture Models (GMMs), Support Vector Models (SVMs), and Multilayer Perceptrons (MLPs) are adopted as the base-level classifiers. A Meta Decision Tree (MDT) is then employed for classifier fusion to obtain the AP-based emotion recognition confidence. For SL-based recognition, semantic labels derived from an existing Chinese knowledge base called HowNet are used to automatically extract Emotion Association Rules (EARs) from the recognized word sequence of the affective speech. The maximum entropy model (MaxEnt) is thereafter utilized to characterize the relationship between emotional states and EARs for emotion recognition. Finally, a weighted product fusion method is used to integrate the AP-based and SL-based recognition results for final emotion decision.

Figure 1 illustrates the block diagram of the training and testing procedures for AP- and, SL-based emotion recognition. For AP-based approach, emotional salient segments (ESS) are firstly detected from the input speech. Acoustic and prosodic features including spectrum-, formant-, and pitch-related features are extracted from the detected emotional salient segments and used to construct the GMM-based, SVM-based, and MLP-based base-level classifiers. The MDT is then employed to combine the three classifiers by selecting the most promising classifier for AP-based emotion recognition. On the other hand, the word sequence recognized by a speech recognizer is used in SL-based emotion recognition. The semantic labels of the word sequence derived from an existing Chinese knowledge base called the HowNet [5] are extracted and then a text-based mining approach is employed to mine the Emotion Association Rules (EARs) of the word sequence. Next, the MaxEnt model [6] is employed to characterize the relation between emotional states and EARs and output the emotion recognition result. Finally, the outputs from the above two recognizers are integrated using a weighted product fusion method to determine the final emotional state. Furthermore, in order to investigate the effect of individual personality characteristic, the personality trait obtained from Eysenck Personality Questionnaire (EPQ) for a specific speaker is considered for personalized emotion recognition.

For evaluation, 2,033 utterances for four emotional states (Neutral, Happy, Angry, and Sad) are collected. The evaluation results are shown in Table 1. According to the result based on EPQ, speaker A is an extrovert and the recognition performance of the corresponding emotions - happy and angry emotion which have stronger expression was improved. For speaker B who is neither extrovert nor introvert, the difference of the evaluation results is small. Besides this evaluation, the subjects were satisfied with the fine-tuned system after they tested this system again. The evaluation of the proposed approach proved that the proposed approach can work well on the emotion recognition task. In summary, the average recognition accuracy of this system can achieve 85.79% considering personality trait. The results confirm the effectiveness of the proposed approach.
Figure 1: An overview of training and testing flowchart of the acoustic-prosodic information-based recognition, the semantic label-based recognition and the personality trait


Copyright National Cheng Kung University