<< Chapter < Page Chapter >> Page >

Feature separation

The neural network must be fed meaningful information. If there are not clusters within the data and distinguishing relationships between features, than there is nothing for the neural network to learn. Here we show that the features extracted from the LPC are in fact providing useful information to the neural network. A subset of four emotions has been selected for visual clarity. They have been normalized by mean and variance to generate a roughly Gaussian distribution of the frequency location of the fourth formant. Notice that the mean, variance, and skewness of each of the four probability density functions is are distinct.

Computation complexity

This algorithmic pipeline involves a significant number of computationally heavy tasks. In its original iteration, each sample required 50 seconds of processing time before the neural network was trained. We minimized the order of the filter to introduce an acceptable level of passband ripple, vectorized the LPC windowing implementation, and replaced the find operation used by in formant detection with logical indexing. These measures reduced the average computational time to only about half a second per sample.

Neural network learning curves

To evaluate the performance of the neural network, we split the database into two distinct sections. The training set (70% of the database) was used to train the neural network. A logarithmic distribution of regularization was tested. When each trained set of internodal weights was applied to the test set (30% of the database), we were able to determine the true accuracy of the neural network. We found that the test set accuracy actually peaked at a low value of regularization and then decreased. This is a characteristic signature of having far too small of a database. Given that the size of the database was fixed, we attempted to compensate by minimizing the number of input features to the neural network. It is important to have the ratio between training examples and inputs be large. If we had instead seen a gap between the training and test set performance even at high levels of regularization, we would have known that our choice of input features was inherently faulty.

Algorithm performance

We established a human performance baseline by playing a randomized list of 100 samples to three individuals. None were able to significantly outperform chance. Clearly humans struggle to classify a precise emotion from such a large list, especially without any context. If the test is altered to be a binary matching test where the subject is asked to evaluate whether a given emotion is intoned in a given sample, human performance markedly improves. This suggests that in normal everyday interaction, we rely on semantic content and situational context to derive an emotional expectation; then, we simply evaluate the speaker's tone to see if it matches that expectation.

The algorithm correctly identifies the tone of the speakers voice with 26 percent accuracy. The algorithm was particularly adept at detecting contempt (41%), anxiety (37%), sadness (33%), and boredom (31%). It struggled to identify cold anger (4%).

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Robust classification of highly-specific emotion in human speech. OpenStax CNX. Dec 14, 2012 Download for free at http://cnx.org/content/col11465/1.1
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Robust classification of highly-specific emotion in human speech' conversation and receive update notifications?

Ask