In its recent paper presented at the International Conference on Acoustics, Speech, and Signal Processing 2019, Amazon has showcased a groundbreaking approach to train a neural network using a publicly available data set in order to recognize speaker’s emotion.
Before we look at what Amazon has put on the display, let’s throw some light into emotion recognition, one of the most popular areas of all contemporary studies in the field of conversational-AI.
Way beyond speech, you can understand a lot from a person’s tone of voice as it tells you exactly how the person is feeling. Various industries can benefit from recognizing emotions. It can help in patient health monitoring; make conversational-AI systems more engaging; and provide implicit customer feedback that enables voice agents like Alexa to learn from their mistakes.
This new approach developed by a team of researchers from Amazon used a publicly available data set to train a neural network known as an adversarial autoencoder. The adversarial autoencoder is an encoder-decoder neural network which includes two components:
- an encoder: which learns to produce a compact (or latent) representation of input speech encoding all properties of the training example
- a decoder: which reconstructs the input from the compact representation
The compact representation aka latent representation encodes all properties of the training example. In the model developed by Amazon, part of the latent representation is dedicated to the speaker’s emotional state, while the remaining part captures all other input characteristics.
Amazon’s latent representation of emotion includes three network nodes- one dedicated for each of the three emotional measures:
- valence, whether the speaker’s emotion is positive or negative;
- activation, whether the speaker is alert and engaged or passive;
- and dominance, whether the speaker feels in control of the situation.
The remaining part of the latent representation is much larger, which includes 100 nodes.
Amazon conducted the training in three phases. In the first phase, both the encoder and the decoder were trained using data without labels. In the second phase, adversarial training was applied to tune the encoder, where the adversarial discriminator (the neural network) tries to distinguish real data representations from artificial representations. Thus the encoder learns to produce representations that fit the probability distributions. A detailed account of this can be read here. In the third phase, the encoder was tuned to ensure that the latent representation predicts the emotional labels of the training data.
All the three training phases are repeated until the best performing model emerges. As part of the training purposes, Amazon used a public data set containing 10,000 utterances from 10 different speakers, labeled based on the three emotional measures (valence, activation and dominance).
Amazon reports that their system achieved 3% better accuracy in assessing valence than a conventionally trained network, and 4% improvement in assessing inputs to the network in the form of acoustic characteristics of 20-millisecond frames, or audio snippets.
In the coming days, we can expect more contributions from Amazon towards improving speech-based emotion detection.