Google has launched an AI-powered speech synthesis system named Tacotron 2, poised to set a major breakthrough with its human-like articulation ability. Reports from tech analysts state that the new text-to-speech system delivers an AI-generated computer speech, which cannot be easily distinguished from human voice. Google’s AI researchers quote that their model has achieved a MOS (Mean Opinion Score) of 4.53, in comparison to a MOS of 4.58 for professionally recorded speech. The tech giant’s vision shift from “mobile-first” to “AI-first”, announced during the Google I/O 2017 developers conference by Sundar Pichai, is bearing more fruits. Several AI products were launched last year, including Google Lens, Smart Reply for Gmail and Google Assistant for iPhone. Tacotron 2 is the latest addition to this list.
How it Works?
The system first creates a spectrogram of the text, which contains a visual representation of how the speech should sound. This image is then fed into Google’s WaveNet algorithm, which brings AI skills closer, in order to mimic human speech. The algorithm has the ability to easily learn different voices and can even generate artificial breaths.
Looking at the capabilities, Tacotron 2 can detect the context and differentiate between two identically-spelled words. For example, it can distinguish between the noun “desert” and the verb “desert” and alter the pronunciation accordingly. Context-driven pronunciation is the highlight of Tacotron 2. The system can understand the sentence type (such as a statement or a question) and adjust the pitch and modulation of the sentence while speaking.
With Tacotron 2, Google is taking one more step towards realizing its “AI-first” dream. In the coming days, we can expect more brilliant AI products from the tech master.