Can you tell which sentence was spoken with Google AI? New audio samples are tough to distinguish from humans

Google's second generation of a text-to-speech system claims near-human accuracy at imitating audio of a person reading text. Audio files, provided by Quartz, give you a chance to assess this claim.

The text-to-speech system, Tacotron 2, is made up of two deep neural networks. The first translates text into a spectrogram, which is a visual method that illustrates audio frequencies over time with spectral vectors. That spectrogram is then fed into WaveNet, a system from Alphabet's AI research lab DeepMind, which reads the spectrogram and generates matching audio elements.

Quartz has six one-sentence audio clips, with three generated by AI and three spoken by a human hired by Google. Visit the site here to play the audio clips and determine whether you can decipher the human from the AI.

Although answers are not provided, author Dave Gershgorn notes "if you reveal the 'page source' and look at the filenames of each on the Google research website, one is labeled 'gen,' ostensibly to mark the generated sample."

The system is currently trained to mimic only one female voice. Google would need to retrain the system to speak like a male or different female, Quartz notes.

Copyright © 2022 Becker's Healthcare. All Rights Reserved. Privacy Policy. Cookie Policy. Linking and Reprinting Policy.


Featured Whitepapers

Featured Webinars