Can you tell which sentence was spoken with Google AI? New audio samples are tough to distinguish from humans

Google’s second generation of a text-to-speech system claims near-human accuracy at imitating audio of a person reading text. Audio files, provided by Quartz, give you a chance to assess this claim.

Advertisement

The text-to-speech system, Tacotron 2, is made up of two deep neural networks. The first translates text into a spectrogram, which is a visual method that illustrates audio frequencies over time with spectral vectors. That spectrogram is then fed into WaveNet, a system from Alphabet’s AI research lab DeepMind, which reads the spectrogram and generates matching audio elements.

Quartz has six one-sentence audio clips, with three generated by AI and three spoken by a human hired by Google. Visit the site here to play the audio clips and determine whether you can decipher the human from the AI.

Although answers are not provided, author Dave Gershgorn notes “if you reveal the ‘page source’ and look at the filenames of each on the Google research website, one is labeled ‘gen,’ ostensibly to mark the generated sample.”

The system is currently trained to mimic only one female voice. Google would need to retrain the system to speak like a male or different female, Quartz notes.

At the Becker's 11th Annual IT + Revenue Cycle Conference: The Future of AI & Digital Health, taking place September 14–17 in Chicago, healthcare executives and digital leaders from across the country will come together to explore how AI, interoperability, cybersecurity, and revenue cycle innovation are transforming care delivery, strengthening financial performance, and driving the next era of digital health. Apply for complimentary registration now.

Advertisement

Next Up in Health IT

Advertisement

Comments are closed.