The text-to-speech system, Tacotron 2, is made up of two deep neural networks. The first translates text into a spectrogram, which is a visual method that illustrates audio frequencies over time with spectral vectors. That spectrogram is then fed into WaveNet, a system from Alphabet’s AI research lab DeepMind, which reads the spectrogram and generates matching audio elements.
Quartz has six one-sentence audio clips, with three generated by AI and three spoken by a human hired by Google. Visit the site here to play the audio clips and determine whether you can decipher the human from the AI.
Although answers are not provided, author Dave Gershgorn notes “if you reveal the ‘page source’ and look at the filenames of each on the Google research website, one is labeled ‘gen,’ ostensibly to mark the generated sample.”
The system is currently trained to mimic only one female voice. Google would need to retrain the system to speak like a male or different female, Quartz notes.
At the Becker's 11th Annual IT + Revenue Cycle Conference: The Future of AI & Digital Health, taking place September 14–17 in Chicago, healthcare executives and digital leaders from across the country will come together to explore how AI, interoperability, cybersecurity, and revenue cycle innovation are transforming care delivery, strengthening financial performance, and driving the next era of digital health. Apply for complimentary registration now.