Microsoft’s New AI Technology Can Mimic Anyone’s Voice Based on a 3-Second Sample
VALL-E was trained on audio from over 7,000 English language speakers.
Microsoft has unveiled a somewhat eerie new AI model. Researchers claim that VALL-E is capable of listening to and simulating virtually anyone’s voice. While most AI models that recreate human voices typically require at least a minute of audio recording input, or even longer, VALL-E needs just a 3-second sample.
In order to develop VALL-E, scientists tapped Meta’s Libri-Light library, containing audio from over 7,000 speakers. They then used the library to train the AI on 60,000 hours of English language recordings.
The company calls VALL-E a “neural codec language model,” based on a similar model from Meta that uses AI to produce text-to-speech audio.
Some VALL-E voices are surprisingly realistic, whereas others don’t quite match up to the task. It seems that in order to create an accurate simulation, the voice inputted into the system must sound somewhat similar to one of the speakers the model was trained on.
Microsoft plans to continue developing the model to improve the accuracy and pronunciation of certain words. Right now, the code isn’t open-source due to the risk of deepfakes, but those interested can check out a demo of VALL-E.
Surprised there isn’t more chatter around VALL-E
This new model by @Microsoft can generate speech in any voice after only hearing a 3s sample of that voice ?
Demo → https://t.co/GgFO6kWKha pic.twitter.com/JY88vf4lYc
— Steven Tey (@steventey) January 9, 2023
In other tech news, people can summon a cinema anywhere with the STEM PROJECTOR.