First they were deepfake images that were perfect, but created by an artificial intelligence (AI) called Dall-E. Then ChatGPT from OpenAI sent the world into a frenzy with its AI-generated texts. There is talk of a game changer and Google search killer in which Microsoft probably wants to invest heavily in order to boost its own search engine. Now comes a Microsoft prank with Vall-E, an AI based on meta-technologies that is said to be able to imitate voices almost perfectly. Based on an audio example that is only three seconds long. That's great. But also extremely dangerous, especially from the point of view of fact checkers. The floodgates seem to be open to abuse through deepfake audios.
Voice generator Vall-E delivers almost perfect imitation
Vall-E is intended to make given text audible with human voices and, according to initial reports, it sometimes comes very, very close to the original.
Vall-E is based on a technology called Encodec , which Facebook parent Meta introduced in October 2022. Previous approaches to generating voices attempted to generate them via wavelength. Vall-E, on the other hand, actually learns how a person speaks and organizes this information into different components, called tokens.

Vall-E sounds like you and me – in all situations
Once Vall-E has learned a voice, she can imitate the original voice surprisingly well based on texts. This preserves the versatility of expression, such as emotional undertones or individual tones of the human speaker. Special environments can also be generated, such as the sound when speaking on a cell phone.
With Microsoft you can imagine a lot of possible uses. Originally developed for text-to-speech applications, a written message could be read out in the author's voice in messenger services, for example. But corrections to audio that has already been spoken also seem possible.
Another interesting approach is the possibility of having entire audio books or podcasts recorded via Vall-E without the original speakers actually having to be there.
Not yet publicly available
Vall-E is not yet available to the public, but a demo site has been set up to get a first impression of the AI's capabilities. It also becomes clear here that Vall-E can do a lot, but also has its limits. At least for now. The AI is currently based on approximately 60,000 hours of audio material from approximately 7,000 English-speaking speakers.
If the three second audio sample resembles one of these voices, Vall-E will produce a very good result. Otherwise, the AI will generate the voice itself and it will actually sound a lot like a computer. However, it is easy to calculate that every extra hour of example material from more and more training speakers will advance and perfect the AI. Also in languages other than English.
Anyone who continues the thread of current developments and imagines a combination of AI such as the chatbot ChatGPT with Vall-E is no longer in the realm of science fiction, as Ars Technica reports . Add visual representations based on AI and things get really exciting.
Vall-E: The blueprint for voice deepfake?
If there is no longer a need for a human to intentionally speak content, but rather irrelevant sound samples are enough to let an AI speak any text in a deceptively similar way, all criminals' ears would have to open wide. Imagine that people will soon receive real shock calls in which their own children supposedly call their parents in an alleged emergency and ask for money. Mimikama is full of examples of such shock calls, which, even without voice AI, still work too often and result in significant financial loss for victims.
But this is not the only place where examples of abuse can be found. It's easy to imagine what's possible with it. Politicians can have words put into their mouths that they have never said and would never say. What about the significance or court usability of sound recordings? Original sounds that are so important in journalism could be misused by negative representatives of their profession or used to deceive the press. In the context of cyber warfare, as we are currently experiencing on a daily basis in the context of the Ukraine war, such voice deepfakes are capable of creating a mood and thus negatively influencing the course of the conflict.
Social media is ideal for distributing fakes, whether images, texts or even voices. With the known consequences. If everything we rely on with our five senses can be faked, what happens to our judgment? Enormous media competence would be necessary here.
Microsoft is building ahead – hopefully
There are enormous risks in manipulating language in the possibilities of Vall-E. Microsoft also sees this and proactively points out possible solutions:
“Because VALL-E can synthesize speech that preserves the identity of the speaker, there is a risk that the model could be misused, for example to spoof voice recognition or impersonate a specific speaker. To minimize such risks, a detection model can be built to distinguish whether an audio clip has been synthesized by VALL-E. As we further develop the models, we will also put Microsoft's AI principles into practice.
Microsoft
A little fun fact at the end: Where does the name come from?
A little joke. We all know the little lonely robot from the Pixar film “Wall E” who unexpectedly finds love. The name of the image AI Dall-E was formed from the name of this little cleaning robot and the name of the surrealist artist Salvador Dali. Vall-E is the further development. V here stands for voice.
Source:
Microsoft
Already read? A Mimikama fact check: Stuffed beef “Anton” in supermarket: now removed
If you enjoyed this post and value the importance of well-founded information, become part of the exclusive Mimikama Club! Support our work and help us promote awareness and combat misinformation. As a club member you receive:
📬 Special Weekly Newsletter: Get exclusive content straight to your inbox.
🎥 Exclusive video* “Fact Checker Basic Course”: Learn from Andre Wolf how to recognize and combat misinformation.
📅 Early access to in-depth articles and fact checks: always be one step ahead.
📄 Bonus articles, just for you: Discover content you won't find anywhere else.
📝 Participation in webinars and workshops : Join us live or watch the recordings.
✔️ Quality exchange: Discuss safely in our comment function without trolls and bots.
Join us and become part of a community that stands for truth and clarity. Together we can make the world a little better!
* In this special course, Andre Wolf will teach you how to recognize and effectively combat misinformation. After completing the video, you have the opportunity to join our research team and actively participate in the education - an opportunity that is exclusively reserved for our club members!
Notes:
1) This content reflects the current state of affairs at the time of publication. The reproduction of individual images, screenshots, embeds or video sequences serves to discuss the topic. 2) Individual contributions were created through the use of machine assistance and were carefully checked by the Mimikama editorial team before publication. ( Reason )

