Microsoft, in collaboration with Chinese researchers, has created a text-to-speech AI that is able to produce realistic speech and match transcriptions, using only 200 voice samples.
The system partially relies on ‘Transformers,’ which continuously weigh every input and output. The transformers allow for even complex and lengthy sentences to be processed quite rapidly.
The AI also has a noise-removing encoder, which allows it to be even more efficient.
“We demonstrate in our experiments that our designed components are necessary to develop the capability of speech and text transformation with few paired data,” the researchers wrote.
Although the results still have a somewhat robotic sound, they are exceedingly accurate and have a word intelligibility of 99.84 percent.
This technology could make text to speech a lot more accessible, as there is not much effort required in obtaining realistic voices.
In the future, it may require even less effort to produce realistic data, as researchers still hope to train on unmatched data.