AI-driven voice assistants are having a moment. Experts estimate that, by the end of this year, there will be 21.4 million smart speakers active in the US alone. And demand is expected to increase in years to come.
AI-powered voice assistants are becoming part of our day-to-day lives, and therefore changing our economies as well. For instance, relying on these devices to make Google searches is becoming so common that businesses are starting to allocate resources to voice assistant SEO. And natural language processing is becoming so refined that some businesses are relying on them for important parts of their marketing and sales processes.
In August of 2018, Google Assistant began supporting bilingual usage. Previously, a multilingual user couldn’t switch between languages when communicating with their assistance. While it was possible, in order to do so, the user had to navigate to the device’s settings and switch its language.
Now, it’s possible to set up Google Assistant so it can understand two languages effortlessly. And the Google AI team is working towards a product that’s fluent in three languages at the same time. But, how is this achieved? In order to understand how a device like Google Assistant becomes multilingual, we need to understand how machines process language.
Behind every voice assistant, there’s complex and fascinating technology. The companies behind these devices have to teach them to both produce and recognize speech: That is, to talk, listen, understand and give relevant responses. This endeavor becomes particularly complex when we consider foreign language-speaking or multilingual users.
In this article, we’ll explore how assistants are trained to communicate with us in our language, and what role voice-over services play in crafting a fully-functional multilingual product.
Processing Linguistic Data Is Harder Than You Think
Natural language processing is a discipline within artificial intelligence that aims to develop hardware and software that can process linguistic data. Teaching computers to speak is complicated. While any 2008 home PC can handle incredible amounts of structured data, computers are less equipped to deal with unstructured data. And linguistic information is unstructured data. The very nature of our languages, with their spontaneity, endless contextual nuances, and aesthetic dimension brings about a whole new layer of complexity.
When we’re teaching a computer to process language, we’re dealing with three great difficulties: How foreign a concept our human languages are to way a computer operates, the very nature of our language as nuanced and dependent on endless variables, and our growing but still very limited understanding of how our brains work in relation to language.
How Your AI Assistant Works
Let’s say you ask Siri what the weather’s going to be like tomorrow.
Your phone will capture the audio and convert it into text, so it can be processed.
Then, through natural language processing software, your phone will try to decipher the meaning in your words. If your command is structured as a question, the software will identify the semantic marks suggesting that you’ve asked a question. “weather” and “tomorrow” inform the software about the content of the question. Then, it will conduct research on your behalf and communicate its results by turning them into audio.
Let’s focus on two parts of this process: The initial input and the output. How does Siri understand what we’re saying and how does Siri communicate to us in our language?
Multilingual Voice Commands: Accents & Phonemes
In 2011, when Siri was first released, it faced certain backlash. Some considered the overall experience to be subpar. Others complained specifically about the assistant being unable to understand their accent. This was due to a lack of diversity in the material used to train the neural networks that Siri relies on.
Basically, NLP software learns to deal with language through audio and text input. If we only use speech samples from people from a certain locale, with a certain accent (or with purposeful accent neutrality), our software will fail to understand rarer speech patterns or regional accents. That’s why some companies in the field are starting to look for international voice-over services that can provide diverse command samples.
But voice-over artists aren’t only involved in feeding data to train neural networks. They also give them the tools to talk to us: phonemes. Phonemes are a language’s smallest possible units of significant sound. We speak by combining phonemes.
And, as Marco Tabini from MacWorld explained in 2013:
“When asked to transform a sentence into speech, the synthesis engine will first look for a predefined entry in its database. If it doesn’t find one, it will then try to make sense of the input’s linguistic makeup, so that it can assign the proper intonation to all the words. Next, it will break it down into combinations of phonemes, and look for the most appropriate candidate sounds in its database.”
Voice-over artists are key actors in the natural language processing field, providing material to refine the software’s understanding of our languages and giving a voice to our AI assistants.