Google announced a new ambitious project to develop a single language model of artificial intelligence that supports 1,000 of the world’s most common languages. As a first step toward this goal, it presented an AI model trained on more than 400 languages, “the largest language coverage seen in language models to date.
Language and AI have probably always been at the heart of Google’s products, but recent advances in machine learning — including the development of powerful, feature-rich “large language models” (LLMs) — have put a new emphasis on these areas.
Google has already begun to integrate these language models into products such as Google Search, while fending off criticism about the systems’ functionality. Language models have a number of weaknesses, including a susceptibility to harmful societal biases such as racism and xenophobia, and an inability to parse language with human sensitivity.
Speaking to The Verge, Zubin Ghahramani, Google’s vice president of AI research, said the company believes building a model of this size will make it easier to bring various AI functionalities to languages that are underrepresented in online spaces and AI training datasets (also known as “low-resource languages”).
“By having a single model that is exposed to and trained on many different languages, we get much better performance on our low resource languages,” says Ghahramani. “The way we get to 1,000 languages is not by building 1,000 different models. Languages are like organisms, they’ve evolved from one another and they have certain similarities. And we can find some pretty spectacular advances in what we call zero-shot learning when we incorporate data from a new language into our 1,000 language model and get the ability to translate [what it’s learned] from a high-resource language to a low-resource language.”
Previous studies have shown the effectiveness of such an approach, and the scale of Google’s planned model may offer significant advantages over previous work. Such large-scale projects have become typical of technology companies seeking to dominate AI research. A similar project is Facebook’s parent company Meta’s current attempt to create a “universal language translator.”
However, access to data is a challenge when learning many languages, and Google says it will fund data collection for low-resource languages, including audio recordings and written texts, to support work on the 1,000-language model.
The company says it has no direct plans to implement the model’s functionality — only that it expects it to be widely used in various Google products, from Google Translate to YouTube captions.