Introduction
AI-driven translation services are increasingly seen as a tool to bridge language barriers in public governance, aligning with global efforts to close digital divides and promote inclusivity. The United Nations’ proposed Global Digital Compact emphasizes “closing digital divides by guaranteeing connectivity and digital skills” and ensuring an inclusive digital future (The Global Digital Compact and a responsible, inclusive transition). Language access is a critical component of digital inclusion – yet today’s major translation platforms cover just around 100 languages, barely 1% of the world’s 7,000+ tongues (Unlocking Zero-Resource Machine Translation to Support New Languages in Google T). These covered languages skew heavily toward those with abundant online data (often European), leaving many Asian, African, and indigenous languages underrepresented (Unlocking Zero-Resource Machine Translation to Support New Languages in Google T). In public services, this gap translates to whole communities being unable to access information or participate fully in civic life in their native languages. Addressing this challenge with AI translation not only advances inclusivity but also supports cultural and linguistic rights, echoing the Compact’s call for a “secure digital future for all” (The Global Digital Compact and a responsible, inclusive transition) and better AI governance. Public agencies and researchers are now exploring how new AI and NLP techniques can extend high-quality translation to low-resource languages (those with scant digital data) and medium-resource languages (moderate data) – thereby helping all citizens access services and information in their mother tongue. The following analysis delves into the policy implications of this trend, the technical breakthroughs enabling it, major initiatives driving progress, case studies of specific languages (with a focus on Asia), and future directions linking these efforts to the Global Digital Compact’s goals.
(New Jersey shares AI translation tool materials with other states | StateScoop) Speech bubbles with “hello” in multiple languages – symbolizing the multilingual communication that AI-driven translation aims to facilitate. Ensuring all languages (not just widely spoken ones) are included is key to digital inclusivity.
Policy Implications for Digital Governance and Inclusivity
AI translation for low- and medium-resource languages carries significant policy implications, especially regarding digital governance, equity, and inclusion. At a high level, it aligns with governmental goals of making digital services accessible to all citizens regardless of language. For example, the Global Digital Compact calls for the internet and digital public services to be “inclusive, open, secure and shared,” with human rights applied in the digital sphere (The Global Digital Compact and a responsible, inclusive transition). Ensuring that citizens can interact with government websites, forms, and information in their own language is fundamental to these principles.
- Language Access as a Digital Right: Treating language accessibility as part of digital rights is gaining traction. Many governments are recognizing that lack of content in one’s native language is a form of digital divide. Even languages with millions of speakers can be “low-resource” online – for instance, Hindi is one of the world’s most spoken languages but has relatively little high-quality bilingual content on the web (Low-resource Languages in AI Translation: A Business Guide). The result is that speakers of Hindi (or Assamese, Khmer, etc.) often cannot find online government information in their language. Policies are increasingly acknowledging that bridging this gap is essential for inclusive governance. New Jersey in the U.S., for example, passed a legal mandate in 2022 requiring improved language access in state services, including translated documents for programs like unemployment insurance (New Jersey shares AI translation tool materials with other states | StateScoop). This policy impetus led New Jersey to develop an AI translation assistant to help provide Spanish translations for public benefits, dramatically speeding up service delivery while maintaining accuracy (New Jersey shares AI translation tool materials with other states | StateScoop) (New Jersey shares AI translation tool materials with other states | StateScoop). Such cases illustrate how policy can drive adoption of AI translators to meet legal obligations for multilingual service delivery.
- Challenges and Gaps in Existing Policies: While some jurisdictions have language access laws or constitutional provisions for multilingualism (India’s 22 scheduled languages, South Africa’s 11 official languages, etc.), implementation is uneven. Traditional approaches (human translators) are costly and slow, leading to patchy compliance. AI offers a potential solution, but many policies do not yet address AI translation specifically. Concerns exist around quality and liability: if an automated system produces an incorrect translation of an official document, who is responsible? Government agencies tend to be risk-averse about this (New Jersey shares AI translation tool materials with other states | StateScoop). Current policies may lack guidelines on vetting AI-generated translations or integrating them with human review. There are also gaps in policies to incentivize data sharing – high-quality machine translation needs bilingual text data, and much of that is held in silos (e.g. translations of laws, court judgments, etc. that governments possess). Initiatives like the European Union’s Connecting Europe Facility have funded projects (e.g. the PRINCIPLE Project) to gather and release such public-sector data for under-resourced languages (PRINCIPLE project – PRINCIPLE Project website) (PRINCIPLE project – PRINCIPLE Project website). However, beyond the EU, few governance frameworks systematically encourage turning government translations into open data that could fuel AI models.
- Inclusivity and Cultural Preservation: A policy dimension goes beyond service delivery to cultural inclusion. International frameworks like UNESCO’s International Decade of Indigenous Languages (2022–2032) emphasize using technology to support endangered languages. Governments partnering with tech companies to add indigenous or minority languages to AI translators see it as part of preserving culture and rights. For instance, the Government of Nunavut (Canada) worked with Microsoft to add Inuktitut and later Inuinnaqtun to Microsoft Translator (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada) (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada). This was framed not just as a tech upgrade but as a step toward reconciliation and fulfilling the rights of Inuit communities to maintain their language (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada) (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada). Policies that support such collaborations can amplify marginalized voices online. On the other hand, there are policy concerns about linguistic bias – i.e. AI systems performing poorly for certain languages. This is increasingly viewed as an AI fairness issue. Researchers note that improving translation for low-resource languages “can provide a positive impact on AI fairness, since both high-resource and low-resource languages see improvements” (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research). Regulators and AI ethics guidelines (including the Global Digital Compact’s focus on AI governance) may start urging that AI models be evaluated for how well they serve diverse languages, not just the global lingua francas.
- Governance and Quality Control: In public governance contexts, AI governance principles are being applied to translation tools. Transparency is one: agencies should know if an AI model’s output is reliable enough for official use. Some jurisdictions might require human oversight or certification of critical translations. Another policy aspect is data governance – ensuring that the data used to train translation AI (often scraped from the web) respects privacy and intellectual property, and that it includes local sources with community consent. The Global Digital Compact’s draft highlights “applying human rights in the digital sphere” and managing data responsibly (The Global Digital Compact and a responsible, inclusive transition). In practice, this means policymakers might push for community consultation when deploying AI translators for indigenous languages (to respect data sovereignty and avoid misrepresentation of sacred terms, for example). UNESCO has even issued guidelines for ethically handling indigenous language data in AI (Digital preservation of Indigenous languages: At the intersection of …). All these implications show that while AI-driven translation holds great promise for inclusivity, it needs supportive policies to guide its use. Governments must update language access laws, invest in data creation, and set quality standards so that AI translation in governance truly serves its purpose as a public good and not just a tech experiment.
Technical Advancements Enabling Low-Resource Language Translation
Recent years have seen major advances in AI and natural language processing (NLP) that are making it feasible to support translations for low- and medium-resource languages. Traditional machine translation struggled when parallel data (translations) were scarce, but new techniques allow models to generalize knowledge from other languages and even learn from monolingual text alone. Key technical developments include neural network architectures, multilingual training, knowledge distillation, retrieval-augmented translation, and zero-resource learning. We examine these in turn and how they help raise translation quality for under-served languages – including cases like Frisian (closely related to Dutch) and many others.
- Neural Networks and Multilingual NMT: The shift from phrase-based statistical MT to neural machine translation (NMT) was a game-changer. Neural models can better capture context and linguistic structure, which proved especially beneficial for languages with complex grammar or when translating between very different languages (Low-resource Languages in AI Translation: A Business Guide). Moreover, a single neural model can be trained on multiple languages, allowing transfer learning between them. A seminal effort by Google introduced a Multilingual NMT system that uses one model for many language pairs, with a special token to indicate target language (Zero-Shot Translation with Google’s Multilingual Neural Machine …). This enabled “zero-shot” translation – the model could translate a pair it never explicitly saw in training by using its internal multilingual representation (Unlocking Zero-Resource Machine Translation to Support New Languages in Google T). For example, a model trained on English↔French and English↔Spanish learned to translate Spanish→French without direct Spanish-French data (Unlocking Zero-Resource Machine Translation to Support New Languages in Google T). This approach is extremely useful for low-resource languages if they can piggyback on a related higher-resource language in the training mix. In practice, a low-resource language like Frisian (spoken in the Netherlands) can be combined in a model with a medium-resource language like Dutch or English, allowing the Frisian translations to benefit from the richer patterns the model learned from those larger languages. Researchers note it’s common in very low-resource cases to leverage data from the closest not-that-low-resource languages (Is EXTREMELY Low Resource Translation (1-10k sents) Possible (with Expert Knowledge)??? : r/LanguageTechnology). In fact, multilingual models naturally enable “efficient knowledge sharing across similar languages” (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research) – the system can find commonalities in vocabulary and grammar (e.g. Dutch and Frisian share many linguistic features) and use that to improve the weaker language’s output. Modern large-scale systems like Meta’s No Language Left Behind (NLLB) and Google’s thousand-language project explicitly train on hundreds of languages together, massively increasing the chance that every low-resource language has some ally (geographically or linguistically) in the training data. The result is that translation quality, even for languages with minimal direct data, has improved substantially. For instance, Microsoft’s new multilingual Z-code Mixture-of-Experts model reported a 15% quality jump on English–Slovenian (a relatively low-resource pair) compared to older models (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research), thanks to knowledge transfer and a smarter architecture that activates different “experts” for each language family.
- Knowledge Distillation and Model Compression: Another technical advance helping low-resource MT is knowledge distillation. This technique involves using a large “teacher” model or an ensemble of models to train a smaller “student” model, effectively transferring knowledge in a compressed form (Improving Low-Resource Neural Machine Translation With Teacher …). Distillation has two major benefits in this context: (1) it can create lightweight translation models that run on limited infrastructure (important if a local government or community group wants to deploy a translator offline or on a smartphone), and (2) it can distill knowledge from high-resource translations into a low-resource model. For example, researchers have distilled ensembles of multiple translation models into a single model for a low-resource language, achieving better accuracy than training the small model from scratch (Collective Wisdom: Improving Low-resource Neural Machine …). Another use of distillation is to take a powerful multilingual model and fine-tune it on a specific low-resource language by using the big model’s outputs as additional training data (a form of pseudo-labeling). Essentially, the large model “teaches” the smaller one by generating translations for sentences in the low-resource language (using, say, a pivot through a related language). This mitigates the data scarcity by producing synthetic parallel data. A notable practical example is the creation of specialized domain translators: New Jersey’s unemployment benefits translator was built by fine-tuning an off-the-shelf large language model (like GPT) on a curated glossary of agency-specific terms (New Jersey shares AI translation tool materials with other states | StateScoop) (New Jersey shares AI translation tool materials with other states | StateScoop). In effect, they distilled the expert knowledge of bilingual staff into the AI model so it could produce high-quality translations of bureaucratic terms. This process took a year of compiling and vetting terms, but it paid off with translations that officials say are on par with human translators for that domain (New Jersey shares AI translation tool materials with other states | StateScoop). Distillation thus helps incorporate expert knowledge into AI models and makes deploying MT more feasible in governance settings where computing resources might be constrained or where custom vocabulary is needed.
- Retrieval-Augmented Translation: A cutting-edge area is retrieval-based or retrieval-augmented translation. In these systems, the translation model is paired with a large database of reference sentences (a bit like a translation memory). When the model needs to translate a new sentence, it first retrieves similar sentences or phrases from the database (which could be millions of human-translated pairs from a corpus or even bilingual dictionaries). These retrieved examples are then used to guide the model’s translation of the new sentence. This approach is powerful for low-resource scenarios because even if the model itself wasn’t fully trained on a particular rare word or idiom, there’s a chance that the repository contains a close match that the model can imitate. One prominent implementation is k-Nearest Neighbor MT (kNN-MT), where for each input the system finds the k most similar source sentences in the training data and biases the translation toward their known translations ([PDF] Towards Faster k-Nearest-Neighbor Machine Translation) ([PDF] Low Resource Retrieval Augmented Adaptive Neural Machine …). For low-resource language pairs, retrieval can inject much-needed context. For example, an English→Igbo translation system might consult a lookup table of known proper names or technical terms that have been translated in the past (Igbo being low-resource, direct model training might not capture all those terms). Early studies show retrieval-augmented models can significantly improve accuracy, especially on niche vocabulary and in cross-domain settings where the training corpus is limited ([PDF] Improving Retrieval Augmented Neural Machine Translation by …) (Study on English to Igbo Translations: A Comparative Analysis of …). In public governance, this means an AI translator could incorporate, say, a government’s bi-lingual glossary of legal terms or previous translations of similar documents to produce more consistent results. The New Jersey system above effectively did a form of this by building a glossary and ensuring the AI used those mappings (New Jersey shares AI translation tool materials with other states | StateScoop) (New Jersey shares AI translation tool materials with other states | StateScoop). As research continues, we anticipate more robust frameworks where an MT model can query large external knowledge bases (including multilingual Wikipedia or national term banks) on the fly to translate low-resource languages with higher fidelity. Retrieval methods complement pure neural approaches and help overcome the data gap by reusing whatever parallel fragments exist for a language.
- Zero-Resource and Unsupervised Translation: Perhaps the most striking recent development is the ability to create translation models without any direct parallel training data at all – so-called zero-resource MT. This goes a step further than the zero-shot discussed earlier. Zero-shot assumed a multilingual model had other language pairs to learn from; zero-resource aims to build a translator when you have no sentence-aligned corpus for a given language pair. Techniques enabling this typically rely on monolingual data, plus either a multilingual model or clever training loops. An early breakthrough was unsupervised MT, where two monolingual corpora (for language A and B) are used to initialize two language models that are then induced to translate by iterative refinement (back-translation cycles) without ever seeing a direct A↔B example. While unsupervised results were initially rough, they demonstrated it’s possible to get a basic translation system with zero parallel lines (e.g., English–Gujarati or English–Kazakh translators were built this way in research) ([PDF] Refining Low-Resource Unsupervised Translation by Language …). Building on this, large tech companies have scaled zero-resource methods. Google researchers recently described how they expanded Google Translate to 24 new languages using monolingual text only, by first creating sizable monolingual datasets for each (through specialized web crawling and language identification) and then training a massively multilingual model with a self-supervised learning component to enable translation without parallel data (Unlocking Zero-Resource Machine Translation to Support New Languages in Google T) (Google Translate adds 24 languages). The result was the addition of languages like Bhojpuri, Krio, Twi, Dhivehi, and Sanskrit into Google Translate – languages for which traditional parallel corpora were practically non-existent (Google Translate adds 24 languages) (Google Translate adds 24 languages). This zero-resource milestone was achieved by combining techniques: training the model to jointly perform translation and language modeling, so it learns to align meanings across languages internally, and using back-translation (having the model generate synthetic translations to create pseudo-parallel data for fine-tuning). Google reported that this approach isn’t yet as perfect as for high-resource languages, but it is a technical milestone that made it feasible to offer at least rudimentary translation for these communities (Google Translate adds 24 languages). This progress directly speaks to Global Digital Compact objectives – it shows AI can now start including languages that were completely left out of the digital sphere. Similarly, Meta’s NLLB-200 model used a combination of multilingual training and data augmentation (creating new training sentences via translation of related languages or via back-translation) to achieve evaluated translations in 200 languages, many of which had minimal prior support (200 languages within a single AI model: A breakthrough in high …). The takeaway is that through zero-resource techniques, any language with some written content can potentially get an AI translator. We are moving from a world where only well-resourced languages benefited from AI to one where even an endangered language can have a machine translation system, as long as we can gather some monolingual texts and possibly a bilingual lexicon. This opens exciting possibilities for public governance: local governments or NGOs could train translation models for minority languages that have no translation databases, using only community texts and leveraging a multilingual AI backbone.
In summary, technical advances like deep neural networks, cross-lingual transfer, distillation, retrieval augmentation, and zero-resource learning form a toolkit that is dramatically lowering the barrier to including low- and medium-resource languages in translation services. Languages closely related to a medium-resource tongue (e.g. Frisian vis-à-vis Dutch, Kazakh vis-à-vis Turkish, Lao vis-à-vis Thai) particularly benefit, since many techniques exploit similarity to amplify scarce data (Is EXTREMELY Low Resource Translation (1-10k sents) Possible (with Expert Knowledge)??? : r/LanguageTechnology). Even languages without such relatives or with unique scripts are seeing progress thanks to innovations in data crawling and unsupervised learning. These technical aspects provide the foundation on which companies and governments are building initiatives to realize multilingual digital governance.
Major Initiatives Supporting Low-Resource Languages
A number of major initiatives – from tech giants, international organizations, and research consortia – are actively working to improve AI translation for low-resource languages. These programs not only push the technology forward but also often involve policy and community components, recognizing that language technology development requires cross-sector collaboration. Below we overview notable initiatives from Google, Microsoft, the EU (PRINCIPLE project and others), Meta, and grassroots efforts that are making a difference.
- Google Translate & Research Efforts: Google has been a leader in expanding language coverage. In May 2022, Google Translate added 24 new languages to its platform (bringing it to 133 total), including many African, South Asian, and indigenous American languages that had never been supported before (Google Translate adds 24 languages) (Google Translate adds 24 languages). This expansion – which included Assamese, Bhojpuri, Dhivehi, Konkani, Maithili, Sanskrit, Tsonga, Twi, Lingala, Tigrinya, Quechua, and others – was enabled by the zero-shot techniques discussed earlier (Google Translate adds 24 languages). Google’s research division published a corresponding paper, “Building Machine Translation Systems for the Next Thousand Languages,” detailing how they gathered monolingual data and used it to unlock translation for languages with no parallel data (Unlocking Zero-Resource Machine Translation to Support New Languages in Google T). In practice, Google’s massively multilingual model (reportedly trained on a stunning 25 billion sentence pairs overall) can transfer knowledge from high-resource languages to low-resource ones, e.g. using French→English and Irish→English data to support French→Irish translation. Beyond Google Translate itself, Google has also open-sourced tools and datasets. For instance, the FLORES evaluation datasets (developed by Facebook/Meta but used widely) and Google’s own releases are improving the benchmarking of low-resource MT ([PDF] The FLORES Evaluation Datasets for Low-Resource Machine …). Google’s stated goal, aligned with its AI for Social Good initiatives, is to eventually support 1,000 languages with translation and speech technology. This effort echoes the Global Digital Compact’s theme of “digital commons as a public good” (The Global Digital Compact and a responsible, inclusive transition), since many of Google’s research outputs (like multilingual models or datasets) are shared openly for anyone to use. However, it also raises governance questions about a private company effectively gatekeeping which languages get digital support. Google has tried to involve communities – e.g., it credited “native speakers, professors and linguists” for helping evaluate the new languages and invited user contributions through its Translate Community platform (Google Translate adds 24 languages). This kind of participatory approach is important to ensure the technology truly serves local needs.
- Microsoft Translator and Z-Code Initiative: Microsoft’s translation team has likewise made big strides in low-resource language support, often through a mix of research and partnerships. In October 2021, Microsoft Translator hit 100 languages, adding 12 new ones that included low-resource and even endangered languages like Georgian, Kyrgyz, Mongolian (both Cyrillic and Traditional scripts), Tibetan, Uyghur, and Yoruba (Microsoft Translator: Now translating 100 languages and counting! – Microsoft Research). Many of these were added in collaboration with communities or linguists (for example, the inclusion of Inuktitut in early 2021 was done hand-in-hand with the Nunavut government sharing its bilingual legislative records (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada)). Under the hood, Microsoft credits its Project Z-Code – a Mixture-of-Experts model architecture – for enabling high-quality translation with fewer training examples (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research) (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research). Z-Code uses a giant model where different “experts” handle different language groups, allowing specialization and transfer learning among similar languages (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research). According to Microsoft, this approach “opens the way to high quality machine translation beyond the high-resource languages and improves the quality of low-resource languages that lack significant training data.” (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research) Notably, Microsoft has tied this directly to AI fairness by noting that bringing up low-resource quality narrows the gap between rich and poor languages in AI (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research). Microsoft’s initiatives also extend to domain-specific needs: they offer custom translator training via Azure Cognitive Services where even a small organization can upload whatever data they have to create a tailored model. On the community side, Microsoft launched the AI for Cultural Heritage program which funded projects to preserve languages. A success story from this is the Maori language: Microsoft worked with Maori translators in New Zealand to create a Maori translation system, respecting cultural protocols around certain words. Similarly, in India, Microsoft collaborated on translating Hindi, Marathi, and others in real-time for education as part of its Local Language Program. These efforts demonstrate a pattern: Microsoft often pairs tech innovation (like Z-Code or bilingual support in Office products) with partnerships and policy advocacy. In Canada, Microsoft’s President highlighted that adding Inuinnaqtun (with only ~600 speakers) to Translator was a direct response to Truth and Reconciliation Commission calls to support Indigenous languages. By anchoring tech projects in policy frameworks (reconciliation, inclusion mandates, etc.), Microsoft shows how corporate initiatives can align with public governance goals.
- The EU’s PRINCIPLE Project and Similar Programs: The PRINCIPLE project (Providing Resources in Irish, Norwegian, Croatian, and Icelandic for Language Engineering) is an EU-funded initiative specifically targeting public sector translation needs in under-resourced European languages (PRINCIPLE project – PRINCIPLE Project website) (PRINCIPLE project – PRINCIPLE Project website). Running from 2019–2021, it assembled a consortium of universities and a language tech company to collect and curate high-quality parallel data for those four languages, focusing on the domains of e-governance: eJustice and eProcurement (PRINCIPLE project – PRINCIPLE Project website). The idea was to feed the EU’s official eTranslation platform with better data so it can produce reliable translations of, say, legal documents or procurement notices into Icelandic, Irish, etc. (PRINCIPLE project – PRINCIPLE Project website) (PRINCIPLE project – PRINCIPLE Project website). PRINCIPLE worked by recruiting national institutions to donate translations (e.g., the National Library of Norway contributed data) and had “early adopter” government offices pilot the MT systems to give feedback (PRINCIPLE project – PRINCIPLE Project website) (PRINCIPLE project – PRINCIPLE Project website). By the end, it uploaded a trove of vetted bilingual texts into ELRC-SHARE (the EU’s language resource repository) for ongoing use (PRINCIPLE project – PRINCIPLE Project website) (PRINCIPLE project – PRINCIPLE Project website). PRINCIPLE is just one example; the EU has a series of efforts under its Connecting Europe Facility and Horizon programs aimed at digital language equality. Other projects like European Language Grid, ELE (European Language Equality), and regional efforts (e.g., fostering MT for the Baltic languages, or initiatives for Basque and Catalan) all contribute to building MT for “medium-resource” languages that are official in smaller countries or regions. These initiatives underscore a policy principle: multilingual governance. The EU, by necessity, must operate in two dozen languages and is extending that ethos to include minority languages, supported by AI. It sets a precedent for other multi-lingual nations or federations – invest in data and AI for all your languages, not just the most spoken. Also, the EU model stresses open results: data and tools arising from these projects are usually made public, aligning with the notion of digital public goods.
- Meta (Facebook) and Open-Source Breakthroughs: Meta’s AI research has launched a high-profile initiative called No Language Left Behind, which produced the NLLB-200 model capable of translating between 200 languages with strong results (200 languages within a single AI model: A breakthrough in high …). What’s significant is that Meta open-sourced this model (the 54 billion-parameter version and smaller variants) and the evaluation dataset FLORES-200 (Meta AI Research Topic – No Language Left Behind). NLLB involved creating or collecting data for many languages that previously had almost no resources – including a number of Asian (like Lao, Khmer, Acehnese), African (Wolof, Lingala, Luganda, etc.), and Pacific languages. Meta also tackled cases like Myanmar (Burmese), which, despite tens of millions of speakers, had sparse parallel data due to limited web presence and a unique script. NLLB applied novel training techniques (like tagging languages to handle multiple scripts, data balancing to not let high-resource languages dominate the training, and intensive back-translation for low-resource pairs) to boost these languages. The result was often a huge quality leap – internal tests showed 10+ BLEU point improvements on some low-resource translations over prior systems (Microsoft taps AI techniques to bring Translator to 100 languages) (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research). In addition to text, Meta has looked at speech translation for low-resource languages. One notable project translated Hokkien, an oral language with no widely used writing system, using speech-to-speech modeling – a feat of zero-resource learning that can be extended to other minority languages that lack standard text corpora. Meta’s work is significant for public governance because of its open-source stance: by releasing models and inviting collaboration, it enables local researchers or governments (who may not have huge AI labs) to leverage state-of-the-art technology. In fact, Meta recently partnered with UNESCO on a Language Technology Partnership Program to get local contributors involved in building MT for indigenous languages (Meta Partners With UNESCO on Indigenous Language Translation | PCMag) (Meta Partners With UNESCO on Indigenous Language Translation | PCMag). They are asking governments and NGOs to contribute recordings and translations, and one of the first to join is the government of Nunavut for Inuit languages (Meta Partners With UNESCO on Indigenous Language Translation | PCMag). This model – big tech providing infrastructure and know-how, international bodies providing convening power, and local authorities providing data and validation – could become a blueprint for addressing the long tail of languages. It’s essentially a multi-stakeholder approach to the global digital divide in languages, resonating with the Open Government Partnership’s idea that we need multi-stakeholder models to deliver the Global Digital Compact goals (Five Ways OGP’s Multi-Stakeholder Model Can Deliver the Global …).
- Grassroots and Other Notable Initiatives: Beyond the tech giants and governments, there are grassroots efforts making remarkable contributions. Masakhane, for example, is a volunteer community of African AI researchers and translators who collaborate on open-source machine translation for dozens of African languages. They have published research on MT for languages like Swahili, Yoruba, Amharic, Shona, and more – many of which had little presence in mainstream MT research before. Masakhane emphasizes participatory research, where native speakers drive the data collection and evaluation, ensuring culturally informed outcomes. Likewise, in South Asia, projects like AI4Bharat in India focus on open-source models for Indian languages. AI4Bharat, based at IIT Madras, has released translation models (such as IndicTrans for 11 Indic languages) and datasets, complementing the Indian government’s efforts through the Bhashini platform (Bhashini CEO Amitabh Nag talks about how their AI tool is bridging India’s language divide – India Today). Bhashini (National Language Translation Mission) is a major government initiative aiming to “enable all Indians easy access to the internet and digital services in their own language” (Under Bhashini, IISc to open source 16000 hours of speech data). It provides APIs and tools for translation and speech in 22 scheduled Indian languages, and hosts an ecosystem of 300+ AI language models (Tech Talk: India needs more local language models, datasets … – Mint). Notably, Bhashini is built on an open approach – many of its models are open-source, and it crowdsources data via apps where citizens can contribute translations or validate outputs (India turns to AI to capture its 121 languages for digital services). This mass collaboration approach is helping create datasets for languages like Sindhi, which previously had virtually no digital corpus (Bhashini CEO Amitabh Nag talks about how their AI tool is bridging India’s language divide – India Today). In academia, initiatives like the LoResMT Workshop (Low-Resource Languages for MT) and special tracks at conferences (e.g., WMT’s low-resource shared tasks on languages like Tamil–English or Livonian–English) are spurring research interest. International organizations also pitch in: UNICEF and Translators without Borders have an open project named AI4D for African languages, releasing datasets for French-Swahili, etc., to aid humanitarian communication. All these efforts, big and small, feed into a virtuous cycle: better data and models for low-resource languages lead to their inclusion in widely used platforms (Google, Microsoft, etc.), which in turn raises public awareness and usage, encouraging more investment. Ultimately, these initiatives support a more inclusive digital public square, where speakers of all languages can be heard and can access information – fulfilling the vision of leaving no one behind in the digital age.
Case Studies: Low-Resource Languages in Focus
To ground the discussion, we look at several case studies of low-resource languages – particularly from Asia – and examine how they are being included in AI-driven translation efforts. These examples highlight the progress made, the remaining gaps, and the real-world impact of translation technology on communities.
- Sinhala and Nepali (South Asia): Sinhala (Sinhalese) and Nepali are official languages of Sri Lanka and Nepal, each spoken by millions, yet historically considered “very low-resource” in the MT field. Until a few years ago, translation systems for these languages were rudimentary due to scarce parallel corpora. This began to change with focused efforts: researchers created new evaluation datasets (e.g., the FLORES-101 and FLORES-200 benchmarks include Sinhala–English and Nepali–English test sets) ([PDF] The FLORES Evaluation Datasets for Low-Resource Machine …), which spurred competition and innovation. Facebook’s team in 2019 demonstrated that using back-translation and multilingual training could greatly improve Sinhala and Nepali translation quality, even though direct training data was tiny ([PDF] On the Off-Target Problem of Zero-Shot Multilingual Neural Machine …). By incorporating Sinhala and Nepali into a multilingual model with related languages (like Hindi or Bengali, which share some script or vocabulary similarities), they achieved understandable translations where earlier systems failed. In 2022, Google’s zero-resource pipeline (described earlier) formally added Nepali to Google Translate’s supported languages for the first time (Google Translate adds 24 languages) (Sinhala was added a bit earlier). The inclusion of these languages has had immediate practical benefits. For instance, during the COVID-19 pandemic, critical health information and guidelines could be quickly translated and disseminated in Nepali and Sinhala via these AI systems. However, challenges remain: the quality is still not as high as for, say, Spanish or Chinese. Idioms and complex syntax often trip up the models, and human post-editing is needed for important texts. Local initiatives are complementing the big players – in Nepal, the Nepali Wikipedia community and universities are collaborating to build bigger parallel corpora (like translating TED talks into Nepali, etc.), and in Sri Lanka, NGOs have worked on Tamil–Sinhala–English corpora given the trilingual policy environment. These case studies show that inclusion is underway, but continued effort is required to reach truly high-quality MT. They also illustrate how languages with moderate speaker populations but low digital presence are finally getting attention, aligning with digital divide concerns.
- Khmer and Lao (Southeast Asia): Khmer (Cambodian) and Lao are two neighboring languages with unique scripts (Abugida systems) and relatively few digital resources. Khmer has around 16 million speakers and Lao about 7 million, yet for a long time neither had robust machine translation available. The complexity of their scripts and lack of standardized romanization made data crawling difficult (language identification algorithms often struggled to recognize Khmer vs random Unicode). Google Translate added Khmer around 2018 and Lao in 2019, initially with phrase-based models that provided only rudimentary translations. The shift to neural MT improved things significantly – e.g., a comparative study found NMT cut Khmer→English translation errors by more than half compared to the older system. Still, progress was slow; these languages were not part of most multilingual research models until recently. Meta’s NLLB included both Khmer and Lao, and reported large gains in BLEU scores, thanks to the model learning from related Thai and Vietnamese data and effective use of monolingual text. In fact, the NLLB model ranked much better on Khmer and Lao translation than Google’s system at the time, according to the FLORES evaluation. This demonstrates how research models can push the envelope for languages that commercial systems haven’t fully optimized. On the usage side, translation tech is crucial in these countries for government transparency. For example, Cambodia publishes many official documents in Khmer; having decent English translations via AI allows international organizations and researchers to understand local developments. Conversely, translating international content into Khmer/Lao helps local populations. An anecdotal case: during the ASEAN meetings, real-time translation systems were trialed to translate between English and Khmer for delegates – using a fine-tuned engine built on top of Google’s API with additional training on past speeches. It worked moderately well, illustrating potential for such tech in diplomacy and regional governance. However, these languages also show the need for local capacity building: the number of Khmer or Lao computational linguists is small. Investment in local NLP talent and perhaps a regional ASEAN language tech initiative could ensure that improvements in MT for these languages continue and that they aren’t left solely to Silicon Valley priorities.
- Lesser-Used Indian Languages: India’s linguistic landscape provides several case studies. While Hindi and a few others (Tamil, Telugu) have moderate resources, many official languages like Oriya (Odia), Assamese, Manipuri (Meiteilon), and Kannada have much less digital content. In 2022’s Google Translate expansion, Assamese (25 million speakers) and Meiteilon (~2 million) were added for the first time (Google Translate adds 24 languages) (Google Translate adds 24 languages). Inclusion of Meiteilon (the language of Manipur state) was especially notable – it has an alternate script (Meitei Mayek) and limited online presence. Google’s team had to build a custom font recognition and data pipeline just to get enough monolingual text to train a model. The result was a working translator, which local journalists and officials began using to produce rough drafts of content in Manipuri. Similarly, Assamese inclusion has helped a state government initiative that translates agricultural advisories for farmers: what used to be done by hand for a handful of documents can now be automated for many more, then lightly edited by bilingual staff. Bhashini, as mentioned, is systematically tackling these languages with open-source models. It has released baseline NMT models for Maithili and Sanskrit, languages with limited parallel data but significant cultural importance (Sanskrit, though classical, is used in official mottos and religious texts, and Bhashini’s model helps translate simple sentences to and from Sanskrit for heritage preservation). One success story is Sindhi: Bhashini reports generating digital data for Sindhi “where none existed before” by crowdsourcing sentence translations (Bhashini CEO Amitabh Nag talks about how their AI tool is bridging India’s language divide – India Today). This data is now feeding MT engines. The impact is that Sindhi speakers (about 25 million in India and Pakistan) might soon have AI-assisted translation for web content, aiding education and administration in Sindhi. A key lesson from India’s case studies is the importance of language-specific customization. Each language had unique hurdles – whether script, dialect variation, or lack of standard terminology – and addressing those often required involving native linguists and domain experts. The payoff is visible in more inclusive governance: e.g., Indian courts exploring real-time translation of proceedings from English to local languages so litigants can follow in their mother tongue (Bhashini CEO Amitabh Nag talks about how their AI tool is bridging India’s language divide – India Today). This is being piloted with Hindi and could extend to others, fundamentally transforming access to justice.
- Inuktitut and Inuinnaqtun (Indigenous North America): As a contrast to Asian cases, it’s worth revisiting the Inuktut example from Canada as a case study. Inuktitut, spoken in Nunavut, had one significant dataset – the Nunavut Hansard (parliament transcripts) – which was used to create a statistical MT in the 2000s. But quality was low and it wasn’t in public use. The partnership with Microsoft changed that: by updating to neural models and involving the community in testing, the translation quality improved to usable levels (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada) (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada). A year after launching, feedback from Inuit users helped identify errors and the model was retrained (“improved with use” as the system gathered more corrections) (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada). The addition of Inuinnaqtun, a severely endangered sister dialect, is a powerful demonstration of AI aiding language preservation – it shows that even languages with a few hundred speakers can be supported if you combine community effort with advanced AI. The social impact in Nunavut has been significant: unilingual Inuktut speakers can now use a smartphone app to translate English medical instructions on the fly (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada), and younger Inuit have a tool to help them learn Inuinnaqtun by comparing translations. The Nunavut case exemplifies many of the themes discussed: policy (a government took initiative, aligning with cultural policy goals), technical (had to adapt to a polysynthetic language and syllabic script), and multi-stakeholder approach (tech company + local government + heritage society). It’s a model that could be replicated for, say, certain indigenous languages in Asia like Bodo in India or Kadazan in Malaysia – languages where local governments could spearhead data collection and partner with AI firms to get them on the MT map.
These case studies collectively show that inclusion of low-resource languages in AI translation is no longer a distant dream but an ongoing reality. Asian languages from Sinhala to Assamese, Southeast Asian languages, and minority languages globally are progressively being covered. The quality may vary, but the trajectory is clearly toward improvement. Each success story also highlights what is needed to sustain it: local expertise, data, funding, and continuous research attention. Importantly, bringing these languages online via translation not only helps speakers consume content but can also encourage content production in those languages (since people know it can be translated out for a wider audience). That virtuous cycle can help revitalize languages and ensure they thrive in the digital era.
Future Directions and Emerging Trends
Looking ahead, several research directions and technological trends are likely to shape the future of AI translation for low- and medium-resource languages – and by extension, influence how public governance can leverage these tools. We identify key areas to watch, as well as notable research works that underpin them, connecting them to the broader goals of inclusivity and better AI governance in the Global Digital Compact.
- Beyond Text: Multimodal and Speech Translation: One frontier is extending translation to speech and other modalities for low-resource languages. Meta’s recent demo of speech-to-speech translation for Hokkien (an oral language) hints at what’s possible: using audio recordings to train models that can translate spoken input to spoken output without needing text. This could be revolutionary for many endangered languages that have rich oral traditions but limited written corpora. Similarly, efforts like Mozilla’s Common Voice (which collects voice data for many languages) and Meta’s Massively Multilingual Speech project (covering 1,000+ languages in speech recognition) will provide the raw material to do speech translation. For public services, this means down the line we might have real-time voice translators for patient-doctor communication in minority languages, or for citizens calling a government hotline and conversing in their native tongue. It also raises new governance questions: how to ensure accuracy in speech translation (which is doubly hard due to speech recognition errors), and how to protect privacy when potentially sensitive spoken data is processed by cloud AI. Research papers like Jia et al.’s work on direct speech-to-speech translation and advances in automatic speech recognition (ASR) for low-resource languages (some Indian languages ASR models are now on par with high-resource ASR after being trained on massive data via Bhashini’s Vaani project (Under Bhashini, IISc to open source 16000 hours of speech data)) are paving the way. The convergence of speech and text translation will make AI translators more accessible (no need to be literate to use them) and more embedded in devices (e.g., handheld translator gadgets for field outreach in multi-lingual societies).
- Integration of Large Language Models (LLMs): The rise of LLMs like GPT-4, which are trained on vast amounts of multilingual data, is also influencing machine translation. These models were not specifically built for translation, but they have shown surprisingly strong translation abilities – even for language pairs they haven’t seen much, through few-shot prompting or chain-of-thought reasoning. A future direction is LLM-assisted translation, where a large model can be used to post-edit or refine translations from a dedicated MT system, or even handle translation tasks directly for low-resource languages by leveraging its general world knowledge. For example, an LLM could be prompted with a few example translations for a low-resource language and then translate new sentences with moderate success. This approach might be useful for languages with extremely little data (no parallel text, but maybe some descriptive grammars or dictionaries), essentially performing translation as an emergent capability. Early experiments in research (e.g., LLM-driven zero-shot MT and using LLMs to aid rule-based MT for no-resource languages (Survey of Low-Resource Machine Translation – MIT Press Direct) (Is EXTREMELY Low Resource Translation (1-10k sents) Possible (with Expert Knowledge)??? : r/LanguageTechnology)) suggest this is promising. OpenAI’s GPT models and Anthropic’s Claude have been noted to support some low-resource languages to an extent (Is EXTREMELY Low Resource Translation (1-10k sents) Possible (with Expert Knowledge)??? : r/LanguageTechnology). The benefit for public governance is that LLMs could act as universal translators in a pinch – imagine a civic chatbot that can communicate in any input language by internally translating via an LLM. However, the caveat is that LLMs can hallucinate or make errors, so combining them with factual translation memories or domain-specific models (hybrid systems) will be an important research direction. Key papers to watch include “Prompting techniques for MT” and “Evaluation of GPT-4 on low-resource translation” (some very recent studies have shown GPT-4 approaching or exceeding the quality of specialized MT systems in high-resource settings, but low-resource evaluation is ongoing).
- Continuous Learning and Community Feedback Loops: Another trend is incorporating user feedback at scale to improve translation systems. We saw glimpses of this in New Jersey’s case (staff corrections used to update the model) and Nunavut’s case (community testing leading to improvements). Future MT systems, especially those deployed in governance, may have built-in active learning – e.g., if a translated document is reviewed by a human who fixes errors, the corrections could be fed back for the model to learn from. This is similar to how Google’s search or voice recognition improves with user interactions. The challenge in a government context is maintaining quality control and not allowing malicious inputs to degrade the model. Nonetheless, research in online learning for NMT and federated learning (where models learn from user data without that data leaving the user’s device) might allow secure feedback loops. Over time, this means the translator for a given language pair used by a government agency could become increasingly specialized and accurate in the language of bureaucracy or law simply by learning from every deployment. Projects like Opus-MT (an open project releasing baseline models which users can fine-tune on their data) and papers on domain adaptation for NMT are relevant here. As these techniques mature, policymakers might need to consider procedures for validating model updates and ensuring that the learning process doesn’t introduce bias – essentially an aspect of AI governance dealing with model lifecycle.
- Inclusion in Global Benchmarks and Research: A subtle but important future direction is the normalization of low-resource languages in mainstream AI research. For years, MT progress was reported mostly on languages like Chinese, French, English (because test sets existed there). Now, with benchmarks like FLORES-200 (covering 200 languages) (Meta AI Research Topic – No Language Left Behind) and initiatives like WMT Shared Tasks on Low-Resource MT, progress (or lack thereof) on low-resource languages is visible and trackable. This helps direct attention and funding. We can expect new research papers that specifically tackle long-tail languages, perhaps proposing novel architectures suited for data-scarce regimes (like meta-learning approaches that learn how to learn new language translations from minimal data). One interesting line of research is character-level or byte-level models that can handle any script (useful for languages with unique scripts or no standard orthography). Another is leveraging cross-lingual embeddings to quickly extend lexicons – e.g., Facebook’s LASER embeddings project creates vector representations such that sentences with the same meaning cluster together regardless of language, which can be used for translation retrieval in zero-shot scenarios. The more such research is done, the more tools will trickle down into practical use. Academic collaborations with local universities in Asia, Africa, Latin America will be crucial so that research isn’t just centralized in English-speaking labs. The Global Digital Compact’s emphasis on “capacity building” for digital skills ([PDF] Global Digital Compact: rev. 2 | Digital Emerging Technologies) resonates here: empowering scholars from under-represented language communities to partake in AI development ensures solutions are culturally informed and sustainable.
- Policy and Ethics Research: Finally, future directions are not just technical. There is a growing body of interdisciplinary work examining the ethical and social implications of AI language technology. For example, questions of bias: does the MT system inadvertently favor more dominant languages even when translating (e.g., by choosing more formal tone that sounds “unnatural” in the target low-resource language)? How to respect dialectal differences – a translation model for “Malay” might need to handle Malaysian Malay and Indonesian Bahasa variants neutrally, which is as much a sociopolitical decision as a technical one. Research and papers on “Bias in Multilingual AI” and “Language technology and linguistic human rights” will inform governance. We might see frameworks for AI language rights, where citizens can expect certain levels of service in their language and recourse if AI translations cause harm (like serious misunderstandings). Additionally, work on data governance might lead to international agreements on sharing linguistic data as a common resource – a direct tie-in to the Digital Compact’s call for “digital commons” (The Global Digital Compact and a responsible, inclusive transition). Imagine a future where countries agree to pool publicly funded translations (parliament transcripts, public brochures, etc.) into a global repository that any AI can learn from – this could exponentially increase data for low-resource languages and firmly establish translation models as a public good.
In conclusion, the future of AI translation for low- and medium-resource languages looks dynamic and promising. With sustained research, inclusive policy-making, and collaboration across governments, academia, and industry, we can envision a world where every citizen can instantly bridge language gaps to engage with information and services. This directly supports the Global Digital Compact’s goals of enhancing inclusivity and closing digital divides, and it demonstrates a constructive path for AI governance – one where governance isn’t just about restricting AI, but proactively guiding it to uplift marginalized communities and languages. Key research milestones and proactive policies will hand-in-hand drive us toward more equitable multilingual digital governance, ensuring that no language is left behind in our AI-powered future.
References (selected):
- Guterres, A. Roadmap for Digital Cooperation, United Nations, 2020 – outlines the vision leading to the Global Digital Compact (The Global Digital Compact and a responsible, inclusive transition) (The Global Digital Compact and a responsible, inclusive transition).
- Milengo Business Guide. Low-resource Languages in AI Translation – discusses how AI is expanding to “rare” languages and mentions EU’s PRINCIPLE project (Low-resource Languages in AI Translation: A Business Guide).
- StateScoop. “New Jersey shares AI translation tool…” – case of NJ using AI to meet language access law, with Spanish glossary and LLM integration (New Jersey shares AI translation tool materials with other states | StateScoop) (New Jersey shares AI translation tool materials with other states | StateScoop).
- Microsoft Research. “Z-code Mixture of Experts models…” – technical blog on improving low-resource translation quality via transfer learning and MoE (Microsoft Translator enhanced with Z-code Mixture of Experts models – Microsoft Research).
- Google AI Blog. “Unlocking Zero-Resource MT to Support New Languages…” – explains how 24 new languages were added with monolingual data and zero-shot techniques (Google Translate adds 24 languages) (Unlocking Zero-Resource Machine Translation to Support New Languages in Google T).
- Meta AI. “No Language Left Behind” – reports open-sourcing of a 200-language MT model and emphasis on low-resource improvements (200 languages within a single AI model: A breakthrough in high …).
- India Today. Interview with Bhashini CEO – highlights open-source AI models for 22 Indian languages and use-cases in governance (courts, farmers) (Bhashini CEO Amitabh Nag talks about how their AI tool is bridging India’s language divide – India Today) (Bhashini CEO Amitabh Nag talks about how their AI tool is bridging India’s language divide – India Today).
- Microsoft News. “Nunavut…and Microsoft” – describes adding Inuktitut and Inuinnaqtun to Translator and community impact (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada) (Government of Nunavut preserving endangered Inuit languages and culture with the help of artificial intelligence and Microsoft – Microsoft News Center Canada).
- Reddit r/LanguageTechnology discussion – (community insight) on leveraging closely related languages for extremely low-resource MT (Is EXTREMELY Low Resource Translation (1-10k sents) Possible (with Expert Knowledge)??? : r/LanguageTechnology).