Rewind: Lost in comprehension — Can India’s AI boom think in its own languages? -

India AI Impact Summit promises need a real plan — link public procurement to vernacular AI standards and build a national language data commons, freely accessible to researchers and startups

Published Date – 11 April 2026, 11:53 PM

Rewind: Lost in comprehension — Can India’s AI boom think in its own languages? — Illustration: GuruG

By Siri Dandur Chandrashekar, Dr Maya K

When Prime Minister Narendra Modi’s speech at the India AI Impact Summit 2026 was translated live into Indian languages at Bharat Mandapam, it was a striking image — held as proof that India’s AI moment had arrived. India has 22 scheduled languages. Not all of them were in the room. For a country where nine in ten people don’t speak English at home, that gap — small enough to go unnoticed at a summit, large enough to exclude millions — is precisely the story that applause drowned out.

The AI summit, the first high-level AI gathering in the Global South, declared that AI stands for “All Inclusive” and must be “a multiplier, not a monopoly.” Only about 10% of India’s population speaks English, and just 0.02% as a native language. Hindi covers roughly 43.6% of Indians, but Bengali, Marathi, Tamil, Telugu, Kannada, Odia, Punjabi, and Assamese together represent hundreds of millions more — each with distinct scripts, grammar structures and cultural contexts that English-dominant models are poorly equipped to handle.

The Sarvam Moment

The summit’s most concrete language AI product came from Bengaluru-based Sarvam AI, which unveiled Vikram, a 35-billion and 105-billion parameter open-source model. Elsewhere, AI4Bharat has gathered 15,000 hours of transcribed data across 22 scheduled languages. BharatGen is building a multimodal Large Language Model (LLM) for all 22, and Gnani.ai handles millions of multilingual voice interactions daily for several financial services companies.

But Vikram isn’t publicly available yet — benchmarks are self-reported, and the harder test is whether any of these models can reason in Indian languages rather than merely respond in them, which remains unverified. The ambition is real. The gap between research and deployment is equally real.

Data Wall

Every model is as good as the data it is trained on, and here India faces a structural problem that investment alone cannot quickly solve. English has decades of digitised text, labelled datasets, and internet content. Languages like Bodo, Dogri and Santali — scheduled languages with millions of speakers — have almost no usable digital training data at all.

Then there is the labelling problem. Building good AI requires human annotators who are fluent, culturally aware, and can evaluate outputs. IBM’s IndQA benchmark required 261 Indian researchers and linguists to build culturally grounded prompts across 11 Indic languages. Doing that systematically across 22 languages and hundreds of dialects requires a workforce and funding pipeline that barely exist in organised form today.

• The recent AI summit’s real test isn’t in investment pledges. It will be seen in everyday moments — an anganwadi worker in Jharkhand logging child nutrition by voice in Santali, without forms or intermediaries; an elderly woman in a Meghalaya village accessing a government health scheme in Khasi, without her grandson’s help

Linguist Arvind Joshi’s foundational work on code switching identified this as one of the hardest problems in computational linguistics decades ago. It remains unsolved. IBM’s MILU Benchmark evaluates AI not just on grammar and syntax but also on cultural fluency across 41 subjects, including law, health and history — most newly announced models have not been tested against it. For healthcare and governance, that gap is not academic. A mistranslated word isn’t an inconvenience; it is a harm.

Real Progress, Real Gaps

The India AI mission has over Rs 10,300 crore allocated and has scaled to 38,000 GPUs — Sarvam’s Vikram is a direct product of that subsidised access. Bhashini supports 22 Indian languages across 350 AI models. But accountability remains thin; policy standards and regulatory frameworks for data privacy and AI ethics are largely absent without clear data governance policies.

Efforts to build multilingual AI for government services risk getting stuck in bureaucratic delays. The critical question is — how much of India’s AI mission allocation is specifically directed toward low-resource language development versus general infrastructure?

Bedrock Others Built

India is not the first country to face this problem. Norway built the Norwegian language bank, which is a publicly funded repository of text and speech data, freely available for AI training — years before large language models became a mainstream concern. Norwegian is today among the best-served non-English languages in AI, not because Norway has the largest tech industry, but because it treated language data as public infrastructure. A national language data commons, freely accessible to every researcher and startup, would do more for multilingual AI than any GPU procurement announcement.

Wales offers the sharpest example of what mandate alone can accomplish. The Welsh Language Act of 1993 made Welsh language-capability a legal requirement for all public digital services. It encouraged technological companies that wanted government contracts to make their products work in Welsh, a language spoken by roughly 8,50,000 people. It now has voice assistants and digital services that work as reliably as their English equivalent, not because the market demanded it, but because policy created the demand.

• Only about 10% of Indians speak English — and just 0.02% as a first language. While Hindi reaches 43.6%, hundreds of millions more speak Bengali, Marathi, Tamil, Telugu, Kannada, Odia, Punjabi, and Assamese — each with distinct scripts, grammar, and cultural context that English-first AI simply cannot capture

The most transferable model for India comes from Africa. The Masakhane project, a grassroots Natural Language Processing initiative, built open-source models and datasets for over 50 African languages by training and paying local community speakers to label data in their own languages. The results outperformed centralised outsourced labelling workforces because the annotators understood the cultural context behind the words they tagged.

India has 1.4 billion people and a vast pool of speakers of every scheduled language. A structured community annotation program would simultaneously solve the data scarcity problem, create dignified local employment, and produce more accurate training data than any outsourced alternative.
The linguistic complexity that makes this hard is precisely what makes it valuable. Whoever builds AI that works authentically across Indian languages unlocks agricultural advisory, healthcare navigation, government service delivery, vernacular fintech and legal aid — sectors representing the vast majority of India’s population.

Crack it here, and you have a replicable playbook for Indonesia, Nigeria and Brazil. India’s diversity isn’t a complication to be managed, but it is the size of the price. Investors who back vernacular AI now are not doing charity but are entering a market that is vast, largely untouched, and will only grow.

The Only Measure That Matters

India made history at this summit. It also made promises. The history is already recorded. The promises now need a plan — one that ties public procurement to vernacular AI performance standards, and funds community annotation at scale.

The summit’s real test won’t be measured in parameter counts or investment pledges. It will be measured in the anganwadi in Jharkhand, where a worker logs child nutrition data by voice, in Santali, without a form or a translator. The first time, an elderly woman in Meghalaya’s village navigates a government health scheme in Khasi without her grandson’s help. The first time an AI conversation in India feels less like a translation and more like a mother tongue. That moment hasn’t arrived yet. Whether the $250 billion announced in Delhi brings it closer is the only question that matters.

Regional Intelligence: Where India stands now

Sarvam AI: Released two open-source reasoning models — Sarvam-30B and Sarvam-105B —, it is trained from scratch entirely in India on compute provided under the India AI Mission. Both use a Mixture-of-Experts (MoE) architecture built with in-house data curation, tokenisation, training, and inference pipelines. Sarvam-30B, with only 2.4B active parameters, is optimised for real-time deployment and powers Samvaad, the company’s conversational agent platform. Sarvam-105B, with a 128K context window, is designed for complex reasoning and agentic workflows and powers Indus, their AI assistant. Both models are available on Hugging Face and AI Kosh under Apache 2.0 licence. On Indian language benchmarks, Sarvam-105B wins on average 90% of comparisons, and Sarvam-30B wins 89%, outperforming several global models significantly larger in size.

AI4Bharat: The Automatic Speech Recognition programme spans all 22 constitutionally recognised languages, combining large-scale data crawling with ground-level collection across over 400 districts. The dataset includes 300,000 hours of raw speech, 6,000 hours of transcribed data, and 6,400 hours of mined audio-text pairs, further augmented by pseudo-labelled data from sources like YouTube. Their state-of-the-art models — IndicWav2Vec, IndicWhisper, and IndicConformer — set benchmarks through evaluation frameworks such as Vistaar, IndicSUPERB, Lahaja, and Svarah. The latest model supports all 22 languages, with future work targeting 8KHz telephony data, domain-specific adaptation, and offline functionality for low-resource languages.

BharatGen: Is the world’s first government-funded multimodal LLM initiative, backed by India’s Department of Science & Technology under the National Mission on Interdisciplinary Cyber-Physical Systems (NM-ICPS) and implemented by IIT Bombay in partnership with IIIT Hyderabad, IIT Kanpur, IIT Mandi and others. It supports text, speech, and image modalities across 22 Indian languages, with a focus on India-centric data collection, open-source development, and cultural representation. Formally launched in October 2024, the project is being executed through a network of 25 Technology Innovation Hubs with a roadmap extending through July 2026, targeting sectors including governance, healthcare, agriculture, and education.

Gnani.ai: Is a Bengaluru-born agentic AI and voice infrastructure company that launched Vachana STT — a foundational speech-to-text model trained on over one million hours of real-world Indian voice data under the IndiaAI Mission. The system currently processes approximately 10 million calls per day at sub-200ms latency (in less than 0.2 seconds per request), supporting both real-time and batch transcription across 15+ Indian languages including Hindi, Bengali, Tamil, Telugu, Kannada, and Malayalam. One of only four companies selected under the IndiaAI Mission to build sovereign foundational AI models, Gnani.ai serves over 200 large enterprises, including Tata Group, Air India, Airtel, and Bank of Baroda.

(Siri Dandur Chandrashekar is Second Year Undergraduate student, and Dr Maya K, Assistant Professor, Department of Economics, CHRIST (Deemed to be University), Bengaluru)

Source link

Related Posts

Leave a Reply Cancel reply