Building a Voice AI for 22 Indian Languages: What We Learned Designing TARA for the Real Indian Workforce

When we started building TARA, the first decision we made was also the most consequential: we would not build an English-first product with Indian language support added later. We would build a multilingual product from the ground up, where English is one of 23 supported languages, not the default.

This sounds straightforward. In practice, it required rethinking almost every architectural assumption that underlies standard voice AI design — from how we handle speech-to-text transcription to how sentiment analysis works across grammatical structures that have no English analogue to how we detect bias signals in conversations where emotional framing shifts by language mid-sentence.

The Code-Switching Problem

The defining characteristic of multilingual Indian professional conversations is code-switching: the fluid movement between two or more languages within a single conversation, and often within a single sentence. A manager in Bengaluru discussing a field employee's performance might say: 'Aapka Q2 ka kaam bahut accha tha — the OKR numbers were strong — but the handover documentation, that needs improvement.'

For a voice AI designed for English, this is a failure state. For TARA, it is the expected input. Our transcription engine is trained on code-switched Indian professional speech specifically, not on isolated language models stitched together. The distinction matters enormously: isolated language models lose context at switch boundaries. A model trained on code-switched speech maintains narrative continuity through the switch.

Sentiment Inference Across Grammatical Structures

Sentiment analysis in English relies heavily on adjective-noun structures and explicit evaluative language: 'good work,' 'poor performance,' 'excellent outcome.' In many Indian languages, evaluation is conveyed through verb form, tonal emphasis, and contextual implication rather than explicit adjectives. A Telugu speaker conveying that an employee 'did well enough' might use a verb construction that, translated literally, reads as neutral — but carries a distinctly positive connotation in context.

Training sentiment models on literal translations of Indian language speech produces systematically incorrect results. TARA's sentiment models were trained on contextually annotated Indian language performance conversations — not translated text — specifically to capture these nuances.

What This Means for Bias Detection

Language-linked bias: When the same event is described positively in English and negatively in the regional language, TARA flags the divergence for HR review.
Tonal bias: Audio-level analysis catches tonal signals (impatience, warmth, dismissiveness) that text transcription alone cannot capture.
Structural bias: TARA tracks which topics managers spend the most time on, what questions they ask, and whether the conversation structure differs significantly between employees.

The Indian workforce is not a simplified version of a Western workforce that happens to speak different languages. It is a fundamentally different organisational environment with its own communication patterns, cultural norms, and professional conventions. Building AI for it requires starting from that reality — not from an English-language baseline and working backward.

Building a Voice AI for 22 Indian Languages: What We Learned Designing TARA for the Real Indian Workforce

The Code-Switching Problem

Sentiment Inference Across Grammatical Structures

What This Means for Bias Detection

See these principles in action