Skip to main content
Back to Blog

Multilingual LLMs and Localization: Building AI for a Global World

A comprehensive guide to multilingual large language models covering cross-lingual transfer, tokenization challenges, cultural adaptation, and production strategies for deploying AI systems that serve users across languages and cultures.

9 min read
Share:

With over 7,000 languages spoken globally, the English-centric nature of most large language models represents a significant limitation on AI's potential impact. Multilingual LLMs aim to bridge this gap, enabling AI systems that understand and generate text across languages. However, building truly effective multilingual systems involves far more than training on diverse data—it requires careful consideration of tokenization, cross-lingual transfer, cultural context, and the inherent trade-offs between language coverage and per-language quality.

The Multilingual Challenge

Language modeling at scale seems straightforward: train on text from many languages, and the model learns to handle them all. Reality is more complex. English dominates the internet and, consequently, most training corpora. Common Crawl, the foundation of many training datasets, contains roughly 46% English content, while languages like Swahili, Bengali, or Thai represent fractions of a percent each.

This imbalance creates cascading effects. Models trained on data proportional to web prevalence effectively become English-primary systems with limited capability in other languages. But over-sampling low-resource languages to achieve balance degrades high-resource language performance and can introduce noise from lower-quality sources.

The challenge extends beyond data quantity. Different languages have different:

Morphological complexity: English has relatively simple word formation; Turkish or Finnish have extensive agglutination where single words carry what would be entire phrases in English.

Writing systems: Latin script, Cyrillic, Arabic, Chinese characters, and dozens of other scripts require different tokenization approaches.

Syntactic structure: Word order, case marking, and grammatical gender vary dramatically across language families.

Cultural context: Concepts, idioms, and pragmatic conventions differ across cultures, affecting appropriate responses.

Resource availability: Evaluation benchmarks, parallel corpora, and linguistic expertise are abundant for some languages and nearly nonexistent for others.

Building effective multilingual systems requires addressing all these challenges simultaneously while managing computational and data constraints.

Tokenization: The Foundation of Multilingual Models

Tokenization determines how text is split into the discrete units that models process. For multilingual models, tokenization choices have outsized impact on performance, efficiency, and fairness across languages.

The Fertility Problem

Standard subword tokenization (BPE, WordPiece, Unigram) learns merge patterns from training data. When English dominates training, the vocabulary becomes English-optimized:

  • "understanding" → 1 token
  • "Verständnis" (German for "understanding") → 3-4 tokens
  • "समझ" (Hindi for "understanding") → 4-6 tokens

This disparity, called fertility, means non-English text requires more tokens to encode the same semantic content. The consequences are severe:

Cost: API pricing is per-token. Hindi users pay 3-5× more than English users for equivalent content.

Context limits: A 4K context window holds ~3K English words but only ~1K Hindi words worth of content.

Quality: Higher fertility correlates with reduced generation quality. The model must predict more tokens, each with less semantic content.

Latency: More tokens means more generation steps, increasing response time.

Vocabulary Expansion Strategies

Several approaches address the fertility gap:

Larger vocabularies: Increasing from 32K to 100K+ tokens allocates more vocabulary to non-English languages. Qwen-3 uses 151K tokens; multilingual models from Google and Meta use 250K+. Diminishing returns and memory costs limit this approach.

Language-specific tokenizers: Train separate tokenizers per language, then combine. This ensures each language gets appropriate coverage but complicates the model architecture.

Balanced sampling during tokenizer training: Over-sample low-resource languages when learning BPE merges. This reduces fertility disparity at the cost of slight English efficiency loss.

Character or byte-level fallback: Use subword tokens for common patterns but fall back to characters or bytes for unknown text. ByT5 and CANINE explore this approach.

Modern Tokenizer Comparisons

ModelVocabulary SizeAvg. Non-English FertilityApproach
GPT-4100K1.8×BPE with expanded vocab
Llama-3128K1.6×Tiktoken with multilingual data
Qwen-3151K1.4×Language-balanced BPE
BLOOM250K1.3×Unigram with massive vocab
mT5250K1.3×SentencePiece multilingual

The trend is clear: newer models invest in larger, more balanced vocabularies. The efficiency gains from reduced fertility outweigh the memory cost of additional embeddings.

Tokenization for Specific Scripts

Some scripts require special handling:

CJK (Chinese, Japanese, Korean): Character-based tokenization often works well since characters carry semantic meaning. However, Japanese mixes three scripts (hiragana, katakana, kanji) requiring unified handling.

Arabic: Right-to-left text, connected letters, and diacritics complicate tokenization. Arabic also has significant dialectal variation (Egyptian, Gulf, Levantine) that models often fail to distinguish.

Indic scripts: Multiple scripts with complex combining characters. Proper handling requires Unicode normalization and understanding of script-specific rules.

Code-mixing: Users frequently mix languages within sentences (Hinglish = Hindi + English). Tokenizers must handle mixed-script text gracefully.

Cross-Lingual Transfer: How Models Learn to Generalize

A remarkable property of multilingual models is cross-lingual transfer: learning in one language improves performance in others. A model fine-tuned on English question-answering often improves at German QA without seeing German examples. Understanding this phenomenon is key to building effective multilingual systems.

The Shared Representation Hypothesis

Cross-lingual transfer suggests that multilingual models develop language-agnostic representations—an internal "interlingua" where meaning is encoded similarly regardless of surface language. Evidence for this includes:

Parallel sentence alignment: Sentences with identical meaning in different languages cluster together in embedding space.

Zero-shot transfer: Task learning transfers across languages without explicit training.

Translation emergence: Models can translate between language pairs never seen together in training.

However, the picture is more nuanced. Recent research (2024-2025) reveals that models often process non-English inputs by implicitly converting to English-like representations, performing reasoning, then converting back. This "English pivot" behavior suggests true language-agnostic representation may be an approximation rather than reality.

Transfer Patterns

Not all language pairs transfer equally:

Language family effects: Transfer works best within language families. Spanish-Italian transfer is strong; Spanish-Japanese transfer is weak.

Typological similarity: Languages with similar syntax transfer better regardless of family. SOV languages (Japanese, Korean, Turkish) share transfer patterns despite being unrelated.

Script effects: Shared scripts facilitate transfer. The Arabic script connects Arabic, Farsi, and Urdu despite different language families.

Resource effects: High-resource languages serve as better source languages. English-to-X transfer is usually stronger than X-to-English, reflecting English's training dominance.

Altruistic, Selfish, and Stagnant Languages

Research has identified different languages' behavior during multilingual training:

Altruistic languages: Improve performance on related languages when added to training. French improves Spanish and Italian; Hindi improves Urdu.

Selfish languages: Primarily benefit themselves, contributing little to other languages. Often typologically isolated languages like Basque.

Stagnant languages: Neither benefit from nor contribute to other languages. Often very low-resource languages with insufficient training signal.

Understanding these patterns helps design training data mixtures that maximize positive transfer.

Cross-lingual In-Context Pre-training

A 2025 technique called Cross-lingual In-context Pre-training (CrossIC-PT) explicitly encourages transfer by interleaving semantically related texts in different languages within the same context window:

Rather than:

Code
[English document 1] [English document 2] [French document 1] [French document 2]

CrossIC-PT uses:

Code
[English doc about climate] [French doc about climate] [English doc about politics] [French doc about politics]

This forces the model to connect semantically similar content across languages, strengthening the shared representation. Experiments show 3-4% improvement across languages using this approach.

Multilingual Model Performance Analysis

Understanding how different models perform across languages helps select the right model for your use case.

Benchmark Performance by Language Family

Performance varies significantly by language family due to training data distribution and linguistic similarity to English:

Language FamilyRepresentative LanguagesGPT-4Claude 3.5Llama-3 70BQwen-2.5
GermanicGerman, Dutch, Swedish95%94%88%85%
RomanceFrench, Spanish, Italian94%93%87%84%
SlavicRussian, Polish, Czech88%87%78%80%
Sino-TibetanChinese (Mandarin)91%90%82%96%
JaponicJapanese89%88%75%88%
DravidianTamil, Telugu, Kannada72%71%58%65%
Indo-IranianHindi, Bengali, Farsi78%77%65%72%
AfroasiaticArabic, Hebrew, Amharic80%79%68%75%
Niger-CongoSwahili, Yoruba, Zulu65%64%48%55%

Scores represent average performance across MMLU, TruthfulQA, and task completion benchmarks, normalized to English = 100%

Quality Tiers by Language

Based on production testing, languages can be grouped into quality tiers:

Tier 1 (>90% of English quality):

  • English, German, French, Spanish, Italian, Dutch, Portuguese
  • Suitable for production applications requiring high reliability

Tier 2 (75-90% of English quality):

  • Chinese, Japanese, Korean, Russian, Arabic, Hindi, Polish, Turkish
  • Suitable for most production uses with quality monitoring

Tier 3 (50-75% of English quality):

  • Thai, Vietnamese, Indonesian, Greek, Hebrew, Czech, Romanian
  • Suitable for assisted tasks, requires human review for critical applications

Tier 4 (<50% of English quality):

  • Low-resource languages: Swahili, Amharic, Yoruba, Burmese, Khmer
  • Experimental use only, significant quality gaps

Cross-Lingual Task Performance

Different task types show different cross-lingual patterns:

Task TypeHigh-Resource TransferLow-Resource TransferNotes
Classification85-95%60-80%Transfers well
NER75-90%50-70%Language-specific patterns matter
QA80-92%55-75%Knowledge availability varies
Summarization78-88%50-70%Grammatical differences impact
Generation70-85%40-65%Most challenging transfer
TranslationN/AN/ARequires parallel data

Token Efficiency Analysis

The fertility gap (tokens per semantic unit) significantly impacts cost and performance:

LanguageAvg Tokens/WordFertility vs EnglishCost Multiplier
English1.31.0×1.0×
Spanish1.41.1×1.1×
German1.61.2×1.2×
French1.51.15×1.15×
Chinese2.11.6×1.6×
Japanese2.82.2×2.2×
Korean2.51.9×1.9×
Arabic2.41.8×1.8×
Hindi3.22.5×2.5×
Thai4.13.2×3.2×

Based on GPT-4 tokenizer; newer tokenizers (Qwen-3, Llama-3) show improved efficiency

Training Multilingual Models

Training effective multilingual models requires careful decisions about data mixing, curriculum design, and capacity allocation.

Data Mixing Strategies

The fundamental question is how to weight languages in training data:

Proportional sampling: Sample each language proportionally to its prevalence. Result: English dominates, low-resource languages underfit.

Temperature-based sampling: Apply temperature to the language distribution: P(Li)Praw(Li)1/TP(L_i) \propto P_{\text{raw}}(L_i)^{1/T}

Higher temperature flattens the distribution, giving more weight to low-resource languages. T=0.3-0.5 is common.

Square root sampling: Sample proportional to the square root of language size. A middle ground between proportional and uniform.

Uniform sampling: Equal weight to all languages regardless of data size. Low-resource languages get maximum exposure but risk overfitting to limited data.

Dynamic sampling: Adjust weights during training based on per-language validation loss. Allocate more compute to languages that are learning.

Empirical findings suggest different strategies for different training phases:

  • Early training: More uniform sampling to establish representations
  • Later training: More proportional sampling to maximize high-resource quality
  • Fine-tuning: Target language focused

Capacity Allocation

Multilingual models face the "curse of multilinguality": performance on any single language typically lags behind a monolingual model of equivalent size. The model's capacity must be shared across many languages.

Strategies to mitigate this:

Scale: Simply make models larger. GPT-4 and Claude handle 95+ languages competently through sheer scale.

Modular architectures: Use language-specific components (embedding layers, adapter modules) while sharing core capacity.

Mixture of Experts: Different experts specialize in different language families. MoE provides effective capacity scaling with sublinear compute increase.

Language-specific fine-tuning: Start with a multilingual base, then fine-tune separate versions per language or language group.

Continual Pre-training for New Languages

Adding new languages to an existing model (continual pre-training) risks catastrophic forgetting of existing languages. Techniques to mitigate this:

Data replay: Mix new language data with samples from original training languages.

Elastic weight consolidation: Regularize to prevent important weights from changing too much.

Progressive expansion: Gradually increase the new language's weight in the training mix.

Adapter-based addition: Add language-specific adapters rather than modifying base weights.

Recent work on "rethinking multilingual continual pretraining" (2024-2025) suggests that careful data mixing with even small amounts of original-language data prevents most forgetting while enabling efficient language addition.

Fine-Tuning Multilingual Models

Fine-tuning multilingual models for specific tasks introduces additional considerations beyond monolingual fine-tuning.

Cross-lingual Fine-tuning Strategies

English-only fine-tuning: Train only on English task data, rely on transfer. Simple but leaves performance gaps for distant languages.

Translate-train: Machine-translate English training data to target languages. Effective but translation quality limits ceiling.

Multilingual fine-tuning: Collect or generate task data in multiple languages. Best results but highest data cost.

Few-shot multilingual: Fine-tune on English with few examples from target languages. Often achieves most of multilingual fine-tuning's benefit with much less data.

Adapter-Based Approaches

Parameter-efficient fine-tuning methods are particularly valuable for multilingual settings:

Language adapters: Small modules trained per-language that modulate the base model. The base model stays frozen and shared.

Task + Language adapters: Separate adapters for task knowledge (shared across languages) and language-specific adaptation. Compose them at inference.

AdaMergeX (2025): Adaptively merges adapters from source languages to adapt to target languages. Enables zero-shot transfer to new languages by combining existing adapters.

FLARE: Latent Fusion for Better Transfer

A late 2024 approach called FLARE integrates source and target language representations within LoRA adapters' bottleneck layers. By performing latent fusion of English and target-language representations, FLARE reduces the performance gap between English and other languages to 8-12% on average, compared to 20-30% gaps with standard fine-tuning.

Cultural Adaptation and Localization

True multilingual capability extends beyond language to culture. A model that translates literally without cultural adaptation produces technically correct but pragmatically wrong outputs.

The Transfer-Localization Tradeoff

Cross-lingual transfer is desirable for factual knowledge:

  • "What is the speed of light?" should get the same answer regardless of query language.

But localization is desirable for culturally-situated responses:

  • "What should I eat for breakfast?" should reflect local cuisine
  • "Who is the president?" should understand country context from language
  • Humor, idioms, and politeness norms vary by culture

Research identifies this as the "transfer-localization plane"—a framework for quantifying both desirable knowledge transfer and undesirable "cultural erasure" when models over-transfer.

Cultural Capabilities

High-quality multilingual models demonstrate:

Idiomatic expression understanding: Recognizing that "it's raining cats and dogs" means heavy rain, with equivalent idiom use in other languages.

Register adaptation: Adjusting formality based on cultural norms. Japanese requires explicit politeness levels; English formality is more implicit.

Cultural knowledge: Understanding local holidays, customs, historical figures, and current events for each language's primary cultures.

Code-switching handling: Naturally mixing languages as bilingual speakers do, rather than enforcing artificial language boundaries.

Pragmatic appropriateness: Responses that would be appropriate in the cultural context of the query language.

Evaluation for Cultural Competence

Standard benchmarks often miss cultural aspects. Evaluating cultural competence requires:

Localized benchmarks: Test sets created by native speakers reflecting their cultural context, not translated from English.

Human evaluation: Automated metrics miss pragmatic appropriateness. Native speaker judgment is essential.

Comparative evaluation: Same semantic query in different languages should sometimes get different answers (local context) and sometimes the same (factual knowledge).

Red-teaming: Testing for cultural stereotypes, biases, and inappropriate generalizations.

Case Studies: Multilingual Deployment

Case Study 1: Global Customer Support

A multinational corporation deployed multilingual LLMs for customer support across 40+ countries:

Challenge:

  • 15 primary languages, 40+ secondary languages
  • Varying quality requirements by market (Tier 1 markets need 95%+ accuracy)
  • Real-time response requirements (<2s latency)
  • Cost constraints ($0.02/query budget)

Solution Architecture:

Code
User Query → Language Detection → Router
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
              Tier 1 Model   Tier 2 Model    Tier 3 Model
              (GPT-4)        (Claude 3.5)    (Llama-3 70B)
                    │               │               │
                    └───────────────┼───────────────┘
                                    ▼
                            Response + QA Check

Results:

  • Tier 1 languages: 94% customer satisfaction
  • Tier 2 languages: 87% customer satisfaction
  • Tier 3 languages: 72% satisfaction (with human escalation path)
  • 60% cost reduction vs. human-only support

Lessons learned:

  • Language detection fails 8% for short queries—use user profile as fallback
  • Code-switching in Asian markets requires specialized handling
  • Cultural adaptation crucial for satisfaction scores

Case Study 2: Multilingual Content Generation

A media company generates content in 12 languages for regional audiences:

Requirements:

  • Generate 1000+ articles/day across languages
  • Maintain brand voice consistency
  • Culturally appropriate content
  • SEO optimization per market

Architecture:

  1. Generate master content in English
  2. Adapt (not translate) to each target language
  3. Apply cultural localization rules
  4. Human review for Tier 1 markets

Quality metrics:

LanguageAutomation RateHuman Edit RateReader Engagement
Spanish85%15%98% of English
German82%18%96% of English
Japanese68%32%91% of English
Arabic65%35%88% of English
Hindi55%45%82% of English

Case Study 3: Multilingual RAG System

An enterprise deployed RAG across documents in 8 languages:

Challenges:

  • Documents in multiple languages (EN, DE, FR, ES, IT, PT, ZH, JA)
  • Users query in their preferred language
  • Need cross-lingual retrieval (German query → French document)

Solution:

  • Multilingual embeddings (mE5, multilingual-e5-large)
  • Cross-lingual reranking with multilingual model
  • Answer generation in query language

Performance:

Query Lang → Doc LangRecall@10Answer Quality
Same language92%94%
Cross-lingual (related)85%88%
Cross-lingual (distant)71%78%

Production Deployment

Deploying multilingual models in production involves considerations beyond monolingual deployment.

Language Detection and Routing

Production systems need to identify query language for:

Model selection: Route to language-specific models if using a model suite rather than one multilingual model.

Response language matching: Generate responses in the query language (or user's preferred language).

Evaluation and monitoring: Track per-language quality metrics.

Language detection is usually accurate for single-language inputs but struggles with:

  • Very short queries
  • Code-mixed text
  • Similar languages (Norwegian/Swedish, Hindi/Urdu)
  • Transliterated text (Hindi written in Latin script)

Quality Variance Across Languages

Multilingual models have inconsistent quality across languages. Production systems should:

Monitor per-language metrics: Track quality scores, user feedback, and task success rates by language.

Set quality thresholds: Gate features that require high reliability for languages where quality is insufficient.

Provide fallback paths: When quality is low, offer machine translation to English or human escalation.

Communicate limitations: Be transparent with users about which languages are well-supported vs. best-effort.

Cost Considerations

Multilingual deployment affects costs through:

Fertility impact: Higher token counts for some languages increase API costs and compute requirements.

Model size: Multilingual models are typically larger than monolingual models for equivalent single-language quality.

Evaluation costs: Testing across many languages multiplies QA effort.

Support costs: User issues in many languages require multilingual support capability.

Regulatory and Compliance

Different regions have different requirements:

Data residency: Some jurisdictions require data processing within their borders.

Content moderation: Moderation systems must handle all supported languages effectively.

Accessibility: Multilingual support may be legally required in some contexts (government services, healthcare).

Right to explanation: GDPR and similar regulations require explanations in users' languages.

Current State of Multilingual Models (2025)

The landscape of multilingual LLMs has evolved significantly:

Leading Models

GPT-4 / GPT-4 Turbo: Strong multilingual capability across 95+ languages. Quality varies significantly between high and low-resource languages.

Claude 3.5 / Claude 4: Comparable multilingual breadth with particular strength in European languages and nuanced cultural understanding.

Gemini 1.5 / Gemini 2: Google's models leverage their translation expertise. Strong in languages covered by Google Translate.

Llama-3: Open-weight with decent multilingual capability. Community has extended through continued pretraining (e.g., Chinese-Llama, Japanese-Llama).

Qwen-3 (2025): Major expansion from Qwen-2.5's 29 languages to 119 languages and dialects. Key advances:

  • Pre-trained on ~36 trillion tokens (2× Qwen-2.5's 18T)
  • Models range from 0.6B to 235B parameters (dense and MoE)
  • Unified thinking mode (complex reasoning) and non-thinking mode (rapid responses)
  • Qwen3-Embedding-8B ranks #1 on MTEB multilingual leaderboard (score 70.58, June 2025)
  • Qwen-MT enables translation across 92 major languages covering 95%+ of global population

Aya (Cohere): Explicitly multilingual-focused, covering 100+ languages with particular attention to underserved languages. Research on Aya-expanse shows debiasing techniques in cross-lingual latent space (EMNLP 2025).

Specialized Multilingual Models

BLOOM: 176B parameter model trained on 46 languages with explicit multilingual focus. Open weights.

mT5 / mT0: Encoder-decoder multilingual models strong for classification and structured generation.

NLLB (No Language Left Behind): Meta's translation-focused model covering 200+ languages.

SeamlessM4T: Multimodal multilingual model handling speech and text across 100+ languages.

Benchmarks and Evaluation

XTREME / XTREME-R: Cross-lingual benchmark suite covering diverse tasks.

XGLUE: Microsoft's cross-lingual evaluation benchmark.

FLORES: Translation quality benchmark for 200 languages.

TyDi QA: Question answering in typologically diverse languages.

Belebele: Reading comprehension across 122 languages (Meta, 2023).

Future Directions

Expanding Language Coverage

Current models serve perhaps 100 languages well. The remaining 6,900+ languages face a chicken-and-egg problem: no data means no model coverage, which means no digital presence to generate data.

Approaches being explored:

Language family clustering: Leverage transfer within language families to bootstrap coverage of related languages.

Multilingual speech-to-text: Spoken language data is more abundant than written for many languages.

Community-driven data collection: Partnering with language communities to create evaluation data and identify model failures.

Synthetic data generation: Using existing models to generate training data in low-resource languages (with careful quality control).

Reducing the Quality Gap

The gap between English and other languages, even in "multilingual" models, remains substantial:

Better evaluation: More comprehensive benchmarks for more languages to identify and target gaps.

Architectural innovations: Language-specific modules, better tokenization, and MoE routing may help allocate capacity more effectively.

Training efficiency: Methods to get more signal from limited data in low-resource languages.

Cultural AI

Beyond language, building AI that respects cultural differences:

Localized preference learning: RLHF with raters from diverse cultural backgrounds, not just English-speaking annotators.

Configurable cultural behavior: Allow users or deployers to specify cultural context for appropriate responses.

Avoiding cultural flattening: Preserving diversity rather than defaulting to English/Western norms.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles