What's the best approach for adding a new language to an existing model?

For widely-spoken languages, continued pre-training on language-specific data usually works well. Start with 1-10B tokens of high-quality text, mixing with some original-language data to prevent forgetting. For lower-resource languages, leverage transfer from related languages and consider adapter-based approaches (like AdaMergeX or FLARE) that don't modify base weights.

How do I handle code-switching (mixed language) inputs?

Current models generally handle code-switching better than older systems but imperfectly. For predictable mixing patterns (Hinglish, Spanglish), including code-switched examples in fine-tuning helps. For unpredictable mixing, ensure your tokenizer handles multi-script text gracefully and test specifically on mixed-language inputs from your target users.

What are "altruistic" vs "selfish" languages in multilingual training?

Research has classified languages by their cross-lingual transfer behavior: - **Altruistic**: Languages that improve performance on related languages when added (e.g., French improves Spanish/Italian) - **Selfish**: Languages that primarily benefit themselves, contributing little to others (often typologically isolated languages like Basque) - **Stagnant**: Languages that neither benefit from nor contribute to other languages (often very low-resource) Understanding these patterns helps design training data mixtures that maximize positive transfer.

Back to Blog

LLMs NLP Training Production Internationalization

Multilingual LLMs and Localization: Building AI for a Global World

Q: How many languages do current LLMs actually support well?

Major models like GPT-4 and Claude handle 20-30 languages at high quality (comparable to English), 50-70 at moderate quality (usable but noticeably worse), and claim support for 95+ at varying levels. "Support" is often generous—low-resource languages may produce output but with significant quality issues. For production applications, test specifically on your target languages rather than relying on claimed language counts.

Q: Why do some languages cost more with API pricing?

API pricing is per-token, and tokenizers are typically optimized for English. Languages with different scripts or morphological patterns require more tokens to encode equivalent content—sometimes 2-4× more. This "fertility gap" means non-English users effectively pay more for the same semantic content. Some providers are addressing this with better tokenizers (Qwen-3 uses 151K tokens, reducing non-English fertility to ~1.4×), but disparities remain.

Q: Can I fine-tune a multilingual model on English data and expect it to work in other languages?

Yes, this often works through cross-lingual transfer, but with caveats. Research shows models often convert non-English inputs to an English-like latent space for reasoning before outputting in the target language. Performance degrades for languages more distant from English (different script, syntax, or language family). For critical applications, include at least some target-language examples in fine-tuning—even a few hundred examples can significantly improve transfer to that language.

Q: How should I evaluate multilingual model quality?

Don't rely solely on translated benchmarks—they miss cultural nuances and translation artifacts can inflate scores artificially. Use native-speaker evaluation for languages important to your application. Test for cultural appropriateness, not just linguistic accuracy. Monitor per-language metrics in production. Consider the "transfer-localization plane" framework that quantifies both desirable knowledge transfer and undesirable cultural erasure.

Q: What is Cross-lingual In-Context Pre-training (CrossIC-PT)?

CrossIC-PT (2025) is a technique that enhances cross-lingual transfer by interleaving semantically related bilingual texts within the same context window during pre-training. Instead of processing documents from each language separately, CrossIC-PT samples are constructed by interleaving semantic-related bilingual Wikipedia documents. Experiments show 3-4% improvement across languages on models like Llama-3.1-8B and Qwen2.5-7B.

Production-focused guide to multilingual large language models covering cross-lingual transfer, tokenization challenges, cultural adaptation, and production strategies for deploying AI systems that serve users across languages and cultures.

December 23, 20259 min read

With over 7,000 languages spoken globally, the English-centric nature of most large language models represents a significant limitation on AI's potential impact. Multilingual LLMs aim to bridge this gap, enabling AI systems that understand and generate text across languages. However, building truly effective multilingual systems involves far more than training on diverse data—it requires careful consideration of tokenization, cross-lingual transfer, cultural context, and the inherent trade-offs between language coverage and per-language quality.

The Multilingual Challenge

Language modeling at scale seems straightforward: train on text from many languages, and the model learns to handle them all. Reality is more complex. English dominates the internet and, consequently, most training corpora. Common Crawl, the foundation of many training datasets, contains roughly 46% English content, while languages like Swahili, Bengali, or Thai represent fractions of a percent each.

This imbalance creates cascading effects. Models trained on data proportional to web prevalence effectively become English-primary systems with limited capability in other languages. But over-sampling low-resource languages to achieve balance degrades high-resource language performance and can introduce noise from lower-quality sources.

The challenge extends beyond data quantity. Different languages have different:

Morphological complexity: English has relatively simple word formation; Turkish or Finnish have extensive agglutination where single words carry what would be entire phrases in English.

Writing systems: Latin script, Cyrillic, Arabic, Chinese characters, and dozens of other scripts require different tokenization approaches.

Syntactic structure: Word order, case marking, and grammatical gender vary dramatically across language families.

Cultural context: Concepts, idioms, and pragmatic conventions differ across cultures, affecting appropriate responses.

Resource availability: Evaluation benchmarks, parallel corpora, and linguistic expertise are abundant for some languages and nearly nonexistent for others.

Building effective multilingual systems requires addressing all these challenges simultaneously while managing computational and data constraints.

Tokenization: The Foundation of Multilingual Models

Tokenization determines how text is split into the discrete units that models process. For multilingual models, tokenization choices have outsized impact on performance, efficiency, and fairness across languages.

The Fertility Problem

Standard subword tokenization (BPE, WordPiece, Unigram) learns merge patterns from training data. When English dominates training, the vocabulary becomes English-optimized:

"understanding" → 1 token
"Verständnis" (German for "understanding") → 3-4 tokens
"समझ" (Hindi for "understanding") → 4-6 tokens

This disparity, called fertility, means non-English text requires more tokens to encode the same semantic content. The consequences are severe:

Cost: API pricing is per-token. Hindi users pay 3-5× more than English users for equivalent content.

Context limits: A 4K context window holds ~3K English words but only ~1K Hindi words worth of content.

Quality: Higher fertility correlates with reduced generation quality. The model must predict more tokens, each with less semantic content.

Latency: More tokens means more generation steps, increasing response time.

Vocabulary Expansion Strategies

Several approaches address the fertility gap:

Larger vocabularies: Increasing from 32K to 100K+ tokens allocates more vocabulary to non-English languages. Qwen-3 uses 151K tokens; multilingual models from Google and Meta use 250K+. Diminishing returns and memory costs limit this approach.

Language-specific tokenizers: Train separate tokenizers per language, then combine. This ensures each language gets appropriate coverage but complicates the model architecture.

Balanced sampling during tokenizer training: Over-sample low-resource languages when learning BPE merges. This reduces fertility disparity at the cost of slight English efficiency loss.

Character or byte-level fallback: Use subword tokens for common patterns but fall back to characters or bytes for unknown text. ByT5 and CANINE explore this approach.

Modern Tokenizer Comparisons

Model	Vocabulary Size	Avg. Non-English Fertility	Approach
GPT-4	100K	1.8×	BPE with expanded vocab
Llama-3	128K	1.6×	Tiktoken with multilingual data
Qwen-3	151K	1.4×	Language-balanced BPE
BLOOM	250K	1.3×	Unigram with massive vocab
mT5	250K	1.3×	SentencePiece multilingual

The trend is clear: newer models invest in larger, more balanced vocabularies. The efficiency gains from reduced fertility outweigh the memory cost of additional embeddings.

Tokenization for Specific Scripts

Some scripts require special handling:

CJK (Chinese, Japanese, Korean): Character-based tokenization often works well since characters carry semantic meaning. However, Japanese mixes three scripts (hiragana, katakana, kanji) requiring unified handling.

Arabic: Right-to-left text, connected letters, and diacritics complicate tokenization. Arabic also has significant dialectal variation (Egyptian, Gulf, Levantine) that models often fail to distinguish.

Indic scripts: Multiple scripts with complex combining characters. Proper handling requires Unicode normalization and understanding of script-specific rules.

Code-mixing: Users frequently mix languages within sentences (Hinglish = Hindi + English). Tokenizers must handle mixed-script text gracefully.

Cross-Lingual Transfer: How Models Learn to Generalize

A remarkable property of multilingual models is cross-lingual transfer: learning in one language improves performance in others. A model fine-tuned on English question-answering often improves at German QA without seeing German examples. Understanding this phenomenon is key to building effective multilingual systems.

The Shared Representation Hypothesis

Cross-lingual transfer suggests that multilingual models develop language-agnostic representations—an internal "interlingua" where meaning is encoded similarly regardless of surface language. Evidence for this includes:

Parallel sentence alignment: Sentences with identical meaning in different languages cluster together in embedding space.

Zero-shot transfer: Task learning transfers across languages without explicit training.

Translation emergence: Models can translate between language pairs never seen together in training.

However, the picture is more nuanced. Recent research (2024-2025) reveals that models often process non-English inputs by implicitly converting to English-like representations, performing reasoning, then converting back. This "English pivot" behavior suggests true language-agnostic representation may be an approximation rather than reality.

Transfer Patterns

Not all language pairs transfer equally:

Language family effects: Transfer works best within language families. Spanish-Italian transfer is strong; Spanish-Japanese transfer is weak.

Typological similarity: Languages with similar syntax transfer better regardless of family. SOV languages (Japanese, Korean, Turkish) share transfer patterns despite being unrelated.

Script effects: Shared scripts facilitate transfer. The Arabic script connects Arabic, Farsi, and Urdu despite different language families.

Resource effects: High-resource languages serve as better source languages. English-to-X transfer is usually stronger than X-to-English, reflecting English's training dominance.

Altruistic, Selfish, and Stagnant Languages

Research has identified different languages' behavior during multilingual training:

Altruistic languages: Improve performance on related languages when added to training. French improves Spanish and Italian; Hindi improves Urdu.

Selfish languages: Primarily benefit themselves, contributing little to other languages. Often typologically isolated languages like Basque.

Stagnant languages: Neither benefit from nor contribute to other languages. Often very low-resource languages with insufficient training signal.

Understanding these patterns helps design training data mixtures that maximize positive transfer.

Cross-lingual In-Context Pre-training

A 2025 technique called Cross-lingual In-context Pre-training (CrossIC-PT) explicitly encourages transfer by interleaving semantically related texts in different languages within the same context window:

Rather than:

Code

[English document 1] [English document 2] [French document 1] [French document 2]

CrossIC-PT uses:

Code

[English doc about climate] [French doc about climate] [English doc about politics] [French doc about politics]

This forces the model to connect semantically similar content across languages, strengthening the shared representation. Experiments show 3-4% improvement across languages using this approach.

Multilingual Model Performance Analysis

Understanding how different models perform across languages helps select the right model for your use case.

Benchmark Performance by Language Family

Performance varies significantly by language family due to training data distribution and linguistic similarity to English:

Language Family	Representative Languages	GPT-4	Claude 3.5	Llama-3 70B	Qwen-2.5
Germanic	German, Dutch, Swedish	95%	94%	88%	85%
Romance	French, Spanish, Italian	94%	93%	87%	84%
Slavic	Russian, Polish, Czech	88%	87%	78%	80%
Sino-Tibetan	Chinese (Mandarin)	91%	90%	82%	96%
Japonic	Japanese	89%	88%	75%	88%
Dravidian	Tamil, Telugu, Kannada	72%	71%	58%	65%
Indo-Iranian	Hindi, Bengali, Farsi	78%	77%	65%	72%
Afroasiatic	Arabic, Hebrew, Amharic	80%	79%	68%	75%
Niger-Congo	Swahili, Yoruba, Zulu	65%	64%	48%	55%

Scores represent average performance across MMLU, TruthfulQA, and task completion benchmarks, normalized to English = 100%

Quality Tiers by Language

Based on production testing, languages can be grouped into quality tiers:

Tier 1 (>90% of English quality):

English, German, French, Spanish, Italian, Dutch, Portuguese
Suitable for production applications requiring high reliability

Tier 2 (75-90% of English quality):

Chinese, Japanese, Korean, Russian, Arabic, Hindi, Polish, Turkish
Suitable for most production uses with quality monitoring

Tier 3 (50-75% of English quality):

Thai, Vietnamese, Indonesian, Greek, Hebrew, Czech, Romanian
Suitable for assisted tasks, requires human review for critical applications

Tier 4 (<50% of English quality):

Low-resource languages: Swahili, Amharic, Yoruba, Burmese, Khmer
Experimental use only, significant quality gaps

Cross-Lingual Task Performance

Different task types show different cross-lingual patterns:

Task Type	High-Resource Transfer	Low-Resource Transfer	Notes
Classification	85-95%	60-80%	Transfers well
NER	75-90%	50-70%	Language-specific patterns matter
QA	80-92%	55-75%	Knowledge availability varies
Summarization	78-88%	50-70%	Grammatical differences impact
Generation	70-85%	40-65%	Most challenging transfer
Translation	N/A	N/A	Requires parallel data

Token Efficiency Analysis

The fertility gap (tokens per semantic unit) significantly impacts cost and performance:

Language	Avg Tokens/Word	Fertility vs English	Cost Multiplier
English	1.3	1.0×	1.0×
Spanish	1.4	1.1×	1.1×
German	1.6	1.2×	1.2×
French	1.5	1.15×	1.15×
Chinese	2.1	1.6×	1.6×
Japanese	2.8	2.2×	2.2×
Korean	2.5	1.9×	1.9×
Arabic	2.4	1.8×	1.8×
Hindi	3.2	2.5×	2.5×
Thai	4.1	3.2×	3.2×

Based on GPT-4 tokenizer; newer tokenizers (Qwen-3, Llama-3) show improved efficiency

Training Multilingual Models

Training effective multilingual models requires careful decisions about data mixing, curriculum design, and capacity allocation.

Data Mixing Strategies

The fundamental question is how to weight languages in training data:

Proportional sampling: Sample each language proportionally to its prevalence. Result: English dominates, low-resource languages underfit.

Temperature-based sampling: Apply temperature to the language distribution: $P(L_i) \propto P_{\text{raw}}(L_i)^{1/T}$

Higher temperature flattens the distribution, giving more weight to low-resource languages. T=0.3-0.5 is common.

Square root sampling: Sample proportional to the square root of language size. A middle ground between proportional and uniform.

Uniform sampling: Equal weight to all languages regardless of data size. Low-resource languages get maximum exposure but risk overfitting to limited data.

Dynamic sampling: Adjust weights during training based on per-language validation loss. Allocate more compute to languages that are learning.

Empirical findings suggest different strategies for different training phases:

Early training: More uniform sampling to establish representations
Later training: More proportional sampling to maximize high-resource quality
Fine-tuning: Target language focused

Capacity Allocation

Multilingual models face the "curse of multilinguality": performance on any single language typically lags behind a monolingual model of equivalent size. The model's capacity must be shared across many languages.

Strategies to mitigate this:

Scale: Simply make models larger. GPT-4 and Claude handle 95+ languages competently through sheer scale.

Modular architectures: Use language-specific components (embedding layers, adapter modules) while sharing core capacity.

Mixture of Experts: Different experts specialize in different language families. MoE provides effective capacity scaling with sublinear compute increase.

Language-specific fine-tuning: Start with a multilingual base, then fine-tune separate versions per language or language group.

Continual Pre-training for New Languages

Adding new languages to an existing model (continual pre-training) risks catastrophic forgetting of existing languages. Techniques to mitigate this:

Data replay: Mix new language data with samples from original training languages.

Elastic weight consolidation: Regularize to prevent important weights from changing too much.

Progressive expansion: Gradually increase the new language's weight in the training mix.

Adapter-based addition: Add language-specific adapters rather than modifying base weights.

Recent work on "rethinking multilingual continual pretraining" (2024-2025) suggests that careful data mixing with even small amounts of original-language data prevents most forgetting while enabling efficient language addition.

Fine-Tuning Multilingual Models

Fine-tuning multilingual models for specific tasks introduces additional considerations beyond monolingual fine-tuning.

Cross-lingual Fine-tuning Strategies

English-only fine-tuning: Train only on English task data, rely on transfer. Simple but leaves performance gaps for distant languages.

Translate-train: Machine-translate English training data to target languages. Effective but translation quality limits ceiling.

Multilingual fine-tuning: Collect or generate task data in multiple languages. Best results but highest data cost.

Few-shot multilingual: Fine-tune on English with few examples from target languages. Often achieves most of multilingual fine-tuning's benefit with much less data.

Adapter-Based Approaches

Parameter-efficient fine-tuning methods are particularly valuable for multilingual settings:

Language adapters: Small modules trained per-language that modulate the base model. The base model stays frozen and shared.

Task + Language adapters: Separate adapters for task knowledge (shared across languages) and language-specific adaptation. Compose them at inference.

AdaMergeX (2025): Adaptively merges adapters from source languages to adapt to target languages. Enables zero-shot transfer to new languages by combining existing adapters.

FLARE: Latent Fusion for Better Transfer

A late 2024 approach called FLARE integrates source and target language representations within LoRA adapters' bottleneck layers. By performing latent fusion of English and target-language representations, FLARE reduces the performance gap between English and other languages to 8-12% on average, compared to 20-30% gaps with standard fine-tuning.

Cultural Adaptation and Localization

True multilingual capability extends beyond language to culture. A model that translates literally without cultural adaptation produces technically correct but pragmatically wrong outputs.

The Transfer-Localization Tradeoff

Cross-lingual transfer is desirable for factual knowledge:

"What is the speed of light?" should get the same answer regardless of query language.

But localization is desirable for culturally-situated responses:

"What should I eat for breakfast?" should reflect local cuisine
"Who is the president?" should understand country context from language
Humor, idioms, and politeness norms vary by culture

Research identifies this as the "transfer-localization plane"—a framework for quantifying both desirable knowledge transfer and undesirable "cultural erasure" when models over-transfer.

Cultural Capabilities

High-quality multilingual models demonstrate:

Idiomatic expression understanding: Recognizing that "it's raining cats and dogs" means heavy rain, with equivalent idiom use in other languages.

Register adaptation: Adjusting formality based on cultural norms. Japanese requires explicit politeness levels; English formality is more implicit.

Cultural knowledge: Understanding local holidays, customs, historical figures, and current events for each language's primary cultures.

Code-switching handling: Naturally mixing languages as bilingual speakers do, rather than enforcing artificial language boundaries.

Pragmatic appropriateness: Responses that would be appropriate in the cultural context of the query language.

Evaluation for Cultural Competence

Standard benchmarks often miss cultural aspects. Evaluating cultural competence requires:

Localized benchmarks: Test sets created by native speakers reflecting their cultural context, not translated from English.

Human evaluation: Automated metrics miss pragmatic appropriateness. Native speaker judgment is essential.

Comparative evaluation: Same semantic query in different languages should sometimes get different answers (local context) and sometimes the same (factual knowledge).

Red-teaming: Testing for cultural stereotypes, biases, and inappropriate generalizations.

Case Studies: Multilingual Deployment

Case Study 1: Global Customer Support

A multinational corporation deployed multilingual LLMs for customer support across 40+ countries:

Challenge:

15 primary languages, 40+ secondary languages
Varying quality requirements by market (Tier 1 markets need 95%+ accuracy)
Real-time response requirements (<2s latency)
Cost constraints ($0.02/query budget)

Solution Architecture:

Code

User Query → Language Detection → Router
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
              Tier 1 Model   Tier 2 Model    Tier 3 Model
              (GPT-4)        (Claude 3.5)    (Llama-3 70B)
                    │               │               │
                    └───────────────┼───────────────┘
                                    ▼
                            Response + QA Check

Results:

Tier 1 languages: 94% customer satisfaction
Tier 2 languages: 87% customer satisfaction
Tier 3 languages: 72% satisfaction (with human escalation path)
60% cost reduction vs. human-only support

Lessons learned:

Language detection fails 8% for short queries—use user profile as fallback
Code-switching in Asian markets requires specialized handling
Cultural adaptation crucial for satisfaction scores

Case Study 2: Multilingual Content Generation

A media company generates content in 12 languages for regional audiences:

Requirements:

Generate 1000+ articles/day across languages
Maintain brand voice consistency
Culturally appropriate content
SEO optimization per market

Architecture:

Generate master content in English
Adapt (not translate) to each target language
Apply cultural localization rules
Human review for Tier 1 markets

Quality metrics:

Language	Automation Rate	Human Edit Rate	Reader Engagement
Spanish	85%	15%	98% of English
German	82%	18%	96% of English
Japanese	68%	32%	91% of English
Arabic	65%	35%	88% of English
Hindi	55%	45%	82% of English

Case Study 3: Multilingual RAG System

An enterprise deployed RAG across documents in 8 languages:

Challenges:

Documents in multiple languages (EN, DE, FR, ES, IT, PT, ZH, JA)
Users query in their preferred language
Need cross-lingual retrieval (German query → French document)

Solution:

Multilingual embeddings (mE5, multilingual-e5-large)
Cross-lingual reranking with multilingual model
Answer generation in query language

Performance:

Query Lang → Doc Lang	Recall@10	Answer Quality
Same language	92%	94%
Cross-lingual (related)	85%	88%
Cross-lingual (distant)	71%	78%

Production Deployment

Deploying multilingual models in production involves considerations beyond monolingual deployment.

Language Detection and Routing

Production systems need to identify query language for:

Model selection: Route to language-specific models if using a model suite rather than one multilingual model.

Response language matching: Generate responses in the query language (or user's preferred language).

Evaluation and monitoring: Track per-language quality metrics.

Language detection is usually accurate for single-language inputs but struggles with:

Very short queries
Code-mixed text
Similar languages (Norwegian/Swedish, Hindi/Urdu)
Transliterated text (Hindi written in Latin script)

Quality Variance Across Languages

Multilingual models have inconsistent quality across languages. Production systems should:

Monitor per-language metrics: Track quality scores, user feedback, and task success rates by language.

Set quality thresholds: Gate features that require high reliability for languages where quality is insufficient.

Provide fallback paths: When quality is low, offer machine translation to English or human escalation.

Communicate limitations: Be transparent with users about which languages are well-supported vs. best-effort.

Cost Considerations

Multilingual deployment affects costs through:

Fertility impact: Higher token counts for some languages increase API costs and compute requirements.

Model size: Multilingual models are typically larger than monolingual models for equivalent single-language quality.

Evaluation costs: Testing across many languages multiplies QA effort.

Support costs: User issues in many languages require multilingual support capability.

Regulatory and Compliance

Different regions have different requirements:

Data residency: Some jurisdictions require data processing within their borders.

Content moderation: Moderation systems must handle all supported languages effectively.

Accessibility: Multilingual support may be legally required in some contexts (government services, healthcare).

Right to explanation: GDPR and similar regulations require explanations in users' languages.

Current State of Multilingual Models (2025)

The landscape of multilingual LLMs has evolved significantly:

Leading Models

GPT-4 / GPT-4 Turbo: Strong multilingual capability across 95+ languages. Quality varies significantly between high and low-resource languages.

Claude 3.5 / Claude 4: Comparable multilingual breadth with particular strength in European languages and nuanced cultural understanding.

Gemini 1.5 / Gemini 2: Google's models leverage their translation expertise. Strong in languages covered by Google Translate.

Llama-3: Open-weight with decent multilingual capability. Community has extended through continued pretraining (e.g., Chinese-Llama, Japanese-Llama).

Qwen-3 (2025): Major expansion from Qwen-2.5's 29 languages to 119 languages and dialects. Key advances:

Pre-trained on ~36 trillion tokens (2× Qwen-2.5's 18T)
Models range from 0.6B to 235B parameters (dense and MoE)
Unified thinking mode (complex reasoning) and non-thinking mode (rapid responses)
Qwen3-Embedding-8B ranks #1 on MTEB multilingual leaderboard (score 70.58, June 2025)
Qwen-MT enables translation across 92 major languages covering 95%+ of global population

Aya (Cohere): Explicitly multilingual-focused, covering 100+ languages with particular attention to underserved languages. Research on Aya-expanse shows debiasing techniques in cross-lingual latent space (EMNLP 2025).

Specialized Multilingual Models

BLOOM: 176B parameter model trained on 46 languages with explicit multilingual focus. Open weights.

mT5 / mT0: Encoder-decoder multilingual models strong for classification and structured generation.

NLLB (No Language Left Behind): Meta's translation-focused model covering 200+ languages.

SeamlessM4T: Multimodal multilingual model handling speech and text across 100+ languages.

Benchmarks and Evaluation

XTREME / XTREME-R: Cross-lingual benchmark suite covering diverse tasks.

XGLUE: Microsoft's cross-lingual evaluation benchmark.

FLORES: Translation quality benchmark for 200 languages.

TyDi QA: Question answering in typologically diverse languages.

Belebele: Reading comprehension across 122 languages (Meta, 2023).

Future Directions

Expanding Language Coverage

Current models serve perhaps 100 languages well. The remaining 6,900+ languages face a chicken-and-egg problem: no data means no model coverage, which means no digital presence to generate data.

Approaches being explored:

Language family clustering: Leverage transfer within language families to bootstrap coverage of related languages.

Multilingual speech-to-text: Spoken language data is more abundant than written for many languages.

Community-driven data collection: Partnering with language communities to create evaluation data and identify model failures.

Synthetic data generation: Using existing models to generate training data in low-resource languages (with careful quality control).

Reducing the Quality Gap

The gap between English and other languages, even in "multilingual" models, remains substantial:

Better evaluation: More comprehensive benchmarks for more languages to identify and target gaps.

Architectural innovations: Language-specific modules, better tokenization, and MoE routing may help allocate capacity more effectively.

Training efficiency: Methods to get more signal from limited data in low-resource languages.

Cultural AI

Beyond language, building AI that respects cultural differences:

Localized preference learning: RLHF with raters from diverse cultural backgrounds, not just English-speaking annotators.

Configurable cultural behavior: Allow users or deployers to specify cultural context for appropriate responses.

Avoiding cultural flattening: Preserving diversity rather than defaulting to English/Western norms.

Frequently Asked Questions

Major models like GPT-4 and Claude handle 20-30 languages at high quality (comparable to English), 50-70 at moderate quality (usable but noticeably worse), and claim support for 95+ at varying levels. "Support" is often generous—low-resource languages may produce output but with significant quality issues. For production applications, test specifically on your target languages rather than relying on claimed language counts.

API pricing is per-token, and tokenizers are typically optimized for English. Languages with different scripts or morphological patterns require more tokens to encode equivalent content—sometimes 2-4× more. This "fertility gap" means non-English users effectively pay more for the same semantic content. Some providers are addressing this with better tokenizers (Qwen-3 uses 151K tokens, reducing non-English fertility to ~1.4×), but disparities remain.

Yes, this often works through cross-lingual transfer, but with caveats. Research shows models often convert non-English inputs to an English-like latent space for reasoning before outputting in the target language. Performance degrades for languages more distant from English (different script, syntax, or language family). For critical applications, include at least some target-language examples in fine-tuning—even a few hundred examples can significantly improve transfer to that language.

Don't rely solely on translated benchmarks—they miss cultural nuances and translation artifacts can inflate scores artificially. Use native-speaker evaluation for languages important to your application. Test for cultural appropriateness, not just linguistic accuracy. Monitor per-language metrics in production. Consider the "transfer-localization plane" framework that quantifies both desirable knowledge transfer and undesirable cultural erasure.

CrossIC-PT (2025) is a technique that enhances cross-lingual transfer by interleaving semantically related bilingual texts within the same context window during pre-training. Instead of processing documents from each language separately, CrossIC-PT samples are constructed by interleaving semantic-related bilingual Wikipedia documents. Experiments show 3-4% improvement across languages on models like Llama-3.1-8B and Qwen2.5-7B.

Research has classified languages by their cross-lingual transfer behavior:

Altruistic: Languages that improve performance on related languages when added (e.g., French improves Spanish/Italian)
Selfish: Languages that primarily benefit themselves, contributing little to others (often typologically isolated languages like Basque)
Stagnant: Languages that neither benefit from nor contribute to other languages (often very low-resource)

Understanding these patterns helps design training data mixtures that maximize positive transfer.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

LLMsNLP

Tokenization Deep Dive: BPE, WordPiece, and SentencePiece

Detailed walkthrough of tokenization—how LLMs convert text to numbers. Understand BPE, WordPiece, Unigram, and SentencePiece, and why tokenization matters for model performance.

7 min read

EducationLLMs

LLM Pre-training: Building Foundation Models from Scratch

Field guide to pre-training large language models—from data curation and architecture decisions to scaling laws and distributed training infrastructure. Understanding how GPT, Llama, and other foundation models are built.

15 min read

EducationLLMs

SFT Deep Dive: Instruction Tuning Techniques and Best Practices

Clear walkthrough of Supervised Fine-Tuning (SFT) for LLMs—covering full fine-tuning vs LoRA vs QLoRA vs DoRA, data curation strategies, instruction formats, multi-task learning, and avoiding catastrophic forgetting.

4 min read

EducationRAG

Building Production-Ready RAG Systems: Lessons from the Field

Production-focused guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.

16 min read

LLMsML Engineering

Data Curation for LLM Training: The Hidden Foundation of Model Quality

End-to-end guide to curating training data for large language models—from web crawl filtering and deduplication to quality classifiers and data mixing strategies. The unglamorous work that determines model quality.

10 min read

Table of Contents

The Multilingual Challenge

Tokenization: The Foundation of Multilingual Models

The Fertility Problem

Vocabulary Expansion Strategies

Modern Tokenizer Comparisons

Tokenization for Specific Scripts

Cross-Lingual Transfer: How Models Learn to Generalize

The Shared Representation Hypothesis

Transfer Patterns

Altruistic, Selfish, and Stagnant Languages

Cross-lingual In-Context Pre-training

Multilingual Model Performance Analysis

Benchmark Performance by Language Family

Quality Tiers by Language

Cross-Lingual Task Performance

Token Efficiency Analysis

Training Multilingual Models

Data Mixing Strategies

Capacity Allocation

Continual Pre-training for New Languages

Fine-Tuning Multilingual Models

Cross-lingual Fine-tuning Strategies

Adapter-Based Approaches

FLARE: Latent Fusion for Better Transfer

Cultural Adaptation and Localization

The Transfer-Localization Tradeoff

Cultural Capabilities

Evaluation for Cultural Competence

Case Studies: Multilingual Deployment

Case Study 1: Global Customer Support

Case Study 2: Multilingual Content Generation

Case Study 3: Multilingual RAG System

Production Deployment

Language Detection and Routing

Quality Variance Across Languages

Cost Considerations

Regulatory and Compliance

Current State of Multilingual Models (2025)

Leading Models

Specialized Multilingual Models

Benchmarks and Evaluation

Future Directions

Expanding Language Coverage

Reducing the Quality Gap

Cultural AI

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Tokenization Deep Dive: BPE, WordPiece, and SentencePiece

LLM Pre-training: Building Foundation Models from Scratch

SFT Deep Dive: Instruction Tuning Techniques and Best Practices

Building Production-Ready RAG Systems: Lessons from the Field

Data Curation for LLM Training: The Hidden Foundation of Model Quality