What's the largest model I can run on a smartphone?

On flagship phones (iPhone 15 Pro, recent high-end Android), you can run 7B models after quantization, though with limited context and moderate generation speed. For reliable performance across mid-range devices, target 3B models. For broad compatibility including older devices, 1-2B models are safer.

How much quality do I lose with quantization?

INT8 quantization typically preserves 95%+ of model quality—most users can't tell the difference. INT4 quantization shows more degradation, particularly on complex reasoning tasks. For simple tasks like classification or short generation, INT4 is often sufficient. For quality-critical applications, stay with INT8 or higher.

Can I fine-tune models for edge deployment?

Yes. Fine-tune using standard approaches (LoRA, full fine-tuning), then quantize for deployment. Fine-tuning can actually improve edge viability by making smaller models more capable at specific tasks, compensating for size limitations.

How do I handle model downloads in a mobile app?

Download models on first use or when the user opts in to AI features. Show clear progress during download. Cache models to local storage for subsequent launches. Consider background download to prepare models before they're needed. Respect user data preferences—offer download over WiFi only option.

WebGPU vs WebAssembly for browser inference?

WebGPU is dramatically faster—often 10-50x for LLM inference. Use WebGPU when available (Chrome, Edge, Safari). Fall back to WebAssembly for broader compatibility, but expect significantly slower inference. Consider whether the WebAssembly experience is acceptable or whether to require WebGPU.

How do I benchmark edge inference fairly?

Measure on target devices, not development machines. Include model loading time for first inference. Measure both prompt processing (prefill) and generation (decode) speeds. Report memory usage throughout inference. Test with realistic prompt and generation lengths. Measure power consumption for mobile.

What about long context on edge devices?

Context length is limited by memory. Each token in context requires KV cache memory. A 7B model might use 500MB+ of KV cache for 8K context. For longer contexts, use smaller models, more aggressive quantization, or architectures with efficient context handling (like RWKV).

Can edge models match cloud model quality?

For general tasks requiring broad knowledge and complex reasoning, no—cloud models remain substantially better. For focused tasks with appropriate fine-tuning, edge models can match or exceed cloud models of similar focus. The quality gap is narrowing as small models improve, but physics (parameters, compute) creates fundamental limits.

Edge AI Models: A Comprehensive Guide to On-Device LLM Deployment | Enrico Piovano

Introduction

The dominant paradigm for LLM deployment has been cloud-based inference—send a request to an API endpoint, wait for a response. But this approach has fundamental limitations: latency depends on network conditions, costs scale with usage, privacy-sensitive data must leave the device, and offline use is impossible. Edge AI offers an alternative: run the model directly on the user's device.

Edge deployment brings inference to where data originates and where users interact. A smartphone can understand voice commands without network connectivity. A browser can provide writing assistance without sending every keystroke to a server. An embedded device can make decisions in milliseconds rather than waiting for round-trip network latency. The tradeoffs are real—edge devices have limited compute, memory, and power—but for many applications, the benefits outweigh the constraints.

This guide explores the complete landscape of edge AI for language models. We'll examine which models are suitable for edge deployment, how quantization makes large models fit in small memory budgets, which runtime frameworks enable efficient inference across platforms, and how to deploy effectively on mobile devices, browsers, desktops, and embedded systems.

Why Edge AI Matters

Understanding when edge deployment makes sense requires examining its advantages and limitations compared to cloud inference.

Advantages of Edge Deployment

Latency improvements are often dramatic. Cloud inference involves network round-trips that add 50-500ms depending on conditions. Edge inference eliminates this entirely—responses begin generating immediately. For interactive applications like autocomplete, voice assistants, or real-time translation, this latency reduction transforms user experience.

Privacy preservation keeps sensitive data on-device. Medical notes, financial information, personal conversations—none of this needs to leave the user's device. This simplifies compliance with regulations like GDPR and HIPAA, and addresses user privacy concerns that prevent adoption of cloud-based AI features.

Offline capability enables use without network connectivity. Mobile users in areas with poor coverage, airplane passengers, field workers in remote locations—all can benefit from AI features that work anywhere. Even in connected environments, offline capability provides resilience against network failures.

Cost structure differs fundamentally from cloud inference. Cloud APIs charge per token, meaning costs scale linearly with usage. Edge deployment has fixed costs (model download, device compute) regardless of how heavily features are used. For high-frequency, low-stakes inference—autocomplete, classification, small generations—edge deployment can be dramatically cheaper at scale.

Customization opportunities emerge when you control the full stack. Fine-tune models for specific use cases without sharing training data. Optimize inference for your specific hardware. Integrate tightly with device capabilities. Cloud APIs offer limited customization; edge deployment offers complete control.

Limitations and Tradeoffs

Model capability is constrained by device resources. A smartphone can run a 3B parameter model; a data center can run a 400B parameter model. For tasks requiring deep reasoning, extensive knowledge, or long context, cloud models will outperform what's feasible on edge devices.

Memory constraints limit model size and context length. Mobile devices typically have 4-8GB RAM shared with the operating system and other apps. After quantization, a 7B model might require 4GB—consuming a significant fraction of available memory and leaving little room for long contexts.

Compute constraints affect generation speed. Mobile GPUs and CPUs generate tokens slower than data center hardware. A cloud API might generate 100+ tokens per second; a mobile device might manage 10-30. For long generations, this difference is noticeable.

Power consumption matters for battery-powered devices. Running inference continuously drains batteries quickly. Edge AI features must be designed with power awareness—triggering inference only when needed, using efficient model architectures, and monitoring power impact.

Update complexity increases compared to cloud deployment. Updating a cloud model is instant for all users. Updating an edge model requires users to download new weights—potentially hundreds of megabytes. Version fragmentation becomes possible as users run different model versions.

When to Choose Edge vs Cloud

Edge deployment fits best when latency is critical (interactive applications, real-time processing), privacy is paramount (sensitive data, regulated industries), offline use is needed (mobile apps, field deployment), usage volume is high (frequent small inferences), or customization is valuable (domain-specific fine-tuning).

Cloud deployment fits best when capability is critical (complex reasoning, large context), the latest models are needed (cutting-edge capabilities), usage is sporadic (occasional large tasks), universal access matters (same capability across all devices), or simplicity is valued (no edge deployment complexity).

Many applications benefit from hybrid approaches. Use edge models for common, latency-sensitive operations; escalate to cloud models for complex tasks. Pre-process on device to reduce data sent to cloud. Cache cloud results for offline access.

The Edge Model Landscape

Not all models are suitable for edge deployment. The models designed for edge have specific characteristics: smaller parameter counts, efficient architectures, and training optimized for the capability-per-parameter sweet spot.

Microsoft Phi Series

Microsoft's Phi models pioneered the "small but capable" approach. Rather than scaling parameters, Phi models achieve capability through high-quality training data and curriculum learning.

Phi-4-mini-instruct is the latest lightweight model from Microsoft's Phi-4 family. With 3.8B parameters, it demonstrates reasoning and multilingual performance comparable to much larger 7B-9B models. Trained on high-quality synthetic data and filtered public datasets, it represents the current state of the art for edge-sized models.

Phi 4 reasoning models at 14B parameters represent Microsoft's push into reasoning-focused models that rival much larger models on complex reasoning tasks. While larger than typical edge models, they run well on high-end laptops and desktops.

Phi-3.5-mini remains popular at 3.8B parameters, fitting comfortably on mobile devices after quantization (around 2GB at 4-bit). The 128K context length is notable—longer than many larger models.

Phi-3.5-MoE uses Mixture of Experts to provide 42B total parameters with only 6.6B active per token. This architecture achieves higher capability while maintaining edge-viable compute requirements.

Phi-3-vision adds multimodal capability, accepting images alongside text. Edge deployment of multimodal models is particularly valuable—process camera input locally without uploading images.

Phi models excel at reasoning tasks relative to their size. They're particularly strong on mathematics, coding, and structured reasoning.

Google Gemma Series

Google's Gemma models bring Google's research advances to open weights accessible for edge deployment.

Gemma-3n represents Google's latest edge-focused architecture. The Gemma-3n-E2B-IT model is an instruction-tuned multimodal model that accepts text, image, audio, and video inputs. While the raw parameter count is around 5B, it uses selective parameter activation to run with a memory footprint closer to a traditional 2B model—a significant innovation for edge deployment.

Gemma 3n models are specifically designed for efficient execution on everyday devices such as laptops, tablets, and phones, making them ideal for on-device and low-resource deployments.

Gemma 2 2B remains a solid choice for pure edge deployment. At 2 billion parameters, it fits easily on mobile devices. Google specifically optimized this size point for mobile inference.

Gemma 2 9B offers more capability while remaining edge-viable on high-end devices. After quantization, it can run on devices with 6GB+ available memory. The capability jump from 2B to 9B is significant for complex tasks.

Gemma 2 27B pushes the boundary of what counts as "edge"—viable on high-end laptops and desktops, but too large for mobile. It offers capability approaching cloud models while running locally.

Gemma models benefit from Google's training infrastructure and data. They tend to be well-rounded across tasks, with good multilingual support covering many languages beyond English.

Alibaba Qwen Series

Qwen models from Alibaba offer strong multilingual capability, particularly for Chinese and Asian languages, while maintaining English performance.

Qwen3-0.6B is the smallest dense model in Alibaba's Qwen3 family, released under the Apache 2.0 license. Despite its tiny size, it inherits strong reasoning, improved agent and tool-use capabilities, and broad multilingual support with 32K context length. It's among the most downloaded text generation models on Hugging Face, demonstrating strong community adoption.

Qwen3-4B offers a good balance of capability and efficiency for edge deployment, with NPU-optimized versions available for Qualcomm Snapdragon devices.

Qwen2.5-3B fits the mobile sweet spot with 3 billion parameters. Strong performance across languages makes it particularly valuable for international applications.

Qwen2.5-7B offers significant capability improvement while remaining edge-viable on capable devices. The model particularly excels at Chinese language tasks while maintaining strong English performance.

Qwen2.5-Coder variants are specifically optimized for code generation. For developer tools running on-device, these specialized models outperform general models of similar size.

Qwen2.5-Math variants excel at mathematical reasoning. If your edge application involves calculations or math problems, these specialized models are worth considering.

The Qwen family's strength is breadth—models across sizes, specializations, and languages. This makes it easier to find a model that matches specific edge deployment requirements.

Meta Llama Series

Meta's Llama models have become the default choice for many deployments, with an ecosystem of tools, fine-tunes, and optimizations built around them.

Llama 3.2 1B is Meta's smallest model, designed explicitly for edge deployment. At 1 billion parameters, it runs on nearly any device. Capability is limited but sufficient for classification, simple generation, and embedding tasks.

Llama 3.2 3B offers substantially more capability while remaining edge-friendly. This size point balances capability and efficiency well for mobile deployment. Thanks to Grouped-Query Attention, it's particularly suitable for edge and on-prem deployments with strong multilingual reasoning and safety protocols.

Llama 3.2 11B Vision adds multimodal capability with image understanding. The model is larger but opens possibilities for on-device image analysis.

MobileLLaMA 1.4B is a lightweight transformer model specifically built for mobile and edge devices. It downsizes LLaMA while maintaining competitive performance on language understanding and reasoning benchmarks, offering an option between the 1B and 3B Llama variants.

Llama 3.1 8B remains popular for edge deployment despite being designed before the explicit edge focus. The extensive ecosystem of fine-tunes, quantizations, and tooling makes deployment straightforward.

Llama's advantage is ecosystem. Whatever edge deployment challenge you face, someone has likely solved it for Llama. Quantized versions, optimized runtimes, fine-tuned variants—all are readily available.

Other Notable Models

SmolLM3-3B from Hugging Face is the latest in the SmolLM series and outperforms Llama-3.2-3B and Qwen2.5-3B at the 3B scale while staying competitive with many 4B-class alternatives. The original SmolLM models (135M, 360M, 1.7B) remain excellent choices for highly constrained environments where even a 3B model is too large.

Ministral 3 from Mistral AI is designed specifically for edge deployment, capable of running on a wide range of hardware. It represents Mistral's entry into the dedicated edge model space.

StableLM from Stability AI offers models optimized for specific tasks. StableLM Zephyr 3B is tuned for chat, making it particularly suitable for conversational edge applications.

RWKV models use a different architecture (linear attention) that offers different tradeoffs. Memory usage grows more slowly with context length, potentially advantageous for edge deployment with long contexts.

Mistral 7B, while at the larger end of edge-viable models, deserves mention for its exceptional capability-per-parameter ratio. On capable edge devices, it provides cloud-competitive quality.

Choosing a Model for Edge

Model selection depends on multiple factors that must be balanced.

Target devices determine the size ceiling. Mobile phones: 1-3B parameters comfortably, up to 7B on high-end devices. Laptops: 7-13B parameters comfortably. Embedded systems: potentially smaller, depending on hardware.

Task requirements affect model choice. Simple classification might work with 1B models. Conversational AI benefits from 3B+. Code generation or complex reasoning might require 7B+ for acceptable quality.

Language requirements matter significantly. If your application needs Chinese, Qwen models have advantages. For primarily English applications, Phi and Llama are strong choices. Gemma offers good multilingual breadth.

Ecosystem considerations affect development speed. Llama's extensive ecosystem means faster development. Less common models might require more custom work.

Licensing varies across models. Llama has usage restrictions at scale. Gemma has specific terms. Phi has its own license. Ensure the model's license permits your intended use.

Quantization: Making Models Fit

Full-precision models are too large for edge deployment. A 7B parameter model in FP16 requires 14GB just for weights. Quantization reduces precision, shrinking model size while preserving most capability.

Understanding Quantization

Neural network weights are stored as floating-point numbers—typically 16-bit (FP16) or 32-bit (FP32) during training. Quantization converts these to lower precision: 8-bit integers (INT8), 4-bit integers (INT4), or mixed formats.

The size reduction is straightforward: FP16 uses 2 bytes per parameter, INT8 uses 1 byte, INT4 uses 0.5 bytes. A 7B model goes from 14GB (FP16) to 7GB (INT8) to 3.5GB (INT4). This makes previously impossible deployments viable.

Quality impact varies by quantization method and target precision. INT8 quantization typically preserves nearly all model capability—quality loss is measurable but rarely noticeable in practice. INT4 quantization has more significant impact, with degradation particularly visible on complex reasoning tasks. The quality-size tradeoff must be evaluated for specific use cases.

Quantization Approaches

Post-training quantization (PTQ) converts an already-trained model to lower precision without additional training. It's fast and requires only the model weights and a small calibration dataset. Most edge deployments use PTQ approaches.

Quantization-aware training (QAT) incorporates quantization into the training process, allowing the model to learn to perform well at low precision. QAT typically produces higher quality quantized models but requires access to training infrastructure and data.

For most practitioners, PTQ is the practical choice—take an existing model and quantize it for deployment. QAT is relevant when training custom models or when pushing quality at extreme quantization levels.

GGUF Format

GGUF (GPT-Generated Unified Format) has become the standard format for quantized models in the llama.cpp ecosystem. It supports multiple quantization levels and is designed for efficient loading and inference.

GGUF quantization levels offer different tradeoffs. Q8_0 provides 8-bit quantization with minimal quality loss and roughly 50% size reduction from FP16. Q6_K offers 6-bit quantization as a middle ground. Q5_K_M provides 5-bit with quality-preserving techniques. Q4_K_M offers 4-bit with good quality retention. Q4_0 provides simple 4-bit with more quality loss but smaller size. Q3_K and Q2_K push to very low bit widths for extremely constrained deployments.

The "K" variants use k-quant methods that apply different quantization levels to different parts of the model, preserving quality in sensitive layers while aggressively quantizing elsewhere.

GGUF models are available pre-quantized from repositories like Hugging Face. TheBloke and other quantizers provide GGUF versions of popular models at various quantization levels.

AWQ (Activation-aware Weight Quantization)

AWQ improves on simple quantization by considering how weights interact with typical activations. Weights that significantly affect activations are preserved at higher precision, while less impactful weights are quantized more aggressively.

AWQ typically achieves better quality than naive quantization at the same bit width. A 4-bit AWQ model often approaches 8-bit naive quantization in quality while being half the size.

AWQ is supported by inference frameworks like vLLM and TensorRT-LLM. For server-side inference with quantized models, AWQ is often the best choice. Edge support is growing but less universal than GGUF.

GPTQ

GPTQ uses a sophisticated algorithm to find quantization parameters that minimize output error. It solves an optimization problem for each layer, finding the quantization that best preserves the layer's behavior.

GPTQ produces high-quality quantized models, particularly at 4-bit precision. The quantization process is slower than simpler methods but runs once at preparation time.

GPTQ models are widely available and supported by frameworks including Hugging Face Transformers with the optimum library.

Quantization Best Practices

Start with pre-quantized models rather than quantizing yourself. Repositories like Hugging Face Hub have GGUF, AWQ, and GPTQ versions of popular models ready for deployment.

Evaluate quality on your specific tasks. Quantization impact varies by model and task. A model that works well at Q4 for classification might struggle at Q4 for code generation. Test with representative workloads.

Match quantization to device capability. INT8 might be sufficient if it fits in memory—the quality is higher than INT4. Only drop to lower precision when memory constraints require it.

Consider mixed-precision approaches. Some frameworks support running critical layers at higher precision while quantizing the rest. This preserves quality for sensitive operations.

Runtime Frameworks

The runtime framework executes the model on device hardware. Different frameworks target different platforms and offer different optimization strategies.

llama.cpp

llama.cpp has emerged as the dominant runtime for edge LLM deployment. Originally implementing Llama inference in C++, it now supports most popular model architectures and runs on virtually any hardware. As of 2025, llama.cpp has over 1,200 contributors and nearly 4,000 releases, reflecting its central role in the edge AI ecosystem.

Platform support is exceptionally broad. llama.cpp runs on x86 and ARM CPUs, Apple Silicon with Metal acceleration, NVIDIA GPUs with CUDA, AMD GPUs with ROCm and HIP, and Moore Threads GPUs via MUSA. The same codebase deploys across platforms.

Quantization support now spans 1.5-bit to 8-bit integer quantization for maximum flexibility in the size-quality tradeoff. The recommended quantization levels include Q4_K_M as the safe default for phones and lighter Macs, Q5_K_M for improved detail and reasoning stability, and Q8_0 when quality matters most and memory isn't a constraint.

Advanced quantization features include imatrix (importance matrix) support for optimized quantization that minimizes quality loss on important weights. The --leave-output-tensor option preserves the output layer at higher precision for improved quality, and FlashAttention CUDA kernels support all KV cache quantization type combinations.

GGUF integration is native—llama.cpp defines and fully supports the GGUF format. Hugging Face provides online tools including GGUF-my-repo for converting and quantizing models, GGUF-my-LoRA for adapter conversion, and GGUF-editor for metadata editing.

The unified memory option (GGML_CUDA_ENABLE_UNIFIED_MEMORY=1) on Linux allows swapping to system RAM instead of crashing when GPU VRAM is exhausted—useful for pushing the limits of model size on memory-constrained devices.

Language bindings extend llama.cpp to other languages. llama-cpp-python provides Python bindings. Node.js, Rust, Go, and other bindings exist. This enables integration with applications in any language.

For most edge deployments, llama.cpp or a framework built on it is the starting point.

ONNX Runtime

ONNX Runtime from Microsoft provides cross-platform inference for models in ONNX format. It's particularly strong on Windows and for deployment alongside other Microsoft tools.

Platform coverage includes Windows, macOS, Linux, Android, iOS, and web (via WebAssembly). This breadth makes it suitable for cross-platform applications.

Execution providers optimize for specific hardware. The CUDA provider accelerates on NVIDIA GPUs. The CoreML provider uses Apple's framework. The DirectML provider accelerates on Windows GPUs. The NNAPI provider uses Android's neural network API.

Model conversion from PyTorch or TensorFlow to ONNX is well-supported through standard export tools. Optimization passes improve inference efficiency after conversion.

ONNX Runtime GenAI extends the core runtime with LLM-specific features optimized for generative AI. As of version 0.11.4, it includes continuous batching for improved throughput when handling multiple concurrent requests, and CUDA Graph support for LLMs that reduces kernel launch overhead by capturing and replaying GPU operations. The library provides pre-built support for popular edge models including SmolLM3 and the gpt-oss family, simplifying deployment of these models across platforms.

TensorRT and TensorRT-LLM

NVIDIA's TensorRT provides highly optimized inference on NVIDIA GPUs. TensorRT-LLM extends this with LLM-specific optimizations.

Performance on NVIDIA hardware is typically best-in-class. TensorRT applies sophisticated optimizations including layer fusion, precision calibration, and kernel auto-tuning.

TensorRT-LLM adds features specific to LLMs: efficient attention implementations, KV cache management, batching strategies, and quantization support including FP8 on Hopper GPUs.

The TensorRT Edge-LLM SDK extends these capabilities specifically for edge deployment on NVIDIA's Jetson platform. It provides optimized inference paths designed for the constrained resources of embedded NVIDIA hardware while maintaining the performance benefits of TensorRT.

The newest Jetson T4000 (Blackwell architecture) pushes embedded GPU capability to new heights: 1200 FP4 TFLOPs of compute and 64GB of unified memory. This blurs the line between edge and datacenter—models that previously required cloud deployment can run on this embedded platform.

Platform limitation is significant—TensorRT only works on NVIDIA GPUs. For NVIDIA-equipped edge devices (Jetson series, laptops with discrete GPUs), it's an excellent choice. For other platforms, alternatives are needed.

Core ML

Apple's Core ML framework provides optimized inference on Apple devices. For iOS, macOS, and other Apple platforms, Core ML leverages the Neural Engine, GPU, and CPU efficiently.

Hardware utilization is automatic. Core ML decides whether to run on Neural Engine, GPU, or CPU based on model characteristics and device state. The Neural Engine provides excellent efficiency for supported operations.

Apple's Foundation Models framework, introduced in 2025, provides a free API for on-device language model inference without requiring developers to bring their own models. The framework runs Apple's own 3B parameter model entirely on-device, achieving approximately 30 tokens per second on iPhone 15 Pro. The model uses mixed 2-4 bit quantization with an average of 3.7 bits per weight, achieving aggressive compression while maintaining quality for common tasks like text generation, summarization, and conversation.

Traditional Core ML model conversion uses coremltools to convert from PyTorch, TensorFlow, or ONNX. LLM conversion requires care—attention mechanisms and generation loops need proper handling.

Integration with Apple ecosystem is seamless. Core ML models integrate with SwiftUI, UIKit, and other Apple frameworks. On-device training (personalization) is supported for some model types.

Platform exclusivity limits Core ML and Foundation Models to Apple devices. For iOS and macOS deployment, they're often the best choice. For cross-platform applications, they represent one target among several.

MediaPipe and TensorFlow Lite

Google's MediaPipe and TensorFlow Lite provide inference on mobile devices with a focus on Android.

TensorFlow Lite has evolved into LiteRT (Lite Runtime), continuing to serve as the standard for on-device ML on Android. It supports quantized models, GPU acceleration via delegates, and integration with Android's neural network API.

The LiteRT QNN Accelerator represents a major leap for Qualcomm-powered Android devices. By leveraging Qualcomm's NPU directly, the accelerator achieves up to 100x speedup over CPU inference and 10x speedup over GPU. This transforms what's possible on Android—models that were impractically slow become responsive.

On the latest Snapdragon 8 Elite Gen 5 chipset, this translates to over 100 tokens per second decode speed for appropriately sized models, bringing edge inference on flagship Android devices into the same league as desktop hardware.

MediaPipe builds on LiteRT to provide ready-to-use ML solutions including LLM inference. The MediaPipe LLM Inference API simplifies deploying language models on Android.

Model support in MediaPipe includes Gemma and other models converted to the required format. The abstraction handles tokenization, generation, and efficient inference.

iOS support exists but Android is the primary focus. For Android-first applications, MediaPipe provides a well-supported path.

WebLLM and Transformers.js

Browser-based inference runs models directly in the web browser, requiring no installation or native code.

WebLLM from the MLC project enables running LLMs in the browser using WebGPU acceleration. Performance has reached approximately 80% of native inference speed—a significant achievement that makes browser-based LLMs practical for real applications. Models up to 7B parameters run with usable generation speeds on capable hardware.

Recent advances in WebGPU inference have pushed performance even further. The WeInfer framework demonstrates up to 3.76x faster execution than standard WebGPU implementations through techniques like efficient memory access patterns and optimized shader compilation.

WebGPU support has expanded significantly. Chrome 113+, Edge 113+, and Safari 17+ support WebGPU on desktop. Safari 26 now supports WebGPU on iOS, opening mobile browser deployment on Apple devices. Firefox support is in development.

Transformers.js from Hugging Face brings the Transformers library to JavaScript. It supports a wide range of models and tasks, running via WebAssembly or WebGPU.

WebAssembly fallback enables Transformers.js to work in browsers without WebGPU, though at reduced performance. This provides broader compatibility at the cost of speed.

Model download happens in the browser, with models cached for subsequent use. Initial load can be slow for large models, but caching via IndexedDB makes subsequent uses fast.

MLC LLM

MLC LLM (Machine Learning Compilation for LLMs) provides universal deployment across platforms through compiler-based optimization.

The approach differs from other frameworks. MLC LLM compiles models specifically for target hardware, generating optimized code rather than interpreting a general format.

Platform support includes iOS, Android, web (via WebGPU), and various desktop and server platforms. The same model can be compiled for different targets.

Performance is competitive with platform-specific solutions while maintaining portability. The compilation approach enables hardware-specific optimizations without separate implementations.

The MLC ecosystem includes WebLLM (browser deployment) and MLCChat (mobile apps demonstrating capability).

Platform-Specific Deployment

Each platform has unique characteristics affecting how models deploy and perform.

iOS Deployment

iOS provides excellent hardware for edge AI through the Neural Engine and unified memory architecture. Deploying effectively requires understanding the platform's capabilities and constraints.

Hardware landscape varies across devices. Recent iPhones (14 Pro+, 15 series, 16 series) have powerful Neural Engines and substantial RAM (6-8GB). Older devices and base models have less capability. iPad Pro models often exceed iPhone capability.

Memory management on iOS requires care. The system can terminate apps using too much memory. LLMs must compete with other apps and system processes for limited RAM. Models around 2-3B parameters (1-2GB quantized) work reliably; larger models risk termination under memory pressure.

Apple's Foundation Models framework offers the simplest path for iOS LLM integration. The framework provides a free, on-device API to Apple's 3B parameter model without developers needing to bundle or download model weights. On iPhone 15 Pro, the API achieves approximately 30 tokens per second—fast enough for real-time conversational interfaces. The system handles all optimization, memory management, and hardware scheduling automatically.

Core ML remains the path for custom models. Convert models to Core ML format, integrate with your app, and let the system manage hardware utilization. The Neural Engine accelerates attention and feed-forward operations efficiently.

llama.cpp with Metal provides maximum flexibility. Metal acceleration runs inference on the GPU with good performance. This path offers more control than Core ML or Foundation Models but requires more integration work.

Practical recommendations for iOS: use Apple's Foundation Models for general-purpose text generation when their capabilities suffice, use Core ML for custom or specialized models, use models in the 2-3B range for reliability across devices, test on older supported devices to ensure broad compatibility, and implement memory warning handlers to release resources under pressure.

Android Deployment

Android's device diversity creates both opportunity and challenge. The range of hardware is enormous—from entry-level phones to flagship devices with desktop-class performance.

Hardware heterogeneity is the defining characteristic. Snapdragon flagship chips have capable GPUs and NPUs. MediaTek chips vary widely in ML capability. Older devices may lack meaningful acceleration. Your model must handle this range.

GPU acceleration via Vulkan provides the most consistent acceleration path. Most Android devices support Vulkan, and it offers good performance for LLM inference.

NPU utilization has improved dramatically with the LiteRT QNN Accelerator. On Qualcomm devices, this accelerator achieves up to 100x speedup over CPU and 10x over GPU by leveraging the NPU directly. The latest Snapdragon 8 Elite Gen 5 demonstrates the potential: over 100 tokens per second decode speed brings flagship Android on par with desktop inference. MediaTek's APU and Samsung's NPU offer similar potential with their respective frameworks.

MediaPipe LLM Inference provides a managed experience that handles device heterogeneity. It selects appropriate backends and model configurations based on device capability.

llama.cpp on Android works via NDK integration. The Vulkan backend provides GPU acceleration. This offers more control than MediaPipe but requires more work to handle device diversity.

Practical recommendations for Android: test on a range of devices including older ones, provide graceful degradation for less capable devices, leverage LiteRT QNN Accelerator on Qualcomm devices for best performance, use MediaPipe for simpler deployment or llama.cpp for more control, and consider model size tiers—smaller models for basic devices, larger for capable ones.

Browser Deployment

Browser deployment eliminates installation friction—users access AI features by visiting a URL. The tradeoffs are constrained resources and dependency on browser capabilities.

WebGPU is the key technology enabling browser LLM inference. It provides low-level GPU access from JavaScript, enabling compute shaders that execute model operations efficiently.

Model loading happens over the network. Large models mean long initial loads. Caching via IndexedDB or Cache API stores models locally for subsequent visits. Consider progressive loading—start with a smaller model, upgrade to larger if the user engages.

Memory limits in browsers are less predictable than native apps. Browsers may limit memory allocation, and the limits vary by browser and system. Models in the 2-4B range are generally safe; larger models may hit limits.

Web Worker isolation runs inference off the main thread, keeping the UI responsive. This is essential—running inference on the main thread blocks the UI.

WebLLM provides the most complete solution for browser LLM deployment. It handles WebGPU utilization, model loading, caching, and inference management. Performance has reached approximately 80% of native inference speed—making browser-based LLMs practical for real applications.

Practical recommendations for browser: require WebGPU for best performance (with graceful messaging for unsupported browsers), cache models after first load, run inference in Web Workers, start with smaller models and offer larger models as options for capable devices.

Desktop Deployment

Desktop deployment offers the most resources while still avoiding cloud dependency. Laptops and desktops have more memory, faster processors, and often discrete GPUs.

Hardware capability typically exceeds mobile by significant margins. 16-32GB RAM is common, discrete GPUs are frequent, and multi-core CPUs handle compute well. Models up to 13B are practical; 30B+ is possible on high-end systems.

Integration paths vary by platform. Native applications can use llama.cpp directly, linked as a library or via subprocess. Electron apps can spawn inference processes. Cross-platform frameworks like Qt or Flutter can integrate native inference.

GPU acceleration improves performance dramatically. CUDA on NVIDIA GPUs, Metal on Apple Silicon, ROCm on AMD GPUs—each provides significant speedups over CPU inference.

Local server pattern runs inference as a local HTTP server, with the application communicating via localhost. This simplifies integration and allows updating the inference component independently.

Ollama provides a turnkey solution for desktop deployment. It manages model downloads, provides an HTTP API, and handles GPU detection automatically. Applications can use Ollama as a local backend.

LM Studio offers a user-friendly interface for running local models on desktop, with built-in model discovery, download, and a local server mode that applications can use.

Practical recommendations for desktop: leverage available GPU acceleration, consider Ollama or LM Studio for simpler deployment, larger models (7-13B) provide notably better quality than mobile-sized models, and balance download size against quality for model selection.

Embedded and IoT Deployment

Embedded deployment spans a huge range—from microcontrollers with minimal resources to powerful modules that rival desktops.

At the constrained end, hardware limitations are severe. Memory might be measured in hundreds of megabytes, not gigabytes. Compute capability is limited. Power budgets are tight. These constraints require the smallest models and most aggressive optimization.

Specialized hardware like NVIDIA Jetson provides a middle ground—embedded form factor with meaningful GPU capability. The Jetson Orin series supports full LLM inference with respectable performance. The newest Jetson T4000 (Blackwell architecture) represents a quantum leap: 1200 FP4 TFLOPs of compute and 64GB of unified memory enable running models previously limited to datacenters. This blurs traditional boundaries between edge and cloud capability.

The TensorRT Edge-LLM SDK optimizes specifically for Jetson deployment, providing inference paths tuned for the platform's unique characteristics—unified memory, power constraints, and thermal limitations.

Model selection must match hardware capability. SmolLM (135M-1.7B parameters), SmolLM3-3B, and Phi-4-mini with aggressive quantization serve constrained devices. The Jetson T4000 can run much larger models.

Use case focus compensates for limited capability on smaller devices. Rather than general-purpose chat, embedded LLMs might handle command classification, simple generation, or text processing. A well-tuned small model can excel at specific tasks.

TinyML approaches from the broader embedded ML field apply—quantization-aware training, architecture search for efficiency, knowledge distillation from larger models.

Practical recommendations for embedded: match model to specific use case rather than seeking general capability, use the smallest model that handles the task, consider Jetson T4000 for applications requiring larger model capability in embedded form factor, consider distilling a capable model's behavior into a smaller model fine-tuned for the specific application, and test power consumption carefully for battery-powered deployments.

Performance Optimization

Getting acceptable performance on edge devices requires attention to multiple optimization opportunities.

Model Architecture Efficiency

Model architecture affects inference efficiency independent of size.

Attention efficiency varies by implementation. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory bandwidth compared to full Multi-Head Attention. Models using these variants (Llama 3.2, Mistral) inference more efficiently.

Feed-forward structure impacts compute. Standard FFN layers are straightforward to optimize. Mixture of Experts adds routing overhead but activates fewer parameters per token.

Context length affects memory usage. KV cache memory grows linearly with context length. On memory-constrained devices, limiting context length may be necessary. Models with efficient attention patterns (like RWKV) scale differently.

Vocabulary size affects embedding memory. Larger vocabularies require more memory for embedding tables. This is typically a small fraction of total model size but matters at the margins.

Inference Optimization Techniques

Beyond choosing an efficient model, inference can be optimized through various techniques.

Batching amortizes per-inference overhead across multiple requests. For servers handling multiple users, batching is essential. For single-user edge deployment, batching is less relevant unless processing multiple documents.

Continuous batching processes requests as they arrive without waiting for batch completion. This reduces latency for individual requests while maintaining throughput benefits.

KV cache management affects memory efficiency. Efficient cache allocation, reuse across similar prompts (prompt caching), and memory mapping enable handling longer contexts.

Speculative decoding uses a small draft model to predict multiple tokens, verified by the main model. On edge devices, the draft model adds overhead that may not be offset by verification speedup—test for your specific case.

Platform-Specific Optimization

Each platform offers optimization opportunities specific to its hardware.

Apple Neural Engine optimization requires Core ML models structured for Neural Engine compatibility. Not all operations execute on the Neural Engine—incompatible operations fall back to GPU or CPU.

NVIDIA optimization leverages tensor cores for matrix operations. TensorRT applies operator fusion, quantization calibration, and kernel auto-tuning.

Memory mapping reduces load time and enables running models larger than available RAM (with performance penalties from paging).

Thread pinning and NUMA awareness matter for multi-socket systems or heterogeneous cores (like Apple's performance and efficiency cores).

Profiling and Measurement

Optimization requires measurement. Profile before optimizing to identify actual bottlenecks.

Tokens per second is the primary metric for generation performance. Measure prompt processing (prefill) and generation (decode) separately—they have different characteristics.

Time to first token matters for user experience. Users wait for this before seeing any response.

Memory usage should be tracked throughout inference—peak usage may exceed steady-state.

Power consumption matters for mobile and embedded deployment. iOS Instruments and Android Profiler provide power measurement capabilities.

Compare across models and quantization levels to find the best tradeoffs for your use case.

Common Use Cases and Patterns

Different applications suit edge AI differently. Understanding common patterns helps select appropriate approaches.

On-Device Assistants

Voice assistants and chatbots running locally provide immediate response and privacy. Siri, Google Assistant, and Alexa have moved toward on-device processing for initial understanding, with cloud escalation for complex queries.

This pattern uses small models for quick classification and simple responses, escalates to larger on-device models for moderate complexity, and reserves cloud calls for tasks requiring maximum capability.

Implementation typically involves a small model always loaded and ready, with larger models loaded on demand based on detected complexity.

Writing Assistance

Autocomplete, grammar correction, and writing suggestions benefit from edge deployment. Latency is critical—suggestions must appear as the user types, not seconds later.

Text completion uses small models effectively. Generating a few tokens to complete a sentence requires less capability than generating full articles.

Grammar and style checking can use classification approaches with small models or embedding-based retrieval with correction rules.

Summarization of local documents keeps content private while providing useful features.

Code Assistance

Developer tools increasingly incorporate AI assistance. On-device inference keeps code private and works offline.

Code completion at the function or line level works well with edge models. Longer completions benefit from larger models.

Documentation lookup and code search can use embeddings generated locally.

Refactoring suggestions require understanding code structure—larger models help, but focused fine-tuning can make smaller models viable.

Translation and Language

Translation benefits from edge deployment for privacy and offline capability.

On-device translation handles common language pairs well with appropriately trained models.

Phrasebooks and contextual assistance provide useful functionality with small models.

Multilingual capability varies by model—select models trained on relevant language pairs.

Document Processing

Processing documents locally maintains confidentiality while enabling AI features.

Text extraction and understanding classifies content, extracts entities, and summarizes documents.

Search over local documents uses embeddings for semantic search without sending content to the cloud.

Form processing extracts structured data from unstructured documents.

Production Considerations

Deploying edge AI in production requires attention to several operational aspects.

Model Updates and Versioning

Unlike cloud models that update transparently, edge models require explicit update mechanisms.

Download management handles large model files efficiently. Delta updates, compression, and background downloading minimize user impact.

Version tracking identifies which model version each user runs. This matters for debugging and capability assumptions.

Graceful rollback enables reverting if new models cause problems. Maintain ability to push updated models or switch versions remotely.

A/B testing of models requires more setup than cloud but enables measuring impact of model changes.

Monitoring and Telemetry

Edge deployment doesn't mean flying blind. Appropriate telemetry provides visibility while respecting privacy.

Performance metrics like inference speed, memory usage, and error rates help identify problems.

Quality signals like user corrections, regeneration requests, and feature abandonment indicate model quality.

Device capability distribution informs decisions about model requirements and optimization priorities.

Privacy-preserving approaches aggregate metrics without collecting content. Track that inference succeeded, not what was inferred.

Resource Management

Edge devices serve many purposes beyond AI inference. Resource management ensures AI features don't degrade device experience.

Memory limits should be respected. Monitor for memory pressure signals and release resources when needed.

Power awareness avoids excessive battery drain. Reduce inference frequency or capability when battery is low.

Thermal management matters for sustained inference. Devices throttle when hot, degrading performance.

Background versus foreground prioritization gives user-facing requests priority over background processing.

Error Handling and Fallback

Edge inference can fail—out of memory, unsupported hardware, corrupted models. Robust error handling maintains functionality.

Graceful degradation provides reduced functionality rather than complete failure. If the large model can't load, try a smaller one.

Cloud fallback escalates to cloud inference when local fails, if appropriate for your use case and privacy requirements.

User communication explains limitations clearly. "Running on device for privacy" sets appropriate expectations.

Retry logic handles transient failures without frustrating users.

Table of Contents