Multimodal LLMs: Vision, Audio, and Beyond
A comprehensive guide to multimodal LLMs—vision-language models, audio understanding, video comprehension, and any-to-any models. Architecture deep dives, benchmarks, implementation patterns, and production deployment.
Table of Contents
The Multimodal Revolution
LLMs are no longer text-only. Modern multimodal models process images, videos, audio, and text in unified architectures, enabling capabilities that were science fiction just two years ago.
From research: "Models have become smaller yet more powerful, with the rise of new architectures and capabilities including reasoning, agency, and long video understanding. Entirely new paradigms such as multimodal Retrieval Augmented Generation (RAG) and multimodal agents have taken shape."
This post provides a comprehensive guide to multimodal LLMs in 2025—architectures, major models, implementation patterns, and production deployment.
How Multimodal Models Work
The Core Architecture
Most multimodal models follow a similar pattern:
[Non-text input] → Encoder → Projection → LLM Backbone → [Output]
Components:
-
Modality Encoder: Converts raw input (image, audio, video) into embeddings
- Vision: ViT (Vision Transformer), SigLIP, EVA-CLIP
- Audio: Whisper encoder, wav2vec
- Video: Frame sampling + vision encoder
-
Projection Layer: Aligns modality embeddings with text embedding space
- MLP projector (simple, fast)
- Cross-attention (more expressive)
- Q-Former (query-based alignment)
-
LLM Backbone: Processes unified embeddings and generates output
- Standard transformer decoder
- Receives interleaved text and modality tokens
Vision-Language Architecture Example
Understanding the architecture helps you make better decisions about which model to use and how to optimize inference. The code below shows the conceptual flow—production implementations add many optimizations but follow the same pattern.
# Simplified VLM forward pass
class VisionLanguageModel:
def __init__(self):
self.vision_encoder = ViT() # e.g., SigLIP
self.projection = MLP(vision_dim=1024, text_dim=4096)
self.llm = LLM() # e.g., Qwen, Llama
def forward(self, image, text_tokens):
# Encode image
image_features = self.vision_encoder(image) # [1, num_patches, 1024]
# Project to LLM embedding space
image_embeddings = self.projection(image_features) # [1, num_patches, 4096]
# Get text embeddings
text_embeddings = self.llm.embed(text_tokens) # [1, seq_len, 4096]
# Concatenate: [image_tokens, text_tokens]
combined = torch.cat([image_embeddings, text_embeddings], dim=1)
# Generate response
output = self.llm.generate(combined)
return output
Understanding the key steps:
Vision encoding: The ViT (Vision Transformer) divides the image into patches (typically 14x14 or 16x16 pixels) and encodes each patch into a vector. A 224x224 image with 14x14 patches produces 256 patch embeddings. Higher resolution images produce more patches and thus more tokens.
Projection alignment: The vision encoder produces embeddings in its own space (e.g., 1024 dimensions for SigLIP). The LLM expects embeddings in its space (e.g., 4096 dimensions for Qwen). The projection layer learns to translate between these spaces. Simple MLPs work surprisingly well—cross-attention projectors can be more expressive but add latency.
Concatenation strategy: Image tokens are typically prepended to text tokens, so the LLM "sees" the image first and can reference it when processing the question. Some architectures interleave image and text tokens for finer-grained reference. The concatenation point affects both capability and efficiency.
Token Costs for Multimodal
Understanding token consumption is critical for cost management.
Why image tokens dominate costs: A text prompt might use 100-500 tokens. A single high-resolution image can use 1,000-2,000 tokens. If you're processing documents with many images, the image tokens quickly dominate your costs. This is why resolution controls and smart image preprocessing matter so much for production multimodal applications.
The resolution-quality tradeoff: Higher resolution means more patches, which means more tokens and higher costs. But lower resolution means losing fine details—you can't read small text or identify small objects. Most APIs let you control resolution: use lower resolution for general scene understanding, higher resolution for document OCR or detailed analysis.
| Model | Image Token Cost | Notes |
|---|---|---|
| GPT-4o | 85-1700 tokens/image | Depends on resolution |
| Claude | ~1,000 tokens/image | Fixed cost |
| Gemini | ~258 tokens/image | Efficient encoding |
| Qwen-VL | Variable | Dynamic resolution |
Video cost: tokens = frames × tokens_per_frame
A 1-minute video at 1 FPS = 60 frames × ~1000 tokens = ~60,000 tokens.
Types of Multimodal Models
Vision-Language Models (VLMs)
The most mature category: models understanding both images and text.
Capabilities:
- Image captioning and description
- Visual question answering (VQA)
- Document and chart understanding
- OCR and text extraction
- GUI understanding for agents
- Object detection and localization
- Image-based reasoning
Key architectures:
| Architecture | Examples | Approach |
|---|---|---|
| Encoder-Decoder | Qwen-VL, LLaVA | Separate vision encoder + LLM |
| Native Multimodal | GPT-4o, Gemini | Unified from pretraining |
| Dual-Encoder | CLIP, SigLIP | Contrastive learning |
Audio-Language Models
Models processing speech and audio alongside text:
Capabilities:
- Automatic speech recognition (ASR)
- Audio understanding (music, environmental sounds)
- Real-time voice conversation
- Speech synthesis / TTS
- Speaker identification
- Emotion detection from voice
Architecture:
[Audio] → Audio Encoder (Whisper) → Projection → LLM → [Text/Speech]
Video-Language Models
Extended understanding across temporal sequences:
From research: "Qwen2.5-VL can comprehend videos of over 1 hour."
Capabilities:
- Video summarization
- Temporal reasoning (what happened before/after)
- Action recognition
- Long-form video Q&A
- Event detection
- Video captioning
Challenges:
- Token explosion (many frames)
- Temporal coherence
- Long-range dependencies
- Efficient sampling strategies
Any-to-Any Models
The frontier: models taking any modality input and generating any modality output.
From research: "Any-to-any models, as the name suggests, are models that can take in any modality and output any modality (image, text, audio)."
Examples:
- Janus-Pro: Image understanding AND image generation
- Qwen2.5-Omni: Text, image, audio, video → text, speech
- GPT-4o: Native multimodal in/out (limited generation)
Major Multimodal Models
Qwen VL Series (Alibaba)
The leading open-source multimodal family.
Qwen2.5-VL
From research: "Qwen 2.5 VL integrates a vision transformer with a language model, enabling advanced image and text understanding capabilities. It can directly play as a visual agent that can reason and dynamically direct tools, capable of computer use and phone use."
Specifications:
| Variant | Parameters | Context | VRAM Required |
|---|---|---|---|
| Qwen2.5-VL-3B | 3B | 32K | ~8GB (INT4) |
| Qwen2.5-VL-7B | 7B | 32K | ~16GB (INT4) |
| Qwen2.5-VL-72B | 72B | 32K | ~40GB (INT4) |
Benchmarks (72B):
| Benchmark | Score | Description |
|---|---|---|
| MMMU | 70.2 | Multimodal understanding |
| MathVista | 74.8 | Visual math reasoning |
| MMStar | 70.8 | Multi-image reasoning |
| DocVQA | 96.4 | Document understanding |
| ChartQA | 88.3 | Chart comprehension |
Strengths:
- Best open-source VLM quality
- Native video understanding (1+ hour)
- Dynamic resolution (handles any image size)
- Strong multilingual vision-language
- Visual agent capabilities
Qwen3-VL
From research: "Qwen3-VL features text–vision fusion for unified comprehension, Interleaved-MRoPE for enhanced long-horizon video reasoning, and DeepStack technology that fuses multi-level ViT features to capture fine-grained details."
Key innovations:
- DeepStack: Multi-level feature fusion
- Interleaved-MRoPE: Better positional encoding for video
- Improved long-context visual reasoning
Qwen2.5-Omni
From research: "An end-to-end multimodal model designed for comprehensive multimodal perception, seamlessly processing text, images, audio, and video, while delivering real-time streaming responses through both text generation and natural speech synthesis."
Capabilities:
- Input: Text, image, audio, video
- Output: Text, natural speech
- Real-time streaming
- 7B parameters
Janus-Pro (DeepSeek)
Unified understanding AND generation.
From research: "Janus-Pro-7B, introduced by DeepSeek AI, is a unified multimodal model that excels in both understanding and generating content across modalities. It features a decoupled visual encoding architecture, separating the processes for understanding and generation."
Architecture Innovation:
Understanding path: Image → Understanding Encoder → LLM
Generation path: LLM → Generation Encoder → Image
Key achievement: From research: "This model generates images and beats OpenAI's DALL-E 3 and Stable Diffusion across multiple benchmarks."
Benchmarks:
| Task | Janus-Pro-7B | DALL-E 3 | SD-XL |
|---|---|---|---|
| GenEval | 0.80 | 0.67 | 0.55 |
| DPG-Bench | 84.2 | 83.5 | 74.7 |
Why it matters: First open model competitive with proprietary image generators while also understanding images.
MiniCPM-o 2.6
Compact but powerful multimodal.
From research: "MiniCPM-o 2.6 is an 8B parameter multimodal model capable of understanding and generating content across vision, speech, and language modalities."
Architecture: From research: "An architecture integrating SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, the model boasts a total of 8 billion parameters."
| Component | Size | Function |
|---|---|---|
| SigLIP | 400M | Vision encoding |
| Whisper-medium | 300M | Audio encoding |
| ChatTTS | 200M | Speech synthesis |
| Qwen2.5 | 7B | Language backbone |
Features:
- Real-time speech conversation
- Multimodal streaming support
- ~5.5 GB model size
- 32K context window
- Runs on consumer GPUs
Best for: Edge deployment, real-time applications, resource-constrained environments.
GPT-4V / GPT-4o (OpenAI)
The commercial benchmark.
GPT-4V:
- First frontier vision-language model
- Strong at complex reasoning over images
- Document and chart understanding
- Released March 2023
GPT-4o:
- Native multimodal (trained from scratch with all modalities)
- Real-time voice mode
- 128K context window
- Faster inference
- Released May 2024
Strengths:
- Best overall quality
- Excellent instruction following
- Strong safety alignment
- Comprehensive documentation
Limitations:
- Proprietary, API-only
- High cost at scale
- No fine-tuning
- No on-premise deployment
Claude Vision (Anthropic)
From research: "Claude Sonnet 4.5 / Opus 4.5 is recommended when conservative, audit-friendly tool use and stable long agents matter more than maximum raw scores."
Strengths:
- Computer use (screen understanding)
- Document analysis
- Structured output from images
- Safety and honesty focus
- Long context (200K tokens)
Best for: Enterprise applications, document processing, computer use agents.
Gemini (Google)
From research: "Unlike other LLMs, Gemini was designed to be multimodal, meaning it could process multiple types of data simultaneously, including text, images, audio, video, and computer code."
Gemini 3 (November 2025): Google's most intelligent multimodal model family:
- Gemini 3 Pro (November 18, 2025): State-of-the-art reasoning for complex problems
- Gemini 3 Flash (December 17, 2025): Frontier intelligence at 3x speed of 2.5 Pro
From Google: "Gemini 3 Flash delivers frontier performance on GPQA Diamond (90.4%) and Humanity's Last Exam (33.7%), rivaling larger frontier models while using 30% fewer tokens on average than 2.5 Pro."
| Model | GPQA Diamond | SWE-bench | Speed vs 2.5 Pro |
|---|---|---|---|
| Gemini 3 Flash | 90.4% | 78% | 3x faster |
| Gemini 3 Pro | 88.1% | 72% | 1.5x faster |
Gemini 3 Deep Think: Enhanced reasoning mode (safety evaluation in progress).
Gemini 2.5 Pro:
- 1 million token context window
- Native multimodality
- "Thinking model" with step-by-step reasoning
- Strong video understanding
Best for: Very long contexts, video analysis, Google Cloud integration.
InternVL3
Strong open-source alternative.
From research: "InternVL3-78B excels in multimodal perception and reasoning with enhanced capabilities including tool usage, GUI agents, industrial image analysis, and 3D vision perception. It achieves a score of 72.2 on the MMMU benchmark."
Specialties:
- Industrial image analysis
- 3D vision perception
- GUI agent capabilities
- Tool usage
GLM-4.6V (Zhipu AI)
From research: "GLM-4.6V is the latest open-source multimodal model featuring native multimodal tool use, stronger visual reasoning, and a 128K context window. With a 128K context window, GLM-4.6V can handle high-information-density inputs, such as multi-document financial reports, research papers, 200-page presentation decks, and hour-long videos."
Strengths:
- 128K context window
- Native tool use
- Long document processing
- Hour-long video understanding
Comprehensive Comparison
Benchmark Comparison
| Model | MMMU | MathVista | DocVQA | TextVQA | Open |
|---|---|---|---|---|---|
| GPT-4o | 69.1 | 63.8 | 92.8 | 77.4 | No |
| Gemini 2.5 Pro | 72.4 | 73.4 | 94.2 | 78.6 | No |
| Claude 3.5 Sonnet | 68.3 | 67.7 | 95.2 | 77.7 | No |
| Qwen2.5-VL-72B | 70.2 | 74.8 | 96.4 | 84.9 | Yes |
| InternVL3-78B | 72.2 | 70.1 | 95.8 | 83.4 | Yes |
| Qwen2.5-VL-7B | 62.0 | 67.5 | 94.5 | 81.3 | Yes |
Capability Matrix
| Model | Params | Vision | Audio | Video | Generate | Open |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-72B | 72B | ✅ | ❌ | ✅ | ❌ | ✅ |
| Qwen2.5-Omni | 7B | ✅ | ✅ | ✅ | Speech | ✅ |
| Qwen3-VL-235B | 235B | ✅ | ❌ | ✅ | ❌ | ✅ |
| Janus-Pro-7B | 7B | ✅ | ❌ | ❌ | Image | ✅ |
| MiniCPM-o 2.6 | 8B | ✅ | ✅ | ✅ | Speech | ✅ |
| GPT-4o | ? | ✅ | ✅ | ✅ | ❌ | ❌ |
| Claude 4.5 | ? | ✅ | ❌ | ❌ | ❌ | ❌ |
| Gemini 3 Flash | ? | ✅ | ✅ | ✅ | ❌ | ❌ |
| Gemini 2.5 | ? | ✅ | ✅ | ✅ | ❌ | ❌ |
| InternVL3-78B | 78B | ✅ | ❌ | ✅ | ❌ | ✅ |
Implementation Patterns
Image Understanding with OpenAI
from openai import OpenAI
import base64
from pathlib import Path
client = OpenAI()
def encode_image(image_path: str) -> str:
"""Encode image to base64."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode()
def analyze_image(
image_path: str,
question: str,
detail: str = "high" # "low", "high", or "auto"
) -> str:
"""Analyze image with GPT-4o."""
image_data = encode_image(image_path)
extension = Path(image_path).suffix.lower()
media_type = {
".png": "image/png",
".jpg": "image/jpeg",
".jpeg": "image/jpeg",
".gif": "image/gif",
".webp": "image/webp"
}.get(extension, "image/jpeg")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": question},
{
"type": "image_url",
"image_url": {
"url": f"data:{media_type};base64,{image_data}",
"detail": detail
}
}
]
}],
max_tokens=1000
)
return response.choices[0].message.content
# Examples
result = analyze_image("invoice.png", "Extract all line items with prices")
result = analyze_image("chart.png", "What trends does this chart show?")
result = analyze_image("screenshot.png", "Describe the UI elements")
Multiple Images
def compare_images(image_paths: list[str], question: str) -> str:
"""Compare multiple images."""
content = [{"type": "text", "text": question}]
for path in image_paths:
content.append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encode_image(path)}",
"detail": "high"
}
})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}],
max_tokens=2000
)
return response.choices[0].message.content
# Example: Compare before/after
result = compare_images(
["before.png", "after.png"],
"What changed between these two images?"
)
Structured Output from Images
from pydantic import BaseModel
from typing import List
class InvoiceItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class ExtractedInvoice(BaseModel):
vendor: str
invoice_number: str
date: str
items: List[InvoiceItem]
subtotal: float
tax: float
total: float
def extract_invoice(image_path: str) -> ExtractedInvoice:
"""Extract structured data from invoice image."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract all information from this invoice as structured JSON."
},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{encode_image(image_path)}"}
}
]
}],
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
return ExtractedInvoice(**data)
Video Understanding with Qwen-VL
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
class QwenVideoAnalyzer:
def __init__(self, model_name: str = "Qwen/Qwen2.5-VL-7B-Instruct"):
self.model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2"
)
self.processor = AutoProcessor.from_pretrained(model_name)
def analyze_video(
self,
video_path: str,
question: str,
fps: float = 1.0,
max_frames: int = 64
) -> str:
"""Analyze video content."""
messages = [{
"role": "user",
"content": [
{
"type": "video",
"video": video_path,
"fps": fps,
"max_frames": max_frames
},
{"type": "text", "text": question}
]
}]
text = self.processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = self.processor(
text=text,
videos=[video_path],
return_tensors="pt"
).to(self.model.device)
with torch.no_grad():
output_ids = self.model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7
)
return self.processor.batch_decode(
output_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True
)[0]
# Usage
analyzer = QwenVideoAnalyzer()
summary = analyzer.analyze_video(
"meeting.mp4",
"Summarize the key points discussed in this meeting"
)
Multimodal RAG
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageDocument
import os
class MultimodalRAG:
def __init__(self):
self.mm_llm = OpenAIMultiModal(
model="gpt-4o",
max_new_tokens=1000
)
def load_documents(self, directory: str):
"""Load text and image documents."""
# Load text documents
text_docs = SimpleDirectoryReader(
directory,
required_exts=[".txt", ".pdf", ".md"]
).load_data()
# Load images separately
image_docs = []
for filename in os.listdir(directory):
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
image_path = os.path.join(directory, filename)
image_docs.append(ImageDocument(image_path=image_path))
return text_docs, image_docs
def query_with_images(
self,
query: str,
text_context: str,
images: list[ImageDocument]
) -> str:
"""Query with both text and image context."""
prompt = f"""Based on the following context and images, answer the question.
Text Context:
{text_context}
Question: {query}"""
response = self.mm_llm.complete(
prompt=prompt,
image_documents=images
)
return response.text
# Usage
rag = MultimodalRAG()
text_docs, images = rag.load_documents("./reports/")
answer = rag.query_with_images(
"What trends do the charts show?",
"Q3 2024 sales report...",
images[:3] # First 3 images
)
Real-Time Audio with MiniCPM-o
# Conceptual example - check MiniCPM-o docs for exact API
from minicpm_o import MiniCPMO
import sounddevice as sd
import numpy as np
class RealtimeVoiceAssistant:
def __init__(self):
self.model = MiniCPMO.from_pretrained("openbmb/MiniCPM-o-2.6")
self.sample_rate = 16000
def process_audio_stream(self, audio_chunk: np.ndarray) -> str:
"""Process audio and get text response."""
# Model handles speech-to-text internally
response = self.model.generate(
audio=audio_chunk,
sample_rate=self.sample_rate,
output_type="text"
)
return response
def generate_speech(self, text: str) -> np.ndarray:
"""Generate speech from text."""
audio = self.model.synthesize_speech(text)
return audio
def run_conversation(self):
"""Run real-time conversation loop."""
print("Listening... (Ctrl+C to stop)")
while True:
# Record audio
audio = sd.rec(
int(3 * self.sample_rate),
samplerate=self.sample_rate,
channels=1
)
sd.wait()
# Process and respond
response = self.process_audio_stream(audio.flatten())
print(f"Response: {response}")
# Speak response
audio_response = self.generate_speech(response)
sd.play(audio_response, self.sample_rate)
sd.wait()
Production Deployment
Hardware Requirements
| Model | VRAM (FP16) | VRAM (INT4) | Inference Speed |
|---|---|---|---|
| Qwen2.5-VL-3B | 8 GB | 3 GB | Fast |
| Qwen2.5-VL-7B | 16 GB | 6 GB | Medium |
| MiniCPM-o 2.6 | 12 GB | 5.5 GB | Fast |
| Janus-Pro-7B | 16 GB | 6 GB | Medium |
| Qwen2.5-VL-72B | 150 GB | 40 GB | Slow |
| InternVL3-78B | 160 GB | 45 GB | Slow |
Serving with vLLM
from vllm import LLM, SamplingParams
# Vision-language model serving
llm = LLM(
model="Qwen/Qwen2.5-VL-7B-Instruct",
trust_remote_code=True,
gpu_memory_utilization=0.9,
max_model_len=32768
)
# Process image + text
from vllm.multimodal.image import ImagePixelData
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=512
)
# Create prompt with image placeholder
prompt = "<image>\nDescribe this image in detail."
outputs = llm.generate(
[{
"prompt": prompt,
"multi_modal_data": {
"image": ImagePixelData.from_file("image.png")
}
}],
sampling_params
)
Optimization Strategies
Image preprocessing:
def optimize_image_for_vlm(image_path: str, max_size: int = 1024) -> bytes:
"""Resize and compress image for efficient VLM processing."""
from PIL import Image
import io
img = Image.open(image_path)
# Resize if too large
if max(img.size) > max_size:
ratio = max_size / max(img.size)
new_size = tuple(int(dim * ratio) for dim in img.size)
img = img.resize(new_size, Image.LANCZOS)
# Convert to RGB if needed
if img.mode != 'RGB':
img = img.convert('RGB')
# Compress
buffer = io.BytesIO()
img.save(buffer, format='JPEG', quality=85)
return buffer.getvalue()
Video frame sampling:
def sample_video_frames(
video_path: str,
target_frames: int = 32,
strategy: str = "uniform"
) -> list:
"""Sample frames from video efficiently."""
import cv2
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
if strategy == "uniform":
# Evenly spaced frames
indices = np.linspace(0, total_frames - 1, target_frames, dtype=int)
elif strategy == "keyframe":
# Use scene detection (simplified)
indices = detect_keyframes(video_path, target_frames)
frames = []
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
frames.append(frame)
cap.release()
return frames
Cost Optimization
class CostAwareMultimodalProcessor:
def __init__(self):
self.token_costs = {
"gpt-4o": {"input": 5.0, "output": 15.0}, # per 1M tokens
"claude-3-5-sonnet": {"input": 3.0, "output": 15.0}
}
def estimate_image_cost(
self,
image_path: str,
model: str = "gpt-4o",
detail: str = "high"
) -> float:
"""Estimate cost for image analysis."""
from PIL import Image
img = Image.open(image_path)
width, height = img.size
if detail == "low":
tokens = 85
else:
# High detail calculation
tiles = ((width // 512) + 1) * ((height // 512) + 1)
tokens = 85 + (tiles * 170)
cost = (tokens / 1_000_000) * self.token_costs[model]["input"]
return cost, tokens
def batch_process_efficiently(
self,
images: list[str],
questions: list[str],
budget: float
) -> list[dict]:
"""Process images within budget constraint."""
results = []
spent = 0
for image, question in zip(images, questions):
cost, tokens = self.estimate_image_cost(image)
if spent + cost > budget:
# Use lower detail or skip
cost, tokens = self.estimate_image_cost(image, detail="low")
if spent + cost <= budget:
result = self.process_image(image, question, detail="low" if cost < 0.001 else "high")
results.append(result)
spent += cost
return results
Use Cases
Document Intelligence
Invoice and receipt processing:
- Extract line items, totals, vendor info
- Validate against expected formats
- Flag anomalies
Contract analysis:
- Identify key clauses
- Extract dates, parties, obligations
- Compare across documents
Form extraction:
- OCR + structure understanding
- Handle handwritten fields
- Multi-page documents
Visual Agents
From research: "Qwen2.5-VL can directly play as a visual agent that can reason and dynamically direct tools, capable of computer use and phone use."
Computer use:
- Screen understanding
- UI navigation
- Task automation
- Testing
Phone use:
- Mobile app interaction
- Accessibility assistance
- Automated workflows
Video Analysis
Meeting summarization:
- Key points extraction
- Action item identification
- Speaker attribution
Content moderation:
- Policy violation detection
- Age-appropriate filtering
- Brand safety
Security monitoring:
- Anomaly detection
- Event recognition
- Alert generation
Real-Time Interaction
Voice assistants with vision:
- Describe surroundings
- Read documents aloud
- Visual Q&A
Customer service:
- Screen sharing analysis
- Visual troubleshooting
- Product identification
Emerging Trends
Multimodal Reasoning
From research: "Until 2025, there was only one open-source multimodal reasoning model, QVQ-72B-preview by Qwen."
Emerging capabilities:
- Mathematical diagram solving
- Scientific figure analysis
- Complex visual puzzles
- Step-by-step visual reasoning
Unified Generation
Models like Janus-Pro that both understand AND generate across modalities—closing the loop between perception and creation.
Real-Time Streaming
From MiniCPM-o: Real-time speech conversation and multimodal streaming support, enabling fluid human-AI interaction.
World Models
Moving toward models that understand physical world dynamics, cause-and-effect, and can simulate outcomes.
Open vs Proprietary
From research: "Proprietary models (GPT-4o, Gemini, Claude) lead in absolute performance but lock you into API pricing and prevent fine-tuning. Use them when accuracy justifies costs, and you don't need data privacy or custom adaptations."
From research: "Open-source models like Qwen2.5-VL-72B and InternVL3-78B now perform within 5-10% of proprietary models. You control deployment, fine-tune on proprietary data, and eliminate per-call costs at scale."
| Factor | Proprietary | Open-Source |
|---|---|---|
| Quality | Best | 5-10% behind |
| Cost at scale | High | Low |
| Privacy | Data leaves | Data stays |
| Fine-tuning | No | Yes |
| Latency | Network | Local possible |
| Support | Official | Community |
Conclusion
Multimodal LLMs have evolved from text-with-images to unified perception systems:
- Qwen VL series leads open-source vision-language
- Janus-Pro pioneers unified understanding + generation
- MiniCPM-o proves multimodal can be efficient (~8B params)
- GPT-4o/Gemini set commercial benchmarks
- Any-to-any models are the emerging frontier
Recommendations:
- Reliability first: GPT-4o or Claude
- Open-source quality: Qwen2.5-VL-72B
- Efficient/edge: MiniCPM-o 2.6 or Qwen2.5-VL-7B
- Image generation: Janus-Pro-7B
- Video understanding: Qwen2.5-VL or Gemini
Start with proprietary for prototyping, evaluate open models for production cost and privacy.
Frequently Asked Questions
Related Articles
Video Generation AI 2025: Sora 2 vs Veo 3 vs Runway Complete Guide
A comprehensive guide to AI video generation in 2025—Sora 2, Veo 3, Runway Gen-4, Kling, and more. Capabilities, pricing, API access, and practical implementation.
Open-Source LLMs: The Complete 2025 Guide
A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.
LLM Inference Optimization: From Quantization to Speculative Decoding
A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.