Which multimodal model should I use?

**For reliability:** GPT-4o or Claude. **For open-source quality:** Qwen2.5-VL-72B. **For efficiency:** MiniCPM-o 2.6. **For image generation:** Janus-Pro-7B. Always benchmark on your specific use case.

Can I run multimodal models locally?

Yes. MiniCPM-o at ~5.5GB runs on consumer GPUs. Qwen2.5-VL-7B works on 16GB VRAM. Larger models (72B) need enterprise hardware or quantization.

How do multimodal models handle video?

They sample frames at regular intervals (e.g., 1 FPS), encode each frame, and process the sequence. Long videos require efficient sampling strategies. Qwen2.5-VL can handle 1+ hour videos with proper frame sampling.

What about audio understanding?

Models like GPT-4o, Gemini, MiniCPM-o, and Qwen2.5-Omni support audio natively. Others require separate speech-to-text preprocessing (Whisper). Real-time voice requires streaming architecture.

Are multimodal models more expensive?

Yes—significantly. A single image costs 85-1700 tokens depending on resolution and model. Video multiplies this by frame count. A 1-minute video at 1 FPS could cost ~60,000 tokens. Budget accordingly and optimize image sizes.

Can I fine-tune multimodal models?

Open-source models (Qwen-VL, InternVL, MiniCPM-o) support fine-tuning with LoRA or full fine-tuning. Proprietary models (GPT-4o, Claude) do not support fine-tuning. Fine-tuning requires image-text pairs with your domain data.

Multimodal LLMs: Vision, Audio, and Beyond | Enrico Piovano

The Multimodal Revolution

LLMs are no longer text-only. Modern multimodal models process images, videos, audio, and text in unified architectures, enabling capabilities that were science fiction just two years ago.

From research: "Models have become smaller yet more powerful, with the rise of new architectures and capabilities including reasoning, agency, and long video understanding. Entirely new paradigms such as multimodal Retrieval Augmented Generation (RAG) and multimodal agents have taken shape."

This post provides a comprehensive guide to multimodal LLMs in 2025—architectures, major models, implementation patterns, and production deployment.

How Multimodal Models Work

The Core Architecture

Most multimodal models follow a similar pattern:

Code

[Non-text input] → Encoder → Projection → LLM Backbone → [Output]

Components:

Modality Encoder: Converts raw input (image, audio, video) into embeddings
- Vision: ViT (Vision Transformer), SigLIP, EVA-CLIP
- Audio: Whisper encoder, wav2vec
- Video: Frame sampling + vision encoder
Projection Layer: Aligns modality embeddings with text embedding space
- MLP projector (simple, fast)
- Cross-attention (more expressive)
- Q-Former (query-based alignment)
LLM Backbone: Processes unified embeddings and generates output
- Standard transformer decoder
- Receives interleaved text and modality tokens

Vision-Language Architecture Example

Understanding the architecture helps you make better decisions about which model to use and how to optimize inference. The code below shows the conceptual flow—production implementations add many optimizations but follow the same pattern.

Python

# Simplified VLM forward pass
class VisionLanguageModel:
    def __init__(self):
        self.vision_encoder = ViT()  # e.g., SigLIP
        self.projection = MLP(vision_dim=1024, text_dim=4096)
        self.llm = LLM()  # e.g., Qwen, Llama

    def forward(self, image, text_tokens):
        # Encode image
        image_features = self.vision_encoder(image)  # [1, num_patches, 1024]

        # Project to LLM embedding space
        image_embeddings = self.projection(image_features)  # [1, num_patches, 4096]

        # Get text embeddings
        text_embeddings = self.llm.embed(text_tokens)  # [1, seq_len, 4096]

        # Concatenate: [image_tokens, text_tokens]
        combined = torch.cat([image_embeddings, text_embeddings], dim=1)

        # Generate response
        output = self.llm.generate(combined)
        return output

Understanding the key steps:

Vision encoding: The ViT (Vision Transformer) divides the image into patches (typically 14x14 or 16x16 pixels) and encodes each patch into a vector. A 224x224 image with 14x14 patches produces 256 patch embeddings. Higher resolution images produce more patches and thus more tokens.

Projection alignment: The vision encoder produces embeddings in its own space (e.g., 1024 dimensions for SigLIP). The LLM expects embeddings in its space (e.g., 4096 dimensions for Qwen). The projection layer learns to translate between these spaces. Simple MLPs work surprisingly well—cross-attention projectors can be more expressive but add latency.

Concatenation strategy: Image tokens are typically prepended to text tokens, so the LLM "sees" the image first and can reference it when processing the question. Some architectures interleave image and text tokens for finer-grained reference. The concatenation point affects both capability and efficiency.

Token Costs for Multimodal

Understanding token consumption is critical for cost management.

Why image tokens dominate costs: A text prompt might use 100-500 tokens. A single high-resolution image can use 1,000-2,000 tokens. If you're processing documents with many images, the image tokens quickly dominate your costs. This is why resolution controls and smart image preprocessing matter so much for production multimodal applications.

The resolution-quality tradeoff: Higher resolution means more patches, which means more tokens and higher costs. But lower resolution means losing fine details—you can't read small text or identify small objects. Most APIs let you control resolution: use lower resolution for general scene understanding, higher resolution for document OCR or detailed analysis.

Model	Image Token Cost	Notes
GPT-4o	85-1700 tokens/image	Depends on resolution
Claude	~1,000 tokens/image	Fixed cost
Gemini	~258 tokens/image	Efficient encoding
Qwen-VL	Variable	Dynamic resolution

Video cost: tokens = frames × tokens_per_frame

A 1-minute video at 1 FPS = 60 frames × ~1000 tokens = ~60,000 tokens.

Types of Multimodal Models

Vision-Language Models (VLMs)

The most mature category: models understanding both images and text.

Capabilities:

Image captioning and description
Visual question answering (VQA)
Document and chart understanding
OCR and text extraction
GUI understanding for agents
Object detection and localization
Image-based reasoning

Key architectures:

Architecture	Examples	Approach
Encoder-Decoder	Qwen-VL, LLaVA	Separate vision encoder + LLM
Native Multimodal	GPT-4o, Gemini	Unified from pretraining
Dual-Encoder	CLIP, SigLIP	Contrastive learning

Audio-Language Models

Models processing speech and audio alongside text:

Capabilities:

Automatic speech recognition (ASR)
Audio understanding (music, environmental sounds)
Real-time voice conversation
Speech synthesis / TTS
Speaker identification
Emotion detection from voice

Architecture:

Code

[Audio] → Audio Encoder (Whisper) → Projection → LLM → [Text/Speech]

Video-Language Models

Extended understanding across temporal sequences:

From research: "Qwen2.5-VL can comprehend videos of over 1 hour."

Capabilities:

Video summarization
Temporal reasoning (what happened before/after)
Action recognition
Long-form video Q&A
Event detection
Video captioning

Challenges:

Token explosion (many frames)
Temporal coherence
Long-range dependencies
Efficient sampling strategies

Any-to-Any Models

The frontier: models taking any modality input and generating any modality output.

From research: "Any-to-any models, as the name suggests, are models that can take in any modality and output any modality (image, text, audio)."

Examples:

Janus-Pro: Image understanding AND image generation
Qwen2.5-Omni: Text, image, audio, video → text, speech
GPT-4o: Native multimodal in/out (limited generation)

Major Multimodal Models

Qwen VL Series (Alibaba)

The leading open-source multimodal family.

Qwen2.5-VL

From research: "Qwen 2.5 VL integrates a vision transformer with a language model, enabling advanced image and text understanding capabilities. It can directly play as a visual agent that can reason and dynamically direct tools, capable of computer use and phone use."

Specifications:

Variant	Parameters	Context	VRAM Required
Qwen2.5-VL-3B	3B	32K	~8GB (INT4)
Qwen2.5-VL-7B	7B	32K	~16GB (INT4)
Qwen2.5-VL-72B	72B	32K	~40GB (INT4)

Benchmarks (72B):

Benchmark	Score	Description
MMMU	70.2	Multimodal understanding
MathVista	74.8	Visual math reasoning
MMStar	70.8	Multi-image reasoning
DocVQA	96.4	Document understanding
ChartQA	88.3	Chart comprehension

Strengths:

Best open-source VLM quality
Native video understanding (1+ hour)
Dynamic resolution (handles any image size)
Strong multilingual vision-language
Visual agent capabilities

Qwen3-VL

From research: "Qwen3-VL features text–vision fusion for unified comprehension, Interleaved-MRoPE for enhanced long-horizon video reasoning, and DeepStack technology that fuses multi-level ViT features to capture fine-grained details."

Key innovations:

DeepStack: Multi-level feature fusion
Interleaved-MRoPE: Better positional encoding for video
Improved long-context visual reasoning

Qwen2.5-Omni

From research: "An end-to-end multimodal model designed for comprehensive multimodal perception, seamlessly processing text, images, audio, and video, while delivering real-time streaming responses through both text generation and natural speech synthesis."

Capabilities:

Input: Text, image, audio, video
Output: Text, natural speech
Real-time streaming
7B parameters

Janus-Pro (DeepSeek)

Unified understanding AND generation.

From research: "Janus-Pro-7B, introduced by DeepSeek AI, is a unified multimodal model that excels in both understanding and generating content across modalities. It features a decoupled visual encoding architecture, separating the processes for understanding and generation."

Architecture Innovation:

Code

Understanding path: Image → Understanding Encoder → LLM
Generation path:    LLM → Generation Encoder → Image

Key achievement: From research: "This model generates images and beats OpenAI's DALL-E 3 and Stable Diffusion across multiple benchmarks."

Benchmarks:

Task	Janus-Pro-7B	DALL-E 3	SD-XL
GenEval	0.80	0.67	0.55
DPG-Bench	84.2	83.5	74.7

Why it matters: First open model competitive with proprietary image generators while also understanding images.

MiniCPM-o 2.6

Compact but powerful multimodal.

From research: "MiniCPM-o 2.6 is an 8B parameter multimodal model capable of understanding and generating content across vision, speech, and language modalities."

Architecture: From research: "An architecture integrating SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, the model boasts a total of 8 billion parameters."

Component	Size	Function
SigLIP	400M	Vision encoding
Whisper-medium	300M	Audio encoding
ChatTTS	200M	Speech synthesis
Qwen2.5	7B	Language backbone

Features:

Real-time speech conversation
Multimodal streaming support
~5.5 GB model size
32K context window
Runs on consumer GPUs

Best for: Edge deployment, real-time applications, resource-constrained environments.

GPT-4V / GPT-4o (OpenAI)

The commercial benchmark.

GPT-4V:

First frontier vision-language model
Strong at complex reasoning over images
Document and chart understanding
Released March 2023

GPT-4o:

Native multimodal (trained from scratch with all modalities)
Real-time voice mode
128K context window
Faster inference
Released May 2024

Strengths:

Best overall quality
Excellent instruction following
Strong safety alignment
Comprehensive documentation

Limitations:

Proprietary, API-only
High cost at scale
No fine-tuning
No on-premise deployment

Claude Vision (Anthropic)

From research: "Claude Sonnet 4.5 / Opus 4.5 is recommended when conservative, audit-friendly tool use and stable long agents matter more than maximum raw scores."

Strengths:

Computer use (screen understanding)
Document analysis
Structured output from images
Safety and honesty focus
Long context (200K tokens)

Best for: Enterprise applications, document processing, computer use agents.

Gemini (Google)

From research: "Unlike other LLMs, Gemini was designed to be multimodal, meaning it could process multiple types of data simultaneously, including text, images, audio, video, and computer code."

Gemini 3 (November 2025): Google's most intelligent multimodal model family:

Gemini 3 Pro (November 18, 2025): State-of-the-art reasoning for complex problems
Gemini 3 Flash (December 17, 2025): Frontier intelligence at 3x speed of 2.5 Pro

From Google: "Gemini 3 Flash delivers frontier performance on GPQA Diamond (90.4%) and Humanity's Last Exam (33.7%), rivaling larger frontier models while using 30% fewer tokens on average than 2.5 Pro."

Model	GPQA Diamond	SWE-bench	Speed vs 2.5 Pro
Gemini 3 Flash	90.4%	78%	3x faster
Gemini 3 Pro	88.1%	72%	1.5x faster

Gemini 3 Deep Think: Enhanced reasoning mode (safety evaluation in progress).

Gemini 2.5 Pro:

1 million token context window
Native multimodality
"Thinking model" with step-by-step reasoning
Strong video understanding

Best for: Very long contexts, video analysis, Google Cloud integration.

InternVL3

Strong open-source alternative.

From research: "InternVL3-78B excels in multimodal perception and reasoning with enhanced capabilities including tool usage, GUI agents, industrial image analysis, and 3D vision perception. It achieves a score of 72.2 on the MMMU benchmark."

Specialties:

Industrial image analysis
3D vision perception
GUI agent capabilities
Tool usage

GLM-4.6V (Zhipu AI)

From research: "GLM-4.6V is the latest open-source multimodal model featuring native multimodal tool use, stronger visual reasoning, and a 128K context window. With a 128K context window, GLM-4.6V can handle high-information-density inputs, such as multi-document financial reports, research papers, 200-page presentation decks, and hour-long videos."

Strengths:

128K context window
Native tool use
Long document processing
Hour-long video understanding

Comprehensive Comparison

Benchmark Comparison

Model	MMMU	MathVista	DocVQA	TextVQA	Open
GPT-4o	69.1	63.8	92.8	77.4	No
Gemini 2.5 Pro	72.4	73.4	94.2	78.6	No
Claude 3.5 Sonnet	68.3	67.7	95.2	77.7	No
Qwen2.5-VL-72B	70.2	74.8	96.4	84.9	Yes
InternVL3-78B	72.2	70.1	95.8	83.4	Yes
Qwen2.5-VL-7B	62.0	67.5	94.5	81.3	Yes

Capability Matrix

Model	Params	Vision	Audio	Video	Generate	Open
Qwen2.5-VL-72B	72B	✅	❌	✅	❌	✅
Qwen2.5-Omni	7B	✅	✅	✅	Speech	✅
Qwen3-VL-235B	235B	✅	❌	✅	❌	✅
Janus-Pro-7B	7B	✅	❌	❌	Image	✅
MiniCPM-o 2.6	8B	✅	✅	✅	Speech	✅
GPT-4o	?	✅	✅	✅	❌	❌
Claude 4.5	?	✅	❌	❌	❌	❌
Gemini 3 Flash	?	✅	✅	✅	❌	❌
Gemini 2.5	?	✅	✅	✅	❌	❌
InternVL3-78B	78B	✅	❌	✅	❌	✅

Implementation Patterns

Image Understanding with OpenAI

Python

from openai import OpenAI
import base64
from pathlib import Path

client = OpenAI()

def encode_image(image_path: str) -> str:
    """Encode image to base64."""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode()

def analyze_image(
    image_path: str,
    question: str,
    detail: str = "high"  # "low", "high", or "auto"
) -> str:
    """Analyze image with GPT-4o."""
    image_data = encode_image(image_path)
    extension = Path(image_path).suffix.lower()
    media_type = {
        ".png": "image/png",
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".gif": "image/gif",
        ".webp": "image/webp"
    }.get(extension, "image/jpeg")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:{media_type};base64,{image_data}",
                        "detail": detail
                    }
                }
            ]
        }],
        max_tokens=1000
    )
    return response.choices[0].message.content

# Examples
result = analyze_image("invoice.png", "Extract all line items with prices")
result = analyze_image("chart.png", "What trends does this chart show?")
result = analyze_image("screenshot.png", "Describe the UI elements")

Multiple Images

Python

def compare_images(image_paths: list[str], question: str) -> str:
    """Compare multiple images."""
    content = [{"type": "text", "text": question}]

    for path in image_paths:
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{encode_image(path)}",
                "detail": "high"
            }
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=2000
    )
    return response.choices[0].message.content

# Example: Compare before/after
result = compare_images(
    ["before.png", "after.png"],
    "What changed between these two images?"
)

Structured Output from Images

Python

from pydantic import BaseModel
from typing import List

class InvoiceItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class ExtractedInvoice(BaseModel):
    vendor: str
    invoice_number: str
    date: str
    items: List[InvoiceItem]
    subtotal: float
    tax: float
    total: float

def extract_invoice(image_path: str) -> ExtractedInvoice:
    """Extract structured data from invoice image."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract all information from this invoice as structured JSON."
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{encode_image(image_path)}"}
                }
            ]
        }],
        response_format={"type": "json_object"}
    )

    data = json.loads(response.choices[0].message.content)
    return ExtractedInvoice(**data)

Video Understanding with Qwen-VL

Python

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch

class QwenVideoAnalyzer:
    def __init__(self, model_name: str = "Qwen/Qwen2.5-VL-7B-Instruct"):
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            attn_implementation="flash_attention_2"
        )
        self.processor = AutoProcessor.from_pretrained(model_name)

    def analyze_video(
        self,
        video_path: str,
        question: str,
        fps: float = 1.0,
        max_frames: int = 64
    ) -> str:
        """Analyze video content."""
        messages = [{
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": video_path,
                    "fps": fps,
                    "max_frames": max_frames
                },
                {"type": "text", "text": question}
            ]
        }]

        text = self.processor.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.processor(
            text=text,
            videos=[video_path],
            return_tensors="pt"
        ).to(self.model.device)

        with torch.no_grad():
            output_ids = self.model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.7
            )

        return self.processor.batch_decode(
            output_ids[:, inputs.input_ids.shape[1]:],
            skip_special_tokens=True
        )[0]

# Usage
analyzer = QwenVideoAnalyzer()
summary = analyzer.analyze_video(
    "meeting.mp4",
    "Summarize the key points discussed in this meeting"
)

Multimodal RAG

Python

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageDocument
import os

class MultimodalRAG:
    def __init__(self):
        self.mm_llm = OpenAIMultiModal(
            model="gpt-4o",
            max_new_tokens=1000
        )

    def load_documents(self, directory: str):
        """Load text and image documents."""
        # Load text documents
        text_docs = SimpleDirectoryReader(
            directory,
            required_exts=[".txt", ".pdf", ".md"]
        ).load_data()

        # Load images separately
        image_docs = []
        for filename in os.listdir(directory):
            if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
                image_path = os.path.join(directory, filename)
                image_docs.append(ImageDocument(image_path=image_path))

        return text_docs, image_docs

    def query_with_images(
        self,
        query: str,
        text_context: str,
        images: list[ImageDocument]
    ) -> str:
        """Query with both text and image context."""
        prompt = f"""Based on the following context and images, answer the question.

Text Context:
{text_context}

Question: {query}"""

        response = self.mm_llm.complete(
            prompt=prompt,
            image_documents=images
        )
        return response.text

# Usage
rag = MultimodalRAG()
text_docs, images = rag.load_documents("./reports/")
answer = rag.query_with_images(
    "What trends do the charts show?",
    "Q3 2024 sales report...",
    images[:3]  # First 3 images
)

Real-Time Audio with MiniCPM-o

Python

# Conceptual example - check MiniCPM-o docs for exact API
from minicpm_o import MiniCPMO
import sounddevice as sd
import numpy as np

class RealtimeVoiceAssistant:
    def __init__(self):
        self.model = MiniCPMO.from_pretrained("openbmb/MiniCPM-o-2.6")
        self.sample_rate = 16000

    def process_audio_stream(self, audio_chunk: np.ndarray) -> str:
        """Process audio and get text response."""
        # Model handles speech-to-text internally
        response = self.model.generate(
            audio=audio_chunk,
            sample_rate=self.sample_rate,
            output_type="text"
        )
        return response

    def generate_speech(self, text: str) -> np.ndarray:
        """Generate speech from text."""
        audio = self.model.synthesize_speech(text)
        return audio

    def run_conversation(self):
        """Run real-time conversation loop."""
        print("Listening... (Ctrl+C to stop)")
        while True:
            # Record audio
            audio = sd.rec(
                int(3 * self.sample_rate),
                samplerate=self.sample_rate,
                channels=1
            )
            sd.wait()

            # Process and respond
            response = self.process_audio_stream(audio.flatten())
            print(f"Response: {response}")

            # Speak response
            audio_response = self.generate_speech(response)
            sd.play(audio_response, self.sample_rate)
            sd.wait()

Production Deployment

Hardware Requirements

Model	VRAM (FP16)	VRAM (INT4)	Inference Speed
Qwen2.5-VL-3B	8 GB	3 GB	Fast
Qwen2.5-VL-7B	16 GB	6 GB	Medium
MiniCPM-o 2.6	12 GB	5.5 GB	Fast
Janus-Pro-7B	16 GB	6 GB	Medium
Qwen2.5-VL-72B	150 GB	40 GB	Slow
InternVL3-78B	160 GB	45 GB	Slow

Serving with vLLM

Python

from vllm import LLM, SamplingParams

# Vision-language model serving
llm = LLM(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    trust_remote_code=True,
    gpu_memory_utilization=0.9,
    max_model_len=32768
)

# Process image + text
from vllm.multimodal.image import ImagePixelData

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512
)

# Create prompt with image placeholder
prompt = "<image>\nDescribe this image in detail."

outputs = llm.generate(
    [{
        "prompt": prompt,
        "multi_modal_data": {
            "image": ImagePixelData.from_file("image.png")
        }
    }],
    sampling_params
)

Optimization Strategies

Image preprocessing:

Python

def optimize_image_for_vlm(image_path: str, max_size: int = 1024) -> bytes:
    """Resize and compress image for efficient VLM processing."""
    from PIL import Image
    import io

    img = Image.open(image_path)

    # Resize if too large
    if max(img.size) > max_size:
        ratio = max_size / max(img.size)
        new_size = tuple(int(dim * ratio) for dim in img.size)
        img = img.resize(new_size, Image.LANCZOS)

    # Convert to RGB if needed
    if img.mode != 'RGB':
        img = img.convert('RGB')

    # Compress
    buffer = io.BytesIO()
    img.save(buffer, format='JPEG', quality=85)
    return buffer.getvalue()

Video frame sampling:

Python

def sample_video_frames(
    video_path: str,
    target_frames: int = 32,
    strategy: str = "uniform"
) -> list:
    """Sample frames from video efficiently."""
    import cv2

    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    if strategy == "uniform":
        # Evenly spaced frames
        indices = np.linspace(0, total_frames - 1, target_frames, dtype=int)
    elif strategy == "keyframe":
        # Use scene detection (simplified)
        indices = detect_keyframes(video_path, target_frames)

    frames = []
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frames.append(frame)

    cap.release()
    return frames

Cost Optimization

Python

class CostAwareMultimodalProcessor:
    def __init__(self):
        self.token_costs = {
            "gpt-4o": {"input": 5.0, "output": 15.0},  # per 1M tokens
            "claude-3-5-sonnet": {"input": 3.0, "output": 15.0}
        }

    def estimate_image_cost(
        self,
        image_path: str,
        model: str = "gpt-4o",
        detail: str = "high"
    ) -> float:
        """Estimate cost for image analysis."""
        from PIL import Image
        img = Image.open(image_path)
        width, height = img.size

        if detail == "low":
            tokens = 85
        else:
            # High detail calculation
            tiles = ((width // 512) + 1) * ((height // 512) + 1)
            tokens = 85 + (tiles * 170)

        cost = (tokens / 1_000_000) * self.token_costs[model]["input"]
        return cost, tokens

    def batch_process_efficiently(
        self,
        images: list[str],
        questions: list[str],
        budget: float
    ) -> list[dict]:
        """Process images within budget constraint."""
        results = []
        spent = 0

        for image, question in zip(images, questions):
            cost, tokens = self.estimate_image_cost(image)
            if spent + cost > budget:
                # Use lower detail or skip
                cost, tokens = self.estimate_image_cost(image, detail="low")

            if spent + cost <= budget:
                result = self.process_image(image, question, detail="low" if cost < 0.001 else "high")
                results.append(result)
                spent += cost

        return results

Use Cases

Document Intelligence

Invoice and receipt processing:

Extract line items, totals, vendor info
Validate against expected formats
Flag anomalies

Contract analysis:

Identify key clauses
Extract dates, parties, obligations
Compare across documents

Form extraction:

OCR + structure understanding
Handle handwritten fields
Multi-page documents

Visual Agents

From research: "Qwen2.5-VL can directly play as a visual agent that can reason and dynamically direct tools, capable of computer use and phone use."

Computer use:

Screen understanding
UI navigation
Task automation
Testing

Phone use:

Mobile app interaction
Accessibility assistance
Automated workflows

Video Analysis

Meeting summarization:

Key points extraction
Action item identification
Speaker attribution

Content moderation:

Policy violation detection
Age-appropriate filtering
Brand safety

Security monitoring:

Anomaly detection
Event recognition
Alert generation

Real-Time Interaction

Voice assistants with vision:

Describe surroundings
Read documents aloud
Visual Q&A

Customer service:

Screen sharing analysis
Visual troubleshooting
Product identification

Emerging Trends

Multimodal Reasoning

From research: "Until 2025, there was only one open-source multimodal reasoning model, QVQ-72B-preview by Qwen."

Emerging capabilities:

Mathematical diagram solving
Scientific figure analysis
Complex visual puzzles
Step-by-step visual reasoning

Unified Generation

Models like Janus-Pro that both understand AND generate across modalities—closing the loop between perception and creation.

Real-Time Streaming

From MiniCPM-o: Real-time speech conversation and multimodal streaming support, enabling fluid human-AI interaction.

World Models

Moving toward models that understand physical world dynamics, cause-and-effect, and can simulate outcomes.

Open vs Proprietary

From research: "Proprietary models (GPT-4o, Gemini, Claude) lead in absolute performance but lock you into API pricing and prevent fine-tuning. Use them when accuracy justifies costs, and you don't need data privacy or custom adaptations."

From research: "Open-source models like Qwen2.5-VL-72B and InternVL3-78B now perform within 5-10% of proprietary models. You control deployment, fine-tune on proprietary data, and eliminate per-call costs at scale."

Factor	Proprietary	Open-Source
Quality	Best	5-10% behind
Cost at scale	High	Low
Privacy	Data leaves	Data stays
Fine-tuning	No	Yes
Latency	Network	Local possible
Support	Official	Community

Conclusion

Multimodal LLMs have evolved from text-with-images to unified perception systems:

Qwen VL series leads open-source vision-language
Janus-Pro pioneers unified understanding + generation
MiniCPM-o proves multimodal can be efficient (~8B params)
GPT-4o/Gemini set commercial benchmarks
Any-to-any models are the emerging frontier

Recommendations:

Reliability first: GPT-4o or Claude
Open-source quality: Qwen2.5-VL-72B
Efficient/edge: MiniCPM-o 2.6 or Qwen2.5-VL-7B
Image generation: Janus-Pro-7B
Video understanding: Qwen2.5-VL or Gemini

Start with proprietary for prototyping, evaluate open models for production cost and privacy.

Table of Contents

The Multimodal Revolution

How Multimodal Models Work

The Core Architecture

Vision-Language Architecture Example

Token Costs for Multimodal

Types of Multimodal Models

Vision-Language Models (VLMs)

Audio-Language Models

Video-Language Models

Any-to-Any Models

Major Multimodal Models

Qwen VL Series (Alibaba)

Qwen2.5-VL

Qwen3-VL

Qwen2.5-Omni

Janus-Pro (DeepSeek)

MiniCPM-o 2.6

GPT-4V / GPT-4o (OpenAI)

Claude Vision (Anthropic)

Gemini (Google)

InternVL3

GLM-4.6V (Zhipu AI)

Comprehensive Comparison

Benchmark Comparison

Capability Matrix

Implementation Patterns

Image Understanding with OpenAI

Multiple Images

Structured Output from Images

Video Understanding with Qwen-VL

Multimodal RAG

Real-Time Audio with MiniCPM-o

Production Deployment

Hardware Requirements

Serving with vLLM

Optimization Strategies

Cost Optimization

Use Cases

Document Intelligence

Visual Agents

Video Analysis

Real-Time Interaction

Emerging Trends

Multimodal Reasoning

Unified Generation

Real-Time Streaming

World Models

Open vs Proprietary

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Video Generation AI 2025: Sora 2 vs Veo 3 vs Runway Complete Guide

Open-Source LLMs: The Complete 2025 Guide

LLM Inference Optimization: From Quantization to Speculative Decoding