Skip to main content
Back to Blog

Multimodal LLMs: Vision, Audio, and Beyond

A comprehensive guide to multimodal LLMs—vision-language models, audio understanding, video comprehension, and any-to-any models. Architecture deep dives, benchmarks, implementation patterns, and production deployment.

12 min read
Share:

The Multimodal Revolution

LLMs are no longer text-only. Modern multimodal models process images, videos, audio, and text in unified architectures, enabling capabilities that were science fiction just two years ago.

From research: "Models have become smaller yet more powerful, with the rise of new architectures and capabilities including reasoning, agency, and long video understanding. Entirely new paradigms such as multimodal Retrieval Augmented Generation (RAG) and multimodal agents have taken shape."

This post provides a comprehensive guide to multimodal LLMs in 2025—architectures, major models, implementation patterns, and production deployment.

How Multimodal Models Work

The Core Architecture

Most multimodal models follow a similar pattern:

Code
[Non-text input] → Encoder → Projection → LLM Backbone → [Output]

Components:

  1. Modality Encoder: Converts raw input (image, audio, video) into embeddings

    • Vision: ViT (Vision Transformer), SigLIP, EVA-CLIP
    • Audio: Whisper encoder, wav2vec
    • Video: Frame sampling + vision encoder
  2. Projection Layer: Aligns modality embeddings with text embedding space

    • MLP projector (simple, fast)
    • Cross-attention (more expressive)
    • Q-Former (query-based alignment)
  3. LLM Backbone: Processes unified embeddings and generates output

    • Standard transformer decoder
    • Receives interleaved text and modality tokens

Vision-Language Architecture Example

Understanding the architecture helps you make better decisions about which model to use and how to optimize inference. The code below shows the conceptual flow—production implementations add many optimizations but follow the same pattern.

Python
# Simplified VLM forward pass
class VisionLanguageModel:
    def __init__(self):
        self.vision_encoder = ViT()  # e.g., SigLIP
        self.projection = MLP(vision_dim=1024, text_dim=4096)
        self.llm = LLM()  # e.g., Qwen, Llama

    def forward(self, image, text_tokens):
        # Encode image
        image_features = self.vision_encoder(image)  # [1, num_patches, 1024]

        # Project to LLM embedding space
        image_embeddings = self.projection(image_features)  # [1, num_patches, 4096]

        # Get text embeddings
        text_embeddings = self.llm.embed(text_tokens)  # [1, seq_len, 4096]

        # Concatenate: [image_tokens, text_tokens]
        combined = torch.cat([image_embeddings, text_embeddings], dim=1)

        # Generate response
        output = self.llm.generate(combined)
        return output

Understanding the key steps:

Vision encoding: The ViT (Vision Transformer) divides the image into patches (typically 14x14 or 16x16 pixels) and encodes each patch into a vector. A 224x224 image with 14x14 patches produces 256 patch embeddings. Higher resolution images produce more patches and thus more tokens.

Projection alignment: The vision encoder produces embeddings in its own space (e.g., 1024 dimensions for SigLIP). The LLM expects embeddings in its space (e.g., 4096 dimensions for Qwen). The projection layer learns to translate between these spaces. Simple MLPs work surprisingly well—cross-attention projectors can be more expressive but add latency.

Concatenation strategy: Image tokens are typically prepended to text tokens, so the LLM "sees" the image first and can reference it when processing the question. Some architectures interleave image and text tokens for finer-grained reference. The concatenation point affects both capability and efficiency.

Token Costs for Multimodal

Understanding token consumption is critical for cost management.

Why image tokens dominate costs: A text prompt might use 100-500 tokens. A single high-resolution image can use 1,000-2,000 tokens. If you're processing documents with many images, the image tokens quickly dominate your costs. This is why resolution controls and smart image preprocessing matter so much for production multimodal applications.

The resolution-quality tradeoff: Higher resolution means more patches, which means more tokens and higher costs. But lower resolution means losing fine details—you can't read small text or identify small objects. Most APIs let you control resolution: use lower resolution for general scene understanding, higher resolution for document OCR or detailed analysis.

ModelImage Token CostNotes
GPT-4o85-1700 tokens/imageDepends on resolution
Claude~1,000 tokens/imageFixed cost
Gemini~258 tokens/imageEfficient encoding
Qwen-VLVariableDynamic resolution

Video cost: tokens = frames × tokens_per_frame

A 1-minute video at 1 FPS = 60 frames × ~1000 tokens = ~60,000 tokens.

Types of Multimodal Models

Vision-Language Models (VLMs)

The most mature category: models understanding both images and text.

Capabilities:

  • Image captioning and description
  • Visual question answering (VQA)
  • Document and chart understanding
  • OCR and text extraction
  • GUI understanding for agents
  • Object detection and localization
  • Image-based reasoning

Key architectures:

ArchitectureExamplesApproach
Encoder-DecoderQwen-VL, LLaVASeparate vision encoder + LLM
Native MultimodalGPT-4o, GeminiUnified from pretraining
Dual-EncoderCLIP, SigLIPContrastive learning

Audio-Language Models

Models processing speech and audio alongside text:

Capabilities:

  • Automatic speech recognition (ASR)
  • Audio understanding (music, environmental sounds)
  • Real-time voice conversation
  • Speech synthesis / TTS
  • Speaker identification
  • Emotion detection from voice

Architecture:

Code
[Audio] → Audio Encoder (Whisper) → Projection → LLM → [Text/Speech]

Video-Language Models

Extended understanding across temporal sequences:

From research: "Qwen2.5-VL can comprehend videos of over 1 hour."

Capabilities:

  • Video summarization
  • Temporal reasoning (what happened before/after)
  • Action recognition
  • Long-form video Q&A
  • Event detection
  • Video captioning

Challenges:

  • Token explosion (many frames)
  • Temporal coherence
  • Long-range dependencies
  • Efficient sampling strategies

Any-to-Any Models

The frontier: models taking any modality input and generating any modality output.

From research: "Any-to-any models, as the name suggests, are models that can take in any modality and output any modality (image, text, audio)."

Examples:

  • Janus-Pro: Image understanding AND image generation
  • Qwen2.5-Omni: Text, image, audio, video → text, speech
  • GPT-4o: Native multimodal in/out (limited generation)

Major Multimodal Models

Qwen VL Series (Alibaba)

The leading open-source multimodal family.

Qwen2.5-VL

From research: "Qwen 2.5 VL integrates a vision transformer with a language model, enabling advanced image and text understanding capabilities. It can directly play as a visual agent that can reason and dynamically direct tools, capable of computer use and phone use."

Specifications:

VariantParametersContextVRAM Required
Qwen2.5-VL-3B3B32K~8GB (INT4)
Qwen2.5-VL-7B7B32K~16GB (INT4)
Qwen2.5-VL-72B72B32K~40GB (INT4)

Benchmarks (72B):

BenchmarkScoreDescription
MMMU70.2Multimodal understanding
MathVista74.8Visual math reasoning
MMStar70.8Multi-image reasoning
DocVQA96.4Document understanding
ChartQA88.3Chart comprehension

Strengths:

  • Best open-source VLM quality
  • Native video understanding (1+ hour)
  • Dynamic resolution (handles any image size)
  • Strong multilingual vision-language
  • Visual agent capabilities

Qwen3-VL

From research: "Qwen3-VL features text–vision fusion for unified comprehension, Interleaved-MRoPE for enhanced long-horizon video reasoning, and DeepStack technology that fuses multi-level ViT features to capture fine-grained details."

Key innovations:

  • DeepStack: Multi-level feature fusion
  • Interleaved-MRoPE: Better positional encoding for video
  • Improved long-context visual reasoning

Qwen2.5-Omni

From research: "An end-to-end multimodal model designed for comprehensive multimodal perception, seamlessly processing text, images, audio, and video, while delivering real-time streaming responses through both text generation and natural speech synthesis."

Capabilities:

  • Input: Text, image, audio, video
  • Output: Text, natural speech
  • Real-time streaming
  • 7B parameters

Janus-Pro (DeepSeek)

Unified understanding AND generation.

From research: "Janus-Pro-7B, introduced by DeepSeek AI, is a unified multimodal model that excels in both understanding and generating content across modalities. It features a decoupled visual encoding architecture, separating the processes for understanding and generation."

Architecture Innovation:

Code
Understanding path: Image → Understanding Encoder → LLM
Generation path:    LLM → Generation Encoder → Image

Key achievement: From research: "This model generates images and beats OpenAI's DALL-E 3 and Stable Diffusion across multiple benchmarks."

Benchmarks:

TaskJanus-Pro-7BDALL-E 3SD-XL
GenEval0.800.670.55
DPG-Bench84.283.574.7

Why it matters: First open model competitive with proprietary image generators while also understanding images.

MiniCPM-o 2.6

Compact but powerful multimodal.

From research: "MiniCPM-o 2.6 is an 8B parameter multimodal model capable of understanding and generating content across vision, speech, and language modalities."

Architecture: From research: "An architecture integrating SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, the model boasts a total of 8 billion parameters."

ComponentSizeFunction
SigLIP400MVision encoding
Whisper-medium300MAudio encoding
ChatTTS200MSpeech synthesis
Qwen2.57BLanguage backbone

Features:

  • Real-time speech conversation
  • Multimodal streaming support
  • ~5.5 GB model size
  • 32K context window
  • Runs on consumer GPUs

Best for: Edge deployment, real-time applications, resource-constrained environments.

GPT-4V / GPT-4o (OpenAI)

The commercial benchmark.

GPT-4V:

  • First frontier vision-language model
  • Strong at complex reasoning over images
  • Document and chart understanding
  • Released March 2023

GPT-4o:

  • Native multimodal (trained from scratch with all modalities)
  • Real-time voice mode
  • 128K context window
  • Faster inference
  • Released May 2024

Strengths:

  • Best overall quality
  • Excellent instruction following
  • Strong safety alignment
  • Comprehensive documentation

Limitations:

  • Proprietary, API-only
  • High cost at scale
  • No fine-tuning
  • No on-premise deployment

Claude Vision (Anthropic)

From research: "Claude Sonnet 4.5 / Opus 4.5 is recommended when conservative, audit-friendly tool use and stable long agents matter more than maximum raw scores."

Strengths:

  • Computer use (screen understanding)
  • Document analysis
  • Structured output from images
  • Safety and honesty focus
  • Long context (200K tokens)

Best for: Enterprise applications, document processing, computer use agents.

Gemini (Google)

From research: "Unlike other LLMs, Gemini was designed to be multimodal, meaning it could process multiple types of data simultaneously, including text, images, audio, video, and computer code."

Gemini 3 (November 2025): Google's most intelligent multimodal model family:

  • Gemini 3 Pro (November 18, 2025): State-of-the-art reasoning for complex problems
  • Gemini 3 Flash (December 17, 2025): Frontier intelligence at 3x speed of 2.5 Pro

From Google: "Gemini 3 Flash delivers frontier performance on GPQA Diamond (90.4%) and Humanity's Last Exam (33.7%), rivaling larger frontier models while using 30% fewer tokens on average than 2.5 Pro."

ModelGPQA DiamondSWE-benchSpeed vs 2.5 Pro
Gemini 3 Flash90.4%78%3x faster
Gemini 3 Pro88.1%72%1.5x faster

Gemini 3 Deep Think: Enhanced reasoning mode (safety evaluation in progress).

Gemini 2.5 Pro:

  • 1 million token context window
  • Native multimodality
  • "Thinking model" with step-by-step reasoning
  • Strong video understanding

Best for: Very long contexts, video analysis, Google Cloud integration.

InternVL3

Strong open-source alternative.

From research: "InternVL3-78B excels in multimodal perception and reasoning with enhanced capabilities including tool usage, GUI agents, industrial image analysis, and 3D vision perception. It achieves a score of 72.2 on the MMMU benchmark."

Specialties:

  • Industrial image analysis
  • 3D vision perception
  • GUI agent capabilities
  • Tool usage

GLM-4.6V (Zhipu AI)

From research: "GLM-4.6V is the latest open-source multimodal model featuring native multimodal tool use, stronger visual reasoning, and a 128K context window. With a 128K context window, GLM-4.6V can handle high-information-density inputs, such as multi-document financial reports, research papers, 200-page presentation decks, and hour-long videos."

Strengths:

  • 128K context window
  • Native tool use
  • Long document processing
  • Hour-long video understanding

Comprehensive Comparison

Benchmark Comparison

ModelMMMUMathVistaDocVQATextVQAOpen
GPT-4o69.163.892.877.4No
Gemini 2.5 Pro72.473.494.278.6No
Claude 3.5 Sonnet68.367.795.277.7No
Qwen2.5-VL-72B70.274.896.484.9Yes
InternVL3-78B72.270.195.883.4Yes
Qwen2.5-VL-7B62.067.594.581.3Yes

Capability Matrix

ModelParamsVisionAudioVideoGenerateOpen
Qwen2.5-VL-72B72B
Qwen2.5-Omni7BSpeech
Qwen3-VL-235B235B
Janus-Pro-7B7BImage
MiniCPM-o 2.68BSpeech
GPT-4o?
Claude 4.5?
Gemini 3 Flash?
Gemini 2.5?
InternVL3-78B78B

Implementation Patterns

Image Understanding with OpenAI

Python
from openai import OpenAI
import base64
from pathlib import Path

client = OpenAI()

def encode_image(image_path: str) -> str:
    """Encode image to base64."""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode()

def analyze_image(
    image_path: str,
    question: str,
    detail: str = "high"  # "low", "high", or "auto"
) -> str:
    """Analyze image with GPT-4o."""
    image_data = encode_image(image_path)
    extension = Path(image_path).suffix.lower()
    media_type = {
        ".png": "image/png",
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".gif": "image/gif",
        ".webp": "image/webp"
    }.get(extension, "image/jpeg")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:{media_type};base64,{image_data}",
                        "detail": detail
                    }
                }
            ]
        }],
        max_tokens=1000
    )
    return response.choices[0].message.content

# Examples
result = analyze_image("invoice.png", "Extract all line items with prices")
result = analyze_image("chart.png", "What trends does this chart show?")
result = analyze_image("screenshot.png", "Describe the UI elements")

Multiple Images

Python
def compare_images(image_paths: list[str], question: str) -> str:
    """Compare multiple images."""
    content = [{"type": "text", "text": question}]

    for path in image_paths:
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{encode_image(path)}",
                "detail": "high"
            }
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=2000
    )
    return response.choices[0].message.content

# Example: Compare before/after
result = compare_images(
    ["before.png", "after.png"],
    "What changed between these two images?"
)

Structured Output from Images

Python
from pydantic import BaseModel
from typing import List

class InvoiceItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class ExtractedInvoice(BaseModel):
    vendor: str
    invoice_number: str
    date: str
    items: List[InvoiceItem]
    subtotal: float
    tax: float
    total: float

def extract_invoice(image_path: str) -> ExtractedInvoice:
    """Extract structured data from invoice image."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract all information from this invoice as structured JSON."
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{encode_image(image_path)}"}
                }
            ]
        }],
        response_format={"type": "json_object"}
    )

    data = json.loads(response.choices[0].message.content)
    return ExtractedInvoice(**data)

Video Understanding with Qwen-VL

Python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch

class QwenVideoAnalyzer:
    def __init__(self, model_name: str = "Qwen/Qwen2.5-VL-7B-Instruct"):
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            attn_implementation="flash_attention_2"
        )
        self.processor = AutoProcessor.from_pretrained(model_name)

    def analyze_video(
        self,
        video_path: str,
        question: str,
        fps: float = 1.0,
        max_frames: int = 64
    ) -> str:
        """Analyze video content."""
        messages = [{
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": video_path,
                    "fps": fps,
                    "max_frames": max_frames
                },
                {"type": "text", "text": question}
            ]
        }]

        text = self.processor.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.processor(
            text=text,
            videos=[video_path],
            return_tensors="pt"
        ).to(self.model.device)

        with torch.no_grad():
            output_ids = self.model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.7
            )

        return self.processor.batch_decode(
            output_ids[:, inputs.input_ids.shape[1]:],
            skip_special_tokens=True
        )[0]

# Usage
analyzer = QwenVideoAnalyzer()
summary = analyzer.analyze_video(
    "meeting.mp4",
    "Summarize the key points discussed in this meeting"
)

Multimodal RAG

Python
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageDocument
import os

class MultimodalRAG:
    def __init__(self):
        self.mm_llm = OpenAIMultiModal(
            model="gpt-4o",
            max_new_tokens=1000
        )

    def load_documents(self, directory: str):
        """Load text and image documents."""
        # Load text documents
        text_docs = SimpleDirectoryReader(
            directory,
            required_exts=[".txt", ".pdf", ".md"]
        ).load_data()

        # Load images separately
        image_docs = []
        for filename in os.listdir(directory):
            if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
                image_path = os.path.join(directory, filename)
                image_docs.append(ImageDocument(image_path=image_path))

        return text_docs, image_docs

    def query_with_images(
        self,
        query: str,
        text_context: str,
        images: list[ImageDocument]
    ) -> str:
        """Query with both text and image context."""
        prompt = f"""Based on the following context and images, answer the question.

Text Context:
{text_context}

Question: {query}"""

        response = self.mm_llm.complete(
            prompt=prompt,
            image_documents=images
        )
        return response.text

# Usage
rag = MultimodalRAG()
text_docs, images = rag.load_documents("./reports/")
answer = rag.query_with_images(
    "What trends do the charts show?",
    "Q3 2024 sales report...",
    images[:3]  # First 3 images
)

Real-Time Audio with MiniCPM-o

Python
# Conceptual example - check MiniCPM-o docs for exact API
from minicpm_o import MiniCPMO
import sounddevice as sd
import numpy as np

class RealtimeVoiceAssistant:
    def __init__(self):
        self.model = MiniCPMO.from_pretrained("openbmb/MiniCPM-o-2.6")
        self.sample_rate = 16000

    def process_audio_stream(self, audio_chunk: np.ndarray) -> str:
        """Process audio and get text response."""
        # Model handles speech-to-text internally
        response = self.model.generate(
            audio=audio_chunk,
            sample_rate=self.sample_rate,
            output_type="text"
        )
        return response

    def generate_speech(self, text: str) -> np.ndarray:
        """Generate speech from text."""
        audio = self.model.synthesize_speech(text)
        return audio

    def run_conversation(self):
        """Run real-time conversation loop."""
        print("Listening... (Ctrl+C to stop)")
        while True:
            # Record audio
            audio = sd.rec(
                int(3 * self.sample_rate),
                samplerate=self.sample_rate,
                channels=1
            )
            sd.wait()

            # Process and respond
            response = self.process_audio_stream(audio.flatten())
            print(f"Response: {response}")

            # Speak response
            audio_response = self.generate_speech(response)
            sd.play(audio_response, self.sample_rate)
            sd.wait()

Production Deployment

Hardware Requirements

ModelVRAM (FP16)VRAM (INT4)Inference Speed
Qwen2.5-VL-3B8 GB3 GBFast
Qwen2.5-VL-7B16 GB6 GBMedium
MiniCPM-o 2.612 GB5.5 GBFast
Janus-Pro-7B16 GB6 GBMedium
Qwen2.5-VL-72B150 GB40 GBSlow
InternVL3-78B160 GB45 GBSlow

Serving with vLLM

Python
from vllm import LLM, SamplingParams

# Vision-language model serving
llm = LLM(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    trust_remote_code=True,
    gpu_memory_utilization=0.9,
    max_model_len=32768
)

# Process image + text
from vllm.multimodal.image import ImagePixelData

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512
)

# Create prompt with image placeholder
prompt = "<image>\nDescribe this image in detail."

outputs = llm.generate(
    [{
        "prompt": prompt,
        "multi_modal_data": {
            "image": ImagePixelData.from_file("image.png")
        }
    }],
    sampling_params
)

Optimization Strategies

Image preprocessing:

Python
def optimize_image_for_vlm(image_path: str, max_size: int = 1024) -> bytes:
    """Resize and compress image for efficient VLM processing."""
    from PIL import Image
    import io

    img = Image.open(image_path)

    # Resize if too large
    if max(img.size) > max_size:
        ratio = max_size / max(img.size)
        new_size = tuple(int(dim * ratio) for dim in img.size)
        img = img.resize(new_size, Image.LANCZOS)

    # Convert to RGB if needed
    if img.mode != 'RGB':
        img = img.convert('RGB')

    # Compress
    buffer = io.BytesIO()
    img.save(buffer, format='JPEG', quality=85)
    return buffer.getvalue()

Video frame sampling:

Python
def sample_video_frames(
    video_path: str,
    target_frames: int = 32,
    strategy: str = "uniform"
) -> list:
    """Sample frames from video efficiently."""
    import cv2

    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    if strategy == "uniform":
        # Evenly spaced frames
        indices = np.linspace(0, total_frames - 1, target_frames, dtype=int)
    elif strategy == "keyframe":
        # Use scene detection (simplified)
        indices = detect_keyframes(video_path, target_frames)

    frames = []
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frames.append(frame)

    cap.release()
    return frames

Cost Optimization

Python
class CostAwareMultimodalProcessor:
    def __init__(self):
        self.token_costs = {
            "gpt-4o": {"input": 5.0, "output": 15.0},  # per 1M tokens
            "claude-3-5-sonnet": {"input": 3.0, "output": 15.0}
        }

    def estimate_image_cost(
        self,
        image_path: str,
        model: str = "gpt-4o",
        detail: str = "high"
    ) -> float:
        """Estimate cost for image analysis."""
        from PIL import Image
        img = Image.open(image_path)
        width, height = img.size

        if detail == "low":
            tokens = 85
        else:
            # High detail calculation
            tiles = ((width // 512) + 1) * ((height // 512) + 1)
            tokens = 85 + (tiles * 170)

        cost = (tokens / 1_000_000) * self.token_costs[model]["input"]
        return cost, tokens

    def batch_process_efficiently(
        self,
        images: list[str],
        questions: list[str],
        budget: float
    ) -> list[dict]:
        """Process images within budget constraint."""
        results = []
        spent = 0

        for image, question in zip(images, questions):
            cost, tokens = self.estimate_image_cost(image)
            if spent + cost > budget:
                # Use lower detail or skip
                cost, tokens = self.estimate_image_cost(image, detail="low")

            if spent + cost <= budget:
                result = self.process_image(image, question, detail="low" if cost < 0.001 else "high")
                results.append(result)
                spent += cost

        return results

Use Cases

Document Intelligence

Invoice and receipt processing:

  • Extract line items, totals, vendor info
  • Validate against expected formats
  • Flag anomalies

Contract analysis:

  • Identify key clauses
  • Extract dates, parties, obligations
  • Compare across documents

Form extraction:

  • OCR + structure understanding
  • Handle handwritten fields
  • Multi-page documents

Visual Agents

From research: "Qwen2.5-VL can directly play as a visual agent that can reason and dynamically direct tools, capable of computer use and phone use."

Computer use:

  • Screen understanding
  • UI navigation
  • Task automation
  • Testing

Phone use:

  • Mobile app interaction
  • Accessibility assistance
  • Automated workflows

Video Analysis

Meeting summarization:

  • Key points extraction
  • Action item identification
  • Speaker attribution

Content moderation:

  • Policy violation detection
  • Age-appropriate filtering
  • Brand safety

Security monitoring:

  • Anomaly detection
  • Event recognition
  • Alert generation

Real-Time Interaction

Voice assistants with vision:

  • Describe surroundings
  • Read documents aloud
  • Visual Q&A

Customer service:

  • Screen sharing analysis
  • Visual troubleshooting
  • Product identification

Multimodal Reasoning

From research: "Until 2025, there was only one open-source multimodal reasoning model, QVQ-72B-preview by Qwen."

Emerging capabilities:

  • Mathematical diagram solving
  • Scientific figure analysis
  • Complex visual puzzles
  • Step-by-step visual reasoning

Unified Generation

Models like Janus-Pro that both understand AND generate across modalities—closing the loop between perception and creation.

Real-Time Streaming

From MiniCPM-o: Real-time speech conversation and multimodal streaming support, enabling fluid human-AI interaction.

World Models

Moving toward models that understand physical world dynamics, cause-and-effect, and can simulate outcomes.

Open vs Proprietary

From research: "Proprietary models (GPT-4o, Gemini, Claude) lead in absolute performance but lock you into API pricing and prevent fine-tuning. Use them when accuracy justifies costs, and you don't need data privacy or custom adaptations."

From research: "Open-source models like Qwen2.5-VL-72B and InternVL3-78B now perform within 5-10% of proprietary models. You control deployment, fine-tune on proprietary data, and eliminate per-call costs at scale."

FactorProprietaryOpen-Source
QualityBest5-10% behind
Cost at scaleHighLow
PrivacyData leavesData stays
Fine-tuningNoYes
LatencyNetworkLocal possible
SupportOfficialCommunity

Conclusion

Multimodal LLMs have evolved from text-with-images to unified perception systems:

  1. Qwen VL series leads open-source vision-language
  2. Janus-Pro pioneers unified understanding + generation
  3. MiniCPM-o proves multimodal can be efficient (~8B params)
  4. GPT-4o/Gemini set commercial benchmarks
  5. Any-to-any models are the emerging frontier

Recommendations:

  • Reliability first: GPT-4o or Claude
  • Open-source quality: Qwen2.5-VL-72B
  • Efficient/edge: MiniCPM-o 2.6 or Qwen2.5-VL-7B
  • Image generation: Janus-Pro-7B
  • Video understanding: Qwen2.5-VL or Gemini

Start with proprietary for prototyping, evaluate open models for production cost and privacy.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles