Skip to main content
Back to Blog

Video Generation AI 2025: Sora 2 vs Veo 3 vs Runway Complete Guide

A comprehensive guide to AI video generation in 2025—Sora 2, Veo 3, Runway Gen-4, Kling, and more. Capabilities, pricing, API access, and practical implementation.

6 min read
Share:

The Video Generation Revolution

2025 has been the breakthrough year for AI video generation. What was impossible in 2023 is now routine: generating photorealistic videos from text prompts, with consistent characters, physics simulation, and even native audio.

Why video generation is fundamentally harder than image generation: Images are static—each pixel is independent of time. Video adds a temporal dimension: objects must move consistently, physics must be respected, and visual coherence must persist across hundreds of frames. A minor glitch that would be invisible in a single image becomes jarring when repeated across 30 frames per second. This is why video generation lagged image generation by 2+ years despite similar underlying architectures.

The commercial inflection point: What changed in 2025 wasn't just quality—it was consistency and controllability. Early models produced impressive clips but couldn't maintain character identity, follow physics accurately, or handle complex prompts reliably. The new generation (Sora 2, Veo 3) achieves "production quality" for many use cases: ads, social content, prototyping. The Disney deal signals that Hollywood sees these tools as complementary to human creators, not replacements.

The scale of investment tells the story: Disney's $1 billion deal with OpenAI for Sora 2 character rights, Google's Veo 3 integration into YouTube Shorts, and Runway powering Hollywood productions.

This guide covers everything you need to know about AI video generation in 2025—from consumer tools to API integration.

Quick Comparison

ToolMax LengthResolutionAudioPriceBest For
Sora 260 seconds1080pNo$20-200/monthRealism, storytelling
Veo 38 seconds4KNative$20-250/monthAudio sync, YouTube
Runway Gen-416 seconds4KAdd-on$15-95/monthEditing, filmmaking
Kling 2.05 minutes1080pVia Kling AudioFree-$66/monthLong-form, faces
Minimax Hailuo6 seconds720pNoFreeQuick experiments
Pika 2.015 seconds1080pYes$8-58/monthMotion effects

OpenAI Sora 2

Overview

Sora 2, released September 2025, is OpenAI's second-generation video model. It produces the most photorealistic videos with remarkable understanding of physics and object permanence.

Key capabilities:

  • Up to 60-second clips at 1080p
  • Text-to-video and image-to-video
  • Excellent physics simulation
  • Character consistency across scenes
  • Support for 200+ Disney characters (licensed)

Technical Details

Sora uses a diffusion transformer architecture:

Code
Text Prompt → Text Encoder → Diffusion Transformer → Video Frames
                                    ↑
                              Noise Schedule
                              (iterative denoising)

Architecture highlights:

  • Spacetime patches (3D video tokens)
  • Variable duration, resolution, aspect ratio
  • Trained on internet-scale video data
  • Recaptioning with detailed descriptions

Using Sora

Web Interface (ChatGPT)

Code
Prompt: "A serene underwater scene of a coral reef. Colorful
tropical fish swim lazily through crystal-clear water. Sunlight
filters down from the surface, creating dancing light patterns
on the sandy bottom. A sea turtle glides gracefully through the
frame from left to right."

Settings:
- Duration: 10 seconds
- Aspect ratio: 16:9
- Resolution: 1080p (Pro users)

Prompt Engineering for Sora

Why video prompts differ from image prompts: Image prompts describe a static scene—composition, lighting, style. Video prompts must also describe motion: what moves, how fast, in what direction. They need camera instructions: is the camera static, panning, tracking a subject? And they need temporal structure: does the scene evolve? The best video prompts read like mini screenplays, specifying what the viewer should experience over time.

The anatomy of an effective video prompt: Top-performing prompts have five elements: (1) scene description (what's visible), (2) camera movement (how we view it), (3) visual style (aesthetic treatment), (4) mood (emotional tone), and (5) motion description (what happens over time). Missing any element leaves the model to guess—and guesses create inconsistency.

Python
def create_sora_prompt(
    scene: str,
    camera: str = None,
    style: str = None,
    mood: str = None,
    motion: str = None
) -> str:
    """
    Structure prompts for best Sora results.

    Key elements:
    1. Scene description (what's in frame)
    2. Camera movement (if any)
    3. Visual style
    4. Mood/atmosphere
    5. Motion/action
    """
    prompt_parts = []

    # Scene is required
    prompt_parts.append(scene)

    # Add camera movement
    if camera:
        camera_terms = {
            "static": "Static camera shot",
            "pan_left": "Camera slowly pans from right to left",
            "pan_right": "Camera slowly pans from left to right",
            "zoom_in": "Camera gradually zooms in",
            "zoom_out": "Camera pulls back slowly",
            "dolly": "Camera moves forward through the scene",
            "crane": "Camera rises upward revealing the scene",
            "handheld": "Slight handheld camera movement",
            "drone": "Aerial drone shot moving forward",
            "tracking": "Camera tracks alongside the subject"
        }
        prompt_parts.append(camera_terms.get(camera, camera))

    # Add style
    if style:
        style_terms = {
            "cinematic": "Cinematic quality, film grain, dramatic lighting",
            "documentary": "Documentary style, natural lighting",
            "anime": "Anime style animation",
            "photorealistic": "Photorealistic, high detail",
            "vintage": "Vintage film look, warm colors, soft focus",
            "noir": "Film noir style, high contrast, dramatic shadows"
        }
        prompt_parts.append(style_terms.get(style, style))

    # Add mood
    if mood:
        prompt_parts.append(f"The mood is {mood}")

    # Add motion description
    if motion:
        prompt_parts.append(motion)

    return ". ".join(prompt_parts)


# Example usage
prompt = create_sora_prompt(
    scene="A lone astronaut walks across the surface of Mars, red dust swirling around their boots",
    camera="tracking",
    style="cinematic",
    mood="isolated and contemplative",
    motion="The astronaut moves slowly, deliberately, pausing to look at the horizon"
)

print(prompt)
# Output: "A lone astronaut walks across the surface of Mars, red dust
# swirling around their boots. Camera tracks alongside the subject.
# Cinematic quality, film grain, dramatic lighting. The mood is isolated
# and contemplative. The astronaut moves slowly, deliberately, pausing
# to look at the horizon."

Sora Storyboard Mode

Create connected scenes:

Python
storyboard = [
    {
        "scene": 1,
        "prompt": "A woman sits alone at a cafe table in Paris, looking pensively out the window at the rain",
        "duration": 8,
        "camera": "static"
    },
    {
        "scene": 2,
        "prompt": "Close-up of the woman's hands holding a coffee cup, rain visible through window reflection",
        "duration": 5,
        "camera": "static",
        "transition": "cut"
    },
    {
        "scene": 3,
        "prompt": "The woman stands and walks toward the cafe door, putting on her coat",
        "duration": 7,
        "camera": "tracking",
        "transition": "dissolve"
    },
    {
        "scene": 4,
        "prompt": "Wide shot of the woman stepping out into the rainy Paris street, Eiffel Tower visible in distance",
        "duration": 10,
        "camera": "crane",
        "transition": "cut"
    }
]

# Character consistency is maintained across scenes through:
# 1. Consistent character descriptions
# 2. Reference images (image-to-video for first scene)
# 3. Sora's internal character tracking

Disney Character Access

With the Disney partnership:

Code
# Licensed characters available:
- Disney Animation: Mickey, Elsa, Moana, etc.
- Pixar: Woody, Buzz, Nemo, etc.
- Marvel: Iron Man, Spider-Man, etc.
- Star Wars: Darth Vader, Yoda, etc.

# Prompt example:
"Buzz Lightyear flying through space, stars streaking past,
dramatic lighting, Pixar animation style"

Pricing

PlanPriceFeatures
ChatGPT Plus$20/month720p, 5-sec limit, 50 videos/month
ChatGPT Pro$200/month1080p, 20-sec limit, unlimited

Limitations

  • No audio generation (must add separately)
  • Sometimes struggles with text in videos
  • Human hands and complex interactions can be glitchy
  • Generation time: 30 seconds to 5 minutes per clip

Google Veo 3

Overview

Veo 3, released December 2025, is Google's flagship video model. Its standout feature is native audio generation—synchronized sound effects, ambient audio, and even music.

Key capabilities:

  • Up to 8-second clips (extendable via stitching)
  • Native 4K resolution
  • Native audio generation (unique feature)
  • YouTube Shorts integration
  • Veo 3 Fast mode for quick iteration

Native Audio Generation

Veo 3's killer feature is synchronized audio:

Python
# Veo 3 generates both video AND matching audio

prompt = """
A thunderstorm over a mountain lake. Lightning illuminates
the peaks. Rain falls heavily on the water surface.
Thunder rumbles in the distance.
"""

# Output includes:
# - Video: The visual scene
# - Audio: Rain sounds, thunder, ambient wind
# - Perfectly synchronized to visual events

Using Veo 3

Google AI Studio

Python
import google.generativeai as genai
from google.generativeai import types

# Configure
genai.configure(api_key="YOUR_API_KEY")

# Generate video
response = genai.generate_video(
    model="veo-3",
    prompt="""
    A cozy coffee shop interior. Soft jazz plays in the background.
    Steam rises from a freshly poured latte. Rain patters against
    the window. A barista moves in the background, preparing drinks.
    """,
    config=types.GenerateVideoConfig(
        duration_seconds=8,
        aspect_ratio="16:9",
        resolution="1080p",
        generate_audio=True,
        audio_style="ambient"
    )
)

# Download the generated video
with open("coffee_shop.mp4", "wb") as f:
    f.write(response.video_bytes)

print(f"Video generated: {response.duration}s")
print(f"Audio included: {response.has_audio}")

Veo 3 Fast Mode

For quick iterations (YouTube Shorts):

Python
# Fast mode: ~10 seconds generation, lower quality
response = genai.generate_video(
    model="veo-3-fast",
    prompt="A cute puppy running through autumn leaves",
    config=types.GenerateVideoConfig(
        duration_seconds=8,
        fast_mode=True  # Enables Veo 3 Fast
    )
)

Prompt Techniques for Veo 3

Python
class Veo3PromptBuilder:
    """Builder for optimized Veo 3 prompts."""

    def __init__(self):
        self.visual_elements = []
        self.audio_elements = []
        self.camera = None
        self.style = None

    def add_visual(self, element: str):
        """Add visual element to scene."""
        self.visual_elements.append(element)
        return self

    def add_sound(self, sound: str):
        """Add sound element (Veo 3 will generate matching audio)."""
        self.audio_elements.append(sound)
        return self

    def set_camera(self, movement: str):
        """Set camera movement."""
        self.camera = movement
        return self

    def set_style(self, style: str):
        """Set visual style."""
        self.style = style
        return self

    def build(self) -> str:
        """Build the final prompt."""
        parts = []

        # Visual scene
        if self.visual_elements:
            parts.append(" ".join(self.visual_elements))

        # Audio cues (Veo 3 understands these)
        if self.audio_elements:
            audio_desc = "Sounds include: " + ", ".join(self.audio_elements)
            parts.append(audio_desc)

        # Camera
        if self.camera:
            parts.append(f"Camera: {self.camera}")

        # Style
        if self.style:
            parts.append(f"Style: {self.style}")

        return ". ".join(parts)


# Example usage
prompt = (
    Veo3PromptBuilder()
    .add_visual("A busy Tokyo street at night")
    .add_visual("Neon signs reflect on wet pavement")
    .add_visual("People with umbrellas walk past")
    .add_sound("City ambiance")
    .add_sound("Rain on umbrellas")
    .add_sound("Distant traffic")
    .add_sound("Japanese pop music from a nearby store")
    .set_camera("Slow tracking shot following pedestrians")
    .set_style("Cinematic, blade runner aesthetic")
    .build()
)

YouTube Shorts Integration

Veo 3 is integrated into YouTube Create:

Code
YouTube Create App:
1. Open YouTube Create
2. Select "AI Video"
3. Enter prompt
4. Choose "Veo 3 Fast" for quick generation
5. Edit in timeline
6. Publish directly to Shorts

Features:
- SynthID watermarking (visible marker for AI content)
- Direct upload to channel
- Built-in editing tools
- Music library integration

Pricing

PlanPriceFeatures
Google AI Pro$20/month1000 credits, watermarked
Google AI Ultra$250/month12500 credits, no watermark
APIPay-per-use$0.50/second generated

Runway Gen-4

Overview

Runway has been the creative professional's choice since Gen-1. Gen-4, along with their Aleph model, offers the most comprehensive editing toolkit alongside generation.

Key capabilities:

  • Up to 16-second clips
  • 4K resolution
  • Advanced editing tools (inpainting, outpainting)
  • Motion brush for precise control
  • Multi-clip projects with transitions

Runway's Tool Suite

Runway isn't just a generator—it's a complete video AI platform:

Code
Generation Tools:
├── Gen-4 (Text-to-Video)
├── Gen-4 Turbo (Fast generation)
├── Aleph (Editing & transformation)
└── Image-to-Video

Editing Tools:
├── Inpainting (remove/replace objects)
├── Outpainting (extend frame)
├── Motion Brush (control movement)
├── Super Resolution (upscale)
├── Frame Interpolation (slow motion)
└── Background Removal

Audio Tools:
├── Audio Sync (lip sync)
├── Sound Effects
└── Music Generation

Using Runway API

Python
import runway

# Initialize client
client = runway.Client(api_key="YOUR_API_KEY")

# Text-to-Video
task = client.text_to_video.create(
    prompt="""
    A majestic eagle soars over snow-capped mountains.
    Golden hour lighting. Dramatic clouds in background.
    Camera follows the eagle in flight.
    """,
    model="gen4",
    duration=10,
    aspect_ratio="16:9",
    resolution="1080p"
)

# Wait for completion
result = task.wait()
print(f"Video URL: {result.url}")

# Download
video_bytes = client.download(result.url)
with open("eagle.mp4", "wb") as f:
    f.write(video_bytes)

Image-to-Video

Start from an image for more control:

Python
# Upload reference image
image = client.uploads.create(
    file=open("hero_character.png", "rb")
)

# Generate video from image
task = client.image_to_video.create(
    image=image.id,
    prompt="The character turns and walks toward the camera, confident stride",
    model="gen4",
    duration=8,
    motion_strength=0.7  # 0.0-1.0, higher = more motion
)

result = task.wait()

Motion Brush

Precise control over what moves:

Python
# Motion brush defines regions and their movement

motion_config = {
    "regions": [
        {
            "mask": "mask_clouds.png",  # Mask image
            "direction": "right",
            "speed": 0.3
        },
        {
            "mask": "mask_water.png",
            "direction": "oscillate",
            "speed": 0.5
        },
        {
            "mask": "mask_character.png",
            "direction": "forward",
            "speed": 0.8
        }
    ],
    "static_regions": ["mask_buildings.png"]  # These don't move
}

task = client.image_to_video.create(
    image=image.id,
    prompt="Scene comes to life",
    motion_config=motion_config,
    duration=8
)

Aleph: Advanced Editing

Runway's Aleph model specializes in video transformation:

Python
# Remove object from video
task = client.aleph.inpaint(
    video="input_video.mp4",
    mask="object_mask.mp4",  # Mask video marking object to remove
    prompt="Clean background, seamless removal"
)

# Style transfer
task = client.aleph.style_transfer(
    video="input_video.mp4",
    style_image="anime_style_reference.jpg",
    strength=0.8
)

# Extend video (outpainting in time)
task = client.aleph.extend(
    video="short_clip.mp4",
    direction="forward",  # or "backward"
    duration=5,  # seconds to add
    prompt="Continue the scene naturally"
)

Pricing

PlanPriceCredits/monthFeatures
Free$0125 one-timeWatermarked
Standard$15/month625Gen-4, no watermark
Pro$35/month2250Priority, 4K
Unlimited$95/monthUnlimitedAll features

Kling 2.0

Overview

Kling (by Kuaishou) excels at long-form generation and realistic human faces. It's the go-to for character-driven content.

Key capabilities:

  • Up to 5 minutes per clip (industry-leading)
  • Excellent facial consistency
  • Lip sync with Kling Audio
  • Powerful motion control

Long-Form Generation

Python
# Kling handles long narratives
story_scenes = [
    {
        "prompt": "A young woman wakes up in a small apartment, morning sunlight streaming through curtains",
        "duration": 20
    },
    {
        "prompt": "She makes coffee, looking thoughtfully out the window at the city below",
        "duration": 15,
        "character_ref": "scene_1"  # Maintain character consistency
    },
    {
        "prompt": "Close-up of her face as she receives a surprising phone call",
        "duration": 10,
        "character_ref": "scene_1"
    },
    {
        "prompt": "She rushes to get ready, putting on a coat and grabbing her keys",
        "duration": 20,
        "character_ref": "scene_1"
    },
    {
        "prompt": "She runs down busy city streets, weaving through crowds",
        "duration": 25,
        "character_ref": "scene_1"
    }
]

# Total: 90 seconds of consistent narrative

Lip Sync with Kling Audio

Python
# Generate video with matching lip sync

# 1. Create the visual
video_task = kling.create_video(
    prompt="A news anchor delivers breaking news in a professional studio",
    duration=30,
    character_style="realistic",
    lip_sync_ready=True  # Prepares for audio overlay
)

# 2. Add voice and lip sync
audio_task = kling.add_audio(
    video_id=video_task.id,
    audio_source="tts",  # or "upload" for custom audio
    text="""
    Good evening. Tonight's top story: Scientists have made
    a breakthrough discovery that could change everything we
    know about renewable energy. Our correspondent has more.
    """,
    voice="news_anchor_female",
    lip_sync=True  # Adjusts mouth movements to match
)

Pricing

PlanPriceFeatures
Free$06 sec clips, watermarked
Standard$8/month30 sec, no watermark
Pro$28/month2 min clips
Enterprise$66/month5 min clips, API access

Practical Implementation

Video Generation Pipeline

Python
import os
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import asyncio

class VideoProvider(Enum):
    SORA = "sora"
    VEO = "veo"
    RUNWAY = "runway"
    KLING = "kling"

@dataclass
class VideoRequest:
    prompt: str
    duration: int = 8
    resolution: str = "1080p"
    aspect_ratio: str = "16:9"
    style: Optional[str] = None
    audio: bool = False
    reference_image: Optional[str] = None

@dataclass
class VideoResult:
    url: str
    duration: float
    resolution: str
    has_audio: bool
    provider: VideoProvider
    cost: float

class VideoGenerator:
    """
    Unified interface for multiple video generation providers.
    """

    def __init__(self):
        self.providers = {}
        self._init_providers()

    def _init_providers(self):
        """Initialize available providers."""
        if os.getenv("OPENAI_API_KEY"):
            from openai import OpenAI
            self.providers[VideoProvider.SORA] = OpenAI()

        if os.getenv("GOOGLE_API_KEY"):
            import google.generativeai as genai
            genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
            self.providers[VideoProvider.VEO] = genai

        if os.getenv("RUNWAY_API_KEY"):
            import runway
            self.providers[VideoProvider.RUNWAY] = runway.Client(
                api_key=os.getenv("RUNWAY_API_KEY")
            )

    def select_provider(self, request: VideoRequest) -> VideoProvider:
        """
        Select best provider based on request requirements.
        """
        # Need audio? Veo 3 is the only native option
        if request.audio and VideoProvider.VEO in self.providers:
            return VideoProvider.VEO

        # Long duration? Kling excels
        if request.duration > 30 and VideoProvider.KLING in self.providers:
            return VideoProvider.KLING

        # Need editing/control? Runway
        if request.reference_image and VideoProvider.RUNWAY in self.providers:
            return VideoProvider.RUNWAY

        # Default to Sora for quality
        if VideoProvider.SORA in self.providers:
            return VideoProvider.SORA

        # Fall back to whatever is available
        return list(self.providers.keys())[0]

    async def generate(
        self,
        request: VideoRequest,
        provider: Optional[VideoProvider] = None
    ) -> VideoResult:
        """Generate video using specified or auto-selected provider."""

        if provider is None:
            provider = self.select_provider(request)

        if provider == VideoProvider.SORA:
            return await self._generate_sora(request)
        elif provider == VideoProvider.VEO:
            return await self._generate_veo(request)
        elif provider == VideoProvider.RUNWAY:
            return await self._generate_runway(request)
        elif provider == VideoProvider.KLING:
            return await self._generate_kling(request)

    async def _generate_sora(self, request: VideoRequest) -> VideoResult:
        """Generate with Sora."""
        client = self.providers[VideoProvider.SORA]

        response = await client.video.generate(
            model="sora-2",
            prompt=request.prompt,
            duration=min(request.duration, 20),  # Sora max
            resolution=request.resolution
        )

        return VideoResult(
            url=response.url,
            duration=response.duration,
            resolution=request.resolution,
            has_audio=False,
            provider=VideoProvider.SORA,
            cost=self._calculate_cost(VideoProvider.SORA, response.duration)
        )

    async def _generate_veo(self, request: VideoRequest) -> VideoResult:
        """Generate with Veo 3."""
        genai = self.providers[VideoProvider.VEO]

        response = genai.generate_video(
            model="veo-3",
            prompt=request.prompt,
            config={
                "duration_seconds": min(request.duration, 8),
                "resolution": request.resolution,
                "generate_audio": request.audio
            }
        )

        return VideoResult(
            url=response.url,
            duration=response.duration,
            resolution=request.resolution,
            has_audio=request.audio,
            provider=VideoProvider.VEO,
            cost=self._calculate_cost(VideoProvider.VEO, response.duration)
        )

    async def _generate_runway(self, request: VideoRequest) -> VideoResult:
        """Generate with Runway Gen-4."""
        client = self.providers[VideoProvider.RUNWAY]

        if request.reference_image:
            task = client.image_to_video.create(
                image=request.reference_image,
                prompt=request.prompt,
                model="gen4",
                duration=min(request.duration, 16)
            )
        else:
            task = client.text_to_video.create(
                prompt=request.prompt,
                model="gen4",
                duration=min(request.duration, 16)
            )

        result = task.wait()

        return VideoResult(
            url=result.url,
            duration=result.duration,
            resolution=request.resolution,
            has_audio=False,
            provider=VideoProvider.RUNWAY,
            cost=self._calculate_cost(VideoProvider.RUNWAY, result.duration)
        )

    def _calculate_cost(self, provider: VideoProvider, duration: float) -> float:
        """Estimate cost for generation."""
        rates = {
            VideoProvider.SORA: 0.10,  # ~$0.10 per second
            VideoProvider.VEO: 0.50,   # $0.50 per second (API)
            VideoProvider.RUNWAY: 0.05, # ~$0.05 per second
            VideoProvider.KLING: 0.03   # ~$0.03 per second
        }
        return rates.get(provider, 0.10) * duration


# Usage example
async def main():
    generator = VideoGenerator()

    # Auto-select provider
    result = await generator.generate(VideoRequest(
        prompt="A serene Japanese garden with koi fish swimming in a pond",
        duration=10,
        audio=True  # Will select Veo 3
    ))

    print(f"Generated with {result.provider.value}")
    print(f"URL: {result.url}")
    print(f"Cost: ${result.cost:.2f}")

asyncio.run(main())

Batch Processing

Python
import asyncio
from typing import List

async def batch_generate(
    prompts: List[str],
    generator: VideoGenerator,
    max_concurrent: int = 3
) -> List[VideoResult]:
    """
    Generate multiple videos with concurrency control.
    """
    semaphore = asyncio.Semaphore(max_concurrent)

    async def generate_one(prompt: str) -> VideoResult:
        async with semaphore:
            return await generator.generate(VideoRequest(prompt=prompt))

    tasks = [generate_one(prompt) for prompt in prompts]
    return await asyncio.gather(*tasks)


# Example: Generate a video series
async def create_video_series():
    generator = VideoGenerator()

    prompts = [
        "Episode 1: A mysterious letter arrives at an old mansion",
        "Episode 2: The detective examines clues in the library",
        "Episode 3: A secret passage is discovered behind the bookshelf",
        "Episode 4: The truth is revealed in a dramatic confrontation"
    ]

    results = await batch_generate(prompts, generator)

    for i, result in enumerate(results):
        print(f"Episode {i+1}: {result.url}")

Adding Audio Post-Generation

For providers without native audio:

Python
from elevenlabs import ElevenLabs
import moviepy.editor as mpe

class AudioAdder:
    """Add audio to generated videos."""

    def __init__(self):
        self.elevenlabs = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

    def add_narration(
        self,
        video_path: str,
        text: str,
        voice: str = "narrator",
        output_path: str = "output.mp4"
    ):
        """Add TTS narration to video."""

        # Generate speech
        audio = self.elevenlabs.generate(
            text=text,
            voice=voice,
            model="eleven_turbo_v2"
        )

        # Save temp audio
        with open("temp_audio.mp3", "wb") as f:
            f.write(audio)

        # Combine video and audio
        video = mpe.VideoFileClip(video_path)
        audio_clip = mpe.AudioFileClip("temp_audio.mp3")

        # Match durations
        if audio_clip.duration > video.duration:
            audio_clip = audio_clip.subclip(0, video.duration)

        final = video.set_audio(audio_clip)
        final.write_videofile(output_path)

        # Cleanup
        os.remove("temp_audio.mp3")

        return output_path

    def add_music(
        self,
        video_path: str,
        music_path: str,
        volume: float = 0.3,
        output_path: str = "output.mp4"
    ):
        """Add background music to video."""

        video = mpe.VideoFileClip(video_path)
        music = mpe.AudioFileClip(music_path)

        # Loop music if needed
        if music.duration < video.duration:
            music = mpe.concatenate_audioclips([music] * int(video.duration / music.duration + 1))

        # Trim and adjust volume
        music = music.subclip(0, video.duration).volumex(volume)

        # Mix with original audio if exists
        if video.audio:
            final_audio = mpe.CompositeAudioClip([video.audio, music])
        else:
            final_audio = music

        final = video.set_audio(final_audio)
        final.write_videofile(output_path)

        return output_path


# Usage
audio_adder = AudioAdder()

# Add narration to Sora-generated video
audio_adder.add_narration(
    video_path="sora_output.mp4",
    text="In a world where dreams become reality, one person dared to imagine the impossible...",
    voice="dramatic_narrator"
)

Best Practices

Prompt Engineering Tips

Python
# DO: Be specific about visual details
good_prompt = """
A golden retriever puppy sits in a sun-dappled garden.
Soft afternoon light filters through oak leaves.
The puppy tilts its head curiously, ears perked up.
Shallow depth of field, background softly blurred.
Shot on 35mm film, warm color grading.
"""

# DON'T: Be vague or contradictory
bad_prompt = """
A dog in a nice place looking cute.
"""

# DO: Describe camera movement explicitly
good_camera = """
Camera slowly dollies forward while tilting up,
revealing the full height of the ancient redwood tree.
"""

# DO: Include temporal descriptions
good_temporal = """
The flower blooms in accelerated time-lapse,
petals unfurling one by one over 5 seconds.
"""

Common Issues and Solutions

IssueSolution
Inconsistent charactersUse image-to-video with reference
Unnatural motionReduce motion strength, be specific
Bad hands/facesUse models optimized for humans (Kling)
Wrong aspect ratioSpecify explicitly in prompt
ArtifactsTry different seed, reduce duration

Conclusion

AI video generation in 2025 has reached production quality. The choice of tool depends on your specific needs:

  • Sora 2: Best overall quality, storytelling
  • Veo 3: Only option for native audio
  • Runway: Best editing tools, creative control
  • Kling: Best for long-form and faces

For most projects, you'll likely use multiple tools—Sora or Veo for generation, Runway for editing, and external tools for audio when needed.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles