Video Generation Models

Model rankings updated May 2026 based on real usage data.

OpenRouter provides access to video generation models through a single, unified API gateway. Generate videos from text prompts and reference images via an asynchronous API — compare pricing, capabilities, and supported resolutions to find the best fit for your use case. Video generation is a new modality on OpenRouter, and available models are improving quickly. Learn more about video generation on OpenRouter.

Video Generation Models on OpenRouter

Kling: Video v3.0 Pro

Kling v3.0 Pro is Kuaishou's premium video generation model, offering higher visual quality than the Standard tier. It supports text-to-video and image-to-video workflows, with first-frame and last-frame control for precise scene composition. Clips range from 3 to 15 seconds in 16:9, 9:16, or 1:1 aspect ratios. Native audio generation is available as an option.

by kwaivgi$0/M input tokens$0/M output tokens

Kling: Video v3.0 Standard

Kling v3.0 Standard is a video generation model from Kuaishou. It supports text-to-video and image-to-video workflows, with first-frame and last-frame control for guided scene composition. Clips range from 3 to 15 seconds in 16:9, 9:16, or 1:1 aspect ratios. Native audio generation is available as an option.

by kwaivgi$0/M input tokens$0/M output tokens

Google: Veo 3.1 Fast

Google's mid-tier video generation model balancing speed and quality. Veo 3.1 Fast generates high-quality video from text or image prompts with native synchronized audio, offering faster turnaround than Veo 3.1 at lower cost. Supports first-frame and last-frame conditioning, multiple resolutions and aspect ratios, and SynthID watermarking.

by google$0/M input tokens$0/M output tokens

Google: Veo 3.1 Lite

Google's most cost-effective video generation model, designed for high-volume applications and rapid iteration. Veo 3.1 Lite generates 720p and 1080p video from text or image prompts with native synchronized audio at less than 50% of the cost of Veo 3.1 Fast. Supports 4–8 second clips in landscape (16:9) and portrait (9:16) formats, with SynthID watermarking. Ideal for content platforms, short-form video creation, and automated media generation.

by google$0/M input tokens$0/M output tokens

Kling: Video O1

Kling Video O1 is a video generation model from Kuaishou. It supports text and image inputs with video output, enabling text-to-video and image-to-video workflows. It is suited for cinematic content production, with first-frame and last-frame control for precise scene composition. It generates 5 or 10 second clips in 16:9, 9:16, or 1:1 aspect ratios.

by kwaivgi$0/M input tokens$0/M output tokens

MiniMax: Hailuo 2.3

Hailuo 2.3 is a video generation model from MiniMax. It accepts text prompts and reference images as input and generates video output, supporting both text-to-video and image-to-video workflows. It is suited for creative content production, cinematic scene generation, and character animation, with a focus on realistic motion and expressive character rendering.

by minimax$0/M input tokens$0/M output tokens

ByteDance: Seedance 2.0

Seedance 2.0 is a video generation model from ByteDance. It supports text-to-video, image-to-video with first and last frame control, and multimodal reference-to-video. It is particularly strong at preserving character consistency, visual style, and camera movement from reference material. The number of tokens is given by (height of output video * width of output video * duration * 24) / 1024

by bytedance$0/M input tokens$0/M output tokens

Alibaba: Wan 2.7

Wan 2.7 is a video generation model from Alibaba. It supports text-to-video, image-to-video with first and last frame control, and reference-to-video, where multiple reference images guide the style and content of the generated scene.

by alibaba$0/M input tokens$0/M output tokens

ByteDance: Seedance 2.0 Fast

Seedance 2.0 Fast is a video generation model from ByteDance. It supports text-to-video, image-to-video with first and last frame control, and multimodal reference-to-video. It prioritizes generation speed and lower cost over maximum output quality. The number of tokens is given by (height of output video * width of output video * duration * 24) / 1024

by bytedance$0/M input tokens$0/M output tokens

Alibaba: Wan 2.6

Alibaba's most advanced video generation model, supporting over 10 visual creation capabilities in a unified system. Wan 2.6 generates 1080p video at 24fps from text, images, reference videos, or audio, with native audio-visual synchronization and precise lip-sync. Key features include reference-to-video (insert a character's appearance and voice into new scenes), multi-shot storytelling from simple prompts, synchronized sound effects and music, and support for 16:9, 9:16, and 1:1 aspect ratios with clips up to 15 seconds.

by alibaba$0/M input tokens$0/M output tokens