Wan 2.2 is one of the most capable open-source video generation models available today. ComfyUI is the go-to interface for running it with full control over every parameter. This guide covers what the model is, how to install it, and how to troubleshoot common errors.
To skip setup entirely, Thunder Compute offers a one-click ComfyUI template on an RTX A6000 (48 GB VRAM) at $0.35/hr.
What Is Wan 2.2?
Wan 2.2 is an open-source AI video generation model released by Alibaba in July 2025. It generates short video clips from text prompts or images at a quality level that competes with commercial tools.
This model is released under the Apache 2.0 license, so you can download the weights from Hugging Face and use them commercially.
What sets it apart is its Mixture-of-Experts (MoE) architecture, which meaningfully changes how the model processes video.
The MoE Architecture
Mixture of Experts is a machine learning technique where a neural network is broken up into specialized sub-networks. A gating network chooses which expert sub-networks to use for an input, which can increase model capacity without activating every parameter at once.
Applied to video generation, this approach uses specialized sub-networks for different stages of the denoising process, rather than a single monolithic transformer for every step.
The result is better motion coherence, sharper detail, and more reliable instruction following, without proportionally higher compute costs.
How It Generates Video
Like Stable Diffusion, Wan 2.2 is a diffusion model. It starts from random noise and progressively denoises toward a coherent video guided by your text or image input.
Unlike image generation, which only manages a single frame (space), video generation must handle that layout while simultaneously calculating how every pixel moves over the duration of the clip (time). In other words, a core challenge in video synthesis is temporal consistency.
Wan 2.2 addresses this at the diffusion level with two expert models:
- A high-noise expert to handle early denoising steps, and establish overall layout and motion structure.
- A low-noise expert that takes over for later steps to refine texture and fine details.
The handoff is determined by the signal-to-noise ratio at each timestep. This two-stage specialization is why Wan 2.2 produces visually cleaner results than a standard single-network approach.
Wan 2.2 vs. Wan 2.1: What Changed
Wan 2.1 launched in February 2025 and quickly became the open-source benchmark for AI video. It used a dense diffusion transformer; a single large network that handled every denoising step.
Despite its high quality, the model had clear weaknesses:
- Motion artifacts in complex scenes
- Inconsistent character appearance
- Poor responsiveness to camera instruction prompts
Wan 2.2 replaces that dense transformer with the MoE architecture. Training data was also expanded significantly, with 65.6% more images and 83.2% more videos than Wan 2.1. The improvements address the weaknesses already mentioned for its predecesor.
What About Wan 2.5? Where Does It Fit In?
Wan 2.5 was released by Alibaba in September 2025. It adds native audio-visual generation (synchronized sound and video in one pass), 1080p output at 24fps, and clips up to 10 seconds long.
However, Wan 2.5 weights are not publicly available for local deployment. Access is through managed API endpoints only.
For ComfyUI users running locally or on their own cloud GPU, Wan 2.2 is still the most capable version with full open-source weights. If you need native audio or 1080p output and are comfortable with API access, Wan 2.5 is worth investigating separately.
What Wan 2.2 Can Do: Wan Video Generation Modes
Text-to-Video (T2V)
Text-to-video is the most direct way to use Wan 2.2. You write a prompt, set resolution and frame count, and the model generates a video from scratch. T2V is available in both model variants:
- 14B delivers higher motion fidelity and follows complex prompts more accurately
- 5B variant runs on modest hardware and is a solid starting point
A practical starting prompt is 80 to 120 words. Lead with what the camera sees first, then describe motion and camera movement. Cinematic terms like "volumetric lighting," "anamorphic bokeh," "rack focus," and "teal-and-orange grade" can help steer the model, but visual reliability still depends on the checkpoint, seed, resolution, and workflow settings.
Image-to-Video (I2V)
Image-to-video takes a static image as input and animates it into a short clip. This is useful for bringing AI-generated or photographic images to life without re-describing the full scene. The model uses the image as the first frame and generates coherent motion from there. A text prompt can still guide the type of motion, such as a camera pan or a character's movement.
First Frame to Last Frame (FLF2V)
FLF2V lets you provide both the opening and closing frames. The model generates the motion that connects them, treating both images as fixed ground truths inside the diffusion process.
This mode is well suited for transitions, morphing effects, and controlled transformation sequences. It gives you deterministic start and end points that text prompting alone cannot reliably guarantee.
Speech-to-Video (S2V)
S2V takes a static image and an audio clip to generate a video of the character speaking, singing, or performing in sync with the audio. This mode is the most relevant for digital human and talking-head applications.
It supports real people, cartoons, digital humans, and animals, in portrait, half-body, and full-body formats. Text prompts can also control background and movement.
Wan 2.2 VRAM Requirements: Can Your GPU Run It?
VRAM Requirements by Model Variant: 5B vs. 14B
Wan 2.2 ships in two primary model sizes:
- The TI2V-5B is a unified hybrid model that handles both text-to-video and image-to-video in a single checkpoint file.
- The A14B variant uses separate high-noise and low-noise checkpoint files for T2V and I2V workflows.
| Variant | Parameters | Precision | Min VRAM (480p) | Recommended VRAM (720p) |
|---|---|---|---|---|
| TI2V-5B | 5B (active) | FP16 | 8 GB | 12 GB |
| TI2V-5B | 5B (active) | FP8 | 6 GB | 8 GB |
| T2V / I2V 14B | 14B (active) | FP8 | 24 GB | 40+ GB |
| T2V / I2V 14B | 14B (active) | GGUF Q5 | 12 GB | 24 GB |
| T2V / I2V 14B | 14B (active) | GGUF Q3 | 6 GB | 12 GB |
| VRAM figures are for the diffusion model only. Add approximately 9 GB for the UMT5-XXL text encoder if loaded in VRAM. Most workflows offload it to system RAM, which requires at least 24 GB. | ||||
To generate videos on a budget, Thunder Compute offers a one-click ComfyUI template starting at $0.35/hr for an RTX A6000 with 48GB of VRAM. Enough to run the 14B model at 720p from day one.
Quantization and Offloading Options for Low-VRAM Setups
At a minor quality cost, FP8 quantization reduces the model's memory footprint by 20 to 40% compared to BF16/FP16. For the 14B model, fitting onto a 24 GB GPU often depends on whether FP8 weights are available. The official Wan 2.2 release includes FP8-scaled checkpoint files for this reason.
GGUF quantization goes further, pushing the 14B model onto GPUs with as little as 6 GB of VRAM via CPU offloading. Q5_K_M and Q6 variants offer the best quality-to-size balance; Q3_K is the minimum viable option.
Keep in mind that more offloading means slower generation, so GGUF on a 6 GB GPU is a learning tool, not a production setup.
How to Install Wan 2.2 in ComfyUI

Updating ComfyUI Before You Start
Update ComfyUI before installing any Wan 2.2 model files. Wan 2.2 received Day-0 native support at launch, but the workflow templates require a recent build (0.3.46 or later).
Downloading the Wan 2.2 Model Files
You need three categories of files: the text encoder, the VAE, and the diffusion model checkpoints.
- Text encoder: download
umt5_xxl_fp8_e4m3fn_scaled.safetensors - VAE:
- The 14B model
wan_2.1_vae.safetensors. - The 5B model
wan2.2_vae.safetensors.
- The 14B model
- Diffusion models (ComfyUI/models/diffusion_models/):
- The 5B variant is a single file (
wan2.2_ti2v_5B_fp16.safetensors). - The 14B T2V variant requires two files:
wan2.2_t2v_high_noise_14B_fp8_scaled.safetensorsandwan2.2_t2v_low_noise_14B_fp8_scaled.safetensors. - The I2V 14B variant follows the same two-file pattern with
i2vin the filename.
- The 5B variant is a single file (
Loading the Official Wan 2.2 Workflow
ComfyUI includes official workflow templates for TI2V-5B, T2V 14B, I2V 14B, and FLF2V. Access them from the workflow templates panel in the interface, or download Wan2.2 workflow JSON files and drag them directly onto the canvas.
The templates handle all node connections automatically, including the dual model loader required for the 14B MoE architecture.
How to Use Wan 2.2 in ComfyUI
Configuring Your First Text-to-Video Generation
Load the Wan 2.2 T2V workflow template and confirm that both diffusion model nodes (high-noise and low-noise) have loaded their checkpoint files.
In the CLIP Text Encode node, write your prompt: aim for 80 to 120 words and lead with subject and camera position before describing motion. Add negative prompts such as "blurry motion," "flickering," and "inconsistent lighting" when you want the sampler to steer away from common video artifacts.
Set resolution and frame count before queuing. A 480p output (832x480) at 81 frames is a practical starting point that balances quality, speed, and VRAM usage.
Running Image-to-Video with Wan 2.2
Load the I2V workflow template and connect your source image to the Load Image node. Keep your prompt focused on the motion you want rather than re-describing the image content; the model already sees the image directly. For best results, use images with clear compositional logic so the model can infer natural motion directions.
For FLF2V, connect two Load Image nodes: one for the first frame and one for the last. The model generates the motion between them, guided by your text prompt. FLF2V works best for smooth, continuous transformations; complex multi-step actions across distant keyframes may produce more conservative results.
Using Wan 2.2 with ControlNet in ComfyUI
Wan 2.2 Fun Control is a ControlNet-enabled variant that lets you drive motion from a reference video or image sequence. It supports Canny edge, depth maps, OpenPose, and MLSD geometric edge signals. To use it, install the VideoX-Fun custom node via ComfyUI Manager, then use the WanFunControl to Video node in your workflow.
ControlNet conditioning locks the motion structure to your reference while the prompt and image handle appearance. This is useful for transferring a performance to a new character or constraining motion to a specific path.
To avoid local VRAM limits, Thunder Compute offers cost effective cloud GPUs starting at $0.35/hr. On top of that, you can start instances with ready-to-launch templates for ComfyUI and Forge Neo.
