Z-Image Turbo is an open-weight text-to-image model released by Alibaba Tongyi Lab in November 2025. It quickly gained popularity for its speed, image fidelity, and low hardware requirements.
Understanding the Z-Image Model Explained: 6B Parameters, Sub-Second Speed
Z-Image Turbo uses a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture. It processes text, visual semantic, and image VAE tokens at once to achieve strong output quality at only 6 billion parameters. On an RTX 4090, it generates a 1024×1024 image in roughly 2.3 seconds with 8 inference steps.
The full training workflow took 314K H800 GPU hours. And according to developers, "qualitative and quantitative experiments demonstrate [...] performance comparable to or surpassing that of leading competitors across various dimensions".
The model excels at photorealistic portrait generation, cinematic lighting, natural skin textures, and bilingual text rendering. It's fully open-weights under the Apache 2.0 license, making it suitable for commercial use with few restrictions.
Z-Image Turbo vs. Other AI Image Generation Models
Z-Image Turbo competes directly with FLUX.1, SDXL, and Midjourney, standing out through parameter efficiency and inference speed.
| Model | Parameters | Min VRAM | Steps | License | Photorealism | |
|---|---|---|---|---|---|---|
| Z-Image Turbo | 6B | ~6GB (GGUF) | 8 | Apache 2.0 | Excellent | |
| FLUX.1 Dev | 12B | ~16GB | 20–50 | FLUX Non-Commercial | Excellent | |
| SDXL | 3.5B | ~8GB | 20–40 | Open Rail M | Good | |
| Midjourney v7 | Closed | Cloud only | N/A | Subscription | Excellent | |
| VRAM figures are approximate and vary by resolution and quantization format. | ||||||
Z-Image Turbo delivers output quality comparable to FLUX.1 for 80% less compute cost. For users who want local image generation without a subscription, it is one of the most accessible options available today.
What You Need Before Getting Started
Before placing model files in folders, confirm your environment is ready. The two main variables to consider are your GPU's VRAM capacity and the tool you will use to run the model.
ComfyUI is a flexible but powerful image and video creation tool. It comes packed with templates, including many for Z Image workflows.
Starting at $0.35/hr, Thunder Compute lets you spin up an instance using a ComfyUI template. Start using Z-Image Turbo in minutes.
System Requirements and VRAM Considerations
Z-Image Turbo in standard BF16 precision requires 14 to 16GB of VRAM. It runs comfortably on top tier consumer grade GPUs from RTX 30 series onwards, or professional GPUs like the RTX A6000. For lower-end cards, the FP8 can run with 8GB, and the GGUF variant runs on as little as 5 to 6GB.
Beyond VRAM, you will also need:
- Python 3.10 or higher
- CUDA 12.x (recommended)
- At least 30GB of free disk space
Downloading Z-Image Turbo: Models, Text Encoders, and VAE
Z-Image Turbo requires three model files hosted on Hugging Face. Each goes into a specific subdirectory inside your ComfyUI installation.
| File | Type | Destination Folder | Size (approx.) |
|---|---|---|---|
| z_image_turbo_bf16.safetensors | Diffusion model | ComfyUI/models/diffusion_models/ | ~12GB |
| qwen_3_4b.safetensors | Text encoder | ComfyUI/models/text_encoders/ | ~7GB |
| ae.safetensors | VAE | ComfyUI/models/vae/ | ~335MB |
| All files are hosted on the official Hugging Face repository. | |||
If your GPU has 8GB or less of VRAM, download z_image_turbo_fp8.safetensors instead of the BF16 version. The text encoder and VAE files stay the same regardless of the diffusion model variant.
To avoid local VRAM limits, Thunder Compute offers cost effective cloud GPUs starting at $0.35/hr.
How to Install Z-Image Turbo in ComfyUI
With your files downloaded, installation is a matter of placing everything in the right folders and loading the workflow. The process takes under five minutes once the downloads are complete.
Setting Up Your ComfyUI Directory Structure
After placing the three model files, your ComfyUI folder structure should look like this:
ComfyUI/
└── models/
├── diffusion_models/
│ └── z_image_turbo_bf16.safetensors
├── text_encoders/
│ └── qwen_3_4b.safetensors
└── vae/
└── ae.safetensors
Before continuing, make sure ComfyUI is updated to the latest version. Open ComfyUI Manager, click Update ComfyUI in the top toolbar, then restart. An outdated installation is the most common reason Z-Image nodes fail to appear or show errors on load.
Loading the Z-Image Turbo Workflow in ComfyUI
The official Z-Image Turbo workflow JSON is maintained by Comfy-Org and comes with all nodes pre-wired. Download it from the Comfy-Org GitHub repository and drag the JSON file onto the ComfyUI canvas.
If the canvas shows nodes highlighted in red, ComfyUI needs to install a missing custom node package. Install what is missing, then restart and reload the workflow. In most cases, the official workflow needs no extra custom nodes beyond a current ComfyUI installation.
Skip setup by launching instances with ready-to-launch templates for ComfyUI and Forge Neo.
Configuring Nodes: Sampler, Steps, and CFG Settings
Z-Image Turbo is distilled for 8-step inference. Start with 8 steps and a CFG (Classifier-Free Guidance) scale between 1.5 and 2.0. Unlike non-distilled models, high CFG values (4+) render worse results.
For resolution, choose 1024×1024 for the best quality, as going directly to 2K can introduce distortion. If you need a higher resolution, use an upscale node followed by a second K-sampler pass at a noise value of around 0.3 to preserve similarity. Increase noise toward 0.6 to 0.7 if you want more creative variation in the upscaled result.
How to Use Z-Image Turbo in ComfyUI
Once the workflow is loaded and nodes are configured, you are ready to generate. The workflow is intentionally simple, with a single primary node handling most of the generation logic.
Running Your First Text-to-Image Generation
Enter your prompt in the text conditioning node and click Run. Z-Image Turbo does not need the heavy prompt engineering that older models like SDXL require.
Natural-language descriptions outperform keyword lists; skip terms like "masterpiece, best quality, 8k" since the model already understands stylistic intent from context.
For portraits, describe the lighting style (for example, "soft window light"), skin texture, and background to give the model strong anchors. Make sure all three model files are selected in their loader nodes. Otherwise, the generation will fail or produce a gray, noisy output.
Tips for Getting the Best Results
When writing prompts, quality beats length. Make focused descriptions that include lighting and environment explicitly, and avoid contradictory instructions. For portraits, using photography terms (for example, "85mm portrait lens, shallow depth of field") produces more grounded, photorealistic results.
To push image quality, stack up to three LoRA weights through ComfyUI's LoRA loader nodes. Use LoRAs trained specifically on Z-Image rather than SDXL or FLUX LoRAs, as SDXL-trained weights will not transfer effectively.
Last Thoughts on Z-Image Turbo
The Z-Image Turbo model has speed, accessibility, and high-fidelity output. By running a 6-billion-parameter architecture in just 8 inference steps, it offers FLUX-level photorealism without requiring heavy enterprise-grade compute.
Whether you're running it locally on a consumer card via GGUF or spinning it up instantly on a cloud template like Thunder Compute, setting it up in ComfyUI is incredibly straightforward.
