Llama 4 is the most capable open-weight model family shipped by Meta. Released on April 5, 2025, it introduced native multimodal understanding, a Mixture-of-Experts (MoE) architecture, and a 10 million token context window.
This guide covers the full model lineup, benchmarks, and how to run Llama 4.

How does Llama 4 improve over Llama 3?
Llama 4 is Meta's fourth-generation family of open-weight LLMs. Unlike the text-only Llama 3, Llama 4 is natively multimodal: processing text, images, and video. The family was trained on over 30 trillion tokens across 200 languages, doubling Llama 3's pre-training mix.
| Feature | Llama 3 | Llama 4 |
|---|---|---|
| Architecture | Dense Transformer | Mixture-of-Experts (MoE) |
| Modality | Text-only | Natively Multimodal (Text, Image, Video) |
| Context Window | 128K tokens | 1M-10M tokens |
| Training Tokens | ~15T (estimated) | >30T |
| Languages | Multilingual support | >100 languages |
| Knowledge Cutoff | December 2023 | March 2025 |
| Refusal Rate | Standard | <2% |
| Training Hardware | ~16,000 H100s | 32,000 H100s |
The architectural shift matters as much as the scale. Llama 3 was a dense text-only transformer with a 128K context window. Llama 4 replaces that with Mixture-of-Experts (MoE) layers, iRoPE positional embeddings, and FP8 precision training across 32,000 H100 GPUs, double the cluster used for Llama 3.
Llama 4 uses a Mixture-of-Experts design where only a subset of parameters activates per token during inference. Models can carry far more total knowledge than a dense architecture would allow.
Llama 4 also received 10x more multilingual training tokens than Llama 3, covering over 100 languages. The knowledge cutoff advanced to March 2025, and refusal rates fell to under 2%.
Llama 4 Release Date and Context Window Size
Llama 4 Scout and Llama 4 Maverick launched publicly on April 5, 2025. They are available via the official Llama website and Hugging Face under the Llama 4 Community License, allowing free use for products with less than 700M monthly active users.
The Llama 4 Model Lineup
Llama 4 consists of three models: Scout, Maverick, and Behemoth. Scout and Maverick are publicly available. Behemoth is unreleased but served as the teacher model for the other two through a process called codistillation.
| Feature | Llama 4 Scout | Llama 4 Maverick | Llama 4 Behemoth |
|---|---|---|---|
| Total Parameters | 109 Billion | 400 Billion | ~2 Trillion |
| Active Parameters | 17 Billion | 17 Billion | 288 Billion |
| Expert Count | 16 Experts | 128 Experts | 16 Experts |
| Max Context Window | 10 Million tokens | 1 Million tokens | Not Publicly Specified |
| Primary Use Case | Long-context retrieval & document analysis | General reasoning, coding & assistant tasks | Teacher/Distillation model & Advanced STEM |
| Deployment Status | Generally Available | Generally Available | Research Preview (Not Publicly Released) |
Llama 4 Scout: Parameters and Hardware Requirements
Llama 4 Scout is the accessible member of the family. It has 17 billion active parameters across 16 experts, with 109 billion total parameters.
The 10 million token context window suits repository-level code analysis, multi-document summarization, and other long-context tasks. However, the usable context size depends on available VRAM; 32K to 128K is a realistic working window on consumer hardware.
Scout is the most practical model for self-hosting. At Q4_K_M quantization, it needs roughly 20–24 GB of VRAM, within reach of a single RTX 4090 or an Apple Silicon Mac with 32 GB.
Llama 4 Maverick: Parameter Size, Benchmarks, and Release Date
Llama 4 Maverick launched on April 5, 2025 alongside Scout. It shares Scout's 17 billion active parameters per token but routes through 128 experts instead of 16, for 400 billion total parameters. That larger knowledge base explains why Maverick outperforms Scout on reasoning, coding, and multimodal tasks at the same inference cost.
Maverick reached an ELO of 1417 on LMArena, beating GPT-4o and Gemini 2.0 Flash on multimodal and reasoning benchmarks. It also matches DeepSeek V3 on coding tasks while activating fewer parameters.
Maverick fits on a single H100 host without multi-GPU coordination, making it a viable production choice for cloud GPU users.
Llama 4 Behemoth: What We Know So Far
Llama 4 Behemoth was previewed alongside Scout and Maverick in April 2025 but was still in training at the time of the public release.
It was designed with approximately 2 trillion total parameters and 288 billion active parameters across 16 experts. This scale puts it in direct competition with closed frontier models on STEM benchmarks.
As of June 2026, public weights are unlikely to become available. Behemoth served its primary purpose as a teacher model for Scout and Maverick, and Meta's subsequent launch of its closed-weight Muse Spark model in April 2026 has reduced the urgency around a public Behemoth release.
Llama 4 Scout vs Maverick: Which Should You Run?
Choosing between Scout and Maverick is a matter of hardware and use case:
- Scout is right if:
- You need a very long context window; its 10M token limit is unique among locally runnable models
- Your GPU has under 80GB of VRAM.
- Maverick is right if:
- You need output quality on reasoning, coding, and complex multimodal tasks
- You Have access to hardware; a multi-GPU setup to run locally at full precision.
For most individual developers, Scout on a single GPU or a Thunder Compute A100 instance is the practical starting point.
Teams building production-grade assistants or inference APIs will find Maverick worth the extra compute, especially given its benchmark parity with GPT-4o at a fraction of the API cost.
Llama 4 Benchmarks: How Does It Compare?
Benchmark comparisons for Llama 4 have to be read carefully. Meta's internal benchmarks used model variants that differed from the ones publicly released, and the AI landscape has shifted considerably since April 2025.
The comparisons below reflect the public models against the competitors they were benchmarked against at launch.
| Model | Active Params | Context Window | Multimodal | Open Weight | LMArena ELO |
|---|---|---|---|---|---|
| Llama 4 Scout | 17B (109B total) | 10M tokens | Yes | Yes | N/A |
| Llama 4 Maverick | 17B (400B total) | 1M tokens | Yes | Yes | 1417 |
| GPT-4o | ~200B (est.) | 128K tokens | Yes | No | ~1380 |
| Gemini 2.0 Flash | Unknown | 1M tokens | Yes | No | ~1350 |
| Llama 3.1 405B | 405B (dense) | 128K tokens | No | Yes | ~1260 |
Llama 4 vs ChatGPT
When comparing Llama 4 against ChatGPT (which runs on GPT-4o and newer variants), the most honest framing is that Maverick benchmarks comparably to GPT-4o across multimodal, reasoning, and coding tasks, while costing significantly less to run via API. Maverick's DocVQA score of 91.6 and its MATH performance suggest near-parity on the tasks GPT-4o has historically led. The key difference is deployment freedom: Llama 4 weights are downloadable and self-hostable, while ChatGPT is a closed API with no ability to fine-tune the base weights or run inference on your own infrastructure.
Llama 4 vs Gemini
The comparison against Gemini is similarly nuanced. Maverick outperforms Gemini 2.0 Flash on Meta's benchmarks in multimodal and reasoning tasks, which is a meaningful result given that Gemini 2.0 Flash was one of Google's most competitive efficiency-oriented models at the time. Gemini 2.5 Pro narrows or reverses that gap on several tasks, but it comes with a substantially higher per-token cost and zero ability to self-host. For teams that want to run inference privately or tune the model on proprietary data, Llama 4 offers something neither Gemini model can match.
Llama 4 vs Llama 3
The improvement from Llama 3 to Llama 4 is large enough that it represents a different class of model rather than a straightforward upgrade. Llama 3's best open-weight option, the 405B dense model, had a 128K context window, no native multimodal capability, and a knowledge cutoff of December 2023. Llama 4 Scout beats it on multimodal benchmarks despite having far fewer total parameters, and does so while fitting on a single GPU with quantization. On LiveCodeBench, Maverick scores 43.4 versus Llama 3.1 405B's 27.7 (a 57% relative improvement on real-world coding tasks).
Running Llama 4 locally is feasible for Scout on high-end consumer hardware, but Maverick requires server-grade GPUs that most developers don't have at home. Thunder Compute solves this through on-demand access to A100 and H100 instances at rates up to 80% lower than AWS, Azure, and GCP.
How Much VRAM Do You Need for Llama 4?
VRAM requirements for Llama 4 depend on which model you're running and what quantization level you choose. The table below covers the most practical configurations.
| Model | Quantization | VRAM Required | Recommended Hardware |
|---|---|---|---|
| Llama 4 Scout | Q4_K_M (4-bit) | ~20–24 GB | RTX 4090 / A100 40GB |
| Llama 4 Scout | Q8_0 (8-bit) | ~55 GB | A100 80GB |
| Llama 4 Maverick | Q4_K_M (4-bit) | ~200 GB | Multi-GPU / H100 host |
| Llama 4 Maverick | 1.78-bit (Unsloth quant) | ~100 GB | H100 80GB x2 |
| Llama 4 Scout (CPU only) | Q4_K_M | 32 GB system RAM | CPU offload (very slow) |
How to Install and Run Llama 4 on a Thunder Compute GPU
Thunder Compute provides pre-configured Ollama instance templates that handle GPU driver setup, Ollama installation, and model configuration.
Rather than spending an hour on environment setup, you can have a Llama 4 model answering prompts in a few minutes.
To go deeper into local LLM tooling options, see our guides on running LM Studio and Unsloth.
Step 1: Install the Thunder Compute CLI
Download and install tnr for Windows, or macOS.
Run this command for Linux:
curl -fsSL https://raw.githubusercontent.com/Thunder-Compute/thunder-cli/main/scripts/install.sh | bash
Step 2: Login
tnr login
Step 3: Launch and connect to an Ollama instance
tnr create --template ollama
Pick the hardware configuration for your instance.
Step 4: Connect to your instance
Establish a connection once the instance is created.
tnr connect 0
Start the Ollama UI. This will take around a minute. Once it's done loading, click the link provided by the terminal to open Ollama in a browser.
You'll be prompted to create an account.
start-ollama
Step 5: Load the desired model
- In the Ollama UI, click "Select a model".
- Add the URL of the model from the Ollama page
- Click "Pull [MODEL_URL]" in the dropdown.

Your download will start.
A few good variants ordered from lightest to heaviest include:
- https://ollama.com/library/llama4:17b-scout-16e-instruct-q4_K_M
- https://ollama.com/library/llama4:17b-scout-16e-instruct-q8_0
- https://ollama.com/library/llama4:17b-scout-16e-instruct-fp16
- https://ollama.com/library/llama4:17b-maverick-128e-instruct-q4_K_M
- https://ollama.com/library/llama4:17b-maverick-128e-instruct-q8_0
- https://ollama.com/library/llama4:17b-maverick-128e-instruct-fp16
Step 6: Start chatting
Once the model is downloaded you can start interacting with it. Keep in mind that the first response will take significantly longer because the GPU is loading the model onto memory.
Thunder Compute: The Fastest Way to Get Started with Llama 4
Thunder Compute's A100 80GB instances give you enough VRAM to run Llama 4 Scout at Q8_0 quality or at Q4_K_M with generous context headroom. Multi-GPU H100 instances are available for Maverick workloads.
The platform is built specifically for ML engineers who want to iterate quickly without infrastructure overhead. Instances spin up in seconds, VS Code integration turns your cloud GPU into a local development environment, and you only pay for what you use.
Get started with Thunder Compute and have Llama 4 running in minutes.
