Technical insights

How to Fine-tune Llama 4

August 13, 2025

7 mins read

Why this guide

Meta’s Llama 4 Scout uses a Mixture-of-Experts design (17B active params, 16 experts, 109B total) and supports long context. With QLoRA and Unsloth, you can fine-tune it on a single A100 80 GB. This walkthrough gives commands, runtimes, and cost math—no infra expertise required.

Prerequisites

What you need	Why
Thunder Compute account	Fast access to an A100 80 GB at $0.78/hr
VS Code + Thunder Compute extension	One‑click instance + integrated terminal
Python 3.10 + Conda	Clean, reproducible env
Hugging Face account + Llama 4 access	Model & dataset hub

Tip: Follow the Thunder Compute Quick Start to install the VS Code extension. Most prerequisites come pre-installed in Thunder Compute instances.

1. Launch an A100 80 GB instance

Console: New Instance → A100 80 GB
VS Code: Thunder tab → ＋ → A100 80 GB
Disk: 300 GB (room for model, checkpoints, dataset)

2. Connect from VS Code

Open Command Palette → Thunder Compute: Connect (or click ⇄). The integrated terminal now runs on the GPU box—no Remote-SSH add-on needed.

3. Request model access

Request Llama 4 access via llama.com or the official Meta Hugging Face org. Approvals are usually quick.

4. Minimal QLoRA project (Unsloth)

Why Unsloth? It’s currently the most stable stack for Llama 4 QLoRA—~71 GB VRAM for Scout at micro-batch size 1, 2k context, fitting on an A100 80 GB.

Shell:

conda create -y -n llama4-qlora python=3.10
conda activate llama4-qlora

pip install -U unsloth trl datasets accelerate bitsandbytes transformers peft
pip install -U huggingface_hub
huggingface-cli login

‍

train_llama4_qlora.py:

from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel, is_bfloat16_supported

MODEL_ID = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
ds = load_dataset("mlabonne/guanaco-llama2-1k")

max_seq_len = 2048
use_bf16 = is_bfloat16_supported()

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_ID,
    max_seq_length=max_seq_len,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0.0,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
)

args = TrainingArguments(
    output_dir="llama4-qlora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=200,
    bf16=use_bf16,
    optim="adamw_torch",
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds["train"],
    dataset_text_field="text",
    max_seq_length=max_seq_len,
    packing=True,
    args=args,
)

trainer.train()

model.save_pretrained("llama4-qlora-adapter")
tokenizer.save_pretrained("llama4-qlora-adapter")

‍

Run:

python train_llama4_qlora.py

5. VRAM & runtime

Llama 4 Scout (QLoRA, 4-bit, Unsloth): ~70–75 GB VRAM on A100 80 GB
Llama 3-8B (QLoRA, 4-bit): < 20 GB VRAM

Cost example: 2 hours × $0.78/hr = $1.56

6. Track spend & shut down

Use the Thunder console to monitor costs. Stopping the instance halts GPU billing; disk persists at storage rates.

7. Next steps

Swap in your dataset
Increase num_train_epochs until validation loss plateaus
If VRAM allows, set load_in_4bit=False for 8-bit precision

Your GPU,
one click away.

Spin up a dedicated GPU in seconds. Develop in VS Code, keep data safe, swap hardware anytime.

Get started