Technical insights

How to Fine-tune Llama 4

August 13, 2025
7 mins read

Why this guide

Meta’s Llama 4 Scout uses a Mixture-of-Experts design (17B active params, 16 experts, 109B total) and supports long context. With QLoRA and Unsloth, you can fine-tune it on a single A100 80 GB. This walkthrough gives commands, runtimes, and cost math—no infra expertise required.

Prerequisites

What you need Why
Thunder Compute account Fast access to an A100 80 GB at $0.78/hr
VS Code + Thunder Compute extension One‑click instance + integrated terminal
Python 3.10 + Conda Clean, reproducible env
Hugging Face account + Llama 4 access Model & dataset hub
Tip: Follow the Thunder Compute Quick Start to install the VS Code extension. Most prerequisites come pre-installed in Thunder Compute instances.

1. Launch an A100 80 GB instance

  • Console: New Instance → A100 80 GB
  • VS Code: Thunder tab → A100 80 GB
  • Disk: 300 GB (room for model, checkpoints, dataset)

2. Connect from VS Code

Open Command Palette → Thunder Compute: Connect (or click ⇄). The integrated terminal now runs on the GPU box—no Remote-SSH add-on needed.

3. Request model access

Request Llama 4 access via llama.com or the official Meta Hugging Face org. Approvals are usually quick.

4. Minimal QLoRA project (Unsloth)

Why Unsloth? It’s currently the most stable stack for Llama 4 QLoRA—~71 GB VRAM for Scout at micro-batch size 1, 2k context, fitting on an A100 80 GB.

Shell:

conda create -y -n llama4-qlora python=3.10
conda activate llama4-qlora

pip install -U unsloth trl datasets accelerate bitsandbytes transformers peft
pip install -U huggingface_hub
huggingface-cli login

train_llama4_qlora.py:

from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel, is_bfloat16_supported

MODEL_ID = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
ds = load_dataset("mlabonne/guanaco-llama2-1k")

max_seq_len = 2048
use_bf16 = is_bfloat16_supported()

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_ID,
    max_seq_length=max_seq_len,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0.0,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
)

args = TrainingArguments(
    output_dir="llama4-qlora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=200,
    bf16=use_bf16,
    optim="adamw_torch",
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds["train"],
    dataset_text_field="text",
    max_seq_length=max_seq_len,
    packing=True,
    args=args,
)

trainer.train()

model.save_pretrained("llama4-qlora-adapter")
tokenizer.save_pretrained("llama4-qlora-adapter")

Run:

python train_llama4_qlora.py

5. VRAM & runtime

  • Llama 4 Scout (QLoRA, 4-bit, Unsloth): ~70–75 GB VRAM on A100 80 GB
  • Llama 3-8B (QLoRA, 4-bit): < 20 GB VRAM

Cost example: 2 hours × $0.78/hr = $1.56

6. Track spend & shut down

Use the Thunder console to monitor costs. Stopping the instance halts GPU billing; disk persists at storage rates.

7. Next steps

  • Swap in your dataset
  • Increase num_train_epochs until validation loss plateaus
  • If VRAM allows, set load_in_4bit=False for 8-bit precision

Your GPU,
one click away.

Spin up a dedicated GPU in seconds. Develop in VS Code, keep data safe, swap hardware anytime.

Get started