Back

How to Fine-tune Llama 4

Fine‑tune Llama 4 on a single A100 GPU, with exact commands, runtimes, and cost math.

Published:

Apr 19, 2025

|

Last updated:

Aug 13, 2025

Why this guide

Meta’s Llama 4 Scout uses a Mixture-of-Experts design (17B active params, 16 experts, 109B total) and supports long context. With QLoRA and Unsloth, you can fine-tune it on a single A100 80 GB. This walkthrough gives commands, runtimes, and cost math—no infra expertise required.

Prerequisites

What you need

Why

Thunder Compute account

Fast access to an A100 80 GB at $0.78/hr

VS Code + Thunder Compute extension

One-click instance + integrated terminal

Python 3.10 + Conda

Clean, reproducible env

Hugging Face account + Llama 4 access

Model & dataset hub

Tip: Follow the Thunder Compute Quick Start to install the VS Code extension. Most prerequisites come pre-installed in Thunder Compute instances.

1) Launch an A100 80 GB instance

  • Console: New Instance → A100 80 GB

  • VS Code: Thunder tab → A100 80 GB

  • Disk: 300 GB (room for model, checkpoints, dataset)

2) Connect from VS Code

Open Command Palette → Thunder Compute: Connect (or click ⇄). The integrated terminal now runs on the GPU box—no Remote-SSH add-on needed.

3) Request model access

Request Llama 4 access via llama.com or the official Meta Hugging Face org. Approvals are usually quick.

4) Minimal QLoRA project (Unsloth)

Why Unsloth? It’s currently the most stable stack for Llama 4 QLoRA—~71 GB VRAM for Scout at micro-batch size 1, 2k context, fitting on an A100 80 GB.

Shell:

conda create -y -n llama4-qlora python=3.10
conda activate llama4-qlora

pip install -U unsloth trl datasets accelerate bitsandbytes transformers peft
pip install -U huggingface_hub
huggingface-cli login

train_llama4_qlora.py:

from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel, is_bfloat16_supported

MODEL_ID = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
ds = load_dataset("mlabonne/guanaco-llama2-1k")

max_seq_len = 2048
use_bf16 = is_bfloat16_supported()

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_ID,
    max_seq_length=max_seq_len,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0.0,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
)

args = TrainingArguments(
    output_dir="llama4-qlora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=200,
    bf16=use_bf16,
    optim="adamw_torch",
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds["train"],
    dataset_text_field="text",
    max_seq_length=max_seq_len,
    packing=True,
    args=args,
)

trainer.train()

model.save_pretrained("llama4-qlora-adapter")
tokenizer.save_pretrained("llama4-qlora-adapter")

Run:

python train_llama4_qlora.py

5) VRAM & runtime

  • Llama 4 Scout (QLoRA, 4-bit, Unsloth): ~70–75 GB VRAM on A100 80 GB

  • Llama 3-8B (QLoRA, 4-bit): < 20 GB VRAM

Cost example: 2 hours × $0.78/hr = $1.56

6) Track spend & shut down

Use the Thunder console to monitor costs. Stopping the instance halts GPU billing; disk persists at storage rates.

7) Next steps

  • Swap in your dataset

  • Increase num_train_epochs until validation loss plateaus

  • If VRAM allows, set load_in_4bit=False for 8-bit precision

Carl Peterson

Try Thunder Compute

Start building AI/ML with the world's cheapest GPUs