How to Fine-tune Llama 4

Why this guide
Meta’s Llama 4 Scout uses a Mixture-of-Experts design (17B active params, 16 experts, 109B total) and supports long context. With QLoRA and Unsloth, you can fine-tune it on a single A100 80 GB. This walkthrough gives commands, runtimes, and cost math—no infra expertise required.
Prerequisites
Tip: Follow the Thunder Compute Quick Start to install the VS Code extension. Most prerequisites come pre-installed in Thunder Compute instances.
1. Launch an A100 80 GB instance
- Console: New Instance → A100 80 GB
- VS Code: Thunder tab → + → A100 80 GB
- Disk: 300 GB (room for model, checkpoints, dataset)
2. Connect from VS Code
Open Command Palette → Thunder Compute: Connect (or click ⇄). The integrated terminal now runs on the GPU box—no Remote-SSH add-on needed.
3. Request model access
Request Llama 4 access via llama.com or the official Meta Hugging Face org. Approvals are usually quick.
4. Minimal QLoRA project (Unsloth)
Why Unsloth? It’s currently the most stable stack for Llama 4 QLoRA—~71 GB VRAM for Scout at micro-batch size 1, 2k context, fitting on an A100 80 GB.
Shell:
conda create -y -n llama4-qlora python=3.10
conda activate llama4-qlora
pip install -U unsloth trl datasets accelerate bitsandbytes transformers peft
pip install -U huggingface_hub
huggingface-cli login
train_llama4_qlora.py:
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel, is_bfloat16_supported
MODEL_ID = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
ds = load_dataset("mlabonne/guanaco-llama2-1k")
max_seq_len = 2048
use_bf16 = is_bfloat16_supported()
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_ID,
max_seq_length=max_seq_len,
dtype=None,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0.0,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
bias="none",
use_gradient_checkpointing="unsloth",
)
args = TrainingArguments(
output_dir="llama4-qlora",
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=2e-4,
num_train_epochs=1,
logging_steps=10,
save_steps=200,
bf16=use_bf16,
optim="adamw_torch",
lr_scheduler_type="cosine",
warmup_ratio=0.03,
report_to="none",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=ds["train"],
dataset_text_field="text",
max_seq_length=max_seq_len,
packing=True,
args=args,
)
trainer.train()
model.save_pretrained("llama4-qlora-adapter")
tokenizer.save_pretrained("llama4-qlora-adapter")
Run:
python train_llama4_qlora.py5. VRAM & runtime
- Llama 4 Scout (QLoRA, 4-bit, Unsloth): ~70–75 GB VRAM on A100 80 GB
- Llama 3-8B (QLoRA, 4-bit): < 20 GB VRAM
Cost example: 2 hours × $0.78/hr = $1.56
6. Track spend & shut down
Use the Thunder console to monitor costs. Stopping the instance halts GPU billing; disk persists at storage rates.
7. Next steps
- Swap in your dataset
- Increase
num_train_epochsuntil validation loss plateaus - If VRAM allows, set
load_in_4bit=Falsefor 8-bit precision