Back

How to Fine-tune Llama 4

Fine‑tune Llama 4 on a single A100 GPU, with exact commands, runtimes, and cost math.

Published:

Apr 19, 2025

|

Last updated:

Jun 17, 2025

Why this guide?

Meta’s Llama 4 Scout packs serious performance into 17 B parameters, yet you can still fine‑tune it cheaply by combining QLoRA with a single A100 80 GB. This walkthrough shows the exact commands, runtimes, and cost math so you can reproduce results—no infra expertise required.

Prerequisites

What you need

Why

Thunder Compute account

Fast access to an A100 80 GB at $0.78 / hr

VS Code + Thunder Compute extension

One‑click instance creation & remote workspace

Python 3.10 + Conda

Clean, reproducible env

Hugging Face account

Model & dataset hub

Tip: Follow the Thunder Compute Quick Start to install the VS Code extension.

1. Launch an A100 80 GB instance

ConsoleNew Instance › A100 80 GB
VS Code → Thunder tab A100 80 GB

  • Set disk = 300 GB (fits model + dataset)

2. Connect from VS Code

Open Command Palette → Thunder Compute: Connect (or click ⇄). The integrated terminal now runs on the GPU box—no Remote‑SSH add‑on needed.

3. Prepare the environment

# Pre‑install CUDA drivers on template image
conda create -y -n l4 python=3.10
conda activate l4
pip install --upgrade "transformers>=4.40" datasets accelerate bitsandbytes peft trl
huggingface-cli login   # paste your token

Access permissions: request Llama access here—typically approved in < 5 min.

4. Minimal QLoRA script

Create train_llama_qLoRA.py:

import logging, os, torch, time
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForCausalLM,
                          TrainingArguments, logging as hf_logging)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

hf_logging.set_verbosity_info()
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")

MODEL = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

ds = load_dataset("Abirate/english_quotes", split="train[:2%]")
tok = AutoTokenizer.from_pretrained(MODEL)
tok.pad_token = tok.eos_token
base = AutoModelForCausalLM.from_pretrained(MODEL,
                                            load_in_4bit=True,
                                            device_map="auto")

lora = LoraConfig(r=64, lora_alpha=16,
                  target_modules=["q_proj","k_proj","v_proj","o_proj"],
                  lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(base, lora)

args = TrainingArguments(
    output_dir="outputs",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
)

trainer = SFTTrainer(model=model,
                     train_dataset=ds,
                     tokenizer=tok,
                     args=args)

trainer.train()

Run it:

5. Runtime & VRAM

Model

Steps (≈ 1 epoch on 2 % data)

Time

Peak VRAM

Llama 3‑8B (4‑bit)

~1 500

~2 h

42 GB

Llama 4 Scout 17B (4‑bit)

~1 500

~2 h

79 GB

Need Llama 4 Maverick? Spin up 2–4× A100s and run torchrun --nproc_per_node $N ....

6. Track spend & shut down

Use the Thunder Compute console to monitor cost. Stopping the instance halts GPU billing while keeping the disk.

7. Next steps

  • Swap in your own dataset

  • Increase num_train_epochs until validation loss plateaus

  • If VRAM allows, try load_in_4bit=False for 8‑bit precision

FAQ

Why QLoRA instead of full fine‑tuning?

QLoRA freezes the base model and trains small 4‑bit adapters, letting even 70 B+ checkpoints fit on one GPU with minimal quality loss (see QLoRA paper).

How much does an A100 80 GB cost?

$0.78 / hr at Thunder Compute—price checked June 2025.

Does Llama 4 Maverick fit on one GPU?

No. Even in 4‑bit it needs ~300 GB VRAM; launch at least 4× A100 80 GB or similar.

Author

Carl Peterson—former NVIDIA solutions architect, 10 + years building large‑scale ML infra. Follow me on LinkedIn or X.

<!-- FAQ structured data -->
<script type="application/ld+json">
{
 "@context":"https://schema.org",
 "@type":"FAQPage",
 "mainEntity":[
  {
   "@type":"Question",
   "name":"Why QLoRA instead of full fine-tuning?",
   "acceptedAnswer":{"@type":"Answer","text":"QLoRA freezes the base model and trains small 4‑bit adapters, so even 70 B+ checkpoints fit on a single GPU."}
  },
  {
   "@type":"Question",
   "name":"How much does an A100 80 GB cost?",
   "acceptedAnswer":{"@type":"Answer","text":"$0.78 per hour on Thunder Compute (June 2025)."}
  },
  {
   "@type":"Question",
   "name":"Does Llama 4 Maverick fit on one GPU?",
   "acceptedAnswer":{"@type":"Answer","text":"No, you need ~300 GB VRAM; spin up at least four A100 80 GB cards."}
  }
 ]
}
</script>

Ready to build?

Create a free Thunder Compute account and start training in minutes.

Carl Peterson

Try Thunder Compute

Start building AI/ML with the world's cheapest GPUs