AI Workflows

How to Fine-tune Llama 4

Last update:
December 7, 2025
2 mins read

Why this guide

Meta's Llama 4 Scout uses a Mixture-of-Experts design (17B active params, 16 experts, 109B total) and supports long context. With QLoRA and Unsloth, you can fine-tune it on a single A100 80 GB. This walkthrough gives commands, runtimes, and cost math—no infra expertise required.

Prerequisites

[THUNDERTABLE:eyJoZWFkZXJzIjpbIldoYXQgeW91IG5lZWQiLCJXaHkiXSwicm93cyI6W1siPGEgaHJlZj1cImh0dHBzOi8vd3d3LnRodW5kZXJjb21wdXRlLmNvbVwiIHJlbD1cIm5vb3BlbmVyXCIgdGFyZ2V0PVwiX2JsYW5rXCI+VGh1bmRlciBDb21wdXRlIGFjY291bnQ8L2E+IiwiRmFzdCBhY2Nlc3MgdG8gYW4gQTEwMCA4MCBHQiBhdCAkMC43OC9ociJdLFsiPGEgaHJlZj1cImh0dHBzOi8vY29kZS52aXN1YWxzdHVkaW8uY29tXCIgcmVsPVwibm9vcGVuZXJcIiB0YXJnZXQ9XCJfYmxhbmtcIj5WUyBDb2RlPC9hPiArIDxhIGhyZWY9XCJodHRwczovL21hcmtldHBsYWNlLnZpc3VhbHN0dWRpby5jb20vaXRlbXM/aXRlbU5hbWU9VGh1bmRlckNvbXB1dGUudGh1bmRlci1jb21wdXRlXCIgcmVsPVwibm9vcGVuZXJcIiB0YXJnZXQ9XCJfYmxhbmtcIj5UaHVuZGVyIENvbXB1dGUgZXh0ZW5zaW9uPC9hPiIsIk9uZS1jbGljayBpbnN0YW5jZSArIGludGVncmF0ZWQgdGVybWluYWwiXSxbIjxhIGhyZWY9XCJodHRwczovL3d3dy5weXRob24ub3JnL2Rvd25sb2Fkcy9yZWxlYXNlL3B5dGhvbi0zMTAwL1wiIHJlbD1cIm5vb3BlbmVyXCIgdGFyZ2V0PVwiX2JsYW5rXCI+UHl0aG9uIDMuMTA8L2E+ICsgPGEgaHJlZj1cImh0dHBzOi8vZG9jcy5jb25kYS5pby9lbi9sYXRlc3QvXCIgcmVsPVwibm9vcGVuZXJcIiB0YXJnZXQ9XCJfYmxhbmtcIj5Db25kYTwvYT4iLCJDbGVhbiwgcmVwcm9kdWNpYmxlIGVudiJdLFsiPGEgaHJlZj1cImh0dHBzOi8vaHVnZ2luZ2ZhY2UuY29cIiByZWw9XCJub29wZW5lclwiIHRhcmdldD1cIl9ibGFua1wiPkh1Z2dpbmcgRmFjZSBhY2NvdW50PC9hPiArIDxhIGhyZWY9XCJodHRwczovL3d3dy5sbGFtYS5jb20vXCIgcmVsPVwibm9vcGVuZXJcIiB0YXJnZXQ9XCJfYmxhbmtcIj5MbGFtYSA0IGFjY2VzczwvYT4iLCJNb2RlbCAmIGRhdGFzZXQgaHViIl1dfQ==]

Tip: Follow the Thunder Compute Quick Start to install the VS Code extension. Most prerequisites come pre-installed in Thunder Compute instances.

1. Launch an A100 80 GB instance

<ul><li><strong>Console:</strong> New Instance → <strong>A100 80 GB</strong></li><li><strong>VS Code:</strong> Thunder Compute tab → <strong>+</strong> → <strong>A100 80 GB</strong></li><li><strong>Disk:</strong> 300 GB (room for model, checkpoints, dataset)</li></ul>

2. Connect from VS Code

Open Command Palette → Thunder Compute: Connect (or click ⇄). The integrated terminal now runs on the GPU box—no Remote-SSH add-on needed.

3. Request model access

Request Llama 4 access via llama.com or the official Meta Hugging Face org. Approvals are usually quick.

4. Minimal QLoRA project (Unsloth)

Why Unsloth? It's currently the most stable stack for Llama 4 QLoRA—~71 GB VRAM for Scout at micro-batch size 1, 2k context, fitting on an A100 80 GB.

Shell:

train_llama4_qlora.py:

Run:

5. VRAM & runtime

<ul><li><strong>Llama 4 Scout (QLoRA, 4-bit, Unsloth):</strong> ~70–75 GB VRAM on A100 80 GB</li><li><strong>Llama 3-8B (QLoRA, 4-bit):</strong> &lt; 20 GB VRAM</li></ul>

Cost example: 2 hours × $0.78/hr = $1.56

6. Track spend & shut down

Use the Thunder Compute console to monitor costs. Stopping the instance halts GPU billing; disk persists at storage rates.

7. Next steps

<ul><li>Swap in your dataset</li><li>Increase num_train_epochs until validation loss plateaus</li><li>If VRAM allows, set load_in_4bit=False for 8-bit precision</li></ul>

Get the world's
cheapest GPUs

Low prices, developer-first features, simple UX. Start building today.

Get started