Weights & Biases - Thunder Compute

Weights & Biases (wandb) is an experiment tracking and model management platform that’s particularly useful when training large models on Cloud GPUs. It helps you:

Track training runs, hyperparameters, and metrics
Monitor GPU/CPU utilization in real time
Version datasets and model checkpoints
Run large-scale hyperparameter sweeps across many GPU instances

On Thunder Compute, wandb helps you monitor GPU utilization, identify bottlenecks, and track training metrics.

Prerequisites

A Thunder Compute GPU instance created and connected
Python environment set up on your instance
A Weights & Biases account (https://wandb.ai)

Installation

Install wandb on your Thunder Compute instance:

pip install wandb

Or add to a requirements.txt:

echo "wandb" >> requirements.txt
pip install -r requirements.txt

Authentication

Authenticate with:

wandb login

Or via environment variable:

export WANDB_API_KEY="your_api_key"
wandb login --relogin

You will see the following below:

wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

Enter your API key which can be found on the homepage of wandb.ai after you create an account, once entered you will see:

wandb: No netrc file found, creating one.
wandb: Appending key for api.wandb.ai to your netrc file: /home/ubuntu/.netrc
wandb: Currently logged in as: username (entity-name) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin

For shared or production Thunder instances, environment variables or secret managers are preferred over pasting API keys directly.

Getting Started

Follow these steps to run your first wandb experiment on your Thunder Compute instance.

Step 1 — Create a Training File

Create a new Python file on your instance:

nano train.py

Or create a new file within your IDE connected over SSH.

Step 2 — Paste Minimal Working Example

Copy this minimal example into your train.py file:

import wandb
import time

# Initialize wandb
wandb.init(
    project="thunder-resnet",
    name="quick-test",
    config={
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 5,
    },
)

# Simple training loop simulation
for epoch in range(5):
    # Simulate training metrics
    train_loss = 1.0 / (epoch + 1)
    train_acc = 0.5 + epoch * 0.1

    # Log metrics to wandb
    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "train/accuracy": train_acc,
    })

    time.sleep(0.5)  # Simulate work

wandb.finish()

Step 3 — Run the Script

Execute your training script:

python train.py

Step 4 — Expected Output

You should see output similar to:

wandb: Currently logged in as: your-username (entity-name) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.23.0
wandb: Run data is saved locally in /home/ubuntu/wandb/run-20251120_135726-abcd
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run quick-test
wandb: ⭐️ View project at https://wandb.ai/entity-name/thunder-resnet
wandb: 🚀 View run at https://wandb.ai/entity-name/thunder-resnet/runs/abcd
wandb:
wandb: Run history:
wandb:          epoch ▁▃▅▆█
wandb: train/accuracy ▁▃▅▆█
wandb:     train/loss █▄▂▁▁
wandb:
wandb: Run summary:
wandb:          epoch 4
wandb: train/accuracy 0.9
wandb:     train/loss 0.2
wandb:
wandb: 🚀 View run quick-test at: https://wandb.ai/entity-name/thunder-resnet/runs/abcd
wandb: ⭐️ View project at: https://wandb.ai/entity-name/thunder-resnet
wandb: Synced 4 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20251120_135726-abcd/logs

Step 5 — View Your Results

View your dashboard: Click the link in the output or visit https://wandb.ai and navigate to your project
View in Table view: Go to (Project Name) > Projects > thunder-resnet > Table to see all your runs in a tabular format
Compare runs: Run the script multiple times with different configurations to compare results
Add artifacts: See the Model Checkpointing with Weights & Biases Artifacts section to version checkpoints and datasets
Scale to multi-GPU: Check out Distributed Training for multi-GPU setups
Run sweeps: Use Hyperparameter Sweeps for automated hyperparameter search

Viewing Results

Visit https://wandb.ai
Select your project
Explore:
- Metrics charts
- GPU utilization
- Model checkpoints
- Dataset artifacts
- Sweep dashboards

Core Concepts for Cloud GPU Workloads

When using remote GPUs, these wandb features matter most:

Run tracking — metrics, hyperparameters, logs
GPU/system monitoring — GPU utilization, power, memory, CPU load
Artifacts — versioned checkpoints and datasets
Sweeps — distributed hyperparameter search
Groups & jobs — organize multi-GPU/distributed training

Basic Usage

Initialize a Run

import wandb

wandb.init(
    project="my-thunder-project",
    name="baseline-resnet50",
    config={
        "learning_rate": 3e-4,
        "batch_size": 64,
        "epochs": 20,
        "optimizer": "adamw",
        "precision": "fp16",
    },
)

Log Metrics

wandb.log({
    "train/loss": loss,
    "train/accuracy": acc,
    "step": step,
})

Best Logging Practices

Log every N steps (e.g., 10–50) to minimize overhead
Avoid logging huge tensors every step
Use artifacts for large files

GPU & System Monitoring

Wandb automatically collects:

GPU utilization
GPU memory usage
GPU temperature and power
CPU usage
RAM usage
Disk and network I/O

Use these graphs to diagnose:

GPU-bound workloads
Data-bound workloads
Bottlenecks due to I/O or preprocessing
Too-small batch sizes

Improving GPU Utilization

Increase batch size until GPU memory is near capacity
Use mixed precision (torch.cuda.amp)
Increase dataloader workers
Preload/augment data on the GPU
Reduce unnecessary synchronizations

Model Checkpointing with Weights & Biases Artifacts

When you train on Thunder Compute GPU instances, it’s important that your model checkpoints are not tied to a single machine. Weights & Biases Artifacts provide a simple way to:

Persist checkpoints even if the instance is deleted
Move checkpoints between different Thunder instances (or GPU types)
Share models with your team
Reproduce and resume long-running training jobs

This section provides a walkthrough of how to do checkpointing with wandb.

Why use Artifacts for checkpoints?

Saving checkpoints only to the local filesystem is risky:

Thunder instances may be stopped or recreated
You may want to resume training on a different GPU (A100 → H100)
Your team may need to reuse your model
You may want versioned, reproducible training history

Artifacts solve this by storing checkpoints in W&B’s managed, versioned storage.

Step 1 — Save a checkpoint locally during training

Inside your real training loop, periodically save a checkpoint.
For real projects (PyTorch):

import torch

# ... inside your training loop ...
if (epoch + 1) % 5 == 0:
    ckpt_path = f"checkpoints/model_epoch_{epoch+1}.pt"
    torch.save(model.state_dict(), ckpt_path)

It is best practice to save checkpoints inside a dedicated checkpoints/ folder.

Step 2 — Log the checkpoint as a W&B Artifact

Right after saving your file:

import wandb

artifact = wandb.Artifact(
    name=f"resnet50-epoch-{epoch+1}",
    type="model",
    metadata={
        "epoch": epoch + 1,
        "val_loss": float(val_loss),
        "val_accuracy": float(val_acc),
    },
)

artifact.add_file(ckpt_path)
wandb.log_artifact(artifact)

This uploads your checkpoint to W&B and keeps a permanent copy.

Step 3 — View & manage checkpoints in the W&B UI

Go to your wandb project
Open the Artifacts tab
Click your model artifact
You can now:
- View version history (v0, v1, v2…)
- Open the metrics/metadata
- Download the checkpoint
- Use it as an input for new runs

Step 4 — Restore a checkpoint on another Thunder instance

On a fresh machine:

import wandb
import torch

run = wandb.init(project="my-thunder-project", job_type="restore")

artifact = run.use_artifact(
    "wato/my-thunder-project/resnet50-epoch-10:latest",
    type="model",
)
artifact_dir = artifact.download()

checkpoint = torch.load(f"{artifact_dir}/model_epoch_10.pt", map_location="cuda")
model.load_state_dict(checkpoint)
model.to("cuda")

You now have the exact model weights from your previous run — even if the original instance is gone.

Step 5 — Resume training

model.load_state_dict(checkpoint)
model.to("cuda")

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

start_epoch = 10
for epoch in range(start_epoch, config.epochs):
    train_one_epoch(...)
    validate(...)
    wandb.log({"epoch": epoch})

Example: Adding Checkpointing to a Minimal `train.py`

Here is a working example using the simple training script from the Getting Started section. This example simulates a checkpoint file (JSON), but the workflow is identical for real model weights.

import wandb
import time
import json
import os

# Initialize wandb
wandb.init(
    project="thunder-resnet",
    name="quick-test",
    config={
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 5,
    },
)

os.makedirs("checkpoints", exist_ok=True)

for epoch in range(5):
    # Simulate training metrics
    train_loss = 1.0 / (epoch + 1)
    train_acc = 0.5 + epoch * 0.1

    # Log metrics to wandb
    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "train/accuracy": train_acc,
    })

    # ---- Checkpointing Example ----
    # In a real project this would be torch.save(model.state_dict(), ...)
    checkpoint_path = f"checkpoints/epoch_{epoch}.json"
    with open(checkpoint_path, "w") as f:
        json.dump({
            "epoch": epoch,
            "train_loss": train_loss,
            "train_accuracy": train_acc,
        }, f)

    # Log checkpoint as an artifact
    artifact = wandb.Artifact(
        name=f"quick-test-epoch-{epoch}",
        type="model",
        metadata={
            "epoch": epoch,
            "train_loss": train_loss,
            "train_accuracy": train_acc
        },
    )
    artifact.add_file(checkpoint_path)
    wandb.log_artifact(artifact)
    # --------------------------------

    time.sleep(0.5)

wandb.finish()

This example demonstrates:

how checkpoint files are created
how they are logged as Artifacts
how each epoch becomes a tracked, versioned checkpoint

These appear in the Artifacts tab of your project.

Quick Reference: Other Artifact Types

Artifacts aren’t just for model checkpoints. You can also version datasets:

# Logging a Dataset
dataset = wandb.Artifact("imagenet-subset", type="dataset")
dataset.add_dir("data/imagenet_subset")
wandb.log_artifact(dataset)

Hyperparameter Sweeps (Multi‑GPU, Multi‑Instance)

Sweeps allow large-scale hyperparameter search across many Thunder Compute instances.

Step 1 — Create `sweep.yaml`

program: train.py
project: thunder-resnet
method: bayes

metric:
  name: val/accuracy
  goal: maximize

parameters:
  learning_rate:
    min: 0.00001
    max: 0.001
  batch_size:
    values: [32, 64, 128]
  weight_decay:
    min: 0.0
    max: 0.1
  augment:
    values: ["none", "light", "heavy"]

Output:

wandb: Creating sweep from: sweep.yaml
wandb: Creating sweep with ID: fgbkmk3q
wandb: View sweep at: https://wandb.ai/entity-name/thunder-resnet/sweeps/fgbkmk3q
wandb: Run sweep agent with: wandb agent entity-name/thunder-resnet/fgbkmk3q

Step 2 — Initialize the sweep:

wandb sweep sweep.yaml

Step 3 — Run agents on Thunder GPU instances:

wandb agent <entity>/<project>/<sweep_id>

Each agent pulls new hyperparameters and launches a run automatically.

Distributed Training (DDP, Lightning, DeepSpeed)

PyTorch DDP Example

wandb.init(
    project="thunder-ddp",
    group="llama7b-a100x4",
    job_type="training",
)

Set run names per rank:

wandb.run.name = f"gpu-{rank}"

PyTorch Lightning Example

from lightning.pytorch import Trainer
from lightning.pytorch.loggers import WandbLogger

wandb_logger = WandbLogger(project="thunder-lightning-demo")

trainer = Trainer(
    logger=wandb_logger,
    accelerator="gpu",
    devices=4,
    strategy="ddp",
    max_epochs=50,
)

trainer.fit(model)

Lightning automatically:

Logs metrics and gradients
Tracks checkpoints
Handles multi-GPU logging

Offline Mode (Air‑Gapped or Firewalled Environments)

Thunder instances may have intermittent or restricted internet access.

Run in offline mode:

export WANDB_MODE=offline
python train.py

Sync later:

wandb sync /path/to/wandb/run-folder

Fully disable wandb:

export WANDB_MODE=disabled

Best Practices for Thunder Compute GPU Instances

Run Management

Use meaningful run names that include dataset + model + GPU type
Log all hyperparameters in wandb.config
Track system metrics to diagnose bottlenecks
Organize multi-GPU runs using group
Reduce logging overhead by batching logs

Artifacts & Checkpointing

Use meaningful artifact names (e.g. llama7b-a100-epoch20)
Attach useful metadata (epoch, val metrics, dataset version)
Log fewer but higher-quality checkpoints
Always use artifacts for long or expensive runs
Use use_artifact(...).download() to restore weights anywhere
Use artifacts for datasets and checkpoints

Experimentation

Use sweeps for expensive experiments
Compare runs systematically using the dashboard
Monitor GPU utilization to optimize batch sizes

Troubleshooting

Authentication Issues

wandb login --relogin

GPU Metrics Not Showing

Ensure nvidia-smi works inside the environment
Use GPU-enabled containers (--gpus all)
Call wandb.init() early

Connection Issues

Verify outbound internet access
Firewalls must allow connections to *.wandb.ai
Use offline mode if required

Large File Uploads

Always use artifacts for multi-GB files
Compress large checkpoints
Prune old versions

Need Help?

W&B Docs: https://docs.wandb.ai
Thunder Compute Discord: https://discord.gg/nwuETS9jJK
Email support: [email protected]

Documentation

Guides

API Reference

​Prerequisites

​Installation

​Authentication

​Getting Started

​Step 1 — Create a Training File

​Step 2 — Paste Minimal Working Example

​Step 3 — Run the Script

​Step 4 — Expected Output

​Step 5 — View Your Results

​Viewing Results

​Core Concepts for Cloud GPU Workloads

​Basic Usage

​Initialize a Run

​Log Metrics

​Best Logging Practices

​GPU & System Monitoring

​Improving GPU Utilization

​Model Checkpointing with Weights & Biases Artifacts

​Why use Artifacts for checkpoints?

​Step 1 — Save a checkpoint locally during training

​Step 2 — Log the checkpoint as a W&B Artifact

​Step 3 — View & manage checkpoints in the W&B UI

​Step 4 — Restore a checkpoint on another Thunder instance

​Step 5 — Resume training

​Example: Adding Checkpointing to a Minimal train.py

​Quick Reference: Other Artifact Types

​Hyperparameter Sweeps (Multi‑GPU, Multi‑Instance)

​Step 1 — Create sweep.yaml

​Step 2 — Initialize the sweep:

​Step 3 — Run agents on Thunder GPU instances:

​Distributed Training (DDP, Lightning, DeepSpeed)

​PyTorch DDP Example

​PyTorch Lightning Example

​Offline Mode (Air‑Gapped or Firewalled Environments)

​Run in offline mode:

​Sync later:

​Fully disable wandb:

​Best Practices for Thunder Compute GPU Instances

​Run Management

​Artifacts & Checkpointing

​Experimentation

​Troubleshooting

​Authentication Issues

​GPU Metrics Not Showing

​Connection Issues

​Large File Uploads

​Need Help?

Prerequisites

Installation

Authentication

Getting Started

Step 1 — Create a Training File

Step 2 — Paste Minimal Working Example

Step 3 — Run the Script

Step 4 — Expected Output

Step 5 — View Your Results

Viewing Results

Core Concepts for Cloud GPU Workloads

Basic Usage

Initialize a Run

Log Metrics

Best Logging Practices

GPU & System Monitoring

Improving GPU Utilization

Model Checkpointing with Weights & Biases Artifacts

Why use Artifacts for checkpoints?

Step 1 — Save a checkpoint locally during training

Step 2 — Log the checkpoint as a W&B Artifact

Step 3 — View & manage checkpoints in the W&B UI

Step 4 — Restore a checkpoint on another Thunder instance

Step 5 — Resume training

Example: Adding Checkpointing to a Minimal `train.py`

Quick Reference: Other Artifact Types

Hyperparameter Sweeps (Multi‑GPU, Multi‑Instance)

Step 1 — Create `sweep.yaml`

Step 2 — Initialize the sweep:

Step 3 — Run agents on Thunder GPU instances:

Distributed Training (DDP, Lightning, DeepSpeed)

PyTorch DDP Example

PyTorch Lightning Example

Offline Mode (Air‑Gapped or Firewalled Environments)

Run in offline mode:

Sync later:

Fully disable wandb:

Best Practices for Thunder Compute GPU Instances

Run Management

Artifacts & Checkpointing

Experimentation

Troubleshooting

Authentication Issues

GPU Metrics Not Showing

Connection Issues

Large File Uploads

Need Help?