- Track training runs, hyperparameters, and metrics
- Monitor GPU/CPU utilization in real time
- Version datasets and model checkpoints
- Run large-scale hyperparameter sweeps across many GPU instances
Prerequisites
- A Thunder Compute GPU instance created and connected
- Python environment set up on your instance
- A Weights & Biases account (https://wandb.ai)
Installation
Install wandb on your Thunder Compute instance:requirements.txt:
Authentication
Authenticate with:For shared or production Thunder instances, environment variables or secret
managers are preferred over pasting API keys directly.
Getting Started
Follow these steps to run your first wandb experiment on your Thunder Compute instance.Step 1 — Create a Training File
Create a new Python file on your instance:Step 2 — Paste Minimal Working Example
Copy this minimal example into yourtrain.py file:
Step 3 — Run the Script
Execute your training script:Step 4 — Expected Output
You should see output similar to:Step 5 — View Your Results
- View your dashboard: Click the link in the output or visit https://wandb.ai and navigate to your project
- View in Table view: Go to (Project Name) > Projects > thunder-resnet > Table to see all your runs in a tabular format
- Compare runs: Run the script multiple times with different configurations to compare results
- Add artifacts: See the Model Checkpointing with Weights & Biases Artifacts section to version checkpoints and datasets
- Scale to multi-GPU: Check out Distributed Training for multi-GPU setups
- Run sweeps: Use Hyperparameter Sweeps for automated hyperparameter search
Viewing Results
- Visit https://wandb.ai
- Select your project
- Explore:
- Metrics charts
- GPU utilization
- Model checkpoints
- Dataset artifacts
- Sweep dashboards
Core Concepts for Cloud GPU Workloads
When using remote GPUs, these wandb features matter most:- Run tracking — metrics, hyperparameters, logs
- GPU/system monitoring — GPU utilization, power, memory, CPU load
- Artifacts — versioned checkpoints and datasets
- Sweeps — distributed hyperparameter search
- Groups & jobs — organize multi-GPU/distributed training
Basic Usage
Initialize a Run
Log Metrics
Best Logging Practices
- Log every N steps (e.g., 10–50) to minimize overhead
- Avoid logging huge tensors every step
- Use artifacts for large files
GPU & System Monitoring
Wandb automatically collects:- GPU utilization
- GPU memory usage
- GPU temperature and power
- CPU usage
- RAM usage
- Disk and network I/O
- GPU-bound workloads
- Data-bound workloads
- Bottlenecks due to I/O or preprocessing
- Too-small batch sizes
Improving GPU Utilization
- Increase batch size until GPU memory is near capacity
- Use mixed precision (
torch.cuda.amp) - Increase dataloader workers
- Preload/augment data on the GPU
- Reduce unnecessary synchronizations
Model Checkpointing with Weights & Biases Artifacts
When you train on Thunder Compute GPU instances, it’s important that your model checkpoints are not tied to a single machine. Weights & Biases Artifacts provide a simple way to:- Persist checkpoints even if the instance is deleted
- Move checkpoints between different Thunder instances (or GPU types)
- Share models with your team
- Reproduce and resume long-running training jobs
Why use Artifacts for checkpoints?
Saving checkpoints only to the local filesystem is risky:- Thunder instances may be stopped or recreated
- You may want to resume training on a different GPU (A100 → H100)
- Your team may need to reuse your model
- You may want versioned, reproducible training history
Step 1 — Save a checkpoint locally during training
Inside your real training loop, periodically save a checkpoint.For real projects (PyTorch):
It is best practice to save checkpoints inside a dedicated checkpoints/ folder.
Step 2 — Log the checkpoint as a W&B Artifact
Right after saving your file:Step 3 — View & manage checkpoints in the W&B UI
- Go to your wandb project
- Open the Artifacts tab
- Click your model artifact
- You can now:
- View version history (v0, v1, v2…)
- Open the metrics/metadata
- Download the checkpoint
- Use it as an input for new runs
Step 4 — Restore a checkpoint on another Thunder instance
On a fresh machine:Step 5 — Resume training
Example: Adding Checkpointing to a Minimal train.py
Here is a working example using the simple training script from the Getting Started section.
This example simulates a checkpoint file (JSON), but the workflow is identical for real model weights.
- how checkpoint files are created
- how they are logged as Artifacts
- how each epoch becomes a tracked, versioned checkpoint
Quick Reference: Other Artifact Types
Artifacts aren’t just for model checkpoints. You can also version datasets:Hyperparameter Sweeps (Multi‑GPU, Multi‑Instance)
Sweeps allow large-scale hyperparameter search across many Thunder Compute instances.Step 1 — Create sweep.yaml
Step 2 — Initialize the sweep:
Step 3 — Run agents on Thunder GPU instances:
Distributed Training (DDP, Lightning, DeepSpeed)
PyTorch DDP Example
PyTorch Lightning Example
- Logs metrics and gradients
- Tracks checkpoints
- Handles multi-GPU logging
Offline Mode (Air‑Gapped or Firewalled Environments)
Thunder instances may have intermittent or restricted internet access.Run in offline mode:
Sync later:
Fully disable wandb:
Best Practices for Thunder Compute GPU Instances
Run Management
- Use meaningful run names that include dataset + model + GPU type
- Log all hyperparameters in
wandb.config - Track system metrics to diagnose bottlenecks
- Organize multi-GPU runs using
group - Reduce logging overhead by batching logs
Artifacts & Checkpointing
- Use meaningful artifact names (e.g.
llama7b-a100-epoch20) - Attach useful metadata (epoch, val metrics, dataset version)
- Log fewer but higher-quality checkpoints
- Always use artifacts for long or expensive runs
- Use
use_artifact(...).download()to restore weights anywhere - Use artifacts for datasets and checkpoints
Experimentation
- Use sweeps for expensive experiments
- Compare runs systematically using the dashboard
- Monitor GPU utilization to optimize batch sizes
Troubleshooting
Authentication Issues
GPU Metrics Not Showing
- Ensure
nvidia-smiworks inside the environment - Use GPU-enabled containers (
--gpus all) - Call
wandb.init()early
Connection Issues
- Verify outbound internet access
- Firewalls must allow connections to
*.wandb.ai - Use offline mode if required
Large File Uploads
- Always use artifacts for multi-GB files
- Compress large checkpoints
- Prune old versions
Need Help?
- W&B Docs: https://docs.wandb.ai
- Thunder Compute Discord: https://discord.gg/nwuETS9jJK
- Email support: [email protected]