Back
GPU Spot Instance Interruption Rates (May 2025): Should You Risk Them for ML Training?
Fresh numbers on how often Spot GPUs are reclaimed, what that means for notebooks and checkpoints, and a safer alternative if you need zero downtime.
Published:
Apr 16, 2025
Last updated:
May 1, 2025

1 Why interruption rate matters
For Spot VMs, cost is predictable—availability is not. AWS can reclaim a Spot Instance with two minutes’ notice, ending your process or stopping the VM. The interruption rate tells you how often that happens in the last 30 days. Under 5% often feels safe; above 10% means you should expect at least one cut-off during a day-long training run. Amazon Web Services
2 Latest GPU Spot numbers (May 2025)
GPU family | Typical interruption band* | Notes |
---|---|---|
A100 (p4d) | 5–10 % | Stable in most regions |
L4 | <5 % | Good budget option |
H100 (p5) | 10–20 % | Capacity scarce; queues common |
*Data pulled from AWS Spot Instance Advisor on 1 May 2025. Individual availability zones vary.
AWS also states that “95% of Spot instances run to completion” across all types, but high-end GPUs sit in the noisy 5%. nOps
3 When Spot is a good deal
Long train with checkpoints – Restarting from the last save costs minutes, not hours.
Stateless batch inference – Interruptions just re-queue the next batch.
CI/CD test jobs – Failures are retried automatically.
Configure automatic checkpointing every 15 minutes and use persistent Spot requests so AWS restarts your VM when capacity returns. See the interruption guide for exact flags. AWS Documentation
4 When Spot backfires
Workflow trait | Risk | Why |
---|---|---|
Interactive notebooks | High | Kernel dies; unsaved code is lost. |
Rapid prototyping loops | High | Waiting to reacquire the same GPU stalls iteration. |
Tight launch deadlines | Medium | Unplanned restarts inflate wall-clock time. |
If you burn hours waiting for capacity, any headline “90% savings” quickly erodes.
5 A middle path: idle-time reuse
Network-attached GPUs from Thunder Compute are loaned to your process, then returned to a pool in seconds when idle, without you having to think about it. In practice, teams converting from Spot to Thunder report 40-60 % total savings without code changes and with jobs that never go dark. Use it when you want Spot-level pricing but can’t tolerate interruptions.
6 Decision checklist
Can you restart from a checkpoint? → Spot.
Need real-time notebooks, live prototyping, or demos? → Thunder Compute.
GPU is rare (e.g., H100) and queue times hurt delivery dates? → Thunder Compute or full on-demand.
Pure cost-per-hour rules and deadlines are loose? → Spot.
Key takeaways
Most mainstream GPU Spot SKUs interrupt <10 % of the time, but H100 rates are now double that.
Spot shines for restart-tolerant jobs; interactive work suffers.
Thunder Compute offers on-demand availability with the savings from improved utilization, giving you predictable uptime at Spot pricing.
Keep an eye on the AWS Spot Instance Advisor before every new project, and pick the model that protects your timeline as well as your budget.

Carl Peterson
Other articles you might like
Learn more about how Thunder Compute will virtualize all GPUs
Try Thunder Compute
Start building AI/ML with the world's cheapest GPUs