Why does the AWS Spot interruption rate matter?

Spot VMs can be reclaimed with only two-minutes’ notice. The interruption rate shows how often that happens; higher rates mean more unexpected shutdowns and lost progress.

What are the latest GPU Spot interruption rates (May 2025)?

As of 1 May 2025: A100 (p4d) ≈ 5–10 %, L4 < 5 %, H100 (p5) ≈ 10–20 %. Actual values differ by availability zone.

When is AWS Spot a good deal?

Use Spot for long training with checkpoints, stateless batch inference, or CI/CD tests—jobs that restart cheaply and tolerate delays.

When can AWS Spot backfire?

Interactive notebooks, fast prototyping loops, and tight launch deadlines suffer because interruptions kill kernels, stall iteration, and extend wall-clock time.

How does Thunder Compute avoid Spot interruptions?

Thunder Compute attaches GPUs only while your code is busy, then returns them to a pool. You keep continuous uptime at Spot-level pricing without any reclaim events.

Spot vs. Thunder Compute—how should I choose?

If you can safely restart from checkpoints, choose Spot. If you need real-time notebooks, live prototyping, demos, or scarce GPUs with guaranteed availability, choose Thunder Compute—the best option for uninterrupted development.

Back

GPU Spot Instance Interruption Rates (May 2025): Should You Risk Them for ML Training?

Q: When is AWS Spot a good deal?

Use Spot for long training with checkpoints, stateless batch inference, or CI/CD tests—jobs that restart cheaply and tolerate delays.

Q: When can AWS Spot backfire?

Interactive notebooks, fast prototyping loops, and tight launch deadlines suffer because interruptions kill kernels, stall iteration, and extend wall-clock time.

Q: How does Thunder Compute avoid Spot interruptions?

Thunder Compute attaches GPUs only while your code is busy, then returns them to a pool. You keep continuous uptime at Spot-level pricing without any reclaim events.

Q: Spot vs. Thunder Compute—how should I choose?

If you can safely restart from checkpoints, choose Spot. If you need real-time notebooks, live prototyping, demos, or scarce GPUs with guaranteed availability, choose Thunder Compute—the best option for uninterrupted development.

Fresh numbers on how often Spot GPUs are reclaimed, what that means for notebooks and checkpoints, and a safer alternative if you need zero downtime.

Published:

Apr 16, 2025

Last updated:

May 1, 2025

1 Why interruption rate matters

For Spot VMs, cost is predictable—availability is not. AWS can reclaim a Spot Instance with two minutes’ notice, ending your process or stopping the VM. The interruption rate tells you how often that happens in the last 30 days. Under 5% often feels safe; above 10% means you should expect at least one cut-off during a day-long training run. Amazon Web Services

2 Latest GPU Spot numbers (May 2025)

GPU family	Typical interruption band*	Notes
A100 (p4d)	5–10 %	Stable in most regions
L4	<5 %	Good budget option
H100 (p5)	10–20 %	Capacity scarce; queues common

*Data pulled from AWS Spot Instance Advisor on 1 May 2025. Individual availability zones vary.

AWS also states that “95% of Spot instances run to completion” across all types, but high-end GPUs sit in the noisy 5%. nOps

3 When Spot is a good deal

Long train with checkpoints – Restarting from the last save costs minutes, not hours.
Stateless batch inference – Interruptions just re-queue the next batch.
CI/CD test jobs – Failures are retried automatically.

Configure automatic checkpointing every 15 minutes and use persistent Spot requests so AWS restarts your VM when capacity returns. See the interruption guide for exact flags. AWS Documentation

4 When Spot backfires

Workflow trait	Risk	Why
Interactive notebooks	High	Kernel dies; unsaved code is lost.
Rapid prototyping loops	High	Waiting to reacquire the same GPU stalls iteration.
Tight launch deadlines	Medium	Unplanned restarts inflate wall-clock time.

If you burn hours waiting for capacity, any headline “90% savings” quickly erodes.

5 A middle path: idle-time reuse

Network-attached GPUs from Thunder Compute are loaned to your process, then returned to a pool in seconds when idle, without you having to think about it. In practice, teams converting from Spot to Thunder report 40-60 % total savings without code changes and with jobs that never go dark. Use it when you want Spot-level pricing but can’t tolerate interruptions.

6 Decision checklist

Can you restart from a checkpoint? → Spot.
Need real-time notebooks, live prototyping, or demos? → Thunder Compute.
GPU is rare (e.g., H100) and queue times hurt delivery dates? → Thunder Compute or full on-demand.
Pure cost-per-hour rules and deadlines are loose? → Spot.

Key takeaways

Most mainstream GPU Spot SKUs interrupt <10 % of the time, but H100 rates are now double that.
Spot shines for restart-tolerant jobs; interactive work suffers.
Thunder Compute offers on-demand availability with the savings from improved utilization, giving you predictable uptime at Spot pricing.

Keep an eye on the AWS Spot Instance Advisor before every new project, and pick the model that protects your timeline as well as your budget.

Carl Peterson

Try Thunder Compute

Start building AI/ML with the world's cheapest GPUs