Back

GPU Spot Instance Interruption Rates (May 2025): Should You Risk Them for ML Training?

Fresh numbers on how often Spot GPUs are reclaimed, what that means for notebooks and checkpoints, and a safer alternative if you need zero downtime.

Published:

Apr 16, 2025

|

Last updated:

May 1, 2025

Spot instances

1 Why interruption rate matters

For Spot VMs, cost is predictable—availability is not. AWS can reclaim a Spot Instance with two minutes’ notice, ending your process or stopping the VM. The interruption rate tells you how often that happens in the last 30 days. Under 5% often feels safe; above 10% means you should expect at least one cut-off during a day-long training run. Amazon Web Services

2 Latest GPU Spot numbers (May 2025)

GPU family

Typical interruption band*

Notes

A100 (p4d)

5–10 %

Stable in most regions

L4

<5 %

Good budget option

H100 (p5)

10–20 %

Capacity scarce; queues common

*Data pulled from AWS Spot Instance Advisor on 1 May 2025. Individual availability zones vary.

AWS also states that “95% of Spot instances run to completion” across all types, but high-end GPUs sit in the noisy 5%. nOps

3 When Spot is a good deal

  1. Long train with checkpoints – Restarting from the last save costs minutes, not hours.

  2. Stateless batch inference – Interruptions just re-queue the next batch.

  3. CI/CD test jobs – Failures are retried automatically.

Configure automatic checkpointing every 15 minutes and use persistent Spot requests so AWS restarts your VM when capacity returns. See the interruption guide for exact flags. AWS Documentation

4 When Spot backfires

Workflow trait

Risk

Why

Interactive notebooks

High

Kernel dies; unsaved code is lost.

Rapid prototyping loops

High

Waiting to reacquire the same GPU stalls iteration.

Tight launch deadlines

Medium

Unplanned restarts inflate wall-clock time.

If you burn hours waiting for capacity, any headline “90% savings” quickly erodes.

5 A middle path: idle-time reuse

Network-attached GPUs from Thunder Compute are loaned to your process, then returned to a pool in seconds when idle, without you having to think about it. In practice, teams converting from Spot to Thunder report 40-60 % total savings without code changes and with jobs that never go dark. Use it when you want Spot-level pricing but can’t tolerate interruptions.

6 Decision checklist

  • Can you restart from a checkpoint? → Spot.

  • Need real-time notebooks, live prototyping, or demos? → Thunder Compute.

  • GPU is rare (e.g., H100) and queue times hurt delivery dates? → Thunder Compute or full on-demand.

  • Pure cost-per-hour rules and deadlines are loose? → Spot.

Key takeaways

  • Most mainstream GPU Spot SKUs interrupt <10 % of the time, but H100 rates are now double that.

  • Spot shines for restart-tolerant jobs; interactive work suffers.

  • Thunder Compute offers on-demand availability with the savings from improved utilization, giving you predictable uptime at Spot pricing.

Keep an eye on the AWS Spot Instance Advisor before every new project, and pick the model that protects your timeline as well as your budget.

Carl Peterson

Try Thunder Compute

Start building AI/ML with the world's cheapest GPUs