NVIDIA H100 Specs: Full Guide (2026) — All Variants, Benchmarks and Pricing

Q: What is MIG on the H100 and what workloads benefit?

MIG partitions one H100 into up to 7 isolated instances, each with dedicated CUDA cores, Tensor Cores, L2 cache, and HBM memory. Common patterns are 7x 1g.10gb for multi-tenant inference of small models and 2x 3g.40gb for two simultaneous 13B model servers. MIG is most useful for inference teams where 80 GB is more than any single request needs.

Q: How does the H100 compare to the H200?

The H200 uses the identical GH100 compute die, so FP8 throughput and CUDA cores are the same. The upgrade is memory: 141 GB HBM3e at ~4.8 TB/s versus 80 GB HBM3 at 3.35 TB/s. The H200 adds value only for models exceeding 80 GB or very long context windows.

Q: How does the H100 compare to the A100?

The H100 delivers 3-4x the throughput on transformer workloads via FP8, 68% more memory bandwidth on the SXM variant, and 50% more NVLink bandwidth. The A100 remains cost-competitive for workloads that do not require FP8.

Carl PetersonJune 24, 202617 min read

Released in 2022, the NVIDIA H100 debuted the Hopper architecture and delivered step-change gains in tensor performance, memory bandwidth, and efficiency over the A100.

It remains the practical workhorse for most AI projects nearly four years later, having been used to train 28 notable AI models in 2025 alone, according to the 2026 AI Index Report.

Newer GPUs like the B200 have since entered the market, but they are expensive, scarce, and hard to rent. The H100 fills that gap. GH100 is the die name, but the products built around it vary significantly in memory, bandwidth, power, and interconnect.This guide covers all of it: form factors, full specifications, precision formats, MIG partitioning, real-world performance, and comparisons to the A100, H200, and Blackwell.

For pricing and provider comparisons, see the companion posts: NVIDIA H100 pricing and NVIDIA A100 vs H100.

H100 Form Factors

The H100 comes in two physical form factors (PCIe and SXM) plus a third specialized configuration (NVL):

PCIe: For standard servers; no NVLink bridge, lower bandwidth, simpler deployment.
NVL: A pre-configured dual-H100 PCIe setup with NVLink and more total memory per pair.
SXM: Designed for high-performance systems; supports full NVLink connectivity and the highest bandwidth.

NVIDIA H100 - GH100 die

H100 PCIe

The H100 PCIe fits into existing PCIe Gen5 server slots and prioritizes flexible deployment over peak performance. It draws around 350W and operates without a NVLink bridge, making it the most practical option for single-GPU inference or fine-tuning.

The H100 PCIe Product Brief confirms a 350W TDP and memory bandwidth exceeding 2,000 GB/s.

Lower power consumption (350W TDP)
No NVLink bridge
Ideal for inference and single-GPU workloads

NVIDIA H100 PCIe GPU

H100 NVL

The H100 NVL pairs two H100 PCIe GPUs through NVLink into a tightly coupled configuration for large-scale inference. It provides 94 GB HBM3 per GPU at 3.9 TB/s bandwidth, making it well suited for LLM serving and recommendation systems that push beyond what a single PCIe card can handle.

The H100 NVL Product Brief lists a 400W TDP and memory bandwidth of nearly 4,000 GB/s.

Dual-GPU configuration
94 GB HBM3 per GPU at 3.9 TB/s bandwidth
High-bandwidth GPU-to-GPU interconnect
Designed for LLM serving and recommendation systems

4 NVIDIA H100 NVL GPUs connected with NVLink in a server

H100 SXM

The SXM form factor is built for maximum performance. It operates at up to 700W and enables full NVLink 4.0 connectivity, allowing GPUs to communicate at 900 GB/s with minimal latency. This is the variant used in DGX and HGX systems for large-scale AI training.

Up to 700W TDP
Full NVLink 4.0 support (900 GB/s per GPU)
Used in HGX and DGX systems
Best for distributed, multi-GPU training

NVIDIA H100 SXM module for high-performance servers

H100 Specifications

All H100 variants share the same underlying Hopper architecture and GH100 die. Specifications diverge across form factors in memory type, bandwidth, CUDA core count, and interconnect.

Common Features For All H100 GPUs
Architecture	Hopper
Die	GH100
Tensor Core Generation	4th
Compatibility	TensorRT, cuDNN, NCCL
AI Frameworks	PyTorch, TensorFlow, JAX

NVIDIA's official datasheet leads with sparsity-enabled figures. Dense (non-sparsity) values are half the listed TFLOPS and reflect what you get without explicit 2:4 structured sparsification of the model weights. Most production workloads use dense values.

Specification	H100 PCIe	H100 NVL	H100 SXM
Streaming Multiprocessors	114 SMs	132 SMs per GPU	132 SMs
CUDA Cores	14,592	16,896	16,896
Tensor Cores (4th Gen)	456	528	528
VRAM	80 GB HBM2e	94 GB HBM3	80 GB HBM3
Memory Bandwidth	2.0 TB/s	~3.9 TB/s	~3.35 TB/s
L2 Cache	50 MB	50 MB	50 MB
FP64 (TFLOPS)	26	30	34
FP64 Tensor (TFLOPS)	51	60	67
FP32 (TFLOPS)	51	60	67
TF32 Tensor* (TFLOPS)	756	835	989
BF16 Tensor* (TFLOPS)	1,513	1,671	1,979
FP16 Tensor* (TFLOPS)	1,513	1,671	1,979
FP8 Tensor* (TFLOPS)	3,026	3,341	3,958
INT8 Tensor* (TOPS)	3,026	3,341	3,958
Power Consumption	~350W	~350–400W	Up to 700W
NVLink	PCIe Gen5 only	NVLink + PCIe Gen5	NVLink 4.0 + PCIe Gen5
NVLink Bandwidth	N/A	~600 GB/s	900 GB/s per GPU
MIG Support	Up to 7 instances	Up to 7 instances	Up to 7 instances
* Shown with sparsity. Dense values are half the listed figure.

Sources: NVIDIA H100 Datasheet | NVIDIA H100 Product Page

Thunder Compute offers H100 PCIe GPUs starting at $1.38/hr, with no long-term commitments and no infrastructure setup required.

H100 Supported Precisions and the Transformer Engine

The H100 expands precision support through its 4th-generation Tensor Cores and a dedicated Transformer Engine.

The Transformer Engine is the most important architectural addition in Hopper. It automatically selects between FP8 and FP16 precision on a per-operation basis during training, with no manual intervention required. On transformer-based models like GPT and LLaMA, this delivers 3-4x the throughput of the A100, as described in the NVIDIA H100 product page.

Frameworks including vLLM, TensorRT-LLM, and the transformer_engine PyTorch package unlock this gain automatically; stacks that fall back to BF16 do not.

The full precision stack the H100 supports:

FP64: Scientific and HPC workloads
FP32: Traditional deep learning compute
TF32: Drop-in replacement for FP32 with higher throughput on Tensor Cores
FP16: Standard deep learning training
BF16: Numerically stable alternative to FP16 for training
FP8: Highest-throughput format for LLM training and inference (Hopper-exclusive)
INT8: Quantized inference

FP8 delivers throughput roughly double that of FP16 while maintaining acceptable accuracy for most LLM workloads. FP8 tooling on H100 is production-proven across PyTorch, vLLM, and TensorRT-LLM, which is a maturity advantage over newer Blackwell hardware.

Chart of H100 CUDA cores, tensor cores, and supported precision formats

H100 vs A100: Key Differences

The H100 is a substantial generational leap over the A100, not a minor refresh. The table below covers the most important differences for teams evaluating both.

Specification	H100 SXM	A100 SXM	Notes
Architecture	Hopper (GH100)	Ampere (GA100)	Different die and process
VRAM	80 GB HBM3	80 GB HBM2e	Same capacity, faster type on H100
Memory Bandwidth	3.35 TB/s	2.0 TB/s	68% faster on H100
FP16 TFLOPS (dense)	989	624	~1.6x higher on H100
FP8 TFLOPS (dense)	1,979	N/A	FP8 is H100-exclusive
Transformer Engine	Yes (FP8/FP16 auto-switching)	No	Defines H100's LLM advantage
NVLink Generation	NVLink 4.0 (900 GB/s)	NVLink 3.0 (600 GB/s)	50% more inter-GPU bandwidth on H100
MIG Support	Up to 7 instances	Up to 7 instances	Equivalent partition count
Confidential Computing	Yes (hardware TEE)	No	H100 is first GPU with on-die TEE
Purchase price (2026)¹	$25,000–$40,000	$10,000–$15,000	Secondary market estimates
¹ Secondary market pricing as of Q2 2026. Subject to change.

For a full breakdown, see the NVIDIA A100 vs H100 comparison guide.

H100 vs H200: What Changed

The H200 uses the same GH100 compute die as the H100. CUDA core count, Tensor Core count, and FP8/FP16 compute performance are identical. The upgrade is entirely in the memory subsystem.

Specification	H100 SXM	H200 SXM
Die	GH100	GH100
VRAM	80 GB HBM3	141 GB HBM3e
Memory Bandwidth	3.35 TB/s	~4.8 TB/s
FP8 TFLOPS (with sparsity)	3,958	3,958
CUDA Cores	16,896	16,896
Transformer Engine	Yes	Yes
Power (TDP)	700W	700W

For workloads that fit within 80 GB VRAM, H100 and H200 perform nearly identically. The H200's 141 GB matters for models that exceed 80 GB at full precision, very long context windows, or large-batch inference where capacity is the binding constraint.

MIG: Running Multiple Workloads on One H100

Multi-Instance GPU (MIG) partitions a single H100 into up to 7 fully isolated GPU instances. Each instance receives a dedicated slice of CUDA cores, Tensor Cores, L2 cache, and HBM memory with hardware-enforced isolation. MIG instances appear as separate GPU devices to the OS, so each runs its own container independently.

On the H100 SXM5, a full 7-way partition gives each instance approximately 10 GB HBM3 and 2,048 CUDA cores. The two most common production patterns are:

7x 1g.10gb: Multi-tenant inference of small models (Mistral 7B, Llama 3 8B) with per-job cost efficiency
2x 3g.40gb: Two simultaneous 13B model servers with balanced memory and compute

MIG is most valuable for inference teams where a full 80 GB GPU is more than any single request needs. At Thunder Compute, H100 PCIe instances are available from $1.38/hr with no minimum commitment, making MIG-style workload splitting accessible for most inference use cases.

H100 Confidential Computing

The H100 was the first GPU to implement hardware-enforced confidential computing, using an on-die Trusted Execution Environment (TEE). Data transfers between CPU and GPU are encrypted using AES-256, executed on-die. A peer-reviewed benchmark study (arXiv 2409.03992) found the average overhead is below 7%, and efficiency approaches 99% for large models like Llama 3.1-70B, where GPU compute time dominates over I/O overhead.

This is particularly relevant for regulated industries: healthcare (HIPAA-sensitive model training), finance (proprietary model IP), and government (sensitive inference workloads). For most developer use cases, confidential computing is not required, but it is the only feature of its kind available in any GPU generation prior to Hopper.

H100 Chips

GH100: GPU Chip

The GH100 is the core silicon behind all H100 GPUs, built specifically for large-scale AI and high-performance computing workloads. It combines massive parallelism and adds a Transformer Engine, enabling efficient execution of modern deep learning models across a range of precisions.

Built on TSMC 4N process
~80 billion transistors
Optimized for tensor operations and parallel compute

Given that the GH100 die powers all H100 variants, differences in performance between them come from power limits, memory, and interconnects.

GH200: The Grace Hopper Superchip

The GH200 combines a GH100 GPU with NVIDIA's Grace ARM CPU into a single superchip, connected via NVLink-C2C at 900 GB/s. This design enables a unified memory architecture that dramatically increases the effective bandwidth available to applications running across both CPU and GPU.

Unified memory architecture
Designed for memory-intensive AI and HPC workloads
Ideal for models requiring terabyte-scale memory access

While NVIDIA had previously developed CPUs for mobile and embedded systems, Grace represents its first major push into data center CPUs, purpose-built for AI and high-performance computing workloads.

NVIDIA Grace Hopper GH200 superchip

H100 Real-World Performance

The H100's advantage over the A100 comes from three compounding improvements: the Transformer Engine's FP8 precision, 68% higher memory bandwidth on the SXM variant, and NVLink 4.0's 50% increase in inter-GPU bandwidth. In practice, H100 clusters have reduced large-model training times by 2 to 3 times compared to equivalent A100 configurations. For inference, gains of 2 to 4 times on large transformer models are common, depending on model size and batch configuration.

According to NVIDIA's SemiAnalysis InferenceX benchmarks from April 2026, H100 delivers inference at approximately $0.09 per 1M tokens for a 120B parameter model using vLLM. At Thunder's $1.38/hr rate, this makes H100 one of the most cost-efficient inference options currently available.

The H100 was used to train 28 notable AI models in 2025 alone, a figure that continues to rise as it replaces A100 in new deployments, per the 2026 AI Index Report.

A bar chart comparing the number of notable AI models trained on different GPU accelerators, with the NVIDIA H100 showing an upward trend. The chart includes bars for various GPUs such as A100, H100, and others, presented in a clean, informative layout within a technology blog post.

H100 Systems

H100 GPUs are typically deployed as part of integrated systems that combine multiple GPUs with high-speed interconnects, optimized power delivery, and coordinated cooling.

DGX H100

The DGX H100 integrates 8 H100 SXM GPUs with dual x86 CPUs (Intel Xeon or AMD EPYC), optimized networking, and storage in a fully interconnected, turnkey configuration.

It's designed as a single-node foundation for large language model training. A complete DGX H100 system costs approximately $300,000 to $400,000 new, based on 2026 market pricing data.

8x H100 SXM GPUs
Dual x86 CPUs (Intel Xeon or AMD EPYC)
Fully interconnected with NVLink
Petaflop-scale AI performance

NVIDIA DGX H100 system with eight H100 SXM GPUs

HGX H100

HGX is a modular platform used by cloud providers and OEMs. Unlike DGX systems, which are fully integrated, HGX provides the core GPU baseboard that partners build around with their hardware and infrastructure.

Configurations with 4, 8, or more H100 GPUs
NVLink-enabled GPU fabric. HGX platforms form the backbone of most modern AI cloud infrastructure, and most H100 instances available from cloud GPU providers are built on HGX-class hardware.

NVIDIA HGX H100 platform for multi-GPU AI infrastructure

H100 Pricing and Cloud Access in 2026

Purchasing H100 hardware outright is expensive. H100 PCIe cards sell for approximately $25,000 to $30,000 through authorized resellers; full DGX H100 systems run $300,000 to $400,000. Secondary market SXM units trade between $6,000 and $15,000 as Blackwell supply expands, but lead times vary significantly.

For most developers and teams, renting on-demand is the practical path. Cloud rental rates have dropped from a 2023 peak of $8 to $12/hr to a 2026 market median around $2.29/hr, with hyperscalers still pricing in the $6 to $11 range.

Provider Tier	H100 Rental Rate (2026)¹	Notes
Thunder Compute	From $1.38/hr (PCIe)	No minimum commitment, VS Code + Cursor extensions, one-click templates
Market median (independent providers)	~$2.29/hr	Fluence, Spheron, Vast.ai, RunPod range
Hyperscalers (AWS, GCP, Azure)	$6–$11/hr	Higher availability guarantees, higher overhead cost
¹ Last update: June 24, 2026. Spot instances available at lower rates with interruption risk.

Beyond pricing, the development workflow matters. Thunder Compute provides VS Code and Cursor extensions that connect you to an H100 instance directly from your IDE, plus one-click templates for image generation and local LLM serving. It is the fastest path to a running H100 environment without any infrastructure setup.

See Thunder Compute's current H100 availability and pricing →

Why the H100 Still Matters in 2026

The Blackwell B200 offers higher raw throughput, but it requires liquid cooling, draws significantly more power, and commands a substantial price premium.

Most existing training pipelines were built on H100, and migrating to Blackwell carries real engineering cost. H100 rental rates have also fallen 64 to 75% from their 2023 peak, making the hardware more cost-accessible than at any point since launch.

For teams whose models fit in 80 GB and whose workloads run efficiently on the existing CUDA/FP8 stack, the H100 is the clear practical choice in 2026. For frontier-scale training runs that require maximum throughput per GPU-hour, B200 may justify the premium, but availability remains constrained.

Last Thoughts on NVIDIA H100 Specs

The H100 remains the de facto standard for large-model training and production inference. Its Transformer Engine, 80 GB of high-bandwidth memory, and NVLink 4.0 interconnect cover everything from single-GPU fine-tuning to distributed training at scale.

At $1.38/hr on Thunder Compute, with VS Code and Cursor IDE integration and one-click templates, getting from zero to a running H100 environment takes minutes.

Get started with an H100 on Thunder Compute →

To match the right GPU to your workload, see the GPU selection guide for AI workflows.

FAQ

What are the main form factors of the NVIDIA H100?

The H100 comes in three configurations: PCIe (standard servers, no NVLink, 350W), NVL (dual PCIe GPUs bridged via NVLink, 94 GB HBM3 per GPU), and SXM (high-performance module with full NVLink 4.0, 700W TDP, used in DGX and HGX systems).

What is the difference between H100 PCIe and H100 SXM?

The PCIe variant uses HBM2e at ~2 TB/s bandwidth, draws 350W, and fits in standard servers without NVLink support. The SXM variant uses HBM3 at 3.35 TB/s, draws up to 700W, and supports NVLink 4.0 at 900 GB/s for multi-GPU training. PCIe is the right choice for inference and single-GPU workloads; SXM is the right choice for distributed training at scale.

How does the H100 Transformer Engine work?

The Transformer Engine automatically selects between FP8 and FP16 precision on a per-operation basis during training, without manual tuning. For transformer-based models, this delivers 3 to 4 times the throughput of the A100 at FP16. Framework support as of 2026 includes vLLM (--dtype fp8), TensorRT-LLM (native FP8 engine), and the transformer_engine package for PyTorch training.

What is MIG on the H100 and what workloads benefit?

MIG (Multi-Instance GPU) partitions one H100 into up to 7 isolated instances, each with a dedicated slice of CUDA cores, Tensor Cores, L2 cache, and HBM memory. Common patterns are 7x 1g.10gb for multi-tenant inference of small models and 2x 3g.40gb for two simultaneous 13B model servers. MIG is most useful for inference teams where a full 80 GB GPU is more than any single request needs.

How does the H100 compare to the H200?

The H200 uses the identical GH100 compute die, so CUDA cores, Tensor Cores, and FP8 throughput are the same. The upgrade is memory: 141 GB HBM3e at ~4.8 TB/s versus 80 GB HBM3 at 3.35 TB/s on the SXM H100. For workloads that fit in 80 GB, performance is nearly identical. The H200 adds value only for models exceeding 80 GB at full precision, very long context windows, or memory-bound large-batch inference.

How does the H100 compare to the A100?

The H100 delivers 3 to 4 times the throughput on transformer workloads via FP8 and the Transformer Engine, 68% more memory bandwidth on the SXM variant (3.35 vs 2.0 TB/s), and 50% more NVLink bandwidth (900 vs 600 GB/s). The A100 remains cost-competitive for workloads that don't require FP8 where the H100 price premium is not justified. See the full A100 vs H100 comparison for a detailed breakdown.

What is the H100's inference cost per token?

According to NVIDIA's SemiAnalysis InferenceX benchmarks from April 2026, the H100 delivers inference at approximately $0.09 per 1M tokens for a 120B parameter model using vLLM. At Thunder Compute's $1.38/hr rate, this makes H100 one of the most cost-efficient inference options available from a cloud provider.

What is the H100's confidential computing capability?

The H100 was the first GPU with hardware-enforced confidential computing via an on-die Trusted Execution Environment (TEE). Data transfers between CPU and GPU are AES-256 encrypted on-die. A peer-reviewed benchmark study found the average overhead is below 7%, approaching zero for large models like Llama 3.1-70B. This makes it suitable for HIPAA, finance, and government workloads.