What are Tensor Cores?
Specialized GPU cores for matrix multiply-accumulate operations
Tensor Cores are specialized processing units on NVIDIA GPUs (Volta and later) that accelerate matrix multiply-accumulate operations — the core computation in deep learning.
Example
import torch
# Tensor Cores are used automatically with mixed precision
with torch.autocast(device_type="cuda", dtype=torch.float16):
x = torch.randn(512, 512, device="cuda")
w = torch.randn(512, 512, device="cuda")
y = x @ w # matrix multiply — runs on Tensor Cores
An Overview of Tensor Cores
- Perform 4x4 matrix multiplications in a single clock cycle
- Require specific data types: FP16, BF16, TF32, INT8
- Dramatically accelerate training and inference when used with mixed precision
Tensor Core Generations
NVIDIA has iterated on Tensor Core technology across several architectural generations to provide exponential leaps in deep learning performance.
- Blackwell (5th Gen): Featured in the RTX PRO 6000, delivering up to 4,000 AI TOPS and introducing support for FP4 precision to maximize throughput for massive LLMs.
- Hopper (4th Gen): Introduced the Transformer Engine in the H100, specifically designed to dynamically scale precision for Transformer-based models using FP8.
- Ada Lovelace (4th Gen): Found in the RTX 6000 and RTX 4090, these cores include an enhanced 8-bit floating point (FP8) engine to double throughput over the previous generation.
- Ampere (3rd Gen): Found in the A100, RTX A6000, and RTX 3090, this generation introduced TF32 (Tensor Float 32), providing speedups on FP32 workloads without requiring code changes.
NVIDIA GPU Tensor Core Comparison
| Graphics Card | Architecture | Tensor Cores | AI TOPS | CUDA Cores | FP32 TFLOPS |
|---|---|---|---|---|---|
| RTX PRO 6000 | NVIDIA Blackwell | 5th Gen | 4,000 | 24,064 | 125.0 |
| RTX 6000 | NVIDIA Ada Lovelace | 4th Gen | 1,457 | 18,176 | 91.1 |
| RTX A6000 | NVIDIA Ampere | 3rd Gen | 309.7 | 10,752 | 38.7 |
| A100 80GB | NVIDIA Ampere | 3rd Gen | 624 | 6,912 | 19.5 |
| H100 PCIe | NVIDIA Hopper | 4th Gen | 1,513 | 14,592 | 51.2 |
| H200 NVL | NVIDIA Hopper | 4th Gen | 3,341 | 16,896 | 60.3 |