What are Tensor Cores?

Tensor Cores are specialized processing units on NVIDIA GPUs (Volta and later) that accelerate matrix multiply-accumulate operations — the core computation in deep learning.

Example

import torch


# Tensor Cores are used automatically with mixed precision
with torch.autocast(device_type="cuda", dtype=torch.float16):
    x = torch.randn(512, 512, device="cuda")
    w = torch.randn(512, 512, device="cuda")
    y = x @ w  # matrix multiply — runs on Tensor Cores

An Overview of Tensor Cores

Perform 4x4 matrix multiplications in a single clock cycle
Require specific data types: FP16, BF16, TF32, INT8
Dramatically accelerate training and inference when used with mixed precision

Tensor Core Generations

NVIDIA has iterated on Tensor Core technology across several architectural generations to provide exponential leaps in deep learning performance.

Blackwell (5th Gen): Featured in the RTX PRO 6000, delivering up to 4,000 AI TOPS and introducing support for FP4 precision to maximize throughput for massive LLMs.
Hopper (4th Gen): Introduced the Transformer Engine in the H100, specifically designed to dynamically scale precision for Transformer-based models using FP8.
Ada Lovelace (4th Gen): Found in the RTX 6000 and RTX 4090, these cores include an enhanced 8-bit floating point (FP8) engine to double throughput over the previous generation.
Ampere (3rd Gen): Found in the A100, RTX A6000, and RTX 3090, this generation introduced TF32 (Tensor Float 32), providing speedups on FP32 workloads without requiring code changes.

NVIDIA GPU Tensor Core Comparison

Graphics Card	Architecture	Tensor Cores	AI TOPS	CUDA Cores	FP32 TFLOPS
RTX PRO 6000	NVIDIA Blackwell	5th Gen	4,000	24,064	125.0
RTX 6000	NVIDIA Ada Lovelace	4th Gen	1,457	18,176	91.1
RTX A6000	NVIDIA Ampere	3rd Gen	309.7	10,752	38.7
A100 80GB	NVIDIA Ampere	3rd Gen	624	6,912	19.5
H100 PCIe	NVIDIA Hopper	4th Gen	1,513	14,592	51.2
H200 NVL	NVIDIA Hopper	4th Gen	3,341	16,896	60.3

Example

An Overview of Tensor Cores

Tensor Core Generations

NVIDIA GPU Tensor Core Comparison

See Also