What is GPU virtualization?

GPU virtualization creates virtual representations of physical GPU hardware so workloads can use GPU resources through an abstraction layer instead of being tied directly to one physical device.

What are the main types of GPU virtualization?

The main types are single-node GPU sharing, dedicated GPU passthrough, and network-based GPU pooling. The first two operate inside one server, while network-based pooling operates across a cluster.

How is NVIDIA MIG different from GPU passthrough?

NVIDIA MIG partitions one physical GPU into multiple smaller fixed GPU instances. GPU passthrough assigns an entire physical GPU or MIG partition to one workload through a virtual interface.

Why does network-based GPU virtualization improve utilization?

Network-based virtualization lets any workload use any GPU on the same network fabric, giving schedulers more flexibility to fill idle capacity across the whole fleet.

Is network-based GPU virtualization similar to storage virtualization?

Yes. Conceptually, network-based GPU virtualization is similar to storage virtualization systems like Ceph or Storage Area Networks because it pools physical resources across a network and allocates them dynamically.

Go back

GPU Virtualization: Approaches and Tradeoffs

Carl PetersonJune 24, 20263 min read

Introducing GPU virtualization

Virtualization is a concept in computer science for creating virtual representations of physical hardware. While virtualization is commonly associated with CPUs, such as Intel VT-x, it extends to other domains, including GPUs. Virtualization is essential for efficient hardware resource allocation, and this is more relevant than ever with massive hardware buildouts for AI. However, the term is loaded and often misunderstood, especially when applied to GPUs, where the term can have multiple meanings.

Existing types of GPU virtualization

GPU virtualization currently exists in three main forms:

Single-node GPU sharing
Dedicated GPU passthrough
Network-based GPU pooling, which is Thunder Compute's approach

The first two operate within a single physical server and are widely used today. Thunder Compute is pioneering the third approach, which operates across a cluster of servers.

Single-node GPU sharing, such as NVIDIA MIG

Diagram of single-node GPU sharing with NVIDIA MIG partitions.

The current class-leading approach to sharing a single physical GPU is NVIDIA's MIG, or Multi-Instance GPU, which partitions a GPU into multiple smaller virtual GPUs. MIG allows several workloads to simultaneously use the same GPU, each getting a fixed partition of the compute cores and memory. When comparing dynamic vs. static partitioning, dynamic partitioning enables the most efficient use of the full GPU resources, but can lead to resource contention with non-cooperative workloads. Static partitioning provides guaranteed compute and memory allocation for each workload, with the drawback that each workload can only use a fixed fraction of the GPU. Modern AI workloads need more, not less, compute, so GPU partitioning is less common than other types of virtualization in production clusters.

Dedicated GPU passthrough, such as NVIDIA vGPU

Diagram of dedicated GPU passthrough with NVIDIA vGPU.

GPU passthrough assigns an entire physical GPU or MIG partition to a single workload. While this does not split the GPU, it is considered virtualization because it allows a VM to control the GPU through a virtual interface. GPU passthrough is the industry standard for cloud environments, and allows providers to use orchestration software like Kubernetes, Slurm, and various hypervisors to manage workloads.

That said, GPU passthrough does not provide any kind of efficiency gain compared with bare metal allocation, and is used in conjunction with other types of virtualization.

A new approach: network-based virtualization

Diagram of network-based GPU pooling across a cluster.

Network-based GPU virtualization creates more flexibility within a data center by allowing any workload to use any GPU on the same network fabric. This flexibility enables efficiency: a scheduler is aware of all workloads and all GPUs, and can dynamically assign workloads to GPUs in a way that maximizes utilization across the fleet.

This enables dramatically more workloads to fit on a fleet of GPUs, filling in gaps that would otherwise have been underutilized. Conceptually, this type of GPU virtualization is very similar to storage virtualization, like Ceph or Storage Area Networks.

The key unlock to enable this virtualization is the ability to extend physical PCIe connections with virtual connections over a network. There is a latency impact to doing so, which varies by workload, but is often far outweighed by cluster-scale efficiency improvements.

The future of GPU virtualization

In production systems you often see multiple types of virtualization used together. For example, MIG is used to slice up GPUs into smaller shapes, which are then passed to a network virtualization layer and allocated to workloads.

As network-based virtualization continues to improve and cluster utilization becomes increasingly important for cloud economics, we expect this will become standard, similarly to network-based storage virtualization.

If you manage a fleet of GPUs and would like to learn more about network virtualization for your cluster, contact us.