Why is standardized cloud terminology important for GPU selection?

Standardized terminology allows users to make apples-to-apples comparisons between different providers. Understanding the underlying compute definitions ensures that performance metrics and billing structures are interpreted correctly across various platforms.

What is the difference between bare metal and virtualized compute?

Bare metal refers to physical servers dedicated to a single user without a virtualization layer, offering maximum performance. Virtualized compute uses a hypervisor to split physical hardware into multiple isolated environments, providing greater flexibility and faster deployment times.

How does compute "availability" impact cloud workloads?

Availability refers to the percentage of time a compute resource is operational and accessible. In cloud infrastructure, this is often guaranteed by Service Level Agreements (SLAs), which are critical for mission-critical AI training and deployment.

Go back

GPU Cloud Terminology: Key Terms Explained

Carl PetersonJuly 2, 20267 min read

To get started with cloud infrastructure, it's essential to understand its nomenclature. Terminology can vary depending on the provider, however the terms listed below are industry standards.

This guide is a cloud computing 101, breaking down key terms you'll encounter when working with modern GPU clouds.

Cloud Computing Terms

Term	Definition	Further Reading
Hyperscaler	Large cloud providers offering massive global infrastructure (e.g., AWS, Azure, Google Cloud).
Instance	A virtual machine running in the cloud with allocated compute resources.
On Demand Instances	Instances billed per second or minute with no long-term commitment.	Read more
Reserved Instances	Instances purchased at a discount in advance and for a certain period.
Spot Instances	Discounted instances using spare capacity, which can be interrupted with little notice.	Read more
GPU-as-a-Service (GPUaaS)	Cloud model where GPUs are rented on demand for compute workloads.	Read more
Neocloud	A newer category of GPU cloud providers focused on AI workloads.	Read more
Egress Fees	Charges for transferring data out of a cloud provider’s network.	Read more
LLM Inference	The process of running a trained large language model to generate outputs.
Supervised Fine-Tuning (SFT)	Training a model on labeled data to improve task-specific performance.	Read more
Computer Vision	A field of AI that enables machines to interpret and analyze images and video.	Read more
Bare Metal	A physical server dedicated to a single tenant with no virtualization layer, giving direct access to the underlying hardware for maximum performance and predictable throughput.
Persistent Storage	Storage that survives instance stops, restarts, and terminations. Unlike ephemeral disk attached to a running instance, persistent storage retains data between sessions and can be remounted on a new instance.
Snapshot	A point-in-time copy of an instance or volume that captures the full environment, including installed packages, model weights, and configuration. Used to save progress, clone environments, or resume work on a different instance type.

Graphic Processing Unit (GPU) Concepts

Term	Definition	Further Reading
CUDA Cores	Programmable processing units in NVIDIA GPUs used for general computation.	Read more
DGX	NVIDIA's line of purpose-built AI supercomputing systems that integrate multiple GPUs, high-speed interconnects, and optimized software into a single turnkey platform.	Read more
HGX	NVIDIA's reference server board platform that allows OEM partners to build GPU-accelerated servers by integrating multiple high-end GPUs with NVLink and NVSwitch.	Read more
NVLink	NVIDIA's high-bandwidth, low-latency GPU-to-GPU interconnect that enables faster data sharing between GPUs than PCIe allows.	Read more
NVSwitch	NVIDIA's high-speed switch chip that enables full all-to-all NVLink connectivity across multiple GPUs within a node or across nodes.	Read more
PCIe	A standard high-speed interface used to connect GPUs, storage, and other peripherals to a computer's CPU and motherboard.	Read more
ROCm	AMD's open-source software platform for GPU computing, providing an alternative to NVIDIA's CUDA ecosystem for running AI and HPC workloads on AMD GPUs.	Read more
Transformer Engine	A dedicated hardware and software component in NVIDIA GPUs that was first introduced with the Hopper architecture (H100). It accelerates transformer-based AI models by automatically switching between FP8 and FP16 precision.	Read more
SXM	NVIDIA's high-bandwidth socket form factor for mounting data center GPUs directly onto a baseboard, enabling faster NVLink connections compared to standard PCIe cards.	Read more
Shannon Entropy	A mathematical measure of the average uncertainty, randomness, or information content inherent in a data source or probability distribution.	Read more
Tensor Cores	Specialized GPU cores optimized for AI and matrix operations.	Read more
Taiwan Semiconductor Manufacturing Company (TSMC)	The world's largest semiconductor producer, controlling nearly 70% of the advanced microchips global market.	Read more
VRAM (Video RAM)	Dedicated high-speed memory on a GPU used to store model weights, activations, and intermediate tensors during computation. VRAM capacity is typically the primary constraint when deciding which models can run on a given GPU.
HBM (High Bandwidth Memory)	A stacked memory architecture used in data center GPUs that delivers significantly higher bandwidth than standard GDDR memory. Variants include HBM2e (A100) and HBM3 (H100 SXM), with higher generations offering greater bandwidth for memory-bound workloads.	Read more
MIG (Multi-Instance GPU)	An NVIDIA feature that partitions a single GPU into up to seven hardware-isolated instances, each with dedicated CUDA cores, Tensor Cores, and HBM memory. Useful for running multiple smaller models concurrently on one physical GPU.	Read more
InfiniBand	A high-speed network interconnect used to link multiple GPU nodes for distributed training. Delivers low latency and high bandwidth (typically 400 Gbps or more) needed for efficient gradient synchronization across nodes.	Read more
FP8 / Mixed Precision	Numerical formats that reduce the bit-width of model weights and activations during training or inference. FP8 (8-bit float) is supported natively by the H100's Transformer Engine and delivers up to 3–4x the throughput of FP16 on transformer workloads while maintaining accuracy.	Read more

AI Concepts

Term	Definition	Further Reading
Latency	The time elapsed between submitting a request to a model and receiving the first token of its response, typically measured in milliseconds. Lower latency is critical for real-time, interactive applications.
Model	A machine learning algorithm trained on data to recognize patterns, make predictions, or generate outputs.
Model Context Protocol (MCP)	An open standard that allows AI models to securely interface with external data sources, developer tools, and client environments.	Read more
Mixture of Experts (MoE)	An architecture that routes inputs to specialized sub-networks ("experts"), activating only a fraction of total parameters at once to scale capacity efficiently.
Open-source model	An AI model whose source code, core architecture, and training pipelines are entirely public and freely modification-accessible.	Read more
Open-weights model	A model whose pre-trained numerical weights are publicly accessible for deployment, though its exact training data or source code may remain proprietary.	Read more
Parameter	The internal numerical variables (weights and biases) that a model optimizes during training to define its behavior and overall capacity.
Throughput	The number of tokens a model generates per second across all concurrent requests, indicating how efficiently an inference system utilizes hardware under load.
Quantization	A compression technique that converts model weights from high-precision formats (like FP16) to lower-bit structures (like FP8 or INT4) to save VRAM.	Read more
LoRA (Low-Rank Adaptation)	A parameter-efficient fine-tuning technique that inserts small trainable weight matrices into a frozen base model instead of updating all parameters. Dramatically reduces VRAM requirements and training time compared to full fine-tuning.	Read more
RLHF (Reinforcement Learning from Human Feedback)	A training technique that uses human preference data to optimize a model beyond what supervised fine-tuning alone can achieve. Typically applied after SFT to improve helpfulness, accuracy, and safety through a reward model trained on human rankings.	Read more
RAG (Retrieval-Augmented Generation)	An inference architecture that augments a model's response by retrieving relevant documents from an external knowledge base at query time. Allows models to answer questions about data they were not trained on without requiring fine-tuning.
Distributed Training	A training strategy that splits a model, dataset, or both across multiple GPUs or nodes to handle workloads that exceed the capacity or throughput of a single GPU. Requires high-bandwidth interconnects such as NVLink or InfiniBand to synchronize gradients efficiently.	Read more
Checkpoint	A saved copy of a model's weights and optimizer state at a point during training. Used to resume interrupted runs, evaluate intermediate model quality, or roll back to an earlier training state.
vLLM	An open-source inference engine optimized for serving large language models at scale. Implements PagedAttention for efficient KV cache management and continuous batching to maximize GPU utilization across concurrent requests.	Read more

Takeaway

This glossary is just the starting point. For deeper dives into pricing, infrastructure, and performance comparisons, explore the Thunder Compute blog.

By building a strong foundation in cloud computing terminology, you'll be better equipped to navigate the rapidly evolving landscape of GPU cloud infrastructure.