NVIDIA L40 vs A100: Full comparison (July 2026)

Q: Can I run Llama 3 70B on the NVIDIA L40?

Not in FP16. Llama 3 70B requires roughly 140 GB of VRAM, well beyond the L40's 48 GB. With INT4 quantization the model footprint drops to around 35 GB, which fits on a single L40. Neither the L40 nor the A100 80 GB can run Llama 3 70B in FP16 on a single card; both require quantization or multi-GPU setups.

Q: Is the A100 faster than the L40 for LLM training?

Yes, particularly in multi-GPU NVLink configurations. At the 7B to 13B parameter scale, the A100 SXM4 delivers 30 to 50% higher throughput than the L40. In single-GPU training of smaller models the gap narrows, because the L40's 4th-gen Tensor Cores and higher CUDA core count partially compensate for lower memory bandwidth.

Q: Is the L40 good for inference with vLLM?

Yes, within its VRAM ceiling. For models up to roughly 20 to 24B parameters in FP16, or up to 70B with INT4 quantization, a single L40 serves production traffic efficiently at low hourly cost. For long-context RAG workloads or large batches, the A100's 2,039 GB/s HBM2e becomes the deciding factor.

Q: Which GPU is better for FP8 inference?

The L40 has a clear advantage. Its 4th-gen Tensor Cores include native FP8 support via the Transformer Engine, which can roughly double effective token throughput vs the FP16 baseline. The A100 lacks native FP8 Tensor Core support and cannot take the same quantization shortcut.

Q: Does the A100 support FP64 for scientific computing?

Yes. The A100 has robust FP64 support, making it appropriate for scientific computing, molecular dynamics, and HPC workloads that require double-precision accuracy. The L40 does not support FP64.

Carl PetersonJuly 1, 202612 min read

Choosing between the NVIDIA L40 and A100 comes down to more than raw specs. Each GPU targets different AI workloads, and picking the wrong one means running out of resources or paying for unused performance.

This guide breaks down the key architectural differences, benchmark results, and practical use cases to help you decide.

Quick Comparison Table

Specification	NVIDIA L40	NVIDIA A100 (80 GB)
Architecture	Ada Lovelace	Ampere
VRAM	48 GB GDDR6	80 GB HBM2e
CUDA cores	18,176	6,912
Tensor cores	568	432
Memory Bandwidth	864 GB/s	2,039 GB/s (SXM4) / 1,935 GB/s (PCIe)
FP16 Tensor Performance	181 TFLOPS (362 with sparsity)	312 TFLOPS (624 with sparsity)
TDP	300 W	400 W (SXM4) / 300 W (PCIe)
Interconnect	PCIe Gen 4	PCIe Gen 4 / NVLink (SXM4)
NVLink Support	No	Yes (SXM variant)
Form Factor	PCIe	PCIe or SXM
Primary Use Case	- Small-scale training - Inference	- Large-scale training - HPC

A Note on L40 vs L40S

This guide covers the NVIDIA L40, the original 2022 Ada Lovelace release designed for visualization and moderate AI inference. If you are evaluating the L40S (the 2023 successor), the key differences are FP8 Transformer Engine support, NVENC/NVDEC with AV1 encoding, and a slightly higher 350W TDP. The L40S is the more relevant card for pure AI inference in 2026; the L40 remains available on cloud marketplaces at lower hourly rates.

NVIDIA L40 vs A100 Specs

Understanding the technical foundation of each GPU makes it easier to predict real-world behavior before running a single workload.

Architecture: Ampere vs. Ada Lovelace

The A100 is built on NVIDIA's Ampere architecture (2020) and remains a workhorse for large-scale training. The L40 uses the newer Ada Lovelace architecture (2022), which improves tensor core efficiency and introduces 4th-gen Tensor Cores with FP8 support.

Ada Lovelace also adds hardware-accelerated ray tracing and DLSS. These are irrelevant for pure AI workloads, but they reflect the generational advancement of the underlying silicon.

Memory Bandwidth

This is where the A100 holds its most decisive lead. The A100 80 GB SXM4 uses HBM2e memory with 2,039 GB/s of bandwidth, more than double the L40's 864 GB/s from GDDR6. The PCIe A100 80 GB delivers 1,935 GB/s.

Memory bandwidth is the primary bottleneck in transformer workloads. Large weight matrices must be streamed repeatedly during forward and backward passes, and the A100's bandwidth advantage translates directly into faster iteration times.

Latency and Concurrency

The L40 is better positioned for low-latency, high-concurrency inference. Its Ada Lovelace architecture improves the efficiency of smaller, parallelized operations common in serving scenarios.

At moderate batch sizes with multiple concurrent requests, the L40's architecture can sustain throughput more efficiently than the A100, even with its lower raw bandwidth.

PCIe vs SXM

The L40 is PCIe-only, making it easy to deploy in standard server configurations. The A100 comes in PCIe or SXM4 form; the SXM4 variant adds NVLink for high-bandwidth multi-GPU setups.

In single-node PCIe deployments, the two GPUs are on more equal footing. For multi-GPU training, the SXM4 A100 with NVLink is in a different tier.

Multi-Instance GPU (MIG) Support

The A100 supports MIG partitioning, which splits a single GPU into up to seven isolated instances, each with dedicated memory and compute. This makes it the preferred choice for multi-tenant cloud environments and shared inference infrastructure.

The L40 does not support MIG. Each workload occupies the full GPU, which is fine for single-tenant deployments but limits flexibility in shared or bursty serving scenarios.

Power Consumption and TCO

The L40's 300W TDP matches the A100 PCIe 80 GB's 300W, but is 25% lower than the A100 SXM4's 400W. The power savings compound meaningfully in large-scale SXM4 deployments. At typical data center rates ($0.08/kWh, PUE 1.4), eight L40 GPUs running 24/7 save roughly $130 to $160 per month compared to eight A100 SXM4s.

Lower power draw also reduces cooling requirements and can lower per-hour cloud costs. See current pricing for each GPU:

Performance

Spec comparisons set the stage, but benchmarks reveal how each GPU performs under real workloads.

Large Language Model (LLM) Training

The A100 is the stronger choice for LLM training because of its memory bandwidth and NVLink support. Training requires frequent, high-volume data movement between memory and compute, and the A100's HBM2e architecture is purpose-built for this.

In multi-GPU NVLink configurations, the gap is decisive: benchmarks on GPT-style training at the 7B and 13B parameter scale consistently show the A100 SXM4 delivering 30 to 50% higher throughput than the L40.

For single-GPU training of smaller models, the gap narrows. The L40's 4th-gen Tensor Cores with native FP8 support offset some of the bandwidth deficit, and its higher CUDA core count (18,176 vs 6,912) improves parallelism on smaller batches. The A100's advantage grows with model size and GPU count, driven by its third-generation NVLink architecture.

Inference Throughput and Cost Per Token

The picture shifts for inference, especially at the batch sizes common in production APIs.

The L40's Ada Lovelace architecture handles FP8 inference more efficiently than the A100, which lacks native FP8 Tensor Core support. The L40's 4th-gen Tensor Cores support FP8 natively; enabling it can significantly increase token throughput compared to the FP16 baseline. At smaller batch sizes, A100 inference throughput is highly sensitive to concurrency, leaving capacity unused in single-stream or small-batch scenarios.

The A100's 1,935 to 2,039 GB/s bandwidth gives it higher raw token throughput in memory-bound regimes. However, at a lower hourly rate, the L40 often delivers better cost-per-million-tokens for inference workloads where the full A100 bandwidth is not saturated. Benchmarks from MLPerf and independent evaluations of models like LLaMA-2 7B show the L40 achieving competitive or superior tokens-per-second at batch sizes of 1 to 16 vs the PCIe A100.

Scaling with NVLink and NVSwitch

Multi-GPU scaling is where the A100 SXM4 becomes difficult to match. NVLink 3.0 delivers up to 600 GB/s of GPU-to-GPU bandwidth per GPU in an eight-way configuration, enabling near-linear scaling for tensor parallelism.

The L40 relies on PCIe Gen 4 for inter-GPU communication, which limits scaling efficiency for models that exceed single-GPU VRAM. For training at 30B parameters and above across multiple GPUs, the A100 SXM4 with NVSwitch is in a different category.

Which Models Run on the L40 vs A100

VRAM capacity determines which models you can run without quantization; memory bandwidth determines how fast they run. The table below maps common models to each GPU.

Model	NVIDIA L40 (48 GB)	NVIDIA A100 (80 GB)
Llama 3.1 8B (FP16, ~16 GB)	Yes	Yes
Mistral 7B (FP16, ~14 GB)	Yes	Yes
Llama 3.1 13B (FP16, ~26 GB)	Yes	Yes
Mixtral 8x7B (FP16, ~90 GB)	No (fits at INT4, ~24 GB)	No (fits at INT4, ~24 GB)
Llama 3.1 70B (FP16, ~140 GB)	No (fits at INT4, ~35 GB)	No (fits at INT4, ~35 GB)
Llama 3.1 70B (INT4, ~35 GB)	Yes	Yes
Stable Diffusion XL (~7 GB)	Yes	Yes

The FP16 ceiling for both GPUs is roughly 20 to 24B parameters on a single card. Beyond that, INT4 or INT8 quantization is required, or you need multi-GPU tensor parallelism. Only the A100 SXM4 with NVLink handles that efficiently.

For serving Llama 3.1 70B or Mixtral 8x7B at production latency, two A100 80 GB cards in NVLink configuration are the cleaner option over single-GPU quantized inference.

L40 vs A100 Cloud Pricing Comparison

The table below shows on-demand pricing for providers that list both the A100 80 GB and L40 as of June 2026.

Provider	A100 80 GB ($/hr)	L40 ($/hr)
Thunder Compute	$1.09	$0.79
Hyperstack	$1.35	$1.00
RunPod	$1.39	$0.99
Vast.ai	$1.94	$0.56
Coreweave	$2.70 (8-GPU nodes)	$1.25 (8-GPU nodes)

Thunder Compute's A100 80 GB at $1.09/hr is the lowest listed A100 rate among providers in this comparison, while Vast.ai lists the lowest L40 rate among marketplace-style offers. For sustained inference or fine-tuning workloads, the savings against higher-priced providers add up quickly at scale.

Use Cases

Both GPUs handle AI workloads well. The decision narrows based on the specific demands of your pipeline.

When to Choose the L40

The L40 is the right choice for inference throughput and cost efficiency. Teams running production inference APIs, fine-tuning smaller models, or keeping costs low will find the L40 a strong fit. It combines Ada Lovelace efficiency, FP8 support, and lower power draw for strong performance per dollar.

It is also a natural fit for single-GPU workloads where NVLink scaling is not a factor. Organizations deploying large GPU fleets benefit from the L40's lower thermal and power footprint at scale.

When to Choose the A100

The A100 is the better option for large-scale distributed training, especially using the SXM4 with NVLink. Teams training foundational models, running scientific simulations, or needing more than 48 GB of VRAM per GPU will find the A100's 80 GB HBM2e and multi-GPU bandwidth indispensable.

It is also the right choice for multi-tenant environments that rely on MIG partitioning to isolate workloads on shared infrastructure. The A100's robust FP64 support makes it appropriate for scientific computing, molecular dynamics, and HPC workloads where double-precision accuracy is required.

Last Thoughts on NVIDIA L40 vs A100

The A100 wins for large-scale distributed training, MIG-based multi-tenancy, and workloads that need more than 48 GB of VRAM. The L40 wins for inference-heavy pipelines and cost-sensitive deployments where FP8 throughput and lower hourly rates reduce cost per token.

Thunder Compute offers the A100 80 GB from $1.09/hr on-demand with no minimum commitment. The VS Code and Cursor extensions let you connect directly from your IDE without configuring SSH or remote containers. See current GPU availability and pricing on Thunder Compute.

To match the right hardware to your workload, see our GPU selection guide for AI workflows.

Frequently Asked Questions

What Is the Difference Between the L40 and L40S?

The L40S is the 2023 successor to the L40. It adds FP8 Transformer Engine support for higher AI inference throughput, NVENC/NVDEC with AV1 encoding, and a slightly higher 350W TDP. The L40 (2022) was designed for visualization, digital twins, and moderate AI inference. In 2026, the L40S is the more relevant card for production AI deployments; the L40 remains available at lower hourly rates on cloud marketplaces.

Does the NVIDIA L40 Support MIG?

No. MIG partitioning is available on the A100 and H100 but not on the L40 or L40S. MIG lets a single A100 be divided into up to seven isolated instances, each with dedicated memory and compute. This makes the A100 the stronger fit for multi-tenant inference serving and shared research infrastructure.

Can I Run Llama 3 70B on the NVIDIA L40?

Not in FP16. Llama 3 70B requires roughly 140 GB of VRAM at full precision, well beyond the L40's 48 GB. With 4-bit quantization (INT4), the model footprint drops to around 35 GB, which fits on a single L40. Note that the A100 80 GB also cannot run Llama 3 70B in FP16 on a single card; both GPUs require quantization or multi-GPU setups for this model. For unquantized inference at scale, two A100 SXM4 cards with NVLink are the cleaner solution.

Is the A100 Faster Than the L40 for LLM Training?

Yes, particularly in multi-GPU NVLink configurations. In distributed training at the 7B to 13B parameter scale, the A100 SXM4 delivers 30 to 50% higher throughput than the L40. In single-GPU training of smaller models, the gap narrows because the L40's 4th-gen Tensor Cores and higher CUDA core count partially compensate for lower memory bandwidth.

How Much Does an A100 vs L40 Cost Per Hour on Thunder Compute?

On Thunder Compute, the A100 80 GB starts from $1.09/hr on-demand, one of the lowest tracked rates across major GPU cloud providers. L40 availability and pricing can be checked on the Thunder Compute pricing page.

Is the L40 Good for Inference with vLLM?

Yes, within its VRAM ceiling. The L40's Ada Lovelace architecture and FP8 Tensor Core support work with vLLM. For models up to roughly 20 to 24B parameters in FP16, or up to 70B with INT4 quantization, a single L40 serves production traffic efficiently at low hourly cost. For long-context RAG workloads or large batches that saturate memory bandwidth, the A100's 1,935 to 2,039 GB/s HBM2e becomes the deciding factor.

Which GPU Is Better for FP8 Inference?

The L40 has an advantage for FP8 inference. Its Ada Lovelace 4th-gen Tensor Cores support native FP8 computation, which can increase token throughput compared to the FP16 baseline. The A100 lacks native FP8 Tensor Core support and cannot take the same quantization shortcut. For production inference pipelines where FP8 precision is acceptable, the L40 delivers better throughput per dollar.

Does the A100 Support FP64 for Scientific Computing?

Yes. The A100 delivers 9.7 TFLOPS of FP64 performance, making it appropriate for scientific computing, molecular dynamics simulations, and HPC workloads that require double-precision accuracy. The L40 does not support FP64, so it is not a suitable replacement for the A100 in these workloads.