How to Run GLM-5.2 Locally with Unsloth: Requirements and Setup

Carl PetersonJune 30, 202610 min read

GLM-5.2 is Z.ai's flagship open weight model. It lands within a few points of Claude Opus 4.8 on long horizon coding benchmarks while staying open under an MIT license. The catch is hardware.

At 744B parameters, GLM-5.2 doesn't fit on a laptop. It also doesn't run locally through Ollama the way smaller open models do.

This guide covers the actual GLM-5.2 requirements by quantization level. It explains why Ollama only offers a cloud passthrough for this model, and shows how to self-host it for real using Unsloth's quantized weights on a rented GPU instance.

Does Ollama Run GLM-5.2 Locally

No. Ollama lists GLM-5.2, but the only available tag is glm-5.2:cloud. That tag routes your prompts through Z.ai's hosted infrastructure instead of loading weights on your machine.

Running ollama pull glm-5.2 without the cloud suffix fails with a manifest error. There's no local download tag published for this model.

This matches how Ollama has handled other frontier-scale open models, including Kimi K2. The cloud tag gives you familiar command syntax, but it's an API wrapper, not local inference. For genuine on-device or self-hosted execution, you need a different toolchain.

Unsloth publishes dynamic GGUF quantizations of GLM-5.2 built specifically for local inference through llama.cpp. That combination is the realistic path to running this model outside of Ollama's cloud tag or Z.ai's own API.

GLM-5.2 VRAM and GPU Requirements

GLM-5.2 is a mixture of experts model with roughly 744B total parameters and about 40B active per token.

The table below breaks down system requirements by precision and quantization level, based on Unsloth's published GGUF sizes and standard weights-memory math for a model this size.

Precision / Quant	Approximate Size	Minimum Hardware to Run It	Notes
BF16 / FP16 (full precision)	~1,488GB	8x H200 or larger multi-node cluster	Best quality, data center territory only
FP8	~744GB	8x H200	Z.ai's own served precision¹
GGUF 8-bit	~810GB	8x H100 PCIe on Thunder Compute (960GB RAM) at $23.12/hr	Near lossless, serving grade speed
GGUF UD-Q4_K_XL (4-bit)	~372 to 475GB	4x H100 PCIe on Thunder Compute (480GB RAM) at $11.56/hr	Quality sweet spot per Unsloth's KL divergence testing
GGUF 3-bit	~290 to 360GB	4x A100 80GB on Thunder Compute (420GB VRAM) at $5.96/hr	Realistic cost/quality balance
GGUF UD-IQ2_M (2-bit dynamic)	~245GB	4x A100 80GB on Thunder Compute (420GB RAM) at $5.96/hr	Most accessible path, ~82% accuracy retention²
GGUF 1-bit dynamic	~223GB	4x A100 80GB on Thunder Compute (420GB RAM) at $5.96/hr	Usable but a meaningful quality drop
¹ Figures reflect total memory for weights only, not KV cache for long context sessions, which adds significantly at GLM-5.2's 1M token window.
² Unsloth's Dynamic 2.0 quantization applies higher precision to sensitive layers, so accuracy retention beats a naive quant at the same bit depth.

No single GPU can run GLM-5.2 on its own: even the most aggressive 1-bit quantization needs ~223GB, more than any single card offers.

The entry point is a 4x A100 80GB instance, which covers the 1-bit through 3-bit quant tiers with room to spare. The 4-bit tier needs a 4x H100 PCIe instance for its 360 VRAM and 480GB system RAM pool, and the 8-bit tier requires 8x H100 PCIe.

The Realistic Hardware Paths

Consumer GPU Rig

A 4x RTX 3090 or 4x RTX 4090 rig with 192 to 256GB of system RAM should run the 2-bit dynamic GGUF using CPU and GPU hybrid offloading. But that's assuming you already own or are willing to buy four consumer GPUs before confirming the model fits your workflow.

Apple Silicon

A Mac Studio with an M3 Ultra or M4 Ultra chip and 256GB+ of unified memory is one of the cleanest local paths, since the 2-bit dynamic quant fits directly within that memory pool. CPU and GPU share the same memory pool, with no PCIe bottleneck between them.

The tradeoff is the upfront cost of a maxed-out Mac Studio (around $15000), which runs several thousand dollars before you've run a single inference.

Rented Cloud GPUs

Buying 4 consumer GPUs or a high-memory Mac Studio is a real commitment if you're not sure GLM-5.2 fits your workflow yet. Renting GPU instances by the hour lets you test at data center grade, at higher precision and faster throughput, without the upfront cost.

Thunder Compute's 4x A100 80GB instance ($5.96/hr) is the entry point for genuine GPU-speed inference on the 2-bit or 3-bit quants. All instances spin up in under a minute and bill by the second, so a short testing session costs a few dollars rather than the thousands a hardware purchase would.

See current GPU availability and pricing on Thunder Compute →

Running GLM-5.2 with Unsloth and llama.cpp

The following steps assume a Linux GPU instance with CUDA drivers installed. The setup is the same whether you're running on your own workstation or a rented instance.

First, install the Hugging Face CLI tools and download the quantized weights. Pick the quant tier that fits your available memory from the table above.

`pip install -U huggingface_hub hf_transfer
hf download unsloth/GLM-5.2-GGUF \
  --local-dir unsloth/GLM-5.2-GGUF \
  --include "*UD-Q3_K_XL*"`

Next, build llama.cpp from source. Set -DGGML_CUDA=ON for NVIDIA GPUs, or -DGGML_CUDA=OFF on Apple Silicon, where Metal support is enabled by default.

`git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first \
  --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp`

Once the build finishes, launch the model with llama-server to expose an OpenAI-compatible endpoint. The temperature and top_p values below match Unsloth's recommended default settings for most use cases.

`./llama.cpp/llama-server \
  --model unsloth/GLM-5.2-GGUF/UD-Q3_K_XL/GLM-5.2-UD-Q3_K_XL-00001-of-00006.gguf \
  --alias "unsloth/GLM-5.2" \
  --temp 1.0 \
  --top-p 0.95 \
  --min-p 0.01 \
  --ctx-size 32768 \
  --port 8001`

With the server running, call it from any OpenAI SDK compatible client by pointing the base URL at your instance.

`from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8001/v1",
    api_key="sk-no-key-required",
)

completion = client.chat.completions.create(
    model="unsloth/GLM-5.2",
    messages=[{"role": "user", "content": "Outline a plan to refactor a FastAPI service to async I/O."}],
)
print(completion.choices[0].message.content)`

This same endpoint works as a custom model target for coding agents like Claude Code or Cursor. Both support pointing at an OpenAI-compatible base URL instead of a hosted provider.

Self-Hosting vs the API: The Real Cost Math

Self-hosting only makes sense once you compare it against what Z.ai already charges. GLM-5.2's API pricing runs $1.40 per M input tokens and $4.40 per M output tokens, the bar a self-hosted setup needs to beat.

A 4x A100 80GB instance on Thunder Compute running the UD-IQ2_M quant puts your effective cost per M tokens well under the API rate, once usage is high enough to keep the GPUs busy.

The breakeven point depends on your actual request volume, since GPU rental bills by the hour whether or not tokens are flowing.

Provider	GPU	Hourly Rate	Best Fit For
Thunder Compute	4x A100 80GB	$5.96/hr	Entry config for GLM-5.2 (2-bit/3-bit quants); no long-term contract, per-minute billing
RunPod	A100 80GB	$1.39/hr	Community cloud experimentation
Lambda Labs	A100 80GB	$2.79/hr	Teams that want managed support, often tied to longer commitments

For lower volume or exploratory use, the API is simpler and cheaper. Self-hosting is better for:

Hard data residency requirements
High volume usage that keep GPUs busy
Fine-tuning on proprietary data

GLM-5 vs GLM-5.1 vs GLM-5.2

Z.ai has shipped three releases in the GLM-5 line within about four months. First, GLM-5 was launched as the initial scaling step up from GLM-4.5, with a 200K token context window.

GLM-5.1 followed in April 2026 with notable coding gains over the base model. GLM-5.2, released June 13, 2026, is the current flagship and the version most third-party providers and community benchmarks reference today. The headline change in 5.2 is a 1M token context window, up from roughly 200K in 5.1, plus improved long horizon coding performance and an MIT license.

GLM-5.2 is the default unless you have a specific reason to use an earlier release. It carries forward the architecture and licensing of its predecessors while adding meaningfully more usable context.

Last Thoughts on GLM-5.2

GLM-5.2 is open weight and capable, but Ollama's cloud tag isn't the local inference shortcut it appears to be. Real self-hosting runs through Unsloth's GGUF quantizations and llama.cpp, and it demands real hardware regardless of which path you take.

Renting GPUs lets you test GLM-5.2 at full GPU speed, and gives you a clear way to compare self-hosted cost per token against Z.ai's API pricing.

Learn more about Unsloth and how to use it to run the latest models.

Frequently Asked Questions

Can you run GLM-5.2 with Ollama?

Not locally. Ollama only offers a glm-5.2:cloud tag, which routes requests through Z.ai's hosted infrastructure. For real local or self-hosted inference, use Unsloth's GGUF quantizations with llama.cpp instead.

How much VRAM does GLM-5.2 need?

It depends on precision. Full BF16 needs roughly 1,488GB, FP8 needs about 744GB, and Unsloth's GGUF quants range from about 810GB at 8-bit down to 223 to 245GB at 1-bit and 2-bit.

Can GLM-5.2 run on a single GPU?

No. Even 1-bit quantization needs over 220GB of combined memory, more than any single consumer or workstation GPU. A realistic setup needs at least 4 GPUs, or a high-memory unified memory system like a Mac Studio.

Is GLM-5.2 actually open weight?

Yes. Z.ai released the full weights on Hugging Face under an MIT license on June 16, 2026. The license allows commercial use, fine-tuning, and redistribution with no regional restrictions.

What's the cheapest way to test GLM-5.2 without buying hardware?

Rent a GPU instance by the hour. A 4x A100 80GB instance on Thunder Compute runs $5.96/hr and handles the 2-bit or 3-bit quant tiers at full GPU speed. Shut it down when you're done.

Do I need datacenter GPUs, or can I use consumer cards?

Consumer cards like the RTX 3090 or 4090 can run GLM-5.2 in a 4-card setup with system RAM offloading, but expect single-digit tokens per second. Datacenter GPUs like the A100 run the same quant tiers much faster and rent by the hour.

What's the difference between GLM-5, GLM-5.1, and GLM-5.2?

GLM-5 launched as the initial scaling step up from GLM-4.5 with a 200K context window. GLM-5.1 followed in April 2026 with coding gains. GLM-5.2, released June 13, is the current flagship with a 1M token context and an MIT license.

Can GLM-5.2 connect to coding agents like Claude Code or Cursor?

Yes. Once llama-server is running, it exposes an OpenAI-compatible endpoint. Point Claude Code or Cursor at that base URL the same way you'd point them at any custom model provider.