Kimi K2 is an open-weight large language model developed by Moonshot AI, a Beijing-based AI lab. It launched in July 2025 and quickly ranked first among open-source models on the LMSYS Arena leaderboard (July 17, 2025). The model is optimized for agentic tasks: tool use, multi-step planning, and autonomous code execution.
Kimi K2 focuses on software engineering and long-horizon reasoning. This means K2 was designed for agentic and coding contexts, making it a serious alternative to proprietary frontier models for developer workloads.
Kimi K2 Model Versions
Since the original July 2025 release, Moonshot AI has shipped multiple iterations of the K2 series:
| Version | Release Date | Key Addition |
|---|---|---|
| Kimi K2 Base | July 10, 2025 | Original open-weight release, 128K context |
| Kimi K2 Instruct | September 5, 2025 | Improved agentic coding, 256K context |
| Kimi K2 Thinking | November 6, 2025 | Extended chain-of-thought reasoning mode |
| Kimi K2.5 | January 27, 2026 | Native multimodal (vision + text), Agent Swarm |
| Kimi K2.6 | April 20, 2026 | 300-agent swarm, production-grade coding |
This guide focuses on the core Kimi K2 Instruct model. It's the version available through Ollama and the most relevant starting point for self-hosted inference.
Understanding the Kimi K2 Model
Kimi K2 uses a Mixture-of-Experts (MoE) architecture, which lets it scale to massive parameter counts without proportionally increasing inference cost.
A routing mechanism selects a small subset of specialized "expert" networks for each input, rather than activating every parameter. This is how K2 has 1 trillion parameters while computing through only 32 billion per forward pass.
The model was trained on 15.5 trillion tokens using the MuonClip optimizer, a method Moonshot AI developed to stabilize large-scale MoE training. Moonshot AI released two variants: Kimi-K2-Base (the foundation model) and Kimi-K2-Instruct (the post-trained version). The instruct variant is the right choice for chat, agentic use, and tool-calling workflows.
Kimi K2 Context Window Size
In July 2025, the original Kimi K2 Base release shipped with a 128K context window, the Instruct version extended this to 256K tokens in September. This expanded context affects agentic workloads as it enables longer codebases, multi-file refactors, and extended tool-use sessions.
For local inference with llama.cpp or Ollama, context window size directly affects VRAM requirements. Configure it conservatively unless you have ample memory headroom.
Kimi K2 Benchmarks: How It Compares to GPT and Claude
Kimi K2's benchmark proves it is strong in software engineering and agentic tool-use. The numbers below are from the official technical report and third-party evaluations, all under non-thinking (standard inference) settings.
| Benchmark | Kimi K2.6 | GPT-5.5 | Claude Opus 4.7 |
|---|---|---|---|
| SWE-Bench Verified | 80.2% | ~83%1 | 87.6% |
| SWE-Bench Pro | 58.6%2 | 58.6% | 64.3% |
| LiveCodeBench v6 | 89.6% | — | — |
| BrowseComp | 83.2% / 86.3%3 | 84.4% | 79.3% |
| HLE with tools | 54.0% | 41.4% | 54.7% |
How Good Is Kimi K2?
Kimi K2 Strengths
Kimi K2 is genuinely strong in the areas it was designed for. SWE-Bench results show it consistently outperforms GPT-4.1 and DeepSeek-V3 on software engineering tasks.
On LiveCodeBench, which tests real coding scenarios that can't be memorized from training data, K2 leads the open-source field by a meaningful margin.
Kimi K2 Weaknesses
Where K2 is less competitive is in broad multimodal tasks. The base K2 model is text-only; vision capabilities were added in K2.5 (January 2026).
Inference speed on quantized local builds is also a real constraint: the 1.8-bit quant maxes out at a few tokens per second without enterprise multi-GPU hardware.
Kimi K2 Use Cases
K2 is best suited for:
- Developers who need a high-quality open-weight coding and agentic model
- Teams that want to self-host for data privacy reasons
- Workflows that can route inference through cloud GPUs like Thunder Compute when local hardware falls short
Why Kimi K2 Needs a Cloud GPU to Run
Running Kimi K2 locally is technically possible but practically difficult and expensive. The model's size means even the most aggressively quantized versions require hardware investments of $100K-$200K.
The Hardware Requirements at 1T Parameters
K2's MoE architecture keeps inference memory requirements lower than a dense 1T model, but they're still substantial. The model weights in FP8 format take up about 1TB on disk. Even the 1.8-bit quantized GGUF comes in around 250GB, requiring at least 247GB of combined RAM and VRAM for usable throughput.
A Q4 quantization runs to approximately 584GB. Running it effectively requires at least 600GB of combined RAM and VRAM, meaning multi-GPU server configurations like 8x H100 or H200 nodes. A 24GB consumer GPU like an RTX 4090 can handle the 1.8-bit variant by offloading MoE layers to system RAM, but inference speed drops to roughly 1–2 tokens per second — too slow for practical use.
Why Thunder Compute Is a Viable Option
Thunder Compute is a Y Combinator-backed cloud GPU provider offering on-demand access to enterprise GPUs like A100s and H100s at a fraction of the cost of owning hardware.
Crucially for this workflow, Thunder Compute ships a pre-configured Ollama template that handles the entire stack installation automatically. This removes the biggest friction point when working with large models in the cloud.
| Configuration | VRAM | Quantization | Expected Speed | Storage | Hourly Price |
|---|---|---|---|---|---|
| 4× A100 80GB1 | 320 GB | Q2 GGUF (~340 GB, CPU offload required) | 1–5 tok/s | 400GB | $7.28 |
| 4× H100 PCIe 80GB1 | 320 GB | Q2 GGUF (~340 GB, CPU offload required) | 3–8 tok/s3 | 400GB | $9.96 |
| 8× A100 80GB | 640 GB | Native INT4 (~594 GB) | 10–20 tok/s | 700GB | $14.56 |
| 8× H100 PCIe 80GB | 640 GB | Native INT4 (~594 GB) | 25–40 tok/s3 | 700GB | $19.92 |
- https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF
- https://huggingface.co/moonshotai/Kimi-K2-Instruct
How to Run Kimi K2 with Ollama on Thunder Compute
The steps below walk through the complete setup from account creation to a live Ollama API endpoint serving Kimi K2. The process takes around 10 to 15 minutes depending on model download speed.
Step 1: Install the Thunder Compute CLI
Download and install tnr for Windows, or macOS.
Run this command for Linux:
curl -fsSL https://raw.githubusercontent.com/Thunder-Compute/thunder-cli/main/scripts/install.sh | bash
Step 2: Login
tnr login
Step 3: Launch and connect to an Ollama instance
tnr create --template ollama
Pick the hardware configuration for your instance. Refer to the table above for recommended specs for each version.
Step 4: Connect to Your Instance and Start Ollama
Establish a connection once the instance is created.
tnr connect 0
Start the Ollama UI. This will take around a minute. Once it's done loading, click the link provided by the terminal to open Ollama in a browser.
You'll be prompted to create an account.
start-ollama
Step 5: Load the desired model
- In the Ollama UI, click "Select a model".
- Add the URL of the model from the Ollama page
- Click "Pull [MODEL_URL]" in the dropdown.

Your download will start.
A few good variants ordered from lightest to heaviest include:
- https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF
- https://huggingface.co/moonshotai/Kimi-K2-Instruct
Note on Ollama compatibility: Current versions of Ollama (0.9.x) require a manual code change to
llama-hparams.hto raiseLLAMA_MAX_EXPERTSfrom 256 to 384 before recompiling, because Kimi K2 uses 384 experts. Thehuihui_ai/kimi-k2model on the Ollama library includes this patch.
Step 6: Start chatting
Once the model is downloaded you can start interacting with it. Keep in mind that the first response will take significantly longer because the GPU is loading the model onto memory.
Kimi K2 Pricing: API Costs and Cloud vs. Self-Hosted
Understanding the cost landscape helps you decide between the Kimi K2 API, a third-party inference provider, or a self-hosted setup on cloud GPUs like Thunder Compute.
API Pricing by Provider
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| Moonshot AI API | Kimi K2 Instruct | ~$0.55 | ~$2.20 |
| OpenRouter | Kimi K2 (0711) | $0.57 | $2.30 |
| OpenRouter | Kimi K2 (0905) | $0.60 | $2.50 |
| Thunder Compute | Self-hosted via Ollama | GPU-hour billing (from $0.35/hr) | Included |
When does self-hosting on Thunder Compute make sense?
For most users, the API is the better default because managed providers are cheaper per token, faster, and require no infrastructure work.
At $7.28–$19.92/hr, Thunder Compute only makes sense for a few scenarios:
- Data sovereignty: If your workload involves regulated data (healthcare, legal, finance) that cannot leave your infrastructure, a self-hosted deployment keeps inference in a controlled environment that third-party API providers don't offer.
- Very high sustained throughput: If you're running a production service generating tens of millions of tokens daily with consistently high GPU utilization, the fixed hourly rate can eventually undercut per-token pricing.
- Fine-tuning and custom weights: If you need to modify the model itself — fine-tuning on proprietary data, merging adapters, or running a custom checkpoint — you need direct access to the weights, which only self-hosting provides.
FAQ
What Is Kimi K2?
Kimi K2 is an open-weight large language model released by Moonshot AI in July 2025. It uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion active parameters per forward pass. The model is optimized for agentic tasks, coding, and long-context reasoning, and it ranked among the highest-scoring open-source models on multiple benchmarks at release.
How Do I Run Kimi K2 Locally?
Running Kimi K2 locally requires at least 247GB of combined RAM and VRAM for the smallest usable quantization (1.8-bit), or upward of 600GB for Q4 quality. Most developers don't have this hardware available. Thunder Compute's Ollama template lets you spin up a ready-to-use inference environment in minutes.
How Do I Use Kimi K2?
There are three main ways to use Kimi K2:
- Kimi chat interface at kimi.com
- API access via the Moonshot AI platform or OpenRouter, both exposing an OpenAI-compatible endpoint
- Self-hosted inference on a Thunder Compute GPU with Ollama
How Good Is Kimi K2?
Kimi K2 is among the strongest open-source models available for coding and agentic tasks. On SWE-Bench Verified, it achieves 65.8% pass@1 in non-thinking mode, exceeding both DeepSeek-V3 and GPT-4.1. On LiveCodeBench v6, it scores 53.7%, leading the open-source field at time of publication. For general reasoning and math olympiad problems, dedicated reasoning models with extended thinking modes score higher.
Who Made Kimi K2?
Kimi K2 was made by Moonshot AI, a Chinese AI company founded in 2023 and headquartered in Beijing. Backed by Alibaba and other investors, the company's valuation has risen rapidly; from $4.8 billion in January 2026 to $20 billion by May 2026 following a $2B Series D.
