Go back

Best Open Source LLMs (July 2026)

Open source large language models have closed the gap with proprietary systems faster than most researchers expected.

Models that required billion-dollar training runs are now downloadable, modifiable, and deployable by any team with access to the right hardware.

Understanding which models lead today's landscape, and how to evaluate them, is essential for anyone building on top of LLMs.

What Makes an LLM "Open Source"?

The term "open source" means something different in the LLM world than in traditional software. A fully open model can be replicated because it publishes weights, training data, architecture code, and training pipeline. It does so under a license allowing free use, modification, and redistribution.

In reality, most models called "open source" are really "open weight": only the model weights are publicly available, but the training data and pipeline remain proprietary.

The Open Source Initiative published its Open Source AI Definition (OSAID) to formalize these distinctions. By its strictest reading, models like DeepSeek R1 and Llama 4 are open weight rather than fully open source, because their training datasets are not released.

For most, the definition is simpler. A model is open source if it lets you:

  • Download weights
  • Run them locally
  • Use them commercially

LLM Model Licences

MIT and Apache 2.0 licenses grant unrestricted use. Custom community licenses, like Meta's Llama 4 license or the Modified MIT used by Kimi K2, add constraints such as user-count thresholds or attribution requirements.

Always read the license before deploying a model commercially.

LLM Evaluation Metrics

Comparing models requires a shared vocabulary of benchmarks. The most commonly cited ones evaluate different capabilities, and no single score captures overall quality.

Benchmarks

The table below outlines the core benchmarks used to stress-test frontier models across advanced logic, autonomous software engineering, and multi-domain reasoning.

Benchmark What It Measures Why It Matters Reference
GPQA Diamond Graduate-level science reasoning Resists memorization; tests true understanding Rein et al. (arXiv 2023)
AIME Multi-step math problem solving Measures structured logical reasoning Mathematical Association of America
SWE-Bench Verified Real-world code issue resolution Best proxy for agentic software engineering OpenAI & SWE-bench Team (2024)
Humanity's Last Exam Extremely hard cross-domain questions Near-ceiling test for general intelligence Center for AI Safety & Scale AI (2025)
MMMLU Multilingual reasoning Evaluates language breadth beyond English Alibaba / EvalScope Project
LiveCodeBench Live coding problem solving Tests coding with contamination-resistant problems Jain et al. (2024)

Inference Performance & Efficiency

A model's practical viability is defined by its operational efficiency. A model that aces every reasoning test but requires 8x H100 GPUs to serve a single user is not production-ready.

When evaluating models, teams must consider:

  • Throughput (Tokens per Second): Measures the volume of text the system can generate concurrently. High throughput is vital for background processing, large-scale data analysis, and keeping infrastructure costs sustainable under heavy user loads.

  • Latency (Time to First Token): Measures how quickly the model begins its response after receiving a prompt. Low latency is critical for user-facing applications, where even a two-second delay breaks the illusion of conversation.

The optimal choice often means sacrificing some accuracy to achieve the speed and cost efficiency your application requires.

Open Source LLM Leaderboard

There is no uncontested leaderboard featuring only open source LLMs. Although now archived, the Open LLM Leaderboard by the Hugging Face community was the reference for comparing open-weight models on standardized benchmarks.

The models explored below hold the top spots in any open source LLM leaderboard.

Model Parameters (Active) GPQA Diamond SWE-Bench AIME 2025 Humanity's Last Exam Live Code Bench Source
GPT-5* N/A 85.7 55.3 99.6 41.7 87.0 OpenAI
Sonnet 4.5 (Thinking) N/A 83.4 68.0 100.0 32.0 64.0 Anthropic
Kimi K2.5 1T (32B) 87.6 76.8 96.1 85.0 Moonshot AI (2026)
Kimi K2 Thinking 1T (32B active) 84.5 71.3 99.1 44.9 83.1 Moonshot AI (2025)
GLM-5.2 744B (40B active) 91.2 62.1† 99.2‡ 40.5 Z.ai (2026)
DeepSeek-R1 671B (37B active) 71.5 49.2 74.0 65.9 DeepSeek AI (2025)
DeepSeek V3 0324 685B (37B active) 59.1 42.0 58.1 20.3 74.1 DeepSeek AI (2025)
Llama 4 Maverick 400B (17B active) 69.8 65.0 43.4 Meta AI (2025)
Llama 4 Scout 109B (17B active) 73.7 68.0 6.7 33.3 Meta AI (2025)
Nemotron Ultra 253B 253B (dense) 76.0 72.5 NVIDIA (2025)

Kimi K2

Moonshot AI's Kimi K2 was one of the most significant open-weight releases of 2025. It demonstrated that a Chinese lab could release models to compete with or beat GPT-5 on key benchmarks.

Both models in the family use a Mixture-of-Experts (MoE) architecture with 1T total parameters and 32B active parameters per inference. They were released under a Modified MIT License that allows broad commercial use.

Learn how to run Kimi K2 with Ollama.

Kimi K2.5

Kimi K2.5 is a native multimodal agentic model. Multimodality means it can process different types of data like text, images, video, and audio. Agentic models have the ability to reason, use tools, and act autonomously.

It was built through continual pretraining on approximately 15T mixed visual and text tokens.

It leads in reasoning (87.6% GPQA Diamond), agentic coding (76.8% SWE-Bench Verified), and visual reasoning (ARC-AGI 2). This arguably makes it the most capable open-weight model available today.

Kimi K2 Thinking

Kimi K2 Thinking is the reasoning-focused variant of the K2 family, optimized for test-time scaling by expanding thinking tokens and tool call rounds simultaneously.

It tops the AIME 2025 math benchmark at 99.1% and leads Humanity's Last Exam at 44.9% with tool use enabled.

Kimi K2 can perform 200 to 300 consecutive tool calls without manual intervention, making it particularly powerful for complex agentic workflows.

GLM-5.2

Z.ai's GLM-5.2 is the strongest open-weight model on several long-horizon coding benchmarks as of its June 2026 release. It's a 744B parameter mixture-of-experts model with 40B parameters active per token, released under an MIT license.

Its GPQA Diamond score of 91.2% and AIME 2026 score of 99.2% place it at the top of the open-weight field on reasoning. On Terminal-Bench 2.1 (81.0%), it trails Claude Opus 4.8 by only 4 points while outperforming every other open-weight model by a wide margin.

GLM-5.2 does not currently support image understanding, which limits it on visual analysis workflows. It requires significant multi-GPU infrastructure for self-hosted inference: the 2-bit quantized variant alone needs roughly 245GB of combined memory.

Learn how to run GLM-5.2 with Unsloth and see the full hardware requirements and cost breakdown.

DeepSeek

DeepSeek's January 2025 release of R1 was a turning point for open-source AI. With a reported training cost of under $6M, it challenged the biggest players. This forced the industry to reassess how much compute was required to reach top-tier performance. Both major DeepSeek models are released under the MIT License.

DeepSeek-R1

DeepSeek-R1 is a 671B parameter model focused on advanced reasoning through reinforcement learning post-training.

It achieves 74% on AIME 2025 and 49.2% on SWE-Bench Verified, performing comparably to models that cost far more to build. Its distilled variants, ranging from 1.5B to 70B parameters, bring strong reasoning to hardware ranging from consumer laptops to mid-range workstations.

Read a full guide on DeepSeek R1 and learn to run it locally.

DeepSeek V3 0324

Released in March 2025, DeepSeek-V3-0324 updates the original V3 with an improved post-training pipeline that borrows reinforcement learning techniques from R1.

The update brought the parameter count to 685B. At release, it outperformed GPT-4.5 in math and coding evaluations, making it a strong general-purpose open-weight baseline.

Llama 4

Meta's Llama family is the infrastructure layer of the open-weight ecosystem. The Llama 4 generation, released in April 2025, introduced MoE architecture and added native multimodality across text, images, and video.

Both Scout and Maverick are available under Meta's Llama 4 Community License, which permits free commercial use for products serving under 700M monthly active users.

Learn how to run Llama 4 with Ollama.

Llama 4 Maverick

Llama 4 Maverick is the flagship generalist model of the Llama 4 family, with 400B total parameters and 17B active per token across 128 experts.

Its 1M-token context window and multimodal capabilities position it as a direct competitor to GPT-4o and Gemini 2.0 Flash. On the multilingual reasoning benchmark MMMLU, Maverick scores 84.6%, and its inference cost and speed make it a practical choice for production deployments. See the Thunder Compute guide to fine-tuning Llama 4 for a step-by-step walkthrough using a single A100 80GB.

Llama 4 Scout

Llama 4 Scout is an efficiency-focused sibling with 109B total parameters and 17B active per token distributed across 16 experts.

Equipped with a 192K-token context window and native multimodal capabilities for text and vision, Scout is engineered for high-speed inference on a significantly smaller GPU footprint. It excels at multi-document summarization, live tool-calling, and parsing dense codebases, making it well suited for local development, edge deployments, and high-volume agentic pipelines.

Nemotron

NVIDIA's Nemotron Ultra 253B takes a different approach from the large MoE models above. It uses Neural Architecture Search to optimize a dense transformer derived from Llama 3.1-405B, achieving competitive benchmark scores with a smaller, more hardware-efficient architecture.

Nemotron Ultra 253B

Released in April 2025, Nemotron Ultra 253B is a dense decoder-only transformer with 253B parameters optimized for reasoning, RAG, and tool-calling tasks.

It operates on a single 8x H100 node using FP8 precision. With reasoning mode enabled, it jumps from approximately 80% to over 97% accuracy on MATH-500 and achieves 76% on GPQA Diamond. The model is released under the NVIDIA Open Model License and supports commercial use.

Open Source LLM Platforms

Choosing a model is the first step. The platform you use to run, fine-tune, or serve it greatly impacts developer experience. Three tools dominate this space: Ollama, Unsloth, and vLLM.

Ollama

Ollama is the fastest way to run open-weight LLMs. A single command like ollama run llama4:maverick downloads and launches the model, handling quantization and hardware detection automatically.

It supports a wide range of model families, exposes an OpenAI-compatible API endpoint, and is the go-to tool for developers who want to prototype quickly without managing infrastructure.

Ollama official website homepage displaying the main interface with navigation menu, logo, and information about running large language models locally on personal computers.

Read the Thunder Compute guide to running Ollama.

Unsloth

Unsloth is an open-source fine-tuning library that rewrites attention and backpropagation kernels in Triton to deliver faster training with lower VRAM consumption.

It supports supervised fine-tuning, QLoRA, and LoRA workflows on models from Llama 4 to Qwen to DeepSeek, and integrates with Hugging Face Transformers for dataset loading and model export. The Thunder Compute Unsloth guide covers installation and first fine-tuning runs in detail.

Unsloth Studio interface for configuring datasets, models, and fine-tuning jobs.

vLLM

vLLM is the production inference engine of choice for serving open-weight models at scale. It implements PagedAttention for efficient KV cache management, continuous batching to maximize GPU utilization, and an OpenAI-compatible server API.

vLLM is the standard choice when a model needs to handle multiple concurrent users or sustained throughput. It supports FP8 and GPTQ quantization, making it practical for deploying large models like Nemotron Ultra 253B on a single 8x H100 node.

Run and Fine-Tune the Best Open Source LLMs on Thunder Compute

Many of the models covered share one requirement: GPU hardware.

Kimi K2.5, GLM-5.2, and DeepSeek R1 require multi-GPU setups for full-precision inference. Even distilled or quantized variants of these models benefit from A100 or H100-class GPUs to deliver acceptable throughput.

Thunder Compute provides on-demand access to A100 and H100 GPUs billed by the minute, with pre-configured templates for both Ollama and Unsloth.

Thunder Compute homepage with GPU templates for Ollama and Unsloth workflows.

Spin up an Ollama instance to run Llama 4 in minutes, or launch an Unsloth environment to fine-tune DeepSeek-R1 on your own dataset without configuring CUDA, PyTorch, or dependency conflicts from scratch.

For fine-tuning specifics, the supervised fine-tuning guide walks through dataset preparation, training configuration, and evaluation.

Last Thoughts on Open Source LLMs

The open-weight field has never moved faster. Kimi K2.5, GLM-5.2, and DeepSeek R1 all reached frontier-level benchmark scores in 2025 and 2026, often on a fraction of the compute budget of their closed-source counterparts. The main bottleneck now is hardware, not model quality.

Thunder Compute's on-demand A100 and H100 instances let you run or fine-tune any of the models above without a long-term infrastructure commitment.

FAQ

Is ChatGPT an LLM?

Yes. ChatGPT is a conversational interface built on GPT-4o and related models developed by OpenAI. ChatGPT is the product; the GPT models are the LLMs.

Is DeepSeek open source?

DeepSeek publishes weights and inference code for R1, V3, and V3 0324 under the MIT License. However, training datasets and the full training pipeline are not released, so it does not meet the OSI's Open Source AI Definition.

What is the difference between open source and open weight LLMs?

A fully open source model publishes weights, training data, architecture code, and training pipeline under a permissive license. An open weight model only releases the weights. Most models marketed as open source are actually open weight.

What is the best open source LLM for coding?

GLM-5.2 leads open-weight models on SWE-bench Pro (62.1%) and Terminal-Bench 2.1 (81.0%) as of June 2026. Kimi K2.5 leads on SWE-Bench Verified at 76.8%. The best choice depends on whether you need self-hosted weights or API access.

What GPU do I need to run large open-weight models?

It depends on the model. Kimi K2.5 and GLM-5.2 require multi-GPU setups with A100 or H100-class GPUs. Smaller distilled variants like DeepSeek-R1 1.5B to 70B can run on consumer GPUs. Thunder Compute offers on-demand A100 and H100 instances billed by the minute.

What platform should I use to run open-weight LLMs locally?

Ollama is the fastest option for local prototyping. Unsloth is the go-to for fine-tuning with lower VRAM usage. vLLM handles production serving at scale with continuous batching and an OpenAI-compatible API.