Open source large language models have closed the gap with proprietary systems faster than most researchers expected. Models that previously required multi-billion-dollar training runs are now downloadable, modifiable, and deployable by any team with access to the right hardware.
Understanding which models lead today's landscape, and how to evaluate them, is essential for anyone building on top of LLMs.
What Makes an LLM Open Source?
The term "open source" is different in the LLM world than in traditional software. A fully open model can be replicated as it publishes: weights, training data, architecture code, and training pipeline. It does so under a license allowing free use, modification, and redistribution.
In practice, most models called "open source" are really "open weight": only the model weights are publicly available, but the training data and pipeline remain proprietary.
The Open Source Initiative published its Open Source AI Definition (OSAID) to formalize these distinctions. By its strictest reading, models like DeepSeek R1 and Llama 4 are open weight rather than fully open source, because their training datasets are not released.
For most, the definition is simpler and a model is open source if it lets you:
<ul><li>Download weights</li><li>Run them locally</li><li>Use them commercially</li></ul>
LLM Model Licences
License types matter, MIT and Apache 2.0 licenses grant unrestricted use. Custom community licenses, like Meta's Llama 4 license or the Modified MIT used by Kimi K2, add constraints such as user-count thresholds or attribution requirements.
Always read the license when considering a model, especially for commercial use.
LLM Evaluation Metrics
Comparing models requires a shared vocabulary of benchmarks. The most commonly cited ones evaluate different capabilities, and no single score captures overall quality.
Benchmarks
To effectively navigate these trade-offs, the industry relies on a suite of rigorous, specialized evaluations. The table below outlines the core benchmarks currently used to stress-test frontier models across advanced logic, autonomous software engineering, and multi-domain reasoning.
| Benchmark | What It Measures | Why It Matters | Reference |
|---|---|---|---|
| GPQA Diamond | Graduate-level science reasoning | Resists memorization; tests true understanding | Rein et al. (arXiv 2023) |
| AIME | Multi-step math problem solving | Measures structured logical reasoning | Mathematical Association of America |
| SWE-Bench Verified | Real-world code issue resolution | Best proxy for agentic software engineering | OpenAI & SWE-bench Team (2024) |
| Humanity's Last Exam | Extremely hard cross-domain questions | Near-ceiling test for general intelligence | Center for AI Safety & Scale AI (2025) |
| MMMLU | Multilingual reasoning | Evaluates language breadth beyond English | Alibaba / EvalScope Project |
| LiveCodeBench | Live coding problem solving | Tests coding with contamination-resistant problems | Jain et al. (2024) |
Inference Performance & Efficiency
A model's practical viability is defined by its operational efficiency. A model that aces every logical reasoning test but requires 8 H100 GPUs to serve a single user is not meant for production.
When evaluating models, teams must consider two performance metrics:
<ul><li><strong>Throughput (Tokens per Second)</strong>: Measures the volume of text the system can generate concurrently. High throughput is vital for background processing, large-scale data analysis, and keeping infrastructure costs sustainable under heavy user loads.</li><li><strong>Latency (Time to First Token)</strong>: Measures how quickly the model begins its response after receiving a prompt. Low latency is important for user-facing applications, where even a two-second delay breaks the illusion of conversation.</li></ul>
Choosing the right model should factor in cognitive depth and computational expense. The optimal choice often means sacrificing accuracy to achieve speed and cost efficiency that your application requires.
Open Source LLM Leaderboard
There is no uncontested Leaderboard that features only open source LLMs. Although now archived, the Open LLM Leaderboard by Hugging Face community was the reference for comparing open-weight models on standardized benchmarks.
Regardless, the models explored below hold the top spots in any open source LLM leaderboard.
| Model | Parameters (Active) | GPQA Diamond | SWE-Bench | AIME 2025 | Humanity's Last Exam | Live Code Bench | Source |
|---|---|---|---|---|---|---|---|
| GPT-5* | N/A | 85.7 | 55.3 | 99.6 | 41.7 | 87.0 | OpenAI |
| Sonnet 4.5 (Thinking) | N/A | 83.4 | 68.0 | 100.0 | 32.0 | 64.0 | Anthropic |
| Kimi K2.5 | 1T (32B) | 87.6 | 76.8 | 96.1 | – | 85.0 | Moonshot AI (2026) |
| Kimi K2 Thinking | 1T (32B active) | 84.5 | 71.3 | 99.1 | 44.9 | 83.1 | Moonshot AI (2025) |
| DeepSeek-R1 | 671B (37B active) | 71.5 | 49.2 | 74.0 | – | 65.9 | DeepSeek AI (2025) |
| DeepSeek V3 0324 | 685B (37B active) | 59.1 | 42.0 | 58.1 | 20.3 | 74.1 | DeepSeek AI (2025) |
| Llama 4 Maverick | 400B (17B active) | 69.8 | 65.0 | — | — | 43.4 | Meta AI (2025) |
| Llama 4 Scout | 109B (17B active) | 73.7 | 68.0 | 6.7 | — | 33.3 | Meta AI (2025) |
| Nemotron Ultra 253B | 253B (dense) | 76.0 | — | 72.5 | — | — | NVIDIA (2025) |
Kimi K2
Moonshot AI's Kimi K2 was one of the most significant open-weight releases of 2025. It demonstrated that a Chinese lab could release models to compete with or beat GPT-5 on key benchmarks.
Both models in the family use a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion active parameters per inference. They were released under a Modified MIT License that allows broad commercial use.
Kimi K2.5
Kimi K2.5 is a native multimodal agentic model. Multimodality means it can process different types of data like text, images, video and audio. Agentic models have the ability to reason, use tools and act autonomously.
I was built through continual pretraining on approximately 15 trillion mixed visual and text tokens.
It leads in reasoning (87.6% GPQA Diamond), agentic coding (76.8% SWE-Bench Verified), and visual reasoning (ARC-AGI 2). This arguably makes it the most capable open-weight model available today.
Kimi K2 Thinking
Kimi K2 Thinking is the reasoning-focused variant of the K2 family, optimized for test-time scaling by expanding thinking tokens and tool call rounds simultaneously.
It tops the AIME 2025 math benchmark at 99.1% and leads Humanity's Last Exam at 44.9% with tool use enabled.
It can perform 200 to 300 consecutive tool calls without manual intervention, making it particularly powerful for complex agentic workflows.
DeepSeek
DeepSeek's January 2025 release of R1 was a turningpoint for open-source AI. With a reported training cost of under $6 million, it challenged the biggest players. This eruption, forced the industry to a reassess how much compute was required to reach top-tier performance. Both major DeepSeek models are released under the MIT License.
DeepSeek-R1
DeepSeek-R1 is a 671B parameter model focused on advanced reasoning through reinforcement learning post-training.
It achieves 74% on AIME 2025 and 49.2% on SWE-Bench Verified, performing comparably to models that cost far more to build.
Its distilled variants, ranging from 1.5B to 70B parameters, bring strong reasoning to hardware that ranges from consumer laptops to mid-range workstations.
DeepSeek V3 0324
Released in March 2025, DeepSeek-V3-0324 updates the original V3 with an improved post-training pipeline that borrows reinforcement learning techniques from R1.
The update brought the parameter count to 685B. At release, it outperformed GPT-4.5 in math and coding evaluations, making it a strong general-purpose open-weight baseline.
Llama 4
Meta's Llama family is the infrastructure layer of the open-weight ecosystem. The Llama 4 generation, released in April 2025, introduced MoE architecture to the lineup and added native multimodality across text, images, and video.
Both Scout and Maverick are available under Meta's Llama 4 Community License, which permits free commercial use for products serving under 700 million monthly active users.
Llama 4 Maverick
Llama 4 Maverick is the flagship generalist model of the Llama 4 family, with 400 billion total parameters and 17 billion active per token across 128 experts.
Its 1-million-token context window and multimodal capabilities position it as a direct competitor to GPT-4o and Gemini 2.0 Flash.
On the multilingual reasoning benchmark MMMLU, Maverick scores 84.6%, and its inference cost and speed make it a practical choice for production deployments. See the Thunder Compute guide to fine-tuning Llama 4 for a step-by-step walkthrough using a single A100 80GB.
Llama 4 Scout
Llama 4 Scout is an efficiency-focused, and highly accessible sibling. With a total of 109B parameters and 17B active parameters per token distributed across 16 experts.
Equipped with a 192,000-token context window and native multimodal capabilities for both text and vision, Scout is engineered for high-speed inference and optimized to run on a significantly smaller GPU footprint.
The model is exceptionally strong at multi-document summarization, live tool-calling, and parsing dense codebases. Its lower hardware requirements and high throughput makes it good for local development, edge deployments, and high-volume agentic pipelines.
Nemotron
NVIDIA's entry into the open-weight leaderboard takes a different approach from the trillion-parameter MoE models. Nemotron Ultra 253B uses Neural Architecture Search to optimize a dense transformer derived from Llama 3.1-405B, achieving competitive benchmark scores with a smaller, more hardware-efficient architecture.
Nemotron Ultra 253B
Released in April 2025, Nemotron Ultra 253B is a dense decoder-only transformer with 253B parameters optimized for reasoning, RAG, and tool-calling tasks.
It operates on a single 8x H100 node using FP8 precision. With reasoning mode enabled, it jumps from approximately 80% to over 97% accuracy on MATH-500 and achieves a 76% score on GPQA Diamond. The model is released under the NVIDIA Open Model License and supports commercial use.
Open Source LLM Platforms
Choosing a model is the first step. The platform you use to run, fine-tune, or serve it greatly impacts developer experience. Three tools dominate this space for open-weight LLMs: Ollama, Unsloth, and vLLM.
Ollama
Ollama is the fastest way to run open-weight LLMs. A single command like ollama run llama4:maverick downloads and launches the model, handling quantization and hardware detection automatically.
It supports a wide range of model families, exposes an OpenAI-compatible API endpoint, and is the go-to tool for developers who want to prototype quickly without managing infrastructure. For a full walkthrough.

Read the Thunder Compute guide to running Ollama.
Unsloth
Unsloth is an open-source fine-tuning library that rewrites attention and backpropagation kernels in Triton to deliver dramatically faster training with lower VRAM consumption.
It supports supervised fine-tuning, QLoRA, and LoRA workflows on models from Llama 4 to Qwen to DeepSeek, and integrates with Hugging Face Transformers for dataset loading and model export. The Thunder Compute Unsloth guide covers installation and first fine-tuning runs in detail.

vLLM
vLLM is the production inference engine of choice for serving open-weight models at scale. It implements PagedAttention for efficient KV cache management, continuous batching to maximize GPU utilization, and an OpenAI-compatible server API.
When a model needs to handle multiple concurrent users or sustained throughput requirements, vLLM is the standard choice. It supports FP8 and GPTQ quantization, making it practical for deploying large models like Nemotron Ultra 253B on a single 8x H100 node.
Run and Fine-Tune the Best Open Source LLMs on Thunder Compute
Many of the models covered have a common requirement: GPU hardware.
Kimi K2.5 and DeepSeek R1 require multi-GPU setups for full-precision inference. Even distilled or quantized variants of these models benefit from A100 or H100-class GPUs to deliver acceptable throughput.
Thunder Compute provides on-demand access to A100 and H100 GPUs billed by the minute, with pre-configured templates for both Ollama and Unsloth.

Spin up an Ollama instance to run Llama 4 in minutes, or launch an Unsloth environment to fine-tune DeepSeek-R1 on your own dataset without configuring CUDA, PyTorch, or dependency conflicts from scratch.
For fine-tuning specifics, the supervised fine-tuning guide walks through dataset preparation, training configuration, and evaluation.
