AI Workflows

Best Open Source LLMs (May 2026)

Last update:
May 28, 2026
10 mins read

Open source large language models have closed the gap with proprietary systems faster than most researchers expected. Models that previously required multi-billion-dollar training runs are now downloadable, modifiable, and deployable by any team with access to the right hardware.

Understanding which models lead today's landscape, and how to evaluate them, is essential for anyone building on top of LLMs.

What Makes an LLM Open Source?

The term "open source" is different in the LLM world than in traditional software. A fully open model can be replicated as it publishes: weights, training data, architecture code, and training pipeline. It does so under a license allowing free use, modification, and redistribution.

In practice, most models called "open source" are really "open weight": only the model weights are publicly available, but the training data and pipeline remain proprietary.

The Open Source Initiative published its Open Source AI Definition (OSAID) to formalize these distinctions. By its strictest reading, models like DeepSeek R1 and Llama 4 are open weight rather than fully open source, because their training datasets are not released.

For most, the definition is simpler and a model is open source if it lets you:

<ul><li>Download weights</li><li>Run them locally</li><li>Use them commercially</li></ul>

LLM Model Licences

License types matter, MIT and Apache 2.0 licenses grant unrestricted use. Custom community licenses, like Meta's Llama 4 license or the Modified MIT used by Kimi K2, add constraints such as user-count thresholds or attribution requirements.

Always read the license when considering a model, especially for commercial use.

LLM Evaluation Metrics

Comparing models requires a shared vocabulary of benchmarks. The most commonly cited ones evaluate different capabilities, and no single score captures overall quality.

Benchmarks

To effectively navigate these trade-offs, the industry relies on a suite of rigorous, specialized evaluations. The table below outlines the core benchmarks currently used to stress-test frontier models across advanced logic, autonomous software engineering, and multi-domain reasoning.

Benchmark What It Measures Why It Matters Reference
GPQA Diamond Graduate-level science reasoning Resists memorization; tests true understanding Rein et al. (arXiv 2023)
AIME Multi-step math problem solving Measures structured logical reasoning Mathematical Association of America
SWE-Bench Verified Real-world code issue resolution Best proxy for agentic software engineering OpenAI & SWE-bench Team (2024)
Humanity's Last Exam Extremely hard cross-domain questions Near-ceiling test for general intelligence Center for AI Safety & Scale AI (2025)
MMMLU Multilingual reasoning Evaluates language breadth beyond English Alibaba / EvalScope Project
LiveCodeBench Live coding problem solving Tests coding with contamination-resistant problems Jain et al. (2024)

Inference Performance & Efficiency

A model's practical viability is defined by its operational efficiency. A model that aces every logical reasoning test but requires 8 H100 GPUs to serve a single user is not meant for production.

When evaluating models, teams must consider two performance metrics:

<ul><li><strong>Throughput (Tokens per Second)</strong>: Measures the volume of text the system can generate concurrently. High throughput is vital for background processing, large-scale data analysis, and keeping infrastructure costs sustainable under heavy user loads.</li><li><strong>Latency (Time to First Token)</strong>: Measures how quickly the model begins its response after receiving a prompt. Low latency is important for user-facing applications, where even a two-second delay breaks the illusion of conversation.</li></ul>

Choosing the right model should factor in cognitive depth and computational expense. The optimal choice often means sacrificing accuracy to achieve speed and cost efficiency that your application requires.

Open Source LLM Leaderboard

There is no uncontested Leaderboard that features only open source LLMs. Although now archived, the Open LLM Leaderboard by Hugging Face community was the reference for comparing open-weight models on standardized benchmarks.

Regardless, the models explored below hold the top spots in any open source LLM leaderboard.

Model Parameters (Active) GPQA Diamond SWE-Bench AIME 2025 Humanity's Last Exam Live Code Bench Source
GPT-5* N/A 85.7 55.3 99.6 41.7 87.0 OpenAI
Sonnet 4.5 (Thinking) N/A 83.4 68.0 100.0 32.0 64.0 Anthropic
Kimi K2.5 1T (32B) 87.6 76.8 96.1 85.0 Moonshot AI (2026)
Kimi K2 Thinking 1T (32B active) 84.5 71.3 99.1 44.9 83.1 Moonshot AI (2025)
DeepSeek-R1 671B (37B active) 71.5 49.2 74.0 65.9 DeepSeek AI (2025)
DeepSeek V3 0324 685B (37B active) 59.1 42.0 58.1 20.3 74.1 DeepSeek AI (2025)
Llama 4 Maverick 400B (17B active) 69.8 65.0 43.4 Meta AI (2025)
Llama 4 Scout 109B (17B active) 73.7 68.0 6.7 33.3 Meta AI (2025)
Nemotron Ultra 253B 253B (dense) 76.0 72.5 NVIDIA (2025)

Kimi K2

Moonshot AI's Kimi K2 was one of the most significant open-weight releases of 2025. It demonstrated that a Chinese lab could release models to compete with or beat GPT-5 on key benchmarks.

Both models in the family use a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion active parameters per inference. They were released under a Modified MIT License that allows broad commercial use.

Kimi K2.5

Kimi K2.5 is a native multimodal agentic model. Multimodality means it can process different types of data like text, images, video and audio. Agentic models have the ability to reason, use tools and act autonomously.

I was built through continual pretraining on approximately 15 trillion mixed visual and text tokens.

It leads in reasoning (87.6% GPQA Diamond), agentic coding (76.8% SWE-Bench Verified), and visual reasoning (ARC-AGI 2). This arguably makes it the most capable open-weight model available today.

Kimi K2 Thinking

Kimi K2 Thinking is the reasoning-focused variant of the K2 family, optimized for test-time scaling by expanding thinking tokens and tool call rounds simultaneously.

It tops the AIME 2025 math benchmark at 99.1% and leads Humanity's Last Exam at 44.9% with tool use enabled.

It can perform 200 to 300 consecutive tool calls without manual intervention, making it particularly powerful for complex agentic workflows.

DeepSeek

DeepSeek's January 2025 release of R1 was a turningpoint for open-source AI. With a reported training cost of under $6 million, it challenged the biggest players. This eruption, forced the industry to a reassess how much compute was required to reach top-tier performance. Both major DeepSeek models are released under the MIT License.

DeepSeek-R1

DeepSeek-R1 is a 671B parameter model focused on advanced reasoning through reinforcement learning post-training.

It achieves 74% on AIME 2025 and 49.2% on SWE-Bench Verified, performing comparably to models that cost far more to build.

Its distilled variants, ranging from 1.5B to 70B parameters, bring strong reasoning to hardware that ranges from consumer laptops to mid-range workstations.

DeepSeek V3 0324

Released in March 2025, DeepSeek-V3-0324 updates the original V3 with an improved post-training pipeline that borrows reinforcement learning techniques from R1.

The update brought the parameter count to 685B. At release, it outperformed GPT-4.5 in math and coding evaluations, making it a strong general-purpose open-weight baseline.

Llama 4

Meta's Llama family is the infrastructure layer of the open-weight ecosystem. The Llama 4 generation, released in April 2025, introduced MoE architecture to the lineup and added native multimodality across text, images, and video.

Both Scout and Maverick are available under Meta's Llama 4 Community License, which permits free commercial use for products serving under 700 million monthly active users.

Llama 4 Maverick

Llama 4 Maverick is the flagship generalist model of the Llama 4 family, with 400 billion total parameters and 17 billion active per token across 128 experts.

Its 1-million-token context window and multimodal capabilities position it as a direct competitor to GPT-4o and Gemini 2.0 Flash.

On the multilingual reasoning benchmark MMMLU, Maverick scores 84.6%, and its inference cost and speed make it a practical choice for production deployments. See the Thunder Compute guide to fine-tuning Llama 4 for a step-by-step walkthrough using a single A100 80GB.

Llama 4 Scout

Llama 4 Scout is an efficiency-focused, and highly accessible sibling. With a total of 109B parameters and 17B active parameters per token distributed across 16 experts.

Equipped with a 192,000-token context window and native multimodal capabilities for both text and vision, Scout is engineered for high-speed inference and optimized to run on a significantly smaller GPU footprint.

The model is exceptionally strong at multi-document summarization, live tool-calling, and parsing dense codebases. Its lower hardware requirements and high throughput makes it good for local development, edge deployments, and high-volume agentic pipelines.

Nemotron

NVIDIA's entry into the open-weight leaderboard takes a different approach from the trillion-parameter MoE models. Nemotron Ultra 253B uses Neural Architecture Search to optimize a dense transformer derived from Llama 3.1-405B, achieving competitive benchmark scores with a smaller, more hardware-efficient architecture.

Nemotron Ultra 253B

Released in April 2025, Nemotron Ultra 253B is a dense decoder-only transformer with 253B parameters optimized for reasoning, RAG, and tool-calling tasks.

It operates on a single 8x H100 node using FP8 precision. With reasoning mode enabled, it jumps from approximately 80% to over 97% accuracy on MATH-500 and achieves a 76% score on GPQA Diamond. The model is released under the NVIDIA Open Model License and supports commercial use.

Open Source LLM Platforms

Choosing a model is the first step. The platform you use to run, fine-tune, or serve it greatly impacts developer experience. Three tools dominate this space for open-weight LLMs: Ollama, Unsloth, and vLLM.

Ollama

Ollama is the fastest way to run open-weight LLMs. A single command like ollama run llama4:maverick downloads and launches the model, handling quantization and hardware detection automatically.

It supports a wide range of model families, exposes an OpenAI-compatible API endpoint, and is the go-to tool for developers who want to prototype quickly without managing infrastructure. For a full walkthrough.

Ollama official website homepage displaying the main interface with navigation menu, logo, and information about running large language models locally on personal computers.

Read the Thunder Compute guide to running Ollama.

Unsloth

Unsloth is an open-source fine-tuning library that rewrites attention and backpropagation kernels in Triton to deliver dramatically faster training with lower VRAM consumption.

It supports supervised fine-tuning, QLoRA, and LoRA workflows on models from Llama 4 to Qwen to DeepSeek, and integrates with Hugging Face Transformers for dataset loading and model export. The Thunder Compute Unsloth guide covers installation and first fine-tuning runs in detail.

Unsloth Studio interface for configuring datasets, models, and fine-tuning jobs.

vLLM

vLLM is the production inference engine of choice for serving open-weight models at scale. It implements PagedAttention for efficient KV cache management, continuous batching to maximize GPU utilization, and an OpenAI-compatible server API.

When a model needs to handle multiple concurrent users or sustained throughput requirements, vLLM is the standard choice. It supports FP8 and GPTQ quantization, making it practical for deploying large models like Nemotron Ultra 253B on a single 8x H100 node.

Run and Fine-Tune the Best Open Source LLMs on Thunder Compute

Many of the models covered have a common requirement: GPU hardware.

Kimi K2.5 and DeepSeek R1 require multi-GPU setups for full-precision inference. Even distilled or quantized variants of these models benefit from A100 or H100-class GPUs to deliver acceptable throughput.

Thunder Compute provides on-demand access to A100 and H100 GPUs billed by the minute, with pre-configured templates for both Ollama and Unsloth.

Thunder Compute homepage with GPU templates for Ollama and Unsloth workflows.

Spin up an Ollama instance to run Llama 4 in minutes, or launch an Unsloth environment to fine-tune DeepSeek-R1 on your own dataset without configuring CUDA, PyTorch, or dependency conflicts from scratch.

For fine-tuning specifics, the supervised fine-tuning guide walks through dataset preparation, training configuration, and evaluation.

Get the world's
cheapest GPUs

Low prices, developer-first features, simple UX. Start building today.