Be the first to know and get exclusive access to offers by signing up for our mailing list(s).

Subscribe

We ❤️ Open Source

A community education resource

8 min read

The rise of small language models and why open source is winning

Why AI in production is smaller, cheaper, and finally yours.

When comparing the “better” versus “bigger,” “cheaper,” and “yours,” the AI discussion is still very much about large language models. Large parameter counts, large context windows, and large API bills continue to dominate the frontiers. Meanwhile, however, small language models (roughly defined as those with fewer than 10 billion parameters) are successfully processing production workloads that most teams believed required a tenfold increase in model size.

In fact, according to the Stanford HAI 2025 AI Index, the inference cost for a system performing at the level of GPT-3.5 decreased by more than 280x between November 2022 and October 2024, primarily due to improvements in capabilities among small models and open-weight alternatives that narrowed the performance gap relative to their closed counterparts.

This shift away from larger models toward smaller ones is no longer speculative; it’s happening now, and open source solutions are driving it.

Why small language models are winning now

There were long-held assumptions that increasing parameter count equated to increased performance. The Chinchilla paper challenged that directly. By demonstrating that a smaller model could produce higher-quality results than a much larger one if given sufficient training data and properly allocated compute resources, the authors demonstrated that data quality, along with proper allocation of compute budgets, mattered significantly more than the number of parameters alone.

Economically, this trend is reinforced further. The AI Index reports that training compute for some of the most notable models doubles approximately every five months, datasets double every eight months, and power consumption increases annually. If you’re going to survive in this environment, you simply cannot afford to design your model without considering efficiency.

Additionally, the capability evidence is difficult to refute. Microsoft’s Phi-2 at 2.7 billion parameters achieves comparable or superior performance compared to much larger models on complex benchmark tests, with gains attributed to advancements in training methodologies combined with carefully curated training data rather than merely relying on scale. Microsoft’s Phi-3-mini at 3.8 billion parameters was intentionally designed to function efficiently enough to operate on a modern smartphone while producing output quality equivalent to or superior to much larger systems.

Mistral 7B achieved performance superior to Llama 2 13B across all tested benchmarks utilizing structural innovations such as grouped query attention and sliding window attention, both specifically designed to decrease inference costs. Google DeepMind’s Gemma product line includes models shipping at both 2B and 7B parameter sizes and includes both pre-trained and instruction-tuned versions.

These examples demonstrate that small models are not inferior models; they represent deliberate engineering decisions.

Read more: How to secure agentic AI with Agent Identity Protocol (AIP)

Why open source controls the small language model stack

The open source ecosystem currently controls nearly all layers of the necessary stack required to create viable small models in production environments.

Regarding serving, vLLM has developed Paged Attention to reduce memory waste associated with traditional attention mechanisms and significantly enhance throughput. Regarding local inference, llama.cpp has normalized running models on common hardware platforms. Formats such as GGUF (GPT-Generated Unified Format) are being developed precisely for use in this type of application. Ollama has bundled the entire experience into an easy-to-use package for developers who wish to establish a similar experience in mere minutes. This represents how most engineers actually use small models today.

Regarding specialization, open source approaches have reduced barriers preventing users from creating small models that are truly useful in particular domains. LoRA freezes base model weights and trains lightweight adapters, reducing trainable parameters and memory requirements by orders of magnitude. QLoRA builds upon this approach by adding 4-bit quantization to LoRA-style adapters, enabling fine-tuning on a single consumer GPU. Post-training quantization methods provided by GPTQ and AWQ enable shrinking memory footprint while maintaining accuracy. Therefore, “specialize the model you can afford” is rapidly becoming a standard engineering practice, and open source solutions are where most teams implement this practice.

Increasingly, models themselves are open as well. Mistral 7B ships under the Apache 2.0 license. Gemma is openly accessible at reasonable sizes. It is worth defining terms correctly. Many “open” models widely discussed are better characterized as open-weight, meaning that the weights of the model are available in public form but the training data, training code, or complete pipeline may not be. The Open Source Initiative’s AI Definition is attempting to formally define this distinction. Open-weight and open source are related but not identical, and the distinction matters.

Finally, there is a trust argument. The openness of models decreases the threshold for inspecting a model’s behavior, reproducing results obtained with a model, and scrutinizing a model by a community. With increasing governance pressures and the growing reliance on frameworks such as the NIST AI Risk Management Framework, there is little doubt that the necessity to be able to effectively examine what a model does will become less of a preference and more of a requirement.

Read more: The AI slop problem threatening open source maintainers

Where small language models outperform large ones in production

Smaller models offer even more compelling advantages when used in agent-based systems. Researchers from NVIDIA published a paper in 2025 arguing that agent-based AI systems rely heavily on repeated specializations of tasks, thereby making small language models more economical for many of the invocations agents require (with selective routing to larger models only when required).

Consider the types of tasks that agents commonly execute in production environments.

Data cleansing and normalization: taking disorganized inputs and transforming them into structured and organized forms. These are pattern recognition and rule-following activities, not generative reasoning. A fine-tuned 3B model consistently produces reliable results in milliseconds, while a frontier model would add significant latency and cost for no tangible improvement in quality.

Structured extraction: identifying names, dates, amounts, and other entities from unstructured text (such as email messages, invoices, or support requests). Any small model fine-tuned on your specific schema will outperform any general-purpose large model since it only knows what to look for and ignores everything else.

Input classification and routing: determining which workflow an incoming request should pursue. Should this customer message trigger a human interaction or initiate a refund process or trigger a knowledge base search? This is a simple classification problem. A small model fine-tuned on your workflow classifications performs the classification task more quickly and cheaply than forwarding every input via a frontier API.

Grounded chat over a known corpus: answering questions against a fixed set of documentation where answers are inherently present in the source materials. The model does not need to generate new information; it simply needs to locate and summarize relevant information within a predetermined scope.

Automatic user preference selection: learning patterns of user behavior to determine default values, suggest options, or configure workflows based upon user preferences. These are lightweight inference operations applied against structured signal sources, not open-ended generation.

Each of these cases demonstrates why model effectiveness is determined more by what surrounds it (prompt structure, fine-tuning strategy, evaluation pipeline architecture) than its size. For almost all agent workloads, focus and structure win over model size every time. There is considerable supporting research behind this claim, and there is also considerable open source tooling available to support it.

What the small language model shift means for open source builders

All three elements exist: tooling, models, and fine-tuning/compression techniques. What is emerging is a pattern where open source solutions control the entire stack, from model weights through serving infrastructure through specialization pipelines.

Therefore, for open source maintainers and builders, the opportunity is not in designing the next frontier model. It is in developing scaffolding that enables small models to be effective in certain contexts. The orchestration layer, the fine-tuning workflows, the evaluation harnesses, the deployment tooling. That is where lasting value creation occurs, and that is where the open source community excels.

Finally, there is an environmental consideration that deserves serious treatment as well. The Green AI position paper, as well as Strubell et al. (2019), identified that compute has real economic costs as well as environmental costs. Efficiency is now an integral part of how we evaluate AI systems, and smaller models represent a direct path towards achieving that goal.

Ultimately, the future of AI for most real-world applications won’t be larger, it will be smaller, specialized, open, and running on infrastructure you control. The open source community is already creating it.

More from We Love Open Source

The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.

Want to contribute your open source content?

Contribute to We ❤️ Open Source

Help educate our community by contributing a blog post, tutorial, or how-to.

Two World-class Events

If you didn't make it to All Things AI, check out the event summary, and make plans to join us October 19-20 for All Things Open.

Open Source Meetups

We host some of the most active open source meetups in the U.S. Get more info and RSVP to an upcoming event.