We ❤️ Open Source
A community education resource
Everything you need to know about running LLMs locally
From Ollama to vLLM: A practical guide to selecting, deploying, and scaling local LLMs for privacy, control, and cost savings.
We’re at a really interesting point in our interactions with AI models. Sure, they’re integrated into our developer environments to help write code (or vibe code, if you live on the edge). We use them to research, to ask questions (as I did with my 65-page lease recently), and so much more. Thing is, I think we should consider what it means to rely on a 3rd-party service or API, and I’d like to show you the wonderful world of open, local LLMs you can run on your own hardware right now. If you’d like to see this in video form, please check out the conference recording below (with fun live demos) at All Things Open ‘25.
Why run your own local & private AI models?
There’s a reason that engineers, and even non-technical folks, are quickly adopting technologies like Ollama (which has over 150k stars on GitHub). It’s primarily about the freedom, control, and cost savings that running any other type of software has for the average person.

To start, monthly subscription services for ChatGPT, Claude, and Gemini typically start at $20/person for individual accounts and $25/person for enterprises, but quickly can become more expensive. Or, for example, if you pay by token via an API, you’re paying a fixed rate for AI applications with increasingly larger prompt/context requirements.
With those 3rd party models, you also have limited control over your unique, sensitive data. Want to fine-tune the model to understand medical terminology (ex. SoB means shortness of breath, not a sad person)? That means sending your data to their servers, which is regulated by HIPAA and the EU AI Act, among others. This also applies to AI use cases like RAG (retrieval augmented generation, which pulls in additional data at query time) and agentic AI (using standards like Model Context Protocol to link databases, CRM’s, and more).

Finally, although it’s less talked about, is the maintenance and control over AI deployments. Suppose a proprietary model provider moves from v4 to v5 and depreciates the old model without notice. In that case, you’re in a tricky situation trying to ensure your app and its functionality can adapt too. Or, let’s say for example, a global DNS outage is occurring; running your own local model would be pretty helpful. But, with over 2 million models available on Hugging Face (the “GitHub for AI”), which model should you be using?
Read more: Usage rules: Making AI coding tools accessible to everyone
Selecting the right model for your use case
Approaching this question with “what problem am I trying to solve?” can save you TONS of time. Generally, though, you can split the ecosystem of models into mainly text models, with embedding and vision models also be used depending on the use case.

The text, or instruction-based models, are the ones we use daily to chat back and forth, and augmenting their prompts allows us to chat with our databases (using model context protocol as an example), pull in website information, and help us write code in our IDE.
Thing is, language models can’t “search” our files, so we use embedding models to convert text into numerical representations that can be used for retrieval augmented generation (RAG) for each query we make, pulling in and adding that additional information to the original prompt.
Finally, if we have images, videos, graphs, and other visual elements we want to use with AI, vision models combine both text and visual components to allow us to extract details from an invoice or requisition, for example.

If you’re looking for a recommendation to get started with, I’ll share my local stack: OpenAI’s gpt-oss for typical text work or agentic AI, Qwen3-Coder as a coding assistant, Hugging Face’s all-MiniLM-L6-v2 for storing my data in a vector database, and Google’s Gemma 3 for processing images.
Now’s the exciting part: how do we actually download and run one of these models!?
Running your own local LLMs
I’ve been using the graphic below in some of my talks recently, as in Super Smash Bros. (one of the greatest games of all time), each character had a unique quality or trait that made them desirable to players. With AI, the playing field is quite similar, because there are tools for small, consumer-grade AI deployments, there are tools for app builders or just for power users, and production-ready inference servers, and you’ll get to know all of them in this article.

Simple model serving via the CLI or a GUI
For most use cases, all you need is an easy interface to chat with a model, as well as serve it as an API (using the established OpenAI-compatible spec).
In that case, a tool like Ollama will get you up and running in no time, using familiar docker pull & docker run types of commands. For example, the command below will both pull and run the Llama 3.2 model in the 3-billion-parameter (approx. 1.5 GB) size.
ollama run llama3.2

Read more: Getting started with Ollama
However, you may be concerned about running models in isolated environments (with read-only mode) or ensuring the model has no outbound connectivity for security concerns. That’s why another similar project, Ramalama, uses your container runtime of choice (like Docker or Podman) to locally run and serve models through the approach of containers.
ramalama serve gpt-oss —port 8000
While we could just run gpt-oss, here we’re serving an OpenAI-compatible HTTP API so that any tool or application that can talk to ChatGPT or remote models could just use our local endpoint instead.

If you prefer a more visual-based interface, while the popular LM Studio isn’t open source, there are similar alternatives like Goose, AnythingLLM, and Jan.ai, which provide a ChatGPT-style UI but with open source models.

For building applications that use local LLMs
There’s one, de facto open source project that’s known for making it simple to scaffold and build AI capabilities into your application: LangChain. Whether you’re using Python or even Java, there’s a LangChain capability that easily lets you call “lang” (or “language”) models and “chain” together steps.
Although you may not have AI experience, the community and documentation are extensive and can help build RAG or agentic AI applications. What’s more is that Podman Desktop, the open source GUI for running containers and deploying to Kubernetes, has an “AI Lab” extension of various one-click demo apps that you can learn and build from. All the application source code using LangChain, container-ready, and in various languages like Python, NodeJS, and Java.

Scaling things up on Linux or Kubernetes
Let’s say you’re a platform engineer or need to serve one model to an entire team! Tools like Ollama default to either 1 or 4 concurrent streams per model, so open source projects like vLLM are specifically designed for higher concurrency and throughput (even on the exact same hardware). It’s a pip install vllm to get things working, and you can check out some of our Red Hat-optimized models from leading model labs that we’ve compressed and benchmarked to ensure they’re enterprise-ready.

What are the best use cases for local LLMs?
I’m sure answers will vary, as I’ve heard everything from using AI to summarize one’s personal Obsidian notes to apps that can help determine insurance billing codes. However, as a developer myself, I typically use local LLMs for AI-assisted coding (being able to get responses faster than tools like Cursor or Claude Code is sweet) as well as automating my life through Model Context Protocal (MCP) servers and agents. Let me share how these are set up.
Read more: Deep dive into the Model Context Protocol
Using a local model in my development environment
These days, I always hear about the latest AI-powered fork of VS Code. Both AWS Kiro and Google’s Antigravity are examples, and while they do a good job, my two requirements are: speed (there are many tasks I can do faster than waiting for a 3rd party model) and MCP server support (to fetch new docs, to create a GitHub issue, etc). That’s why I’ve been using tools such as Roo Code (a visual, in-line extension) or OpenCode (a terminal-based, Claude-like code editor). Both of these provide similar functionality as their non-open source and paid counterparts, such as the ability to “plan” new features, “spec” them into a product requirement document, and “implement” them using tools like the Context7 MCP, which pulls in the newest library documentation.

Automating tedious tasks with local LLMs and agentic AI through MCP Servers
Although I’m a developer, a large portion of my day is non-coding activities, things like perusing GitHub issues, responding and status reports on Slack, and debugging issues once I push a container to our Kubernetes cluster. Fortunately, there are integrations, or MCP Servers, to connect my local AI model to all of these services I use every day!
While I’ve detailed a guide to installing these here, make sure you’re running these MCP Servers with limited permissions (read-only is best) and using authorization methods like OAuth where possible. But it’s really fun to have my AI Agent go off and pull Kubernetes pod logs, extract the exact error, and Slack my team the information. Feeling like the 10x developer I always knew I was!

Wrapping up & next steps
First, thank you for making it all this way. For AI in particular, the open source world has been making leaps and bounds, and you’re now equipped with the knowledge of models, tools, and use cases to start running your own AI today! From more transparency and privacy to cost control and more, I’m happy to see the local LLM community growing, and it’s all thanks to open technology. Big thanks to the team at All Things Open! Feel free to reach out to me on LinkedIn or X (formerly Twitter) to nerd out about anything open source or AI-related. Cheers!
More from We Love Open Source
- Getting started with Ollama
- Want to get into AI? Start with this.
- Deep dive into the Model Context Protocol
- Usage rules: Making AI coding tools accessible to everyone
- How I use NotebookLM as my personal research assistant
The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.