Be the first to know and get exclusive access to offers by signing up for our mailing list(s).

Subscribe

We ❤️ Open Source

A community education resource

10 min read

Everything you need to know about running LLMs locally

From Ollama to vLLM: A practical guide to selecting, deploying, and scaling local LLMs for privacy, control, and cost savings.

We’re at a really interesting point in our interactions with AI models. Sure, they’re integrated into our developer environments to help write code (or vibe code, if you live on the edge). We use them to research, to ask questions (as I did with my 65-page lease recently), and so much more. Thing is, I think we should consider what it means to rely on a 3rd-party service or API, and I’d like to show you the wonderful world of open, local LLMs you can run on your own hardware right now. If you’d like to see this in video form, please check out the conference recording below (with fun live demos) at All Things Open ‘25.

Subscribe to our All Things Open YouTube channel to get notifications when new videos are available.

Why run your own local & private AI models?

There’s a reason that engineers, and even non-technical folks, are quickly adopting technologies like Ollama (which has over 150k stars on GitHub). It’s primarily about the freedom, control, and cost savings that running any other type of software has for the average person.

Running local models benefits for developers and organizations
Running local models provides a wide range of benefits for both developers and organizations.

To start, monthly subscription services for ChatGPT, Claude, and Gemini typically start at $20/person for individual accounts and $25/person for enterprises, but quickly can become more expensive. Or, for example, if you pay by token via an API, you’re paying a fixed rate for AI applications with increasingly larger prompt/context requirements.

With those 3rd party models, you also have limited control over your unique, sensitive data. Want to fine-tune the model to understand medical terminology (ex. SoB means shortness of breath, not a sad person)? That means sending your data to their servers, which is regulated by HIPAA and the EU AI Act, among others. This also applies to AI use cases like RAG (retrieval augmented generation, which pulls in additional data at query time) and agentic AI (using standards like Model Context Protocol to link databases, CRM’s, and more).

Retrieval Augmented Generation workflow
With RAG or Retrieval Augmented Generation as an example, your private and unique data needs to be consumed by an LLM before being output to the user.

Finally, although it’s less talked about, is the maintenance and control over AI deployments. Suppose a proprietary model provider moves from v4 to v5 and depreciates the old model without notice. In that case, you’re in a tricky situation trying to ensure your app and its functionality can adapt too. Or, let’s say for example, a global DNS outage is occurring; running your own local model would be pretty helpful. But, with over 2 million models available on Hugging Face (the “GitHub for AI”), which model should you be using?

Read more: Usage rules: Making AI coding tools accessible to everyone

Selecting the right model for your use case

Approaching this question with “what problem am I trying to solve?” can save you TONS of time. Generally, though, you can split the ecosystem of models into mainly text models, with embedding and vision models also be used depending on the use case.

Local LLM instruct vs embedded model
The two most common types of models process either text, or numerical representations of text (for RAG applications).

The text, or instruction-based models, are the ones we use daily to chat back and forth, and augmenting their prompts allows us to chat with our databases (using model context protocol as an example), pull in website information, and help us write code in our IDE.

Thing is, language models can’t “search” our files, so we use embedding models to convert text into numerical representations that can be used for retrieval augmented generation (RAG) for each query we make, pulling in and adding that additional information to the original prompt.

Finally, if we have images, videos, graphs, and other visual elements we want to use with AI, vision models combine both text and visual components to allow us to extract details from an invoice or requisition, for example.

Image and text input example with Hugging Face
Vision models, like this SmolVLM from the Hugging Face team, can input two different modalities (image and text) to process an output

If you’re looking for a recommendation to get started with, I’ll share my local stack: OpenAI’s gpt-oss for typical text work or agentic AI, Qwen3-Coder as a coding assistant, Hugging Face’s all-MiniLM-L6-v2 for storing my data in a vector database, and Google’s Gemma 3 for processing images.

Now’s the exciting part: how do we actually download and run one of these models!?

Running your own local LLMs

I’ve been using the graphic below in some of my talks recently, as in Super Smash Bros. (one of the greatest games of all time), each character had a unique quality or trait that made them desirable to players. With AI, the playing field is quite similar, because there are tools for small, consumer-grade AI deployments, there are tools for app builders or just for power users, and production-ready inference servers, and you’ll get to know all of them in this article.

Different options in the AI ecosystem shown in a video game format
The AI ecosystem has various tools and frameworks for almost any use case you can think of! It’s all about using the right one to solve your challenge.

Simple model serving via the CLI or a GUI

For most use cases, all you need is an easy interface to chat with a model, as well as serve it as an API (using the established OpenAI-compatible spec).

In that case, a tool like Ollama will get you up and running in no time, using familiar docker pull & docker run types of commands. For example, the command below will both pull and run the Llama 3.2 model in the 3-billion-parameter (approx. 1.5 GB) size.

ollama run llama3.2
Example of Ollama 3.2 running on a system
A basic example of pulling, and running, and small language model like Llama

Read more: Getting started with Ollama

However, you may be concerned about running models in isolated environments (with read-only mode) or ensuring the model has no outbound connectivity for security concerns. That’s why another similar project, Ramalama, uses your container runtime of choice (like Docker or Podman) to locally run and serve models through the approach of containers.

ramalama serve gpt-oss —port 8000

While we could just run gpt-oss, here we’re serving an OpenAI-compatible HTTP API so that any tool or application that can talk to ChatGPT or remote models could just use our local endpoint instead.

Example of Ramalama running on a system
Tools like Ramalama and Ollama allow you to run and serve local models with a API-endpoint, accessible via cURL or POST requests.

If you prefer a more visual-based interface, while the popular LM Studio isn’t open source, there are similar alternatives like Goose, AnythingLLM, and Jan.ai, which provide a ChatGPT-style UI but with open source models.

Example open source tool AnythingLLM
Open source tools like AnythingLLM have pre-built integrations to search the web, or even pull in custom data using connectors via the application’s settings.

For building applications that use local LLMs

There’s one, de facto open source project that’s known for making it simple to scaffold and build AI capabilities into your application: LangChain. Whether you’re using Python or even Java, there’s a LangChain capability that easily lets you call “lang” (or “language”) models and “chain” together steps.

Although you may not have AI experience, the community and documentation are extensive and can help build RAG or agentic AI applications. What’s more is that Podman Desktop, the open source GUI for running containers and deploying to Kubernetes, has an “AI Lab” extension of various one-click demo apps that you can learn and build from. All the application source code using LangChain, container-ready, and in various languages like Python, NodeJS, and Java.

Podman Desktop’s AI Lab
Using the Podman Desktop’s AI Lab makes it easier to start from a template app with AI capabilities and containerize it for deployment

Scaling things up on Linux or Kubernetes

Let’s say you’re a platform engineer or need to serve one model to an entire team! Tools like Ollama default to either 1 or 4 concurrent streams per model, so open source projects like vLLM are specifically designed for higher concurrency and throughput (even on the exact same hardware). It’s a pip install vllm to get things working, and you can check out some of our Red Hat-optimized models from leading model labs that we’ve compressed and benchmarked to ensure they’re enterprise-ready.

Llama 3.1 model for summarization and data extraction
Models such as the Llama 3.1 are regularly used for summarization and data extraction, but can be deployed on the vLLM with support for hardware acceleration and distributed inference

What are the best use cases for local LLMs?

I’m sure answers will vary, as I’ve heard everything from using AI to summarize one’s personal Obsidian notes to apps that can help determine insurance billing codes. However, as a developer myself, I typically use local LLMs for AI-assisted coding (being able to get responses faster than tools like Cursor or Claude Code is sweet) as well as automating my life through Model Context Protocal (MCP) servers and agents. Let me share how these are set up.

Read more: Deep dive into the Model Context Protocol

Using a local model in my development environment

These days, I always hear about the latest AI-powered fork of VS Code. Both AWS Kiro and Google’s Antigravity are examples, and while they do a good job, my two requirements are: speed (there are many tasks I can do faster than waiting for a 3rd party model) and MCP server support (to fetch new docs, to create a GitHub issue, etc). That’s why I’ve been using tools such as Roo Code (a visual, in-line extension) or OpenCode (a terminal-based, Claude-like code editor). Both of these provide similar functionality as their non-open source and paid counterparts, such as the ability to “plan” new features, “spec” them into a product requirement document, and “implement” them using tools like the Context7 MCP, which pulls in the newest library documentation.

Exploring Model Context Protocol (MCP) servers
With support for Model Context Protocol servers such as Context7, I’m more confident knowing that my model won’t write code using old documentation

Automating tedious tasks with local LLMs and agentic AI through MCP Servers

Although I’m a developer, a large portion of my day is non-coding activities, things like perusing GitHub issues, responding and status reports on Slack, and debugging issues once I push a container to our Kubernetes cluster. Fortunately, there are integrations, or MCP Servers, to connect my local AI model to all of these services I use every day! 

While I’ve detailed a guide to installing these here, make sure you’re running these MCP Servers with limited permissions (read-only is best) and using authorization methods like OAuth where possible. But it’s really fun to have my AI Agent go off and pull Kubernetes pod logs, extract the exact error, and Slack my team the information. Feeling like the 10x developer I always knew I was!

Goose MCP server example
Using a tool like Goose to connect multiple MCP Servers together and automate parts of my life, such as debugging container logs

Wrapping up & next steps

First, thank you for making it all this way. For AI in particular, the open source world has been making leaps and bounds, and you’re now equipped with the knowledge of models, tools, and use cases to start running your own AI today! From more transparency and privacy to cost control and more, I’m happy to see the local LLM community growing, and it’s all thanks to open technology. Big thanks to the team at All Things Open! Feel free to reach out to me on LinkedIn or X (formerly Twitter) to nerd out about anything open source or AI-related. Cheers!

More from We Love Open Source

The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.

Working on something worth sharing? Write for us.

Contribute to We ❤️ Open Source

Help educate our community by contributing a blog post, tutorial, or how-to.

Two World-class Events

If you didn't make it to All Things AI, check out the event summary, and make plans to join us October 19-20 for All Things Open.

Open Source Meetups

We host some of the most active open source meetups in the U.S. Get more info and RSVP to an upcoming event.