23 Local LLMs

Unlike OpenAI where data must be sent to an external party, local LLMs run entirely on your machine.

Reasons to prefer a local LLM:

Data security — sensitive data never leaves your network
Cost — no per-token API charges after the hardware is paid for
Latency / offline — works without an internet connection
Customisation — fine-tune on proprietary data

The 2025 landscape:

Tool	Best for
Ollama	Easiest setup; production-grade local serving; OpenAI-compatible API
HuggingFace Transformers	Programmatic model access, custom pipelines, research
llama.cpp / llama-cpp-python	Maximum efficiency on CPU; GGUF quantised models
GPT4All	Beginner-friendly GUI + simple Python API

Hardware reality check:

7B parameter models (Llama 3.2, Mistral 7B, Gemma 2 9B) — run well on 8 GB VRAM (or CPU with 16 GB RAM at reduced speed)
13–14B models — need 16 GB VRAM or 32 GB RAM
70B models — require multi-GPU or high-end workstation; most users run 4-bit quantised versions

The recommended starting point for most practitioners in 2025 is Ollama — it handles model downloads, quantisation, and serving behind a local REST API with zero configuration.

23.1 Ollama

Ollama is the easiest way to run large language models locally in 2025. It bundles the model runtime, quantisation, and a local API server into a single installation.

Installation: - Windows / Mac: download the installer from https://ollama.com - Linux: curl -fsSL https://ollama.com/install.sh | sh

Pull a model from the command line (first time only):

ollama pull llama3.2          # 3B parameter model, ~2 GB
ollama pull mistral           # Mistral 7B, ~4 GB
ollama pull gemma2            # Gemma 2 9B by Google, ~5 GB
ollama pull phi4              # Microsoft Phi-4 14B, ~9 GB
ollama pull qwen2.5:7b        # Alibaba Qwen 2.5 7B
ollama pull llava             # LLaVA multimodal (vision + text)

Python SDK: pip install ollama

Code

# pip install ollama
import ollama

# Simple single-turn chat
response = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'What is the capital of China?'}]
)
print(response['message']['content'])

23.1.1 Available models

You can list models you have already pulled:

Code

import ollama

# List locally available models
models = ollama.list()
for m in models['models']:
    size_gb = m['size'] / 1e9
    print(f"{m['name']:40s}  {size_gb:.1f} GB")

23.1.2 Streaming responses

For long responses, streaming lets you display tokens as they are generated rather than waiting for the full reply.

Code

import ollama

stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a short poem about data science.'}],
    stream=True
)
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)
print()  # newline at end

23.1.3 Multi-turn conversation

Ollama (and all chat APIs) maintain context by passing the full message history. Build a history list and append each exchange.

Code

import ollama

history = []

def chat(user_input, model='llama3.2'):
    history.append({'role': 'user', 'content': user_input})
    response = ollama.chat(model=model, messages=history)
    reply = response['message']['content']
    history.append({'role': 'assistant', 'content': reply})
    return reply

print(chat('My name is Alex. What is a good first programming language to learn?'))
print()
print(chat('What is my name?'))  # model should remember from prior turn

23.1.4 System prompts

A system message sets the model’s persona and constraints — it is prepended before all user messages and is not visible to the user.

Code

import ollama

messages = [
    {'role': 'system', 'content': 'You are a terse data analyst who answers every question '
                                  'in one sentence and always cites a number.'},
    {'role': 'user',   'content': 'Why is Python popular for data science?'}
]

response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])

23.1.5 Text summarisation with a local LLM

One of the highest-value use cases for local LLMs is processing confidential documents that cannot be sent to a cloud API.

Code

import ollama

# Read a text file — replace with your document
with open('op_ed.txt', 'r', encoding='utf-8') as f:
    document = f.read()

word_count = len(document.split())
print(f'Document: {word_count:,} words')

# Summarise with local Llama
prompt = f"""Summarise the following article in 3 bullet points.  
Be concise and focus on the main insights.

Article:
{document[:4000]}"""  # trim to fit context window of small models

response = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': prompt}]
)
print(response['message']['content'])

23.1.6 OpenAI-compatible endpoint

Ollama exposes the same REST API as OpenAI (http://localhost:11434/v1). This means you can swap base_url and use existing OpenAI client code unchanged — useful when migrating cloud code to a local model.

Code

from openai import OpenAI

# Point the OpenAI client at the local Ollama server
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required by the client but not validated by Ollama
)

completion = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Explain gradient descent in one paragraph.'}]
)
print(completion.choices[0].message.content)

23.1.7 Vision: describing images with LLaVA

llava (Large Language and Vision Assistant) is a multimodal model that can answer questions about images. Pull it once: ollama pull llava

Code

import ollama

# Replace with an image file on your machine
IMAGE_PATH = '20240321_194345.jpg'

response = ollama.chat(
    model='llava',
    messages=[{
        'role': 'user',
        'content': 'Describe what you see in this image.',
        'images': [IMAGE_PATH]
    }]
)
print(response['message']['content'])

23.2 HuggingFace Transformers

The transformers library gives you direct programmatic access to model weights — useful when you need custom pre/post-processing, fine-tuning, or access to models not yet in Ollama.

Key models available on HuggingFace (2025):

Model	HF repo	Notes
Llama 3.2 3B	`meta-llama/Llama-3.2-3B-Instruct`	Gated — requires HF account + access request
Llama 3.1 8B	`meta-llama/Meta-Llama-3.1-8B-Instruct`	Gated
Mistral 7B v0.3	`mistralai/Mistral-7B-Instruct-v0.3`	Open
Gemma 2 9B	`google/gemma-2-9b-it`	Gated
Phi-4	`microsoft/phi-4`	Open
Qwen2.5 7B	`Qwen/Qwen2.5-7B-Instruct`	Open
SmolLM2 1.7B	`HuggingFaceTB/SmolLM2-1.7B-Instruct`	Open; runs on laptop CPU

Required packages:

pip install transformers torch accelerate bitsandbytes

Code

# Authenticate with HuggingFace (required for gated models)
# Store your token in the environment variable HF_TOKEN, not hardcoded here
import os
from huggingface_hub import login

# login(token=os.environ['HF_TOKEN'])  # uncomment and set env var
# Or run once from a terminal: huggingface-cli login

23.2.1 Running Llama 3 with the pipeline API

The HuggingFace pipeline() function is the simplest way to load a model for inference. Use device_map='auto' to spread the model across available GPUs/CPU automatically.

Code

from transformers import pipeline
import torch

# SmolLM2 1.7B — small enough to run on CPU, no login required
model_id = 'HuggingFaceTB/SmolLM2-1.7B-Instruct'

# For larger gated models swap in:
# model_id = 'meta-llama/Meta-Llama-3.1-8B-Instruct'

pipe = pipeline(
    'text-generation',
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map='auto'
)

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user',   'content': 'Explain overfitting in one paragraph.'}
]

output = pipe(messages, max_new_tokens=256)
print(output[0]['generated_text'][-1]['content'])

23.2.2 4-bit quantisation with BitsAndBytes

4-bit quantisation (QLoRA) reduces a 7B model from ~14 GB to ~4 GB of VRAM with minimal quality loss. Requires bitsandbytes and a CUDA GPU.

Code

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit quantisation config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model_id = 'Qwen/Qwen2.5-7B-Instruct'  # open model, no login required

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto'
)

messages = [{'role': 'user', 'content': 'What are decision trees?'}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=200)
# Decode only the new tokens (not the prompt)
new_ids = [out[len(inp):] for inp, out in zip(model_inputs.input_ids, generated_ids)]
print(tokenizer.batch_decode(new_ids, skip_special_tokens=True)[0])

23.3 GPT4All

GPT4All (pip install gpt4all) is a beginner-friendly library that downloads and runs GGUF-format models locally. Its Python API is minimal but easy to use. Ollama generally provides better performance and a wider model selection, but GPT4All has the advantage of a GUI desktop app.

Source: https://docs.gpt4all.io

Code

# pip install gpt4all
from gpt4all import GPT4All

# Modern GPT4All models (2025); list at https://gpt4all.io/index.html
# GPT4All will download the model on first run (~2–4 GB)
model = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')

with model.chat_session():
    reply = model.generate('Explain random forests in one paragraph.', max_tokens=300)
    print(reply)

23.3.1 Streaming with GPT4All

Code

from gpt4all import GPT4All

model = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')

with model.chat_session():
    tokens = []
    for token in model.generate('Write a haiku about machine learning.', streaming=True):
        tokens.append(token)
        print(token, end='', flush=True)
print()

23.4 Key Takeaways

Approach	When to use
Ollama	Default choice — easiest setup, best performance, OpenAI-compatible API
HF Transformers	When you need fine-tuning, custom architecture, or research-grade access
4-bit quantisation (BnB)	Run 7–14B models on consumer GPUs (8–16 GB VRAM)
GPT4All	Quick demos, GUI, CPU-only environments

Model recommendations (2025):

General purpose / chat: llama3.2 (3B) or llama3.1:8b via Ollama
Coding: qwen2.5-coder:7b or deepseek-coder-v2
Reasoning: phi4 (14B) — strong benchmark performance per GB of model
Multilingual: qwen2.5:7b — 29 languages
Vision: llava or llama3.2-vision
Embedding: nomic-embed-text via ollama pull nomic-embed-text