Code
# pip install ollama
import ollama
# Simple single-turn chat
response = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'What is the capital of China?'}]
)
print(response['message']['content'])Unlike OpenAI where data must be sent to an external party, local LLMs run entirely on your machine.
Reasons to prefer a local LLM:
The 2025 landscape:
| Tool | Best for |
|---|---|
| Ollama | Easiest setup; production-grade local serving; OpenAI-compatible API |
| HuggingFace Transformers | Programmatic model access, custom pipelines, research |
| llama.cpp / llama-cpp-python | Maximum efficiency on CPU; GGUF quantised models |
| GPT4All | Beginner-friendly GUI + simple Python API |
Hardware reality check:
The recommended starting point for most practitioners in 2025 is Ollama — it handles model downloads, quantisation, and serving behind a local REST API with zero configuration.
Ollama is the easiest way to run large language models locally in 2025. It bundles the model runtime, quantisation, and a local API server into a single installation.
Installation: - Windows / Mac: download the installer from https://ollama.com - Linux: curl -fsSL https://ollama.com/install.sh | sh
Pull a model from the command line (first time only):
ollama pull llama3.2 # 3B parameter model, ~2 GB
ollama pull mistral # Mistral 7B, ~4 GB
ollama pull gemma2 # Gemma 2 9B by Google, ~5 GB
ollama pull phi4 # Microsoft Phi-4 14B, ~9 GB
ollama pull qwen2.5:7b # Alibaba Qwen 2.5 7B
ollama pull llava # LLaVA multimodal (vision + text)Python SDK: pip install ollama
# pip install ollama
import ollama
# Simple single-turn chat
response = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'What is the capital of China?'}]
)
print(response['message']['content'])You can list models you have already pulled:
import ollama
# List locally available models
models = ollama.list()
for m in models['models']:
size_gb = m['size'] / 1e9
print(f"{m['name']:40s} {size_gb:.1f} GB")For long responses, streaming lets you display tokens as they are generated rather than waiting for the full reply.
import ollama
stream = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a short poem about data science.'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
print() # newline at endOllama (and all chat APIs) maintain context by passing the full message history. Build a history list and append each exchange.
import ollama
history = []
def chat(user_input, model='llama3.2'):
history.append({'role': 'user', 'content': user_input})
response = ollama.chat(model=model, messages=history)
reply = response['message']['content']
history.append({'role': 'assistant', 'content': reply})
return reply
print(chat('My name is Alex. What is a good first programming language to learn?'))
print()
print(chat('What is my name?')) # model should remember from prior turnA system message sets the model’s persona and constraints — it is prepended before all user messages and is not visible to the user.
import ollama
messages = [
{'role': 'system', 'content': 'You are a terse data analyst who answers every question '
'in one sentence and always cites a number.'},
{'role': 'user', 'content': 'Why is Python popular for data science?'}
]
response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])One of the highest-value use cases for local LLMs is processing confidential documents that cannot be sent to a cloud API.
import ollama
# Read a text file — replace with your document
with open('op_ed.txt', 'r', encoding='utf-8') as f:
document = f.read()
word_count = len(document.split())
print(f'Document: {word_count:,} words')
# Summarise with local Llama
prompt = f"""Summarise the following article in 3 bullet points.
Be concise and focus on the main insights.
Article:
{document[:4000]}""" # trim to fit context window of small models
response = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': prompt}]
)
print(response['message']['content'])Ollama exposes the same REST API as OpenAI (http://localhost:11434/v1). This means you can swap base_url and use existing OpenAI client code unchanged — useful when migrating cloud code to a local model.
from openai import OpenAI
# Point the OpenAI client at the local Ollama server
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # required by the client but not validated by Ollama
)
completion = client.chat.completions.create(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Explain gradient descent in one paragraph.'}]
)
print(completion.choices[0].message.content)llava (Large Language and Vision Assistant) is a multimodal model that can answer questions about images. Pull it once: ollama pull llava
import ollama
# Replace with an image file on your machine
IMAGE_PATH = '20240321_194345.jpg'
response = ollama.chat(
model='llava',
messages=[{
'role': 'user',
'content': 'Describe what you see in this image.',
'images': [IMAGE_PATH]
}]
)
print(response['message']['content'])The transformers library gives you direct programmatic access to model weights — useful when you need custom pre/post-processing, fine-tuning, or access to models not yet in Ollama.
Key models available on HuggingFace (2025):
| Model | HF repo | Notes |
|---|---|---|
| Llama 3.2 3B | meta-llama/Llama-3.2-3B-Instruct |
Gated — requires HF account + access request |
| Llama 3.1 8B | meta-llama/Meta-Llama-3.1-8B-Instruct |
Gated |
| Mistral 7B v0.3 | mistralai/Mistral-7B-Instruct-v0.3 |
Open |
| Gemma 2 9B | google/gemma-2-9b-it |
Gated |
| Phi-4 | microsoft/phi-4 |
Open |
| Qwen2.5 7B | Qwen/Qwen2.5-7B-Instruct |
Open |
| SmolLM2 1.7B | HuggingFaceTB/SmolLM2-1.7B-Instruct |
Open; runs on laptop CPU |
Required packages:
pip install transformers torch accelerate bitsandbytes# Authenticate with HuggingFace (required for gated models)
# Store your token in the environment variable HF_TOKEN, not hardcoded here
import os
from huggingface_hub import login
# login(token=os.environ['HF_TOKEN']) # uncomment and set env var
# Or run once from a terminal: huggingface-cli loginThe HuggingFace pipeline() function is the simplest way to load a model for inference. Use device_map='auto' to spread the model across available GPUs/CPU automatically.
from transformers import pipeline
import torch
# SmolLM2 1.7B — small enough to run on CPU, no login required
model_id = 'HuggingFaceTB/SmolLM2-1.7B-Instruct'
# For larger gated models swap in:
# model_id = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
pipe = pipeline(
'text-generation',
model=model_id,
torch_dtype=torch.bfloat16,
device_map='auto'
)
messages = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Explain overfitting in one paragraph.'}
]
output = pipe(messages, max_new_tokens=256)
print(output[0]['generated_text'][-1]['content'])4-bit quantisation (QLoRA) reduces a 7B model from ~14 GB to ~4 GB of VRAM with minimal quality loss. Requires bitsandbytes and a CUDA GPU.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# 4-bit quantisation config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model_id = 'Qwen/Qwen2.5-7B-Instruct' # open model, no login required
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map='auto'
)
messages = [{'role': 'user', 'content': 'What are decision trees?'}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=200)
# Decode only the new tokens (not the prompt)
new_ids = [out[len(inp):] for inp, out in zip(model_inputs.input_ids, generated_ids)]
print(tokenizer.batch_decode(new_ids, skip_special_tokens=True)[0])GPT4All (pip install gpt4all) is a beginner-friendly library that downloads and runs GGUF-format models locally. Its Python API is minimal but easy to use. Ollama generally provides better performance and a wider model selection, but GPT4All has the advantage of a GUI desktop app.
Source: https://docs.gpt4all.io
# pip install gpt4all
from gpt4all import GPT4All
# Modern GPT4All models (2025); list at https://gpt4all.io/index.html
# GPT4All will download the model on first run (~2–4 GB)
model = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')
with model.chat_session():
reply = model.generate('Explain random forests in one paragraph.', max_tokens=300)
print(reply)from gpt4all import GPT4All
model = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')
with model.chat_session():
tokens = []
for token in model.generate('Write a haiku about machine learning.', streaming=True):
tokens.append(token)
print(token, end='', flush=True)
print()| Approach | When to use |
|---|---|
| Ollama | Default choice — easiest setup, best performance, OpenAI-compatible API |
| HF Transformers | When you need fine-tuning, custom architecture, or research-grade access |
| 4-bit quantisation (BnB) | Run 7–14B models on consumer GPUs (8–16 GB VRAM) |
| GPT4All | Quick demos, GUI, CPU-only environments |
Model recommendations (2025):
llama3.2 (3B) or llama3.1:8b via Ollamaqwen2.5-coder:7b or deepseek-coder-v2phi4 (14B) — strong benchmark performance per GB of modelqwen2.5:7b — 29 languagesllava or llama3.2-visionnomic-embed-text via ollama pull nomic-embed-text