Ollama: The local LLM solution
Using Ollama
Introduction
Ollama is the best platform for running, managing, and interacting with Large Language Models (LLM) models locally. For Python programmers, Ollama offers seamless integration and robust features for querying, manipulating, and deploying LLMs. In this post I will explore how Python developers can leverage Ollama for powerful and efficient AI-based workflows.
1. What is Ollama?
Ollama is a tool designed to enable local hosting and interaction with LLMs. Unlike cloud-based APIs, Ollama prioritizes privacy and speed by running models directly on your machine. Key benefits include:
- Local Model Execution: No internet connection is needed for most interactions.
- Privacy: Data never leaves your machine.
- Python Integration: Easily interact with Ollama’s models through Python scripts.
- Access to many powerful Models: With Ollama you have access to all open models.
- No Cost Solutions: Allows you to build AI solutions with no cost. There are no fees for using Ollama Models.
For Python programmers, Ollama unlocks possibilities for integrating AI models into projects without relying on external services.
2. Installing and Setting Up Ollama
Step 1: Install Ollama
Download Ollama from its official website (ollama.com) and follow the installation instructions for your operating system.
Step 2: Verify Installation
After installation, verify that Ollama is working by running:
ollama --version
Step 3: Install Python SDK
Ollama can be accessed from Python using REST APIs or third-party libraries. Ensure you have Python installed and create a virtual environment for your project:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install requests
for interacting with the Ollama API:
pip install requests
3. Chat with an Ollama Model
Here’s how to interact with an Ollama model using Python:
import requests
# Ollama server URL
def chat_with_ollama(text, model_name):
"""Chat with Ollama."""
try:
url = f"{OLLAMA_BASE_URL}/api/chat"
data = {
"prompt": text,
"model": model_name,
"stream": False
}
response = requests.post(url, json=data)
# Check if the request was successful
if response.status_code == 200:
# Parse the JSON response
embeddings = response.json()
print("Chat Response:")
pretty_json = json.dumps(embeddings, indent=4)
print(pretty_json)
return embeddings
else:
print(f"Failed to generate embeddings. Status code: {response.status_code}")
print("Response:", response.text)
return None
except requests.ConnectionError:
print("Failed to connect to the Ollama server. Make sure it is running locally and the URL is correct.")
return None
except json.JSONDecodeError:
print("Failed to parse JSON response from Ollama server.")
return None
except Exception as e:
print(f"An error occurred: {e}")
return None
# Example usage
text = "Why is recursion in python?"
model_name = "llama3.2"
res = chat_with_ollama(text, model_name)
print(res)
Output Example:
Recursion in Python is a process where a function calls itself to solve smaller instances of a problem...
4. List Ollama models installed
import requests
import json
OLLAMA_BASE_URL = "http://localhost:11434"
def filter_response(json_data: str, filter_keys: list[str]):
for model in json_data:
print(f"| {model['name']} | {model['model']} |")
def list_ollama_models():
"""List installed models from the Ollama server."""
try:
url = f"{OLLAMA_BASE_URL}/api/tags"
response = requests.get(url)
if response.status_code == 200:
print("Installed Ollama Models:")
filtered_models = filter_response(response.json()["models"], ['name', 'tags'])
print(filtered_models)
return filtered_models
else:
print(f"Failed to retrieve models. Status code: {response.status_code}")
print("Response:", response.text)
return []
except requests.ConnectionError:
print("Failed to connect to the Ollama server. Make sure it is running locally and the URL is correct.")
return []
except json.JSONDecodeError:
print("Failed to parse JSON response from Ollama server.")
return []
except Exception as e:
print(f"An error occurred: {e}")
return []
list_ollama_models()
The result will be something like this
Installed Ollama Models:
| phi:latest | phi:latest |
| nomic-embed-text:latest | nomic-embed-text:latest |
| phi3:latest | phi3:latest |
| llama3.2:latest | llama3.2:latest |
| mxbai-embed-large:latest | mxbai-embed-large:latest |
| mistral:latest | mistral:latest |
5. Show information on a local model
This will give you detailed information on the model. Where I have found this useful is when calling the model using prompts with parameters.
def show_model(model_name):
"""Show model details."""
try:
url = f"{OLLAMA_BASE_URL}/api/show"
data = {
"model": model_name
}
response = requests.post(url, json=data)
# Check if the request was successful
if response.status_code == 200:
# Parse the JSON response
model_details = response.json()
print("Show Model Response:")
pretty_json = json.dumps(model_details, indent=4)
print(pretty_json)
return model_details
else:
print(f"Failed to generate embeddings. Status code: {response.status_code}")
print("Response:", response.text)
return None
except requests.ConnectionError:
print("Failed to connect to the Ollama server. Make sure it is running locally and the URL is correct.")
return None
except json.JSONDecodeError:
print("Failed to parse JSON response from Ollama server.")
return None
except Exception as e:
print(f"An error occurred: {e}")
return None
# Example usage
model_name = "llama3.2"
res = show_model(model_name)
print(res)
Will generate the following response:
{
"license": "LLAMA 3.2 COMMUNITY LICENSE AGREEMENT Llama 3.2",
"modelfile": "# Modelfile generated by \"ollama show\"","parameters": "stop \"<|start_header_id|>\"\nstop \"<|end_header_id|>\"\nstop \"<|eot_id|>\"",
"details": {
"parent_model": "",
"format": "gguf",
"family": "llama",
"families": [
"llama"
], ],
"parameter_size": "3.2B",
"quantization_level": "Q4_K_M"
},
"model_info": {
"general.architecture": "llama",
"general.basename": "Llama-3.2",
"general.file_type": 15,
"general.finetune": "Instruct",
"general.languages": [
"en",
"de",
"fr",
}
}
6. Generate Embeddings
Language model embeddings are a way of representing text as numerical vectors (arrays of numbers) in a mathematical space. These vectors capture the meaning and relationships between the text elements in a way that computers can process and analyze.
It like having multiple database indexes on the text where the indexes can be on anything we want. This allows us to group similar items together. We then use these groupings to extract more information from the data.
def generate_embeddings(text, model_name):
"""Generate embeddings for the given text using the specified model."""
try:
# Send a POST request to generate embeddings
url = f"{OLLAMA_BASE_URL}/api/embeddings"
data = {
"prompt": text,
"model": model_name
}
response = requests.post(url, json=data)
# Check if the request was successful
if response.status_code == 200:
# Parse the JSON response
embeddings = response.json()
print("Embeddings:")
pretty_json = json.dumps(embeddings, indent=4)
print(pretty_json)
return embeddings
else:
print(f"Failed to generate embeddings. Status code: {response.status_code}")
print("Response:", response.text)
return None
except requests.ConnectionError:
print("Failed to connect to the Ollama server. Make sure it is running locally and the URL is correct.")
return None
except json.JSONDecodeError:
print("Failed to parse JSON response from Ollama server.")
return None
except Exception as e:
print(f"An error occurred: {e}")
return None
# Example usage
text = "Hello, world!"
model_name = "nomic-embed-text"
res = generate_embeddings(text, model_name)
print(res)
The response will be something like:
Embeddings:
{
"embedding": [
0.3587897717952728,
-0.019634757190942764,
-3.578793525695801,
....
-1.039941668510437,
0.17351241409778595
]
}
7. Pull a new Model
def pull_model(model_name):
"""Pull ollama model."""
try:
url = f"{OLLAMA_BASE_URL}/api/pull"
data = {
"name": model_name
}
response = requests.post(url, json=data)
if response.status_code == 200:
print(f"Pull Model Successful! Response: {response}")
else:
print(f"Failed to pull model. Status code: {response.status_code}")
print("Response:", response.text)
return None
except requests.ConnectionError:
print("Failed to connect to the Ollama server. Make sure it is running locally and the URL is correct.")
return None
except json.JSONDecodeError:
print("Failed to parse JSON response from Ollama server.")
return None
except Exception as e:
print(f"An error occurred: {e}")
return None
# Example usage
model_name = "mistral-small"
res = pull_model(model_name)
The result if successful will be like:
Pull Model Successful! Response: <Response [200]>
8. Delete a model
def delete_model(model_name):
"""Delete ollama model."""
try:
url = f"{OLLAMA_BASE_URL}/api/delete"
data = {
"name": model_name
}
response = requests.post(url, json=data)
if response.status_code == 200:
print(f"Delete Model Successful! Response: {response}")
else:
print(f"Failed to delete model. Status code: {response.status_code}")
print("Response:", response.text)
return None
except requests.ConnectionError:
print("Failed to connect to the Ollama server. Make sure it is running locally and the URL is correct.")
return None
except json.JSONDecodeError:
print("Failed to parse JSON response from Ollama server.")
return None
except Exception as e:
print(f"An error occurred: {e}")
return None
# Example usage
model_name = "mistral-small"
res = delete_model(model_name)
The result if successful will be:
9. Advanced Topics
A. Streaming Responses
To handle large outputs, Ollama supports streaming responses. Here’s an example:
def query_ollama_stream(model, prompt):
headers = {"Content-Type": "application/json"}
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": True
}
with requests.post(OLLAMA_URL, json=payload, headers=headers, stream=True) as response:
if response.status_code == 200:
for chunk in response.iter_content(chunk_size=None):
print(chunk.decode("utf-8"), end="")
else:
raise Exception(f"Error: {response.status_code}, {response.text}")
# Example usage
query_ollama_stream("llama2", "Write a Python script for a to-do list application.")
B. Fine-Tuning and Custom Models
While Ollama doesn’t directly support model fine-tuning, you can define custom prompts and system behaviors to tailor the model’s output. Use the system
message role:
payload = {
"model": "llama2",
"messages": [
{"role": "system", "content": "You are an expert Python programmer."},
{"role": "user", "content": "How do I implement a decorator in Python?"}
]
}
C. Batch Processing
For handling multiple queries at once, Python’s concurrent.futures
can be used:
from concurrent.futures import ThreadPoolExecutor
def batch_query_ollama(model, prompts):
with ThreadPoolExecutor() as executor:
results = list(executor.map(lambda p: query_ollama(model, p), prompts))
return results
# Example usage
prompts = ["Explain recursion.", "What is a Python generator?", "How do I use asyncio?"]
responses = batch_query_ollama("llama2", prompts)
for resp in responses:
print(resp)
10. Deploying Ollama Models in Applications
A. Flask Integration
Create a Flask API endpoint that serves responses from an Ollama model:
from flask import Flask, request, jsonify
import requests
app = Flask(__name__)
@app.route("/api/query", methods=["POST"])
def query():
data = request.json
model = data.get("model")
prompt = data.get("prompt")
try:
response = query_ollama(model, prompt)
return jsonify({"response": response})
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == "__main__":
app.run(port=5000)
B. Deploying with Docker
Wrap your application in a Docker container alongside Ollama for portable deployments.
11. Best Practices for Python Programmers
-
Local Model Management:
- Periodically update your models for the latest improvements.
- Use lightweight models for quick prototyping and heavier models for production.
-
Error Handling:
- Always handle connection errors and model-specific issues gracefully.
-
Streaming and Memory Efficiency:
- Stream results for large outputs to avoid blocking your application.
-
System Resource Monitoring:
- Monitor system resources when running large models locally to prevent overloading your machine.
Code examples
has some example code for using Ollama.