
From Localhost to Production: A Guide to Deploying Ollama on Kubernetes
Running large language models (LLMs) on your own hardware has never been more accessible, thanks to tools like Ollama. It’s a fantastic way to experiment, develop, and maintain privacy. But what happens when your project is ready to move beyond your local machine? How do you build a scalable, resilient, and production-ready service? The answer for many modern infrastructure teams is Kubernetes.
This guide will walk you through the essential steps to deploy Ollama on a Kubernetes cluster, transforming your local setup into a robust, production-grade LLM serving platform.
The Foundation: Containerizing Ollama
Before we can run Ollama on Kubernetes, we need a container image. While Ollama provides an official image, creating your own Dockerfile
gives you more control for production environments. This allows you to pre-load specific models or add custom configurations.
A basic Dockerfile
for a production-ready Ollama instance might look like this:
# Use the official Ollama base image
FROM ollama/ollama
# Expose the default Ollama port
EXPOSE 11434
# You can add custom entrypoint scripts here to pre-pull models
# For example, create an entrypoint.sh script:
#
# #!/bin/sh
# /bin/ollama serve &
# PID=$!
# ollama pull llama3
# ollama pull mistral
# wait $PID
# For a basic setup, the default entrypoint is often sufficient.
This simple file provides the blueprint for our deployment. For a true production setup, you would likely expand this with a custom entrypoint script to pre-pull the models you need, ensuring the container is ready to serve requests immediately upon starting.
Your First Kubernetes Deployment
With a container image ready, we can define how it runs in Kubernetes using two core resources: a Deployment
and a Service
.
- Deployment: This resource manages your application’s pods. It ensures that a specified number of replicas (copies) of your Ollama container are always running. If a pod crashes, the Deployment automatically replaces it.
- Service: This provides a stable network endpoint (a single, consistent IP address and DNS name) to access the Ollama pods. It load-balances requests across all the running replicas.
Here is a basic deployment.yaml
to get started:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
And the corresponding service.yaml
:
apiVersion: v1
kind: Service
metadata:
name: ollama-service
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: ClusterIP # Exposes the service only within the cluster
Applying these files to your cluster will create a running Ollama instance, but it’s not yet optimized for performance.
Unleashing Performance with GPUs
LLMs are computationally intensive and perform dramatically better with GPUs. To leverage GPU power in Kubernetes, your cluster nodes must have NVIDIA drivers installed and the NVIDIA device plugin for Kubernetes deployed. This plugin allows Kubernetes to discover and assign GPUs to pods.
Once the prerequisites are met, you can request a GPU for your Ollama pod by adding a resources
section to your Deployment
:
# ... (inside spec.template.spec.containers)
resources:
limits:
nvidia.com/gpu: 1 # Requesting one GPU
This is the single most important change for achieving high-performance inference. Without it, your models will run on the CPU, resulting in extremely slow response times.
Essential Production-Grade Enhancements
A running service with GPU support is great, but a production environment demands more. Here are the key areas to focus on for a robust deployment.
1. Persistent Storage for Models
When a pod restarts, any data inside it is lost. For Ollama, this means downloaded models are deleted, and the new pod must download them all over again. This leads to slow startup times and wasted bandwidth.
To solve this, we use Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) to provide durable storage.
First, create a PersistentVolumeClaim
to request storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi # Adjust size based on your model needs
Then, update your Deployment
to mount this volume into the Ollama container at the /root/.ollama
directory, where models are stored.
# ... (inside spec.template.spec)
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
containers:
- name: ollama
# ... other container config
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
2. Secure External Access with Ingress
Exposing your service directly to the internet with a NodePort
or LoadBalancer
service type isn’t ideal for production. A Kubernetes Ingress is the standard, most secure method for managing external access. It acts as an intelligent router, providing SSL/TLS termination, path-based routing, and a single entry point for your HTTP traffic.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
annotations:
# Annotations for your specific ingress controller (e.g., NGINX, Traefik)
# cert-manager.io/cluster-issuer: "letsencrypt-prod" # For automatic SSL
spec:
rules:
- host: ollama.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama-service
port:
number: 11434
# tls section for HTTPS
3. Critical Security Hardening Tips
Security should never be an afterthought. Here are two actionable tips to secure your Ollama deployment:
- Use Network Policies: By default, any pod in a Kubernetes cluster can talk to any other pod. Use Network Policies to act as a firewall for your pods. Create a policy that only allows traffic to your Ollama pods from trusted sources, such as your application’s backend services. This prevents unauthorized access from other compromised pods within the cluster.
- Run as a Non-Root User: Running containers as the root user is a security risk. Use a
securityContext
in your pod specification to run the Ollama process as a dedicated, non-privileged user.
Monitoring Your LLM Service
Finally, you can’t manage what you can’t monitor. Ollama conveniently exposes a /metrics
endpoint compatible with Prometheus, the industry standard for metrics and alerting in the cloud-native world.
By adding a few simple annotations to your Service
or Pod
, you can configure Prometheus to automatically discover and scrape these metrics. You can then use Grafana to build dashboards that visualize key indicators like:
- GPU utilization and temperature
- Memory usage
- Inference latency
- Request rate and error counts
By taking these steps, you can confidently move your Ollama service from a local development environment to a scalable, secure, and observable production system on Kubernetes, ready to power the next generation of AI-driven applications.
Source: https://collabnix.com/ollama-production-deployment-on-kubernetes-3/