Deploying Ollama in Production on Kubernetes

04/08/2025

0 Views 0

SaveSavedRemoved 0

Deploying Ollama in Production on Kubernetes

From Localhost to Production: A Guide to Deploying Ollama on Kubernetes

Running large language models (LLMs) on your own hardware has never been more accessible, thanks to tools like Ollama. It’s a fantastic way to experiment, develop, and maintain privacy. But what happens when your project is ready to move beyond your local machine? How do you build a scalable, resilient, and production-ready service? The answer for many modern infrastructure teams is Kubernetes.

This guide will walk you through the essential steps to deploy Ollama on a Kubernetes cluster, transforming your local setup into a robust, production-grade LLM serving platform.

The Foundation: Containerizing Ollama

Before we can run Ollama on Kubernetes, we need a container image. While Ollama provides an official image, creating your own Dockerfile gives you more control for production environments. This allows you to pre-load specific models or add custom configurations.

A basic Dockerfile for a production-ready Ollama instance might look like this:

# Use the official Ollama base image
FROM ollama/ollama

# Expose the default Ollama port
EXPOSE 11434

# You can add custom entrypoint scripts here to pre-pull models
# For example, create an entrypoint.sh script:
#
# #!/bin/sh
# /bin/ollama serve &
# PID=$!
# ollama pull llama3
# ollama pull mistral
# wait $PID

# For a basic setup, the default entrypoint is often sufficient.

This simple file provides the blueprint for our deployment. For a true production setup, you would likely expand this with a custom entrypoint script to pre-pull the models you need, ensuring the container is ready to serve requests immediately upon starting.

Your First Kubernetes Deployment

With a container image ready, we can define how it runs in Kubernetes using two core resources: a Deployment and a Service.

Deployment: This resource manages your application’s pods. It ensures that a specified number of replicas (copies) of your Ollama container are always running. If a pod crashes, the Deployment automatically replaces it.
Service: This provides a stable network endpoint (a single, consistent IP address and DNS name) to access the Ollama pods. It load-balances requests across all the running replicas.

Here is a basic deployment.yaml to get started:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434

And the corresponding service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
    - protocol: TCP
      port: 11434
      targetPort: 11434
  type: ClusterIP # Exposes the service only within the cluster

Applying these files to your cluster will create a running Ollama instance, but it’s not yet optimized for performance.

Unleashing Performance with GPUs

LLMs are computationally intensive and perform dramatically better with GPUs. To leverage GPU power in Kubernetes, your cluster nodes must have NVIDIA drivers installed and the NVIDIA device plugin for Kubernetes deployed. This plugin allows Kubernetes to discover and assign GPUs to pods.

Once the prerequisites are met, you can request a GPU for your Ollama pod by adding a resources section to your Deployment:

# ... (inside spec.template.spec.containers)
        resources:
          limits:
            nvidia.com/gpu: 1 # Requesting one GPU

This is the single most important change for achieving high-performance inference. Without it, your models will run on the CPU, resulting in extremely slow response times.

Essential Production-Grade Enhancements

A running service with GPU support is great, but a production environment demands more. Here are the key areas to focus on for a robust deployment.

1. Persistent Storage for Models

When a pod restarts, any data inside it is lost. For Ollama, this means downloaded models are deleted, and the new pod must download them all over again. This leads to slow startup times and wasted bandwidth.

To solve this, we use Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) to provide durable storage.

First, create a PersistentVolumeClaim to request storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi # Adjust size based on your model needs

Then, update your Deployment to mount this volume into the Ollama container at the /root/.ollama directory, where models are stored.

# ... (inside spec.template.spec)
      volumes:
        - name: ollama-data
          persistentVolumeClaim:
            claimName: ollama-pvc
      containers:
      - name: ollama
        # ... other container config
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama

2. Secure External Access with Ingress

Exposing your service directly to the internet with a NodePort or LoadBalancer service type isn’t ideal for production. A Kubernetes Ingress is the standard, most secure method for managing external access. It acts as an intelligent router, providing SSL/TLS termination, path-based routing, and a single entry point for your HTTP traffic.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama-ingress
  annotations:
    # Annotations for your specific ingress controller (e.g., NGINX, Traefik)
    # cert-manager.io/cluster-issuer: "letsencrypt-prod" # For automatic SSL
spec:
  rules:
  - host: ollama.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama-service
            port:
              number: 11434
  # tls section for HTTPS

3. Critical Security Hardening Tips

Security should never be an afterthought. Here are two actionable tips to secure your Ollama deployment:

Use Network Policies: By default, any pod in a Kubernetes cluster can talk to any other pod. Use Network Policies to act as a firewall for your pods. Create a policy that only allows traffic to your Ollama pods from trusted sources, such as your application’s backend services. This prevents unauthorized access from other compromised pods within the cluster.
Run as a Non-Root User: Running containers as the root user is a security risk. Use a securityContext in your pod specification to run the Ollama process as a dedicated, non-privileged user.

Monitoring Your LLM Service

Finally, you can’t manage what you can’t monitor. Ollama conveniently exposes a /metrics endpoint compatible with Prometheus, the industry standard for metrics and alerting in the cloud-native world.

By adding a few simple annotations to your Service or Pod, you can configure Prometheus to automatically discover and scrape these metrics. You can then use Grafana to build dashboards that visualize key indicators like:

GPU utilization and temperature
Memory usage
Inference latency
Request rate and error counts

By taking these steps, you can confidently move your Ollama service from a local development environment to a scalable, secure, and observable production system on Kubernetes, ready to power the next generation of AI-driven applications.

Source: https://collabnix.com/ollama-production-deployment-on-kubernetes-3/