
How to Run Multiple Ollama Models on Kubernetes: A Practical Guide
Running large language models (LLMs) locally has become significantly easier thanks to tools like Ollama. When combined with the power of Kubernetes for orchestration, you can build scalable and robust AI infrastructure. However, a common challenge arises when you need to run multiple, distinct Ollama models concurrently. A single Ollama instance is designed to load and serve one model at a time, which can be a bottleneck for applications requiring different specialized models.
This guide provides a clear, production-ready strategy for deploying and managing multiple Ollama models on a Kubernetes cluster. By isolating each model into its own deployment, you gain granular control, independent scalability, and simplified management.
The Challenge: Concurrent Model Serving
By default, an Ollama server loads models into memory on demand. If you send a request for Llama3 and then another for Mistral, the server will unload the first model to make room for the second. This switching process introduces latency and is inefficient for applications that need to query different models simultaneously. The goal is to have dedicated, always-on endpoints for each model you need to serve.
The most effective solution is to move away from a single, monolithic Ollama instance and embrace a microservice-style architecture. The core principle is simple: one model per Kubernetes Deployment.
This approach provides several key advantages:
- Isolation: Each model runs in its own dedicated pod(s), preventing resource conflicts.
- Scalability: You can scale the deployment for a high-demand model (e.g.,
Llama3) independently of a less-used one. - Clear Endpoints: Each model gets its own stable internal DNS name and IP address through a Kubernetes Service.
- Efficient Resource Management: You can assign specific CPU, memory, and GPU resources to each model based on its unique requirements.
Step 1: Preparing for Persistence with a PersistentVolumeClaim
LLMs are large, and you don’t want your pods to re-download a multi-gigabyte model every time they restart. To solve this, we’ll use a Kubernetes PersistentVolumeClaim (PVC) to create a reusable storage volume. This volume will store the downloaded models.
It is crucial to create a single, shared PVC that all your Ollama pods can mount. This ensures that a model downloaded by one pod is immediately available to any other pod that needs it, saving significant time and bandwidth.
Create a file named ollama-pvc.yaml:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi # Adjust size based on the models you plan to store
Apply this configuration to your cluster:
kubectl apply -f ollama-pvc.yaml
Step 2: Deploying Your First Model (Llama3)
Now, let’s create a Kubernetes Deployment for our first model, Llama3. This manifest will define a pod that runs the Ollama server and is configured to load Llama3 on startup.
The key to this setup is using the command argument in the container spec. By running ollama run llama3, we instruct this specific instance to pull and load only that model when it starts, making it immediately ready to serve requests.
Create a file named deployment-llama3.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-llama3-deployment
spec:
replicas: 1
selector:
matchLabels:
app: ollama-llama3
template:
metadata:
labels:
app: ollama-llama3
spec:
containers:
- name: ollama
image: ollama/ollama:latest
command: ["/bin/sh", "-c", "ollama serve & ollama run llama3 & wait"]
ports:
- containerPort: 11434
volumeMounts:
- name: ollama-models
mountPath: /root/.ollama
volumes:
- name: ollama-models
persistentVolumeClaim:
claimName: ollama-models-pvc
Next, we need to expose this deployment within the cluster using a Service. This gives us a stable endpoint to interact with the Llama3 model.
Create a file named service-llama3.yaml:
apiVersion: v1
kind: Service
metadata:
name: ollama-llama3-service
spec:
selector:
app: ollama-llama3
ports:
- protocol: TCP
port: 11434
targetPort: 11434
Apply both files to deploy the model:
kubectl apply -f deployment-llama3.yaml
kubectl apply -f service-llama3.yaml
Your cluster now has a running instance of Ollama dedicated exclusively to serving the Llama3 model, accessible at the DNS name ollama-llama3-service.
Step 3: Deploying a Second Model (Phi-3)
Deploying another model is as simple as duplicating and modifying the previous configurations. Let’s deploy Phi-3.
First, create deployment-phi3.yaml. Notice the changes in metadata.name, selector, labels, and the command:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-phi3-deployment # Changed
spec:
replicas: 1
selector:
matchLabels:
app: ollama-phi3 # Changed
template:
metadata:
labels:
app: ollama-phi3 # Changed
spec:
containers:
- name: ollama
image: ollama/ollama:latest
command: ["/bin/sh", "-c", "ollama serve & ollama run phi3 & wait"] # Changed model
ports:
- containerPort: 11434
volumeMounts:
- name: ollama-models
mountPath: /root/.ollama
volumes:
- name: ollama-models
persistentVolumeClaim:
claimName: ollama-models-pvc
Now, create service-phi3.yaml:
apiVersion: v1
kind: Service
metadata:
name: ollama-phi3-service # Changed
spec:
selector:
app: ollama-phi3 # Changed
ports:
- protocol: TCP
port: 11434
targetPort: 11434
Apply the new configurations:
kubectl apply -f deployment-phi3.yaml
kubectl apply -f service-phi3.yaml
You now have a second, independent Ollama instance running Phi-3, accessible at ollama-phi3-service. You can repeat this process for as many models as your cluster resources allow.
Security and Performance Best Practices
To move this setup closer to a production environment, consider the following actionable tips:
Define Resource Requests and Limits: To prevent a single model from consuming all cluster resources, it is essential to set CPU and memory limits in your
Deploymentmanifest. If you have GPUs, be sure to request them appropriately. This ensures stability and fair resource sharing.# Add this inside the container spec resources: requests: memory: "16Gi" cpu: "4" limits: memory: "24Gi" cpu: "6"Use an Ingress Controller: For external access, avoid exposing services directly. Instead, deploy an Ingress controller (like NGINX or Traefik) to manage and route traffic from a single entry point to the correct model service based on the path (e.g.,
api.yourapi.com/llama3routes toollama-llama3-service).Implement Readiness Probes: Add a readiness probe to your
Deploymentspec to ensure Kubernetes only sends traffic to pods that have successfully loaded the model and are ready to serve requests. This prevents connection errors during startup.Consider Namespace Isolation: For better organization and security, deploy your Ollama models within a dedicated Kubernetes namespace. This helps separate your AI workloads from other applications running in the cluster.
By adopting this isolated deployment strategy, you can build a powerful, scalable, and easy-to-manage inference platform on Kubernetes, capable of serving multiple specialized LLMs concurrently to power your most demanding AI applications.
Source: https://collabnix.com/running-multiple-ollama-models-on-kubernetes-complete-guide/


