1080*80 ad

Kubernetes Guide: Running Multiple Ollama Models

How to Run Multiple Ollama Models on Kubernetes: A Practical Guide

Running large language models (LLMs) locally has become significantly easier thanks to tools like Ollama. When combined with the power of Kubernetes for orchestration, you can build scalable and robust AI infrastructure. However, a common challenge arises when you need to run multiple, distinct Ollama models concurrently. A single Ollama instance is designed to load and serve one model at a time, which can be a bottleneck for applications requiring different specialized models.

This guide provides a clear, production-ready strategy for deploying and managing multiple Ollama models on a Kubernetes cluster. By isolating each model into its own deployment, you gain granular control, independent scalability, and simplified management.


The Challenge: Concurrent Model Serving

By default, an Ollama server loads models into memory on demand. If you send a request for Llama3 and then another for Mistral, the server will unload the first model to make room for the second. This switching process introduces latency and is inefficient for applications that need to query different models simultaneously. The goal is to have dedicated, always-on endpoints for each model you need to serve.

The most effective solution is to move away from a single, monolithic Ollama instance and embrace a microservice-style architecture. The core principle is simple: one model per Kubernetes Deployment.

This approach provides several key advantages:

  • Isolation: Each model runs in its own dedicated pod(s), preventing resource conflicts.
  • Scalability: You can scale the deployment for a high-demand model (e.g., Llama3) independently of a less-used one.
  • Clear Endpoints: Each model gets its own stable internal DNS name and IP address through a Kubernetes Service.
  • Efficient Resource Management: You can assign specific CPU, memory, and GPU resources to each model based on its unique requirements.

Step 1: Preparing for Persistence with a PersistentVolumeClaim

LLMs are large, and you don’t want your pods to re-download a multi-gigabyte model every time they restart. To solve this, we’ll use a Kubernetes PersistentVolumeClaim (PVC) to create a reusable storage volume. This volume will store the downloaded models.

It is crucial to create a single, shared PVC that all your Ollama pods can mount. This ensures that a model downloaded by one pod is immediately available to any other pod that needs it, saving significant time and bandwidth.

Create a file named ollama-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi # Adjust size based on the models you plan to store

Apply this configuration to your cluster:

kubectl apply -f ollama-pvc.yaml

Step 2: Deploying Your First Model (Llama3)

Now, let’s create a Kubernetes Deployment for our first model, Llama3. This manifest will define a pod that runs the Ollama server and is configured to load Llama3 on startup.

The key to this setup is using the command argument in the container spec. By running ollama run llama3, we instruct this specific instance to pull and load only that model when it starts, making it immediately ready to serve requests.

Create a file named deployment-llama3.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-llama3-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-llama3
  template:
    metadata:
      labels:
        app: ollama-llama3
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        command: ["/bin/sh", "-c", "ollama serve & ollama run llama3 & wait"]
        ports:
        - containerPort: 11434
        volumeMounts:
        - name: ollama-models
          mountPath: /root/.ollama
      volumes:
      - name: ollama-models
        persistentVolumeClaim:
          claimName: ollama-models-pvc

Next, we need to expose this deployment within the cluster using a Service. This gives us a stable endpoint to interact with the Llama3 model.

Create a file named service-llama3.yaml:

apiVersion: v1
kind: Service
metadata:
  name: ollama-llama3-service
spec:
  selector:
    app: ollama-llama3
  ports:
    - protocol: TCP
      port: 11434
      targetPort: 11434

Apply both files to deploy the model:

kubectl apply -f deployment-llama3.yaml
kubectl apply -f service-llama3.yaml

Your cluster now has a running instance of Ollama dedicated exclusively to serving the Llama3 model, accessible at the DNS name ollama-llama3-service.


Step 3: Deploying a Second Model (Phi-3)

Deploying another model is as simple as duplicating and modifying the previous configurations. Let’s deploy Phi-3.

First, create deployment-phi3.yaml. Notice the changes in metadata.name, selector, labels, and the command:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-phi3-deployment # Changed
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-phi3 # Changed
  template:
    metadata:
      labels:
        app: ollama-phi3 # Changed
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        command: ["/bin/sh", "-c", "ollama serve & ollama run phi3 & wait"] # Changed model
        ports:
        - containerPort: 11434
        volumeMounts:
        - name: ollama-models
          mountPath: /root/.ollama
      volumes:
      - name: ollama-models
        persistentVolumeClaim:
          claimName: ollama-models-pvc

Now, create service-phi3.yaml:

apiVersion: v1
kind: Service
metadata:
  name: ollama-phi3-service # Changed
spec:
  selector:
    app: ollama-phi3 # Changed
  ports:
    - protocol: TCP
      port: 11434
      targetPort: 11434

Apply the new configurations:

kubectl apply -f deployment-phi3.yaml
kubectl apply -f service-phi3.yaml

You now have a second, independent Ollama instance running Phi-3, accessible at ollama-phi3-service. You can repeat this process for as many models as your cluster resources allow.


Security and Performance Best Practices

To move this setup closer to a production environment, consider the following actionable tips:

  1. Define Resource Requests and Limits: To prevent a single model from consuming all cluster resources, it is essential to set CPU and memory limits in your Deployment manifest. If you have GPUs, be sure to request them appropriately. This ensures stability and fair resource sharing.

    # Add this inside the container spec
    resources:
      requests:
        memory: "16Gi"
        cpu: "4"
      limits:
        memory: "24Gi"
        cpu: "6"
    
  2. Use an Ingress Controller: For external access, avoid exposing services directly. Instead, deploy an Ingress controller (like NGINX or Traefik) to manage and route traffic from a single entry point to the correct model service based on the path (e.g., api.yourapi.com/llama3 routes to ollama-llama3-service).

  3. Implement Readiness Probes: Add a readiness probe to your Deployment spec to ensure Kubernetes only sends traffic to pods that have successfully loaded the model and are ready to serve requests. This prevents connection errors during startup.

  4. Consider Namespace Isolation: For better organization and security, deploy your Ollama models within a dedicated Kubernetes namespace. This helps separate your AI workloads from other applications running in the cluster.

By adopting this isolated deployment strategy, you can build a powerful, scalable, and easy-to-manage inference platform on Kubernetes, capable of serving multiple specialized LLMs concurrently to power your most demanding AI applications.

Source: https://collabnix.com/running-multiple-ollama-models-on-kubernetes-complete-guide/

900*80 ad

      1080*80 ad