Dynamic LLM Autoscaling with KEDA and KServe

Metrics-based autoscaling for LLM inference services on OpenShift AI using KEDA, KServe and vLLM metrics. Scale GPU workloads efficiently based on request queue depth.

This article is based on the Dynamic Model Autoscaling Repository, which contains Helm charts and demo scripts.

Why KEDA for LLM Autoscaling?

Deploying LLMs in production presents unique autoscaling challenges. Traditional CPU/memory-based autoscaling often falls short for LLM workloads because:

Unpredictable resource needs: Processing time varies dramatically based on input/output sequence length
GPU utilization is misleading: High GPU utilization might indicate efficient usage or a saturated, overloaded state
Request queue depth matters: The number of waiting requests is a better indicator of when to scale

KEDA (Kubernetes Event-driven Autoscaling) solves this by scaling based on custom, application-specific metrics like vLLM's num_requests_waiting and num_requests_running.

Knative vs RawDeployment

Knative autoscaling is not available in KServe RawDeployment mode. KEDA provides autoscaling capabilities for RawDeployment-based InferenceServices.

Architecture

KEDA Autoscaling Architecture

The autoscaling flow works as follows:

vLLM exposes metrics (num_requests_waiting, num_requests_running)
ServiceMonitor scrapes metrics into Prometheus via Thanos Querier
KEDA monitors the metrics and triggers scaling when thresholds are exceeded
HPA scales the deployment up or down based on KEDA's signals

Technology Preview

Metrics-based autoscaling is currently available in Red Hat OpenShift AI as a Technology Preview feature.For more information, see Technology Preview Features Support Scope.

Automatic KEDA Integration

Setting the annotation serving.kserve.io/autoscalerClass: keda on your InferenceService triggers odh-model-controller to automatically create:

TriggerAuthentication, ServiceAccount, Role, RoleBinding, Secret
ScaledObject with Prometheus trigger

No manual KEDA configuration is required.

Key Metrics for Autoscaling

Query vLLM metrics in the OpenShift Console (Observe → Metrics):

# Requests waiting in queue (triggers scale-up)
sum(vllm:num_requests_waiting{model_name="<model-name>"})

# Active requests being processed
sum(vllm:num_requests_running{model_name="<model-name>"})

Prometheus Metrics

Expected behavior:

Scale-up: Pods increase within ~30-60 seconds when request queue grows
Scale-down: Pods return to minimum after ~5 minutes cooldown when load stops

Pod Scaling

Getting Started

Quick Install

Clone the repository and deploy the components using Helm:

git clone https://github.com/rh-aiservices-bu/dynamic-model-autoscaling.git
cd dynamic-model-autoscaling

# 1. Install KEDA Operator
oc create namespace openshift-keda
oc label namespace openshift-keda openshift.io/cluster-monitoring=true
helm install keda-operator helm/keda-operator/ -n openshift-keda

# 2. Enable User Workload Monitoring
helm install uwm helm/uwm/ -n openshift-monitoring

# 3. Configure KEDA Controller
helm install keda helm/keda/ -n openshift-keda

# 4. Deploy a model with KEDA autoscaling enabled
export NAMESPACE=autoscaling-keda
oc new-project $NAMESPACE
helm install llama3-2-3b helm/llama3.2-3b/ \
  --set keda.enabled=true \
  --set inferenceService.maxReplicas=3 \
  -n $NAMESPACE

Verify Autoscaling

# Check KEDA resources were created
oc get scaledobject,hpa,pods -n $NAMESPACE

# Run the included load test to trigger scaling
DURATION=60 RATE=20 NAMESPACE=$NAMESPACE ./scripts/basic-load-test.sh

Repository Contents

For complete documentation, additional models, and advanced configurations, see:

Dynamic Model Autoscaling Repository

The repository includes:

Helm charts for KEDA operator, controller, and model deployments
Pre-configured InferenceServices for Llama 3.2-3B and Granite 3.3-8B
Load testing scripts to verify autoscaling behavior
KEDA HTTP Add-on setup for scale-to-zero

Scale-to-Zero

Standard Prometheus-based KEDA cannot scale to zero because when no pods are running, no metrics are generated. This creates a chicken-and-egg problem: KEDA needs metrics to scale up, but metrics only exist when pods are running.

For scale-to-zero capabilities, the KEDA HTTP Add-on provides a solution by intercepting HTTP requests before they reach the service:

The HTTP Add-on interceptor sits in front of your service
When pods are at zero, incoming requests are held by the interceptor
The interceptor triggers KEDA to scale up from 0 → 1
Once the pod is ready, requests are forwarded to the model

Scale-to-Zero Trade-off

First request after scale-to-zero takes 60-90 seconds while the model loads. This is suitable for cost optimization but not for latency-sensitive workloads.

For detailed setup instructions, see the KEDA HTTP Add-on Demo Guide.