Install with Envoy AI Gateway

This guide provides step-by-step instructions for integrating the vLLM Semantic Router with Envoy AI Gateway on Kubernetes for advanced traffic management and AI-specific features.

Architecture Overview

The deployment consists of:

vLLM Semantic Router: Provides intelligent request routing and semantic understanding
Envoy Gateway: Core gateway functionality and traffic management
Envoy AI Gateway: AI Gateway built on Envoy Gateway for LLM providers

Benefits of Integration

Integrating vLLM Semantic Router with Envoy AI Gateway provides enterprise-grade capabilities for production LLM deployments:

1. Hybrid Model Selection

Seamlessly route requests between cloud LLM providers (OpenAI, Anthropic, etc.) and self-hosted models.

2. Token Rate Limiting

Protect your infrastructure and control costs with fine-grained rate limiting:

Input token limits: Control request size to prevent abuse
Output token limits: Manage response generation costs
Total token limits: Set overall usage quotas per user/tenant
Time-based windows: Configure limits per second, minute, or hour

3. Model/Provider Failover

Ensure high availability with automatic failover mechanisms:

Detect unhealthy backends and route traffic to healthy instances
Support for active-passive and active-active failover strategies
Graceful degradation when primary models are unavailable

4. Traffic Splitting & Canary Testing

Deploy new models safely with progressive rollout capabilities:

A/B Testing: Split traffic between model versions to compare performance
Canary Deployments: Gradually shift traffic to new models (e.g., 5% → 25% → 50% → 100%)
Shadow Traffic: Send duplicate requests to new models without affecting production
Weight-based routing: Fine-tune traffic distribution across model variants

5. LLM Observability & Monitoring

Gain deep insights into your LLM infrastructure:

Request/Response Metrics: Track latency, throughput, token usage, and error rates
Model Performance: Monitor accuracy, quality scores, and user satisfaction
Cost Analytics: Analyze spending patterns across models and providers
Distributed Tracing: End-to-end visibility with OpenTelemetry integration
Custom Dashboards: Visualize metrics in Prometheus, Grafana, or your preferred monitoring stack

Supported LLM Providers

Provider Name	API Schema Config on AIServiceBackend	Upstream Authentication Config on BackendSecurityPolicy	Status
OpenAI	`{"name":"OpenAI","version":"v1"}`	API Key	✅
AWS Bedrock	`{"name":"AWSBedrock"}`	AWS Bedrock Credentials	✅
Azure OpenAI	`{"name":"AzureOpenAI","version":"2025-01-01-preview"}` or `{"name":"OpenAI", "version": "openai/v1"}`	Azure Credentials or Azure API Key	✅
Google Gemini on AI Studio	`{"name":"OpenAI","version":"v1beta/openai"}`	API Key	✅
Google Vertex AI	`{"name":"GCPVertexAI"}`	GCP Credentials	✅
Anthropic on GCP Vertex AI	`{"name":"GCPAnthropic", "version":"vertex-2023-10-16"}`	GCP Credentials	✅
Groq	`{"name":"OpenAI","version":"openai/v1"}`	API Key	✅
Grok	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Together AI	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Cohere	`{"name":"Cohere","version":"v2"}` or `{"name":"OpenAI","version":"v1"}`	API Key	✅
Mistral	`{"name":"OpenAI","version":"v1"}`	API Key	✅
DeepInfra	`{"name":"OpenAI","version":"v1/openai"}`	API Key	✅
DeepSeek	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Hunyuan	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Tencent LLM Knowledge Engine	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Tetrate Agent Router Service (TARS)	`{"name":"OpenAI","version":"v1"}`	API Key	✅
SambaNova	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Anthropic	`{"name":"Anthropic"}`	Anthropic API Key	✅
Self-hosted-models	`{"name":"OpenAI","version":"v1"}`	N/A	✅

Prerequisites

Before starting, ensure you have the following tools installed:

kind - Kubernetes in Docker (Optional)
kubectl - Kubernetes CLI
Helm - Package manager for Kubernetes

Step 1: Create Kind Cluster (Optional)

Create a local Kubernetes cluster optimized for the semantic router workload:

kind create cluster --name semantic-router-cluster

# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

Step 2: Deploy vLLM Semantic Router

Deploy the semantic router service with all required components using Helm:

# Install with custom values from GHCR OCI registry
helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
  --version v0.0.0-latest \
  --namespace vllm-semantic-router-system \
  --create-namespace \
  -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml

# Wait for deployment to be ready (this may take several minutes for model downloads)
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s

# Verify deployment status
kubectl get pods -n vllm-semantic-router-system

Note: The values file contains the configuration for the semantic router, including model settings, categories, and routing rules. You can download and customize it from values.yaml.

Step 3: Install Envoy Gateway

Install the core Envoy Gateway for traffic management:

# Install Envoy Gateway using Helm
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
  --version v0.0.0-latest \
  --namespace envoy-gateway-system \
  --create-namespace \
  -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-values.yaml

kubectl wait --timeout=2m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available

Step 4: Install Envoy AI Gateway

Install the AI-specific extensions for inference workloads:

# Install Envoy AI Gateway using Helm
helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
    --version v0.0.0-latest \
    --namespace envoy-ai-gateway-system \
    --create-namespace

# Install Envoy AI Gateway CRDs
helm upgrade -i aieg-crd oci://docker.io/envoyproxy/ai-gateway-crds-helm --version v0.0.0-latest --namespace envoy-ai-gateway-system

# Wait for AI Gateway Controller to be ready
kubectl wait --timeout=300s -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available

Step 5: Deploy Demo LLM

Create a demo LLM to serve as the backend for the semantic router:

# Deploy demo LLM
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml

Step 6: Create Gateway API Resources

Create the necessary Gateway API resources for the AI gateway:

kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml

Testing the Deployment

Method 1: Port Forwarding (Recommended for Local Testing)

Set up port forwarding to access the gateway locally:

# Get the Envoy service name
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
  --selector=gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=semantic-router \
  -o jsonpath='{.items[0].metadata.name}')

kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80

Send Test Requests

Once the gateway is accessible, test the inference endpoint:

# Test math domain chat completions endpoint
curl -i -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [
      {"role": "user", "content": "What is the derivative of f(x) = x^3?"}
    ]
  }'

Troubleshooting

Common Issues

Gateway not accessible:

# Check gateway status
kubectl get gateway semantic-router -n default

# Check Envoy service
kubectl get svc -n envoy-gateway-system

AI Gateway controller not ready:

# Check AI gateway controller logs
kubectl logs -n envoy-ai-gateway-system deployment/ai-gateway-controller

# Check controller status
kubectl get deployment -n envoy-ai-gateway-system

Semantic router not responding:

# Check semantic router pod status
kubectl get pods -n vllm-semantic-router-system

# Check semantic router logs
kubectl logs -n vllm-semantic-router-system deployment/semantic-router

Cleanup

To remove the entire deployment:

# Remove Gateway API resources and Demo LLM
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml

# Remove semantic router
helm uninstall semantic-router -n vllm-semantic-router-system

# Remove AI gateway
helm uninstall aieg -n envoy-ai-gateway-system
helm uninstall aieg-crd -n envoy-ai-gateway-system

# Remove Envoy gateway
helm uninstall eg -n envoy-gateway-system

# Delete kind cluster (optional)
kind delete cluster --name semantic-router-cluster

Next Steps

Configure custom routing rules in the AI Gateway
Set up monitoring and observability
Implement authentication and authorization
Scale the semantic router deployment for production workloads

Architecture Overview​

Benefits of Integration​

1. Hybrid Model Selection​

2. Token Rate Limiting​

3. Model/Provider Failover​

4. Traffic Splitting & Canary Testing​

5. LLM Observability & Monitoring​

Supported LLM Providers​

Prerequisites​

Step 1: Create Kind Cluster (Optional)​

Step 2: Deploy vLLM Semantic Router​

Step 3: Install Envoy Gateway​

Step 4: Install Envoy AI Gateway​

Step 5: Deploy Demo LLM​

Step 6: Create Gateway API Resources​

Testing the Deployment​

Method 1: Port Forwarding (Recommended for Local Testing)​

Send Test Requests​

Troubleshooting​

Common Issues​

Cleanup​

Next Steps​