Install with Envoy AI Gateway
This guide provides step-by-step instructions for integrating the vLLM Semantic Router with Envoy AI Gateway on Kubernetes for advanced traffic management and AI-specific features.
Architecture Overview​
The deployment consists of:
- vLLM Semantic Router: Provides intelligent request routing and semantic understanding
- Envoy Gateway: Core gateway functionality and traffic management
- Envoy AI Gateway: AI Gateway built on Envoy Gateway for LLM providers
Benefits of Integration​
Integrating vLLM Semantic Router with Envoy AI Gateway provides enterprise-grade capabilities for production LLM deployments:
1. Hybrid Model Selection​
Seamlessly route requests between cloud LLM providers (OpenAI, Anthropic, etc.) and self-hosted models.
2. Token Rate Limiting​
Protect your infrastructure and control costs with fine-grained rate limiting:
- Input token limits: Control request size to prevent abuse
- Output token limits: Manage response generation costs
- Total token limits: Set overall usage quotas per user/tenant
- Time-based windows: Configure limits per second, minute, or hour
3. Model/Provider Failover​
Ensure high availability with automatic failover mechanisms:
- Detect unhealthy backends and route traffic to healthy instances
- Support for active-passive and active-active failover strategies
- Graceful degradation when primary models are unavailable
4. Traffic Splitting & Canary Testing​
Deploy new models safely with progressive rollout capabilities:
- A/B Testing: Split traffic between model versions to compare performance
- Canary Deployments: Gradually shift traffic to new models (e.g., 5% → 25% → 50% → 100%)
- Shadow Traffic: Send duplicate requests to new models without affecting production
- Weight-based routing: Fine-tune traffic distribution across model variants
5. LLM Observability & Monitoring​
Gain deep insights into your LLM infrastructure:
- Request/Response Metrics: Track latency, throughput, token usage, and error rates
- Model Performance: Monitor accuracy, quality scores, and user satisfaction
- Cost Analytics: Analyze spending patterns across models and providers
- Distributed Tracing: End-to-end visibility with OpenTelemetry integration
- Custom Dashboards: Visualize metrics in Prometheus, Grafana, or your preferred monitoring stack