Skip to main content

Embedding Based Routing

This guide shows you how to route requests using semantic similarity with embedding models. Embedding routing matches user queries to predefined categories based on meaning rather than exact keywords, making it ideal for handling diverse phrasings and rapidly evolving categories.

Key Advantages​

  • Scalable: Handles unlimited categories without retraining models
  • Fast: 10-50ms inference with efficient embedding models (Qwen3, Gemma)
  • Flexible: Add/remove categories by updating keyword lists, no model retraining
  • Semantic: Captures meaning beyond exact keyword matching

What Problem Does It Solve?​

Keyword matching fails when users phrase questions differently. Embedding routing solves:

  • Paraphrase handling: "How to install?" matches "installation guide" without exact words
  • Intent detection: Routes based on semantic meaning, not surface patterns
  • Fuzzy matching: Handles typos, abbreviations, informal language
  • Dynamic categories: Add new categories without retraining classification models
  • Multilingual support: Embeddings capture cross-lingual semantics

When to Use​

  • Customer support with diverse query phrasings
  • Product inquiries where users ask the same thing many different ways
  • Technical support needing semantic understanding of error descriptions
  • Rapidly evolving categories where you need to add/update categories frequently
  • Moderate latency tolerance (10-50ms acceptable for better semantic accuracy)

Configuration​

Add embedding rules to your config.yaml:

classifier:
category_model:
model_id: "models/category_classifier_modernbert-base_model"
use_modernbert: true
threshold: 0.6
use_cpu: true
category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json"

embedding_models:
qwen3_model_path: "models/Qwen3-Embedding-0.6B"
gemma_model_path: "models/embeddinggemma-300m"
use_cpu: true

embedding_rules:
- category: "technical_support"
threshold: 0.75
keywords:
- "how to configure the system"
- "installation guide"
- "troubleshooting steps"
- "error message explanation"
aggregation_method: "max"
model: "auto"
dimension: 768
quality_priority: 0.7
latency_priority: 0.3

- category: "product_inquiry"
threshold: 0.70
keywords:
- "product features and specifications"
- "pricing information"
- "availability and stock"
aggregation_method: "avg"
model: "gemma"
dimension: 768

- category: "account_management"
threshold: 0.72
keywords:
- "password reset"
- "account settings"
- "subscription management"
aggregation_method: "max"
model: "qwen3"
dimension: 1024

categories:
- name: technical_support
system_prompt: "You are a technical support specialist."
model_scores:
- model: qwen3
score: 0.9
use_reasoning: true
jailbreak_enabled: true
pii_detection_enabled: true

- name: product_inquiry
system_prompt: "You are a product specialist."
model_scores:
- model: qwen3
score: 0.85
use_reasoning: false

Embedding Models​

  • qwen3: High quality, 1024-dim, 32K context
  • gemma: Balanced, 768-dim, 8K context, Matryoshka support (128/256/512/768)
  • auto: Automatically selects based on quality/latency priorities

Aggregation Methods​

  • max: Uses highest similarity score
  • avg: Uses average similarity across keywords
  • any: Matches if any keyword exceeds threshold

Example Requests​

# Technical support query
curl -X POST http://localhost:8801/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [{"role": "user", "content": "How do I troubleshoot connection errors?"}]
}'

# Product inquiry
curl -X POST http://localhost:8801/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [{"role": "user", "content": "What are the pricing options?"}]
}'

Real-World Use Cases​

1. Customer Support (Scalable Categories)​

Problem: Need to add new support categories weekly without retraining models Solution: Add new categories by updating keyword lists, embeddings handle semantic matching Impact: Deploy new categories in minutes vs weeks for model retraining

2. E-commerce Support (Fast Semantic Matching)​

Problem: "Where's my order?" vs "track package" vs "shipping status" all mean the same Solution: Gemma embeddings (10-20ms) route all variations to order tracking category Impact: 95% accuracy with 10-20ms latency, handles 5K+ queries/sec

3. SaaS Product Inquiries (Flexible Routing)​

Problem: Users ask about pricing in 100+ different ways Solution: Semantic similarity matches all variations to "pricing information" keywords Impact: Single category handles all pricing queries without explicit rules

4. Startup Iteration (Rapid Category Updates)​

Problem: Product evolves rapidly, need to adjust categories daily Solution: Update embedding keywords in config, no model retraining required Impact: Category updates in seconds vs days for fine-tuning

5. Multilingual Platform (Semantic Understanding)​

Problem: Same question in English, Spanish, Chinese needs same routing Solution: Embeddings capture cross-lingual semantics automatically Impact: Single category definition works across languages

Model Selection Strategy​

model: "auto"
quality_priority: 0.7 # Favor accuracy
latency_priority: 0.3 # Accept some latency
  • Automatically selects Qwen3 (high quality) or Gemma (fast) based on priorities
  • Balances accuracy vs speed per request

Qwen3 (High Quality)​

model: "qwen3"
dimension: 1024
  • Best for: Complex queries, subtle distinctions, high-value interactions
  • Latency: ~30-50ms per query
  • Use case: Account management, financial queries

Gemma (Fast)​

model: "gemma"
dimension: 768 # or 512, 256, 128 for Matryoshka
  • Best for: High-throughput, simple categorization, cost-sensitive
  • Latency: ~10-20ms per query
  • Use case: Product inquiries, general support

Performance Characteristics​

ModelDimensionLatencyAccuracyMemory
Qwen3102430-50msHighest600MB
Gemma76810-20msHigh300MB
Gemma5128-15msMedium300MB
Gemma2565-10msLower300MB

Reference​

See embedding.yaml for complete configuration.