Embedding Based Routing

This guide shows you how to route requests using semantic similarity with embedding models. Embedding routing matches user queries to predefined categories based on meaning rather than exact keywords, making it ideal for handling diverse phrasings and rapidly evolving categories.

Key Advantages

Scalable: Handles unlimited categories without retraining models
Fast: 10-50ms inference with efficient embedding models (Qwen3, Gemma)
Flexible: Add/remove categories by updating keyword lists, no model retraining
Semantic: Captures meaning beyond exact keyword matching

What Problem Does It Solve?

Keyword matching fails when users phrase questions differently. Embedding routing solves:

Paraphrase handling: "How to install?" matches "installation guide" without exact words
Intent detection: Routes based on semantic meaning, not surface patterns
Fuzzy matching: Handles typos, abbreviations, informal language
Dynamic categories: Add new categories without retraining classification models
Multilingual support: Embeddings capture cross-lingual semantics

When to Use

Customer support with diverse query phrasings
Product inquiries where users ask the same thing many different ways
Technical support needing semantic understanding of error descriptions
Rapidly evolving categories where you need to add/update categories frequently
Moderate latency tolerance (10-50ms acceptable for better semantic accuracy)

Configuration

Add embedding rules to your config.yaml:

classifier:
  category_model:
    model_id: "models/category_classifier_modernbert-base_model"
    use_modernbert: true
    threshold: 0.6
    use_cpu: true
    category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json"

embedding_models:
  qwen3_model_path: "models/Qwen3-Embedding-0.6B"
  gemma_model_path: "models/embeddinggemma-300m"
  use_cpu: true

embedding_rules:
  - category: "technical_support"
    threshold: 0.75
    keywords:
      - "how to configure the system"
      - "installation guide"
      - "troubleshooting steps"
      - "error message explanation"
    aggregation_method: "max"
    model: "auto"
    dimension: 768
    quality_priority: 0.7
    latency_priority: 0.3
  
  - category: "product_inquiry"
    threshold: 0.70
    keywords:
      - "product features and specifications"
      - "pricing information"
      - "availability and stock"
    aggregation_method: "avg"
    model: "gemma"
    dimension: 768
  
  - category: "account_management"
    threshold: 0.72
    keywords:
      - "password reset"
      - "account settings"
      - "subscription management"
    aggregation_method: "max"
    model: "qwen3"
    dimension: 1024

categories:
  - name: technical_support
    system_prompt: "You are a technical support specialist."
    model_scores:
      - model: qwen3
        score: 0.9
        use_reasoning: true
    jailbreak_enabled: true
    pii_detection_enabled: true
  
  - name: product_inquiry
    system_prompt: "You are a product specialist."
    model_scores:
      - model: qwen3
        score: 0.85
        use_reasoning: false

Embedding Models

qwen3: High quality, 1024-dim, 32K context
gemma: Balanced, 768-dim, 8K context, Matryoshka support (128/256/512/768)
auto: Automatically selects based on quality/latency priorities

Aggregation Methods

max: Uses highest similarity score
avg: Uses average similarity across keywords
any: Matches if any keyword exceeds threshold

Example Requests

# Technical support query
curl -X POST http://localhost:8801/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [{"role": "user", "content": "How do I troubleshoot connection errors?"}]
  }'

# Product inquiry
curl -X POST http://localhost:8801/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [{"role": "user", "content": "What are the pricing options?"}]
  }'

Real-World Use Cases

1. Customer Support (Scalable Categories)

Problem: Need to add new support categories weekly without retraining models Solution: Add new categories by updating keyword lists, embeddings handle semantic matching Impact: Deploy new categories in minutes vs weeks for model retraining

2. E-commerce Support (Fast Semantic Matching)

Problem: "Where's my order?" vs "track package" vs "shipping status" all mean the same Solution: Gemma embeddings (10-20ms) route all variations to order tracking category Impact: 95% accuracy with 10-20ms latency, handles 5K+ queries/sec

3. SaaS Product Inquiries (Flexible Routing)

Problem: Users ask about pricing in 100+ different ways Solution: Semantic similarity matches all variations to "pricing information" keywords Impact: Single category handles all pricing queries without explicit rules

4. Startup Iteration (Rapid Category Updates)

Problem: Product evolves rapidly, need to adjust categories daily Solution: Update embedding keywords in config, no model retraining required Impact: Category updates in seconds vs days for fine-tuning

5. Multilingual Platform (Semantic Understanding)

Problem: Same question in English, Spanish, Chinese needs same routing Solution: Embeddings capture cross-lingual semantics automatically Impact: Single category definition works across languages

Model Selection Strategy

Auto Mode (Recommended)

model: "auto"
quality_priority: 0.7  # Favor accuracy
latency_priority: 0.3  # Accept some latency

Automatically selects Qwen3 (high quality) or Gemma (fast) based on priorities
Balances accuracy vs speed per request

Qwen3 (High Quality)

model: "qwen3"
dimension: 1024

Best for: Complex queries, subtle distinctions, high-value interactions
Latency: ~30-50ms per query
Use case: Account management, financial queries

Gemma (Fast)

model: "gemma"
dimension: 768  # or 512, 256, 128 for Matryoshka

Best for: High-throughput, simple categorization, cost-sensitive
Latency: ~10-20ms per query
Use case: Product inquiries, general support

Performance Characteristics

Model	Dimension	Latency	Accuracy	Memory
Qwen3	1024	30-50ms	Highest	600MB
Gemma	768	10-20ms	High	300MB
Gemma	512	8-15ms	Medium	300MB
Gemma	256	5-10ms	Lower	300MB

Reference

See embedding.yaml for complete configuration.

Key Advantages​

What Problem Does It Solve?​

When to Use​

Configuration​

Embedding Models​

Aggregation Methods​

Example Requests​

Real-World Use Cases​

1. Customer Support (Scalable Categories)​

2. E-commerce Support (Fast Semantic Matching)​

3. SaaS Product Inquiries (Flexible Routing)​

4. Startup Iteration (Rapid Category Updates)​

5. Multilingual Platform (Semantic Understanding)​

Model Selection Strategy​

Auto Mode (Recommended)​

Qwen3 (High Quality)​

Gemma (Fast)​

Performance Characteristics​

Reference​