Middleware Configuration Guide¶
This guide provides comprehensive examples for configuring middleware with LLM models in the AI Ops App. Middleware allows you to enhance, modify, or monitor interactions with LLM models through a flexible priority-based execution system.
Overview¶
Middleware executes in a chain pattern with configurable priority:
Each middleware can process requests before they reach the LLM and/or process responses before they're returned to the user.
Middleware Execution Order¶
Middleware priority determines execution order:
- Lower numbers execute first (1-100)
- Ties broken alphabetically by middleware name
- Pre-processing: Priority 1 → 100 → LLM
- Post-processing: LLM → Priority 100 → 1
Recommended Priority Ranges¶
| Priority Range | Purpose | Examples |
|---|---|---|
| 1-20 | Request validation, logging | ValidationMiddleware, LoggingMiddleware |
| 21-40 | Caching, PII redaction | CacheMiddleware, PIIMiddleware |
| 41-60 | Rate limiting, circuit breakers | RateLimitMiddleware |
| 61-80 | Retry logic, fallbacks | RetryMiddleware |
| 81-100 | Response processing, metrics | MetricsMiddleware |
Middleware Types¶
Built-in Middleware¶
Available through LangChain:
- CacheMiddleware - Response caching to reduce API calls
- RetryMiddleware - Automatic retry with exponential backoff
- RateLimitMiddleware - Request rate limiting
- LoggingMiddleware - Request/response logging
- ValidationMiddleware - Input/output validation
Custom Middleware¶
Implement custom middleware for:
- PII detection and redaction
- Custom authentication/authorization
- Domain-specific transformations
- Security scanning
- Cost tracking
- Custom metrics
Configuration Steps¶
Step 1: Create Middleware Type¶
Navigate to AI Platform > Configuration > Middleware Types
Click + Add to create a new middleware type.
Screenshot Placeholder:
[Screenshot: Middleware Types List View]
Step 2: Configure Middleware Instance¶
Navigate to AI Platform > Configuration > LLM Middleware
Click + Add to apply middleware to a model.
Screenshot Placeholder:
[Screenshot: LLM Middleware Configuration Form]
Step 3: Test Configuration¶
Use the AI Chat Assistant to verify middleware is functioning.
Common Middleware Configurations¶
Cache Middleware¶
Reduces API costs by caching responses for identical requests.
Configuration¶
{
"cache_backend": "redis",
"max_entries": 10000,
"ttl_seconds": 3600,
"key_prefix": "llm_cache:",
"cache_responses_only": true,
"exclude_patterns": ["random", "uuid", "current_time"]
}
Setup in Nautobot¶
Create Middleware Type:
Name: CacheMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Caches LLM responses to reduce API calls and costs
Default Config:
{
"cache_backend": "redis",
"max_entries": 10000,
"ttl_seconds": 3600
}
Apply to Model:
LLM Model: gpt-4o
Middleware: CacheMiddleware
Config:
{
"cache_backend": "redis",
"max_entries": 10000,
"ttl_seconds": 3600,
"key_prefix": "prod_gpt4o_cache:"
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ☐
Priority: 20
Screenshot Placeholder:
[Screenshot: Cache Middleware Configuration]
Benefits¶
- ✓ Reduces API costs for repeated queries
- ✓ Improves response time for cached requests
- ✓ Decreases load on LLM provider
- ✓ Configurable TTL per use case
Best Practices¶
- Set TTL based on data freshness requirements
- Use cache for deterministic queries (temperature = 0)
- Higher cache TTL for stable content
- Lower cache TTL for dynamic content
- Monitor cache hit rate
- Clear cache when models are updated
Retry Middleware¶
Handles transient failures with exponential backoff.
Configuration¶
{
"max_retries": 3,
"initial_delay": 1.0,
"max_delay": 60.0,
"exponential_base": 2,
"retry_on": [
"rate_limit_error",
"timeout_error",
"connection_error",
"server_error"
],
"do_not_retry_on": [
"authentication_error",
"invalid_request_error"
]
}
Setup in Nautobot¶
Create Middleware Type:
Name: RetryMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Automatic retry with exponential backoff for transient failures
Default Config:
{
"max_retries": 3,
"initial_delay": 1.0,
"max_delay": 60.0,
"exponential_base": 2
}
Apply to Model:
LLM Model: gpt-4o
Middleware: RetryMiddleware
Config:
{
"max_retries": 3,
"initial_delay": 1.0,
"max_delay": 60.0,
"exponential_base": 2,
"retry_on": ["rate_limit_error", "timeout_error", "connection_error"]
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ✓ (Production environments)
Priority: 30
Screenshot Placeholder:
[Screenshot: Retry Middleware Configuration]
Retry Logic¶
Attempt 1: Wait 0s → Fail
Attempt 2: Wait 1s (1.0 * 2^0) → Fail
Attempt 3: Wait 2s (1.0 * 2^1) → Fail
Attempt 4: Wait 4s (1.0 * 2^2) → Success
Benefits¶
- ✓ Handles rate limiting automatically
- ✓ Recovers from transient network issues
- ✓ Prevents cascading failures
- ✓ Configurable retry strategy
Best Practices¶
- Mark as critical for production models
- Configure max_retries based on SLA requirements
- Set appropriate max_delay to avoid long waits
- Don't retry on authentication errors
- Monitor retry rates to identify issues
Logging Middleware¶
Logs requests and responses for debugging and monitoring.
Configuration¶
{
"log_level": "INFO",
"log_requests": true,
"log_responses": true,
"include_tokens": true,
"include_latency": true,
"include_model_info": true,
"log_to_file": true,
"file_path": "/var/log/ai_ops/llm_requests.log",
"max_log_length": 1000,
"mask_sensitive_data": true
}
Setup in Nautobot¶
Create Middleware Type:
Name: LoggingMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Logs LLM requests and responses for debugging and monitoring
Default Config:
{
"log_level": "INFO",
"log_requests": true,
"log_responses": true,
"include_tokens": true
}
Apply to Model:
LLM Model: gpt-4o
Middleware: LoggingMiddleware
Config:
{
"log_level": "INFO",
"log_requests": true,
"log_responses": true,
"include_tokens": true,
"include_latency": true,
"include_model_info": true,
"log_to_file": true,
"file_path": "/var/log/ai_ops/prod_llm.log",
"max_log_length": 1000
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ☐
Priority: 10
Screenshot Placeholder:
[Screenshot: Logging Middleware Configuration]
Log Format¶
[2024-01-09 12:34:56] INFO - Model: gpt-4o
Request: "What is the status of device SW-CORE-01?"
Response: "Device SW-CORE-01 is online and operational..."
Tokens: 125 (prompt) + 87 (completion) = 212 total
Latency: 1.23s
Benefits¶
- ✓ Debugging failed requests
- ✓ Performance monitoring
- ✓ Usage analytics
- ✓ Compliance and auditing
- ✓ Cost tracking
Best Practices¶
- Always enable for production environments
- Use INFO level for normal operations
- Use DEBUG level for troubleshooting
- Mask sensitive data in logs
- Rotate log files regularly
- Monitor log file sizes
Rate Limit Middleware¶
Controls request rate to prevent quota exhaustion.
Configuration¶
{
"requests_per_minute": 60,
"requests_per_hour": 1000,
"requests_per_day": 10000,
"burst_size": 10,
"strategy": "sliding_window",
"queue_requests": true,
"max_queue_size": 100
}
Setup in Nautobot¶
Create Middleware Type:
Name: RateLimitMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Controls request rate to prevent quota exhaustion
Default Config:
{
"requests_per_minute": 60,
"burst_size": 10
}
Apply to Model:
LLM Model: gpt-4o
Middleware: RateLimitMiddleware
Config:
{
"requests_per_minute": 60,
"requests_per_hour": 1000,
"burst_size": 10,
"strategy": "sliding_window",
"queue_requests": true,
"max_queue_size": 100
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ☐
Priority: 50
Screenshot Placeholder:
[Screenshot: Rate Limit Middleware Configuration]
Benefits¶
- ✓ Prevents quota exhaustion
- ✓ Controls costs
- ✓ Ensures fair resource allocation
- ✓ Prevents API provider throttling
Best Practices¶
- Set limits based on API tier/quota
- Leave headroom for bursts
- Monitor queue sizes
- Adjust limits based on usage patterns
Validation Middleware¶
Validates inputs and outputs for security and correctness.
Configuration¶
{
"validate_input": true,
"validate_output": true,
"max_input_length": 10000,
"max_output_length": 20000,
"allowed_input_patterns": ["^[a-zA-Z0-9\\s\\.,!?-]+$"],
"blocked_output_patterns": ["<script>", "DROP TABLE", "rm -rf"],
"sanitize_html": true,
"reject_on_validation_failure": true
}
Setup in Nautobot¶
Create Middleware Type:
Name: ValidationMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Validates input and output for security and correctness
Default Config:
{
"validate_input": true,
"validate_output": true,
"max_input_length": 10000
}
Apply to Model:
LLM Model: gpt-4o
Middleware: ValidationMiddleware
Config:
{
"validate_input": true,
"validate_output": true,
"max_input_length": 10000,
"max_output_length": 20000,
"sanitize_html": true,
"reject_on_validation_failure": true
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ✓ (Security critical)
Priority: 15
Screenshot Placeholder:
[Screenshot: Validation Middleware Configuration]
Benefits¶
- ✓ Prevents injection attacks
- ✓ Ensures data quality
- ✓ Protects against malicious inputs
- ✓ Validates response format
Best Practices¶
- Enable for all production models
- Mark as critical for security
- Customize patterns for your domain
- Log validation failures
- Review blocked patterns regularly
Custom Middleware Examples¶
PII Redaction Middleware¶
Detects and redacts personally identifiable information.
Configuration¶
{
"pii_entities": [
"PERSON",
"EMAIL_ADDRESS",
"PHONE_NUMBER",
"SSN",
"CREDIT_CARD",
"IP_ADDRESS"
],
"redaction_strategy": "replace",
"replacement_text": "[REDACTED]",
"log_detections": true,
"confidence_threshold": 0.85
}
Setup in Nautobot¶
Create Middleware Type:
Name: PIIRedactionMiddleware
Is Custom: ✓
Description: Detects and redacts PII from requests and responses
Default Config:
{
"pii_entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
"redaction_strategy": "replace",
"replacement_text": "[REDACTED]"
}
Apply to Model:
LLM Model: gpt-4o
Middleware: PIIRedactionMiddleware
Config:
{
"pii_entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"],
"redaction_strategy": "replace",
"replacement_text": "[REDACTED]",
"log_detections": true,
"confidence_threshold": 0.85
}
Config Version: 1.0.0
Is Active: ✓
Is Critical: ✓
Priority: 25
Screenshot Placeholder:
[Screenshot: PII Redaction Middleware Configuration]
Implementation¶
# ai_ops/middleware/pii_redaction.py
from ai_ops.middleware.base import BaseMiddleware
class PIIRedactionMiddleware(BaseMiddleware):
"""Redacts PII from requests and responses."""
def process_request(self, request):
# Detect and redact PII from request
pass
def process_response(self, response):
# Detect and redact PII from response
pass
Cost Tracking Middleware¶
Tracks token usage and estimated costs.
Configuration¶
{
"track_tokens": true,
"track_costs": true,
"cost_per_1k_input_tokens": 0.03,
"cost_per_1k_output_tokens": 0.06,
"log_to_database": true,
"alert_threshold": 100.0,
"alert_email": "admin@example.com"
}
Setup in Nautobot¶
Create Middleware Type:
Name: CostTrackingMiddleware
Is Custom: ✓
Description: Tracks token usage and estimated costs per request
Default Config:
{
"track_tokens": true,
"track_costs": true
}
Apply to Model:
LLM Model: gpt-4o
Middleware: CostTrackingMiddleware
Config:
{
"track_tokens": true,
"track_costs": true,
"cost_per_1k_input_tokens": 0.03,
"cost_per_1k_output_tokens": 0.06,
"log_to_database": true,
"alert_threshold": 100.0,
"alert_email": "ops@example.com"
}
Config Version: 1.0.0
Is Active: ✓
Is Critical: ☐
Priority: 85
Screenshot Placeholder:
[Screenshot: Cost Tracking Middleware Configuration]
Circuit Breaker Middleware¶
Prevents cascading failures by stopping requests when error rate is high.
Configuration¶
{
"failure_threshold": 5,
"success_threshold": 2,
"timeout": 60,
"half_open_max_requests": 3,
"failure_rate_threshold": 0.5,
"window_size": 10
}
Setup in Nautobot¶
Create Middleware Type:
Name: CircuitBreakerMiddleware
Is Custom: ✓
Description: Circuit breaker pattern to prevent cascading failures
Default Config:
{
"failure_threshold": 5,
"timeout": 60
}
Apply to Model:
LLM Model: gpt-4o
Middleware: CircuitBreakerMiddleware
Config:
{
"failure_threshold": 5,
"success_threshold": 2,
"timeout": 60,
"half_open_max_requests": 3
}
Config Version: 1.0.0
Is Active: ✓
Is Critical: ☐
Priority: 55
Screenshot Placeholder:
[Screenshot: Circuit Breaker Middleware Configuration]
Complete Middleware Stack Examples¶
Production Model (gpt-4o)¶
Comprehensive middleware stack for production workloads:
Priority 10: LoggingMiddleware - Request/response logging
Priority 15: ValidationMiddleware - Security validation
Priority 20: CacheMiddleware - Response caching (1 hour TTL)
Priority 25: PIIRedactionMiddleware - PII detection and redaction
Priority 30: RetryMiddleware - Exponential backoff retry
Priority 50: RateLimitMiddleware - Rate limiting
Priority 85: CostTrackingMiddleware - Cost monitoring
Screenshot Placeholder:
[Screenshot: Production Middleware Stack]
Development Model (ollama:llama2)¶
Minimal middleware for local development:
Priority 10: LoggingMiddleware - Verbose debugging
Priority 15: ValidationMiddleware - Input validation only
Screenshot Placeholder:
[Screenshot: Development Middleware Stack]
Non-Production Model (gpt-3.5-turbo)¶
Balanced middleware for testing:
Priority 10: LoggingMiddleware - Standard logging
Priority 20: CacheMiddleware - Extended caching (4 hour TTL)
Priority 30: RetryMiddleware - Basic retry logic
Priority 50: RateLimitMiddleware - Relaxed limits
Screenshot Placeholder:
[Screenshot: Non-Production Middleware Stack]
Middleware Management¶
Enable/Disable Middleware¶
Navigate to AI Platform > Configuration > LLM Middleware
- Toggle Is Active to enable/disable without deletion
- Disabled middleware remains configured for future use
Screenshot Placeholder:
[Screenshot: Middleware Enable/Disable Toggle]
Update Middleware Priority¶
- Click middleware instance to edit
- Update Priority field (1-100)
- Save changes
- Restart affected services
Screenshot Placeholder:
[Screenshot: Priority Update Form]
Critical vs Non-Critical¶
- Critical Middleware: Agent fails to start if middleware cannot load
- Non-Critical Middleware: Errors logged, agent continues
Use critical flag for security-related middleware (validation, PII redaction).
Monitoring Middleware¶
Performance Impact¶
Monitor middleware overhead:
# Check middleware execution time
from ai_ops.helpers.middleware import get_middleware_metrics
metrics = get_middleware_metrics(model_name="gpt-4o")
# {
# "LoggingMiddleware": {"avg_time": 0.002, "calls": 1234},
# "CacheMiddleware": {"avg_time": 0.015, "calls": 1234, "hit_rate": 0.45},
# "RetryMiddleware": {"avg_time": 0.001, "calls": 1234, "retry_rate": 0.03}
# }
Cache Hit Rates¶
Track cache effectiveness:
Error Rates¶
Monitor middleware errors:
Best Practices Summary¶
Priority Assignment¶
- 10-20: Logging, validation (foundation)
- 21-40: Caching, PII redaction (enhancement)
- 41-60: Rate limiting, circuit breaker (protection)
- 61-80: Retry logic (reliability)
- 81-100: Metrics, cost tracking (monitoring)
Configuration Management¶
- ✓ Version control middleware configurations
- ✓ Test middleware in non-production first
- ✓ Document custom middleware thoroughly
- ✓ Monitor middleware performance impact
- ✓ Review and update configurations quarterly
Security Considerations¶
- ✓ Always enable validation middleware
- ✓ Mark security middleware as critical
- ✓ Enable PII redaction for sensitive data
- ✓ Mask sensitive data in logs
- ✓ Audit middleware access regularly
Performance Optimization¶
- ✓ Minimize middleware count per model
- ✓ Use caching to reduce API calls
- ✓ Monitor middleware execution time
- ✓ Disable unused middleware
- ✓ Profile middleware chains regularly
Troubleshooting¶
Middleware Not Executing¶
Check: - Is middleware marked as Active? - Is the model using the correct middleware? - Are there any configuration errors in logs? - Is the middleware type properly registered?
High Latency¶
Check: - Number of active middleware - Middleware priority order - Cache hit rates - Retry rates - Network latency
Configuration Errors¶
Common Issues: - Invalid JSON in config field - Missing required parameters - Incompatible config_version - Middleware dependencies not met
Next Steps¶
After configuring middleware:
- Set up MCP Servers - Extend capabilities
- Monitor Performance - Track middleware metrics
- Review Logs - Analyze middleware execution
- Optimize Configuration - Fine-tune settings
Related Documentation¶
- Models Reference - Middleware model details
- Provider Configuration - LLM provider setup
- Architecture Overview - Middleware architecture
- API Documentation - Middleware API endpoints