Middleware Configuration Guide¶

This guide provides comprehensive examples for configuring middleware with LLM models in the AI Ops App. Middleware allows you to enhance, modify, or monitor interactions with LLM models through a flexible priority-based execution system.

Overview¶

Middleware executes in a chain pattern with configurable priority:

Request → [Priority 10: Logging] → [Priority 20: Cache] → [Priority 30: Retry] → LLM → Response

Each middleware can process requests before they reach the LLM and/or process responses before they're returned to the user.

Middleware Execution Order¶

Middleware priority determines execution order:

Lower numbers execute first (1-100)
Ties broken alphabetically by middleware name
Pre-processing: Priority 1 → 100 → LLM
Post-processing: LLM → Priority 100 → 1

Recommended Priority Ranges¶

Priority Range	Purpose	Examples
1-20	Request validation, logging	ValidationMiddleware, LoggingMiddleware
21-40	Caching, PII redaction	CacheMiddleware, PIIMiddleware
41-60	Rate limiting, circuit breakers	RateLimitMiddleware
61-80	Retry logic, fallbacks	RetryMiddleware
81-100	Response processing, metrics	MetricsMiddleware

Middleware Types¶

Built-in Middleware¶

Available through LangChain:

CacheMiddleware - Response caching to reduce API calls
RetryMiddleware - Automatic retry with exponential backoff
RateLimitMiddleware - Request rate limiting
LoggingMiddleware - Request/response logging
ValidationMiddleware - Input/output validation

Custom Middleware¶

Implement custom middleware for:

PII detection and redaction
Custom authentication/authorization
Domain-specific transformations
Security scanning
Cost tracking
Custom metrics

Configuration Steps¶

Step 1: Create Middleware Type¶

Navigate to AI Platform > Configuration > Middleware Types

Click + Add to create a new middleware type.

Screenshot Placeholder:

[Screenshot: Middleware Types List View]

Step 2: Configure Middleware Instance¶

Navigate to AI Platform > Configuration > LLM Middleware

Click + Add to apply middleware to a model.

Screenshot Placeholder:

[Screenshot: LLM Middleware Configuration Form]

Step 3: Test Configuration¶

Use the AI Chat Assistant to verify middleware is functioning.

Common Middleware Configurations¶

Cache Middleware¶

Reduces API costs by caching responses for identical requests.

Configuration¶

{
  "cache_backend": "redis",
  "max_entries": 10000,
  "ttl_seconds": 3600,
  "key_prefix": "llm_cache:",
  "cache_responses_only": true,
  "exclude_patterns": ["random", "uuid", "current_time"]
}

Setup in Nautobot¶

Create Middleware Type:

Name: CacheMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Caches LLM responses to reduce API calls and costs
Default Config:
{
  "cache_backend": "redis",
  "max_entries": 10000,
  "ttl_seconds": 3600
}

Apply to Model:

LLM Model: gpt-4o
Middleware: CacheMiddleware
Config:
{
  "cache_backend": "redis",
  "max_entries": 10000,
  "ttl_seconds": 3600,
  "key_prefix": "prod_gpt4o_cache:"
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ☐
Priority: 20

Screenshot Placeholder:

[Screenshot: Cache Middleware Configuration]

Benefits¶

✓ Reduces API costs for repeated queries
✓ Improves response time for cached requests
✓ Decreases load on LLM provider
✓ Configurable TTL per use case

Best Practices¶

Set TTL based on data freshness requirements
Use cache for deterministic queries (temperature = 0)
Higher cache TTL for stable content
Lower cache TTL for dynamic content
Monitor cache hit rate
Clear cache when models are updated

Retry Middleware¶

Handles transient failures with exponential backoff.

Configuration¶

{
  "max_retries": 3,
  "initial_delay": 1.0,
  "max_delay": 60.0,
  "exponential_base": 2,
  "retry_on": [
    "rate_limit_error",
    "timeout_error",
    "connection_error",
    "server_error"
  ],
  "do_not_retry_on": [
    "authentication_error",
    "invalid_request_error"
  ]
}

Setup in Nautobot¶

Create Middleware Type:

Name: RetryMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Automatic retry with exponential backoff for transient failures
Default Config:
{
  "max_retries": 3,
  "initial_delay": 1.0,
  "max_delay": 60.0,
  "exponential_base": 2
}

Apply to Model:

LLM Model: gpt-4o
Middleware: RetryMiddleware
Config:
{
  "max_retries": 3,
  "initial_delay": 1.0,
  "max_delay": 60.0,
  "exponential_base": 2,
  "retry_on": ["rate_limit_error", "timeout_error", "connection_error"]
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ✓ (Production environments)
Priority: 30

Screenshot Placeholder:

[Screenshot: Retry Middleware Configuration]

Retry Logic¶

Attempt 1: Wait 0s → Fail
Attempt 2: Wait 1s (1.0 * 2^0) → Fail
Attempt 3: Wait 2s (1.0 * 2^1) → Fail
Attempt 4: Wait 4s (1.0 * 2^2) → Success

Benefits¶

✓ Handles rate limiting automatically
✓ Recovers from transient network issues
✓ Prevents cascading failures
✓ Configurable retry strategy

Best Practices¶

Mark as critical for production models
Configure max_retries based on SLA requirements
Set appropriate max_delay to avoid long waits
Don't retry on authentication errors
Monitor retry rates to identify issues

Logging Middleware¶

Logs requests and responses for debugging and monitoring.

Configuration¶

{
  "log_level": "INFO",
  "log_requests": true,
  "log_responses": true,
  "include_tokens": true,
  "include_latency": true,
  "include_model_info": true,
  "log_to_file": true,
  "file_path": "/var/log/ai_ops/llm_requests.log",
  "max_log_length": 1000,
  "mask_sensitive_data": true
}

Setup in Nautobot¶

Create Middleware Type:

Name: LoggingMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Logs LLM requests and responses for debugging and monitoring
Default Config:
{
  "log_level": "INFO",
  "log_requests": true,
  "log_responses": true,
  "include_tokens": true
}

Apply to Model:

LLM Model: gpt-4o
Middleware: LoggingMiddleware
Config:
{
  "log_level": "INFO",
  "log_requests": true,
  "log_responses": true,
  "include_tokens": true,
  "include_latency": true,
  "include_model_info": true,
  "log_to_file": true,
  "file_path": "/var/log/ai_ops/prod_llm.log",
  "max_log_length": 1000
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ☐
Priority: 10

Screenshot Placeholder:

[Screenshot: Logging Middleware Configuration]

Log Format¶

[2024-01-09 12:34:56] INFO - Model: gpt-4o
Request: "What is the status of device SW-CORE-01?"
Response: "Device SW-CORE-01 is online and operational..."
Tokens: 125 (prompt) + 87 (completion) = 212 total
Latency: 1.23s

Benefits¶

✓ Debugging failed requests
✓ Performance monitoring
✓ Usage analytics
✓ Compliance and auditing
✓ Cost tracking

Best Practices¶

Always enable for production environments
Use INFO level for normal operations
Use DEBUG level for troubleshooting
Mask sensitive data in logs
Rotate log files regularly
Monitor log file sizes

Rate Limit Middleware¶

Controls request rate to prevent quota exhaustion.

Configuration¶

{
  "requests_per_minute": 60,
  "requests_per_hour": 1000,
  "requests_per_day": 10000,
  "burst_size": 10,
  "strategy": "sliding_window",
  "queue_requests": true,
  "max_queue_size": 100
}

Setup in Nautobot¶

Create Middleware Type:

Name: RateLimitMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Controls request rate to prevent quota exhaustion
Default Config:
{
  "requests_per_minute": 60,
  "burst_size": 10
}

Apply to Model:

LLM Model: gpt-4o
Middleware: RateLimitMiddleware
Config:
{
  "requests_per_minute": 60,
  "requests_per_hour": 1000,
  "burst_size": 10,
  "strategy": "sliding_window",
  "queue_requests": true,
  "max_queue_size": 100
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ☐
Priority: 50

Screenshot Placeholder:

[Screenshot: Rate Limit Middleware Configuration]

Benefits¶

✓ Prevents quota exhaustion
✓ Controls costs
✓ Ensures fair resource allocation
✓ Prevents API provider throttling

Best Practices¶

Set limits based on API tier/quota
Leave headroom for bursts
Monitor queue sizes
Adjust limits based on usage patterns

Validation Middleware¶

Validates inputs and outputs for security and correctness.

Configuration¶

{
  "validate_input": true,
  "validate_output": true,
  "max_input_length": 10000,
  "max_output_length": 20000,
  "allowed_input_patterns": ["^[a-zA-Z0-9\\s\\.,!?-]+$"],
  "blocked_output_patterns": ["<script>", "DROP TABLE", "rm -rf"],
  "sanitize_html": true,
  "reject_on_validation_failure": true
}

Setup in Nautobot¶

Create Middleware Type:

Name: ValidationMiddleware
Is Custom: ☐ (Built-in LangChain)
Description: Validates input and output for security and correctness
Default Config:
{
  "validate_input": true,
  "validate_output": true,
  "max_input_length": 10000
}

Apply to Model:

LLM Model: gpt-4o
Middleware: ValidationMiddleware
Config:
{
  "validate_input": true,
  "validate_output": true,
  "max_input_length": 10000,
  "max_output_length": 20000,
  "sanitize_html": true,
  "reject_on_validation_failure": true
}
Config Version: 1.1.0
Is Active: ✓
Is Critical: ✓ (Security critical)
Priority: 15

Screenshot Placeholder:

[Screenshot: Validation Middleware Configuration]

Benefits¶

✓ Prevents injection attacks
✓ Ensures data quality
✓ Protects against malicious inputs
✓ Validates response format

Best Practices¶

Enable for all production models
Mark as critical for security
Customize patterns for your domain
Log validation failures
Review blocked patterns regularly

Custom Middleware Examples¶

PII Redaction Middleware¶

Detects and redacts personally identifiable information.

Configuration¶

{
  "pii_entities": [
    "PERSON",
    "EMAIL_ADDRESS",
    "PHONE_NUMBER",
    "SSN",
    "CREDIT_CARD",
    "IP_ADDRESS"
  ],
  "redaction_strategy": "replace",
  "replacement_text": "[REDACTED]",
  "log_detections": true,
  "confidence_threshold": 0.85
}

Setup in Nautobot¶

Create Middleware Type:

Name: PIIRedactionMiddleware
Is Custom: ✓
Description: Detects and redacts PII from requests and responses
Default Config:
{
  "pii_entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
  "redaction_strategy": "replace",
  "replacement_text": "[REDACTED]"
}

Apply to Model:

LLM Model: gpt-4o
Middleware: PIIRedactionMiddleware
Config:
{
  "pii_entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"],
  "redaction_strategy": "replace",
  "replacement_text": "[REDACTED]",
  "log_detections": true,
  "confidence_threshold": 0.85
}
Config Version: 1.0.0
Is Active: ✓
Is Critical: ✓
Priority: 25

Screenshot Placeholder:

[Screenshot: PII Redaction Middleware Configuration]

Implementation¶

# ai_ops/middleware/pii_redaction.py
from ai_ops.middleware.base import BaseMiddleware

class PIIRedactionMiddleware(BaseMiddleware):
    """Redacts PII from requests and responses."""

    def process_request(self, request):
        # Detect and redact PII from request
        pass

    def process_response(self, response):
        # Detect and redact PII from response
        pass

Cost Tracking Middleware¶

Tracks token usage and estimated costs.

Configuration¶

{
  "track_tokens": true,
  "track_costs": true,
  "cost_per_1k_input_tokens": 0.03,
  "cost_per_1k_output_tokens": 0.06,
  "log_to_database": true,
  "alert_threshold": 100.0,
  "alert_email": "admin@example.com"
}

Setup in Nautobot¶

Create Middleware Type:

Name: CostTrackingMiddleware
Is Custom: ✓
Description: Tracks token usage and estimated costs per request
Default Config:
{
  "track_tokens": true,
  "track_costs": true
}

Apply to Model:

LLM Model: gpt-4o
Middleware: CostTrackingMiddleware
Config:
{
  "track_tokens": true,
  "track_costs": true,
  "cost_per_1k_input_tokens": 0.03,
  "cost_per_1k_output_tokens": 0.06,
  "log_to_database": true,
  "alert_threshold": 100.0,
  "alert_email": "ops@example.com"
}
Config Version: 1.0.0
Is Active: ✓
Is Critical: ☐
Priority: 85

Screenshot Placeholder:

[Screenshot: Cost Tracking Middleware Configuration]

Circuit Breaker Middleware¶

Prevents cascading failures by stopping requests when error rate is high.

Configuration¶

{
  "failure_threshold": 5,
  "success_threshold": 2,
  "timeout": 60,
  "half_open_max_requests": 3,
  "failure_rate_threshold": 0.5,
  "window_size": 10
}

Setup in Nautobot¶

Create Middleware Type:

Name: CircuitBreakerMiddleware
Is Custom: ✓
Description: Circuit breaker pattern to prevent cascading failures
Default Config:
{
  "failure_threshold": 5,
  "timeout": 60
}

Apply to Model:

LLM Model: gpt-4o
Middleware: CircuitBreakerMiddleware
Config:
{
  "failure_threshold": 5,
  "success_threshold": 2,
  "timeout": 60,
  "half_open_max_requests": 3
}
Config Version: 1.0.0
Is Active: ✓
Is Critical: ☐
Priority: 55

Screenshot Placeholder:

[Screenshot: Circuit Breaker Middleware Configuration]

Complete Middleware Stack Examples¶

Production Model (gpt-4o)¶

Comprehensive middleware stack for production workloads:

Priority 10: LoggingMiddleware - Request/response logging
Priority 15: ValidationMiddleware - Security validation
Priority 20: CacheMiddleware - Response caching (1 hour TTL)
Priority 25: PIIRedactionMiddleware - PII detection and redaction
Priority 30: RetryMiddleware - Exponential backoff retry
Priority 50: RateLimitMiddleware - Rate limiting
Priority 85: CostTrackingMiddleware - Cost monitoring

Screenshot Placeholder:

[Screenshot: Production Middleware Stack]

Development Model (ollama:llama2)¶

Minimal middleware for local development:

Priority 10: LoggingMiddleware - Verbose debugging
Priority 15: ValidationMiddleware - Input validation only

Screenshot Placeholder:

[Screenshot: Development Middleware Stack]

Non-Production Model (gpt-3.5-turbo)¶

Balanced middleware for testing:

Priority 10: LoggingMiddleware - Standard logging
Priority 20: CacheMiddleware - Extended caching (4 hour TTL)
Priority 30: RetryMiddleware - Basic retry logic
Priority 50: RateLimitMiddleware - Relaxed limits

Screenshot Placeholder:

[Screenshot: Non-Production Middleware Stack]

Middleware Management¶

Enable/Disable Middleware¶

Navigate to AI Platform > Configuration > LLM Middleware

Toggle Is Active to enable/disable without deletion
Disabled middleware remains configured for future use

Screenshot Placeholder:

[Screenshot: Middleware Enable/Disable Toggle]

Update Middleware Priority¶

Click middleware instance to edit
Update Priority field (1-100)
Save changes
Restart affected services

Screenshot Placeholder:

[Screenshot: Priority Update Form]

Critical vs Non-Critical¶

Critical Middleware: Agent fails to start if middleware cannot load
Non-Critical Middleware: Errors logged, agent continues

Use critical flag for security-related middleware (validation, PII redaction).

Monitoring Middleware¶

Performance Impact¶

Monitor middleware overhead:

# Check middleware execution time
from ai_ops.helpers.middleware import get_middleware_metrics

metrics = get_middleware_metrics(model_name="gpt-4o")
# {
#   "LoggingMiddleware": {"avg_time": 0.002, "calls": 1234},
#   "CacheMiddleware": {"avg_time": 0.015, "calls": 1234, "hit_rate": 0.45},
#   "RetryMiddleware": {"avg_time": 0.001, "calls": 1234, "retry_rate": 0.03}
# }

Cache Hit Rates¶

Track cache effectiveness:

# Redis CLI
redis-cli
> INFO stats
> KEYS llm_cache:*
> TTL llm_cache:some_key

Error Rates¶

Monitor middleware errors:

# Check logs
tail -f /var/log/ai_ops/llm_requests.log | grep ERROR

Best Practices Summary¶

Priority Assignment¶

10-20: Logging, validation (foundation)
21-40: Caching, PII redaction (enhancement)
41-60: Rate limiting, circuit breaker (protection)
61-80: Retry logic (reliability)
81-100: Metrics, cost tracking (monitoring)

Configuration Management¶

✓ Version control middleware configurations
✓ Test middleware in non-production first
✓ Document custom middleware thoroughly
✓ Monitor middleware performance impact
✓ Review and update configurations quarterly

Security Considerations¶

✓ Always enable validation middleware
✓ Mark security middleware as critical
✓ Enable PII redaction for sensitive data
✓ Mask sensitive data in logs
✓ Audit middleware access regularly

Performance Optimization¶

✓ Minimize middleware count per model
✓ Use caching to reduce API calls
✓ Monitor middleware execution time
✓ Disable unused middleware
✓ Profile middleware chains regularly

Troubleshooting¶

Middleware Not Executing¶

Check: - Is middleware marked as Active? - Is the model using the correct middleware? - Are there any configuration errors in logs? - Is the middleware type properly registered?

High Latency¶

Check: - Number of active middleware - Middleware priority order - Cache hit rates - Retry rates - Network latency

Configuration Errors¶

Common Issues: - Invalid JSON in config field - Missing required parameters - Incompatible config_version - Middleware dependencies not met

Next Steps¶

After configuring middleware:

Set up MCP Servers - Extend capabilities
Monitor Performance - Track middleware metrics
Review Logs - Analyze middleware execution
Optimize Configuration - Fine-tune settings

Models Reference - Middleware model details
Provider Configuration - LLM provider setup
Architecture Overview - Middleware architecture
API Documentation - Middleware API endpoints

Middleware Configuration Guide¶

Overview¶

Middleware Execution Order¶

Recommended Priority Ranges¶

Middleware Types¶

Built-in Middleware¶

Custom Middleware¶

Configuration Steps¶

Step 1: Create Middleware Type¶

Step 2: Configure Middleware Instance¶

Step 3: Test Configuration¶

Common Middleware Configurations¶

Cache Middleware¶

Configuration¶

Setup in Nautobot¶

Benefits¶

Best Practices¶

Retry Middleware¶

Configuration¶

Setup in Nautobot¶

Retry Logic¶

Benefits¶

Best Practices¶

Logging Middleware¶

Configuration¶

Setup in Nautobot¶

Log Format¶

Benefits¶

Best Practices¶

Rate Limit Middleware¶

Configuration¶

Setup in Nautobot¶

Benefits¶

Best Practices¶

Validation Middleware¶

Configuration¶

Setup in Nautobot¶

Benefits¶

Best Practices¶

Custom Middleware Examples¶

PII Redaction Middleware¶

Configuration¶

Setup in Nautobot¶

Implementation¶

Cost Tracking Middleware¶

Configuration¶

Setup in Nautobot¶

Circuit Breaker Middleware¶

Configuration¶

Setup in Nautobot¶

Complete Middleware Stack Examples¶

Production Model (gpt-4o)¶

Development Model (ollama:llama2)¶

Non-Production Model (gpt-3.5-turbo)¶

Middleware Management¶

Enable/Disable Middleware¶

Update Middleware Priority¶

Critical vs Non-Critical¶

Monitoring Middleware¶

Performance Impact¶

Cache Hit Rates¶

Error Rates¶

Best Practices Summary¶

Priority Assignment¶

Configuration Management¶

Security Considerations¶

Performance Optimization¶

Troubleshooting¶

Middleware Not Executing¶

High Latency¶

Configuration Errors¶

Next Steps¶

Related Documentation¶