MCP Server Health Checks¶
The AI Ops app includes automated health check monitoring for MCP (Model Context Protocol) servers to ensure reliable agent operations.
Overview¶
Health checks automatically verify the operational status of HTTP MCP servers by periodically querying their health endpoints. Servers that fail health checks are automatically marked as "Unhealthy" and excluded from agent operations until they recover.
Features¶
- Automated Scheduling: Health checks run every minute (configurable)
- Intelligent Retry Logic: Status changes require verification to prevent false positives
- Parallel Execution: Multiple servers checked simultaneously for efficiency
- Cache Invalidation: MCP client cache automatically cleared when server statuses change
- Selective Checking: Skips servers with "Vulnerable" status or "stdio" protocol
Health Check Process¶
Basic Flow¶
- Query: Fetch all HTTP MCP servers (excluding "Vulnerable" status)
- Check: Send HTTP GET request to each server's health endpoint
- Evaluate: HTTP 200 response = Healthy, otherwise = Unhealthy
- Update: Change server status if needed (with verification)
- Invalidate: Clear MCP client cache if any status changed
Retry Logic¶
To prevent false positives from temporary network issues, the health check uses a verification process before changing server status:
No Retry Scenarios (Status matches check result): - Server status is "Healthy" AND health check passes → No change needed - Server status is "Unhealthy" AND health check fails → No change needed
Verification Required (Status differs from check result): 1. Initial health check indicates status should change 2. Wait 5 seconds, perform verification check #1 3. Wait 5 seconds, perform verification check #2 4. Evaluate all 3 checks (initial + 2 verifications) 5. If 2 out of 3 checks confirm the new status, flip the status 6. Otherwise, keep current status
Example: If a "Healthy" server fails the initial check: - Wait 5s → Check again - Wait 5s → Check again - If 2 out of 3 checks failed → Mark as "Unhealthy" - If only 1 out of 3 checks failed → Keep "Healthy"
Parallel Execution¶
Health checks use parallel processing to efficiently handle multiple servers:
- Worker Count: Minimum 2 workers, maximum 50% of available CPU cores
- Execution: Uses Python's
ThreadPoolExecutorfor concurrent checks - Timeout: Each individual check times out after 5 seconds
Performance Impact: - 10 servers with 4 CPU cores → Uses 2 workers (50% of 4 = 2) - 20 servers with 16 CPU cores → Uses 8 workers (50% of 16 = 8) - Worst-case runtime per server: 15 seconds (initial + 2×5s verification)
Configuration¶
Scheduling¶
The health check job is automatically scheduled during app migrations and runs every minute by default.
Current Schedule (POC/Testing):
Production Recommendation: For production environments, update the scheduled job to run less frequently:
- Navigate to Jobs > Scheduled Jobs
- Find "MCP Server Health Check"
- Edit the schedule
- Change crontab to:
*/5 * * * *(every 5 minutes) - Save changes
Alternative Schedules:
- Every 5 minutes: */5 * * * *
- Every 10 minutes: */10 * * * *
- Every 15 minutes: */15 * * * *
- Every 30 minutes: */30 * * * *
Server Status Configuration¶
Health checks only process servers meeting these criteria:
- Protocol: Must be
http(STDIO servers are skipped) - Status: Must NOT be "Vulnerable" (these are excluded)
- Health Endpoint: Defaults to
/health(configurable per server)
Status Meanings: - Healthy: Server is operational and responding to health checks - Unhealthy: Server failed health checks and is excluded from agent operations - Vulnerable: Manually set status to exclude server from health checks entirely (e.g., known security issues)
Health Check Endpoint¶
Each MCP server can specify its health check endpoint path:
- Default:
/health - Configurable: Set custom path in MCP Server configuration
- URL Construction:
{server.url.rstrip('/')}{server.health_check}
Example:
Server URL: https://mcp-server.internal.com
Health Check Path: /health
Final URL: https://mcp-server.internal.com/health
SSL Verification¶
SSL certificate verification behavior depends on server type:
- Internal Servers (
mcp_type="internal"): SSL verification disabled (for self-signed certs) - External Servers (
mcp_type="external"): SSL verification enabled
Monitoring¶
Job Execution¶
Monitor health check execution through Nautobot's job system:
- Navigate to Jobs > Job Results
- Filter by job name: "MCP Server Health Check"
- View execution logs and results
Log Messages:
- ✅ Success: Health check completed: X checked, Y changed, Z failed
- ⚠️ Status Change: MCP Server status changed: {name} (Healthy → Unhealthy)
- ❌ Failure: Health check failed: {error}
Cache Invalidation¶
When server statuses change, the MCP client cache is automatically invalidated:
This ensures the agent immediately picks up the new server configuration without requiring a manual cache clear.
Manual Health Checks¶
In addition to automated health checks, you can manually check individual servers:
- Navigate to AI Platform > Configuration > MCP Servers
- Click on a specific MCP server
- Click the Check Health button
- View immediate health check results
Manual health checks: - Do NOT trigger status changes - Do NOT clear the MCP client cache - Provide immediate feedback for troubleshooting
Troubleshooting¶
Health Check Not Running¶
Verify Scheduled Job: 1. Navigate to Jobs > Scheduled Jobs 2. Find "MCP Server Health Check" 3. Verify: - Status is "Enabled" - Crontab is correct - Start time is in the past - User is "JobRunner"
Check Job Registration:
- Verify job appears in Jobs list
- Module: ai_ops.jobs.mcp_health_check
- Class: MCPServerHealthCheckJob
Server Stuck in Unhealthy Status¶
Verify Server Accessibility:
Expected response: HTTP 200
Common Issues: - Server is actually down - Network connectivity problems - Firewall blocking Nautobot → MCP server - Health endpoint path incorrect - SSL certificate issues (for external servers)
Temporary Override: If server is actually healthy but status is stuck: 1. Edit the MCP server 2. Manually set status to "Healthy" 3. Save changes 4. Monitor next health check cycle
All Health Checks Failing¶
Check Nautobot Server: - Network connectivity working? - DNS resolution working? - Firewall rules correct?
Check Celery Workers:
# View Celery worker status
nautobot-server celery inspect active
# Check for stuck tasks
nautobot-server celery inspect scheduled
Review Logs:
# Check Nautobot logs for health check errors
tail -f /var/log/nautobot/nautobot.log | grep "health check"
Performance Tuning¶
Adjust Worker Count¶
Worker count is calculated automatically but can be influenced by system resources:
To modify (requires code change):
1. Edit ai_ops/celery_tasks.py
2. Find perform_mcp_health_checks() function
3. Adjust the worker calculation formula
4. Restart Nautobot
Adjust Check Frequency¶
For environments with many servers or slow networks, consider:
- Increase schedule interval: Every 10-15 minutes instead of every minute
- Reduce verification checks: Modify retry logic (requires code change)
- Increase timeout: Modify 5-second timeout (requires code change)
Exclude Servers from Checks¶
To permanently exclude a server from automated health checks:
- Edit the MCP server
- Set status to "Vulnerable"
- Save changes
The server will be skipped in all future automated health checks.
API Integration¶
Celery Task Invocation¶
You can programmatically trigger health checks:
from ai_ops.celery_tasks import perform_mcp_health_checks
# Trigger health checks
result = perform_mcp_health_checks()
# Check results
if result['success']:
print(f"Checked: {result['checked_count']}")
print(f"Changed: {result['changed_count']}")
print(f"Failed: {result['failed_count']}")
print(f"Cache cleared: {result['cache_cleared']}")
Single Server Check¶
Check a specific server:
from ai_ops.celery_tasks import check_mcp_server_health
# Check server by ID
result = check_mcp_server_health(server_id='uuid-here')
if result['success']:
if result['status_changed']:
print(f"Status changed: {result['old_status']} → {result['new_status']}")
else:
print(f"Status unchanged: {result['new_status']}")
Best Practices¶
- Production Scheduling: Use 5-15 minute intervals instead of every minute
- Monitor Failures: Review job results regularly for persistent failures
- Use Vulnerable Status: For servers with known issues that shouldn't be checked
- Test Health Endpoints: Manually verify health endpoints work before registering servers
- SSL Certificates: Use valid certificates for external servers, or mark as internal
- Network Requirements: Ensure Nautobot can reach all MCP server URLs
- Resource Planning: More servers = more workers = more CPU usage during checks