Deep Agent Architecture¶

Overview¶

The Deep Agent is an advanced AI agent implementation in ai-ops that adopts the deepagents framework from the network-agent project. It provides significant enhancements over the standard agent including semantic caching, tool error retry, subagent delegation, skills system, and cross-conversation memory.

Key Features¶

1. Langfuse LLM Observability¶

Full LLM call tracing and monitoring
Prompt management and versioning
Token usage and cost tracking
Performance analytics and debugging
Web UI at http://localhost:8000
Toggle-able with ENABLE_LANGFUSE environment variable
Callbacks at graph level (propagate to all child runnables)

2. Semantic Caching with Embeddings¶

Caches final LLM responses using vector similarity
Reduces costs by reusing semantically similar answers
Configurable similarity threshold and TTL
Only caches final responses (not intermediate planning steps)
Requires Redis and embedding model

3. Tool Error Retry with Backoff¶

Automatically retries transient tool errors
Identifies retriable errors (connection, timeout, parsing)
Configurable retry count and delay
Returns graceful error messages on failure

4. Subagent Delegation¶

Hierarchical agent system for specialized tasks
YAML-based subagent configuration
Each subagent can have its own tools and prompts
Example subagents: nautobot-query, network-analyzer

5. Skills System¶

Directory-based skills with markdown instructions
Skills provide domain-specific guidance to the agent
Example skill: nautobot-search for inventory queries
Skills loaded automatically from ai_ops/skills/ directory

6. Cross-Conversation Memory (Store)¶

Redis-based persistent memory across conversations
Stores user preferences, learned facts, context
Accessible via /memories/ path in backend
Falls back to InMemoryStore if Redis unavailable

7. Connection Pooling¶

Efficient database connection management
Supports both PostgreSQL and Redis checkpointers
Automatic Azure AD token refresh for PostgreSQL
Configurable pool sizes and TTL

Architecture Diagram¶

┌─────────────────────────────────────────────────────────────┐
│                      Deep MCP Agent                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  create_deep_agent()                                  │   │
│  │  ├── LLM Model (from Django LLMModel)                │   │
│  │  ├── MCP Tools (from MCPServer with auth)            │   │
│  │  ├── Middleware                                        │   │
│  │  │   ├── SemanticCacheMiddleware (Redis + embeddings)│   │
│  │  │   └── ToolErrorHandlerMiddleware (retry logic)    │   │
│  │  ├── Checkpointer (Redis/PostgreSQL with pooling)    │   │
│  │  ├── Store (Redis/InMemory for cross-conv memory)    │   │
│  │  ├── Backend (CompositeBackend with routing)         │   │
│  │  │   ├── FilesystemBackend (skills, memory files)    │   │
│  │  │   └── StoreBackend (/memories/ → Redis)           │   │
│  │  ├── Skills (from ai_ops/skills/)                     │   │
│  │  ├── Memory (from ai_ops/prompts/)                    │   │
│  │  └── Subagents (from agents/subagents.yaml)          │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Directory Structure¶

ai_ops/
├── agents/
│   ├── multi_mcp_agent.py      # Standard agent (existing)
│   ├── deep_mcp_agent.py        # Deep agent (new)
│   └── subagents.yaml           # Subagent configuration
├── helpers/
│   └── deep_agent/              # Deep agent utilities
│       ├── __init__.py
│       ├── checkpoint_factory.py    # Connection pooling
│       ├── store_factory.py         # Cross-conversation memory
│       ├── embedding_factory.py     # Embedding models
│       ├── middleware.py            # Semantic cache, tool retry
│       ├── mcp_tools_auth.py        # MCP tools with auth
│       ├── agents_loader.py         # Subagent configuration loader
│       └── backend_factory.py       # CompositeBackend setup
├── skills/                      # Skills directory
│   └── nautobot-search/
│       └── SKILL.md             # Skill instructions
└── prompts/                     # System prompts (memory files)
    └── *.md

Configuration¶

Environment Variables¶

# Langfuse Observability
ENABLE_LANGFUSE=true
LANGFUSE_PUBLIC_KEY=pk-lf-local-dev-key
LANGFUSE_SECRET_KEY=sk-lf-local-dev-secret
LANGFUSE_HOST=http://langfuse-web:3000

# Redis (required for caching, store, checkpointer)
REDIS_URL=redis://redis:6379

# Checkpointer settings
CHECKPOINT_TTL=3600              # 1 hour
CHECKPOINT_POOL_SIZE=10          # Max connections
CHECKPOINT_POOL_MIN_SIZE=2       # Min connections

# Semantic cache settings  
SEMANTIC_CACHE_TTL=3600          # 1 hour
SEMANTIC_CACHE_THRESHOLD=0.05    # Similarity threshold (0-1, lower = stricter)

# Tool retry settings
TOOL_MAX_RETRIES=2               # Retry attempts

# Embedding model (for semantic cache)
EMBEDDING_MODEL=mxbai-embed-large
EMBEDDING_BASE_URL=http://ollama:11434

# Optional: Azure AD for PostgreSQL
DB_AUTH_METHOD=basic             # or "service_principal"

Django Settings¶

No Django settings changes required. The deep agent uses existing: - LLMModel for LLM configuration - MCPServer for MCP tool discovery - SystemPrompt for system prompts - Standard Django database settings

Usage¶

Using Deep Agent via API¶

The deep agent will be available alongside the standard agent. Users can select which agent to use:

# In API views (to be implemented)
from ai_ops.agents import deep_mcp_agent

response = await deep_mcp_agent.process_message(
    user_input="Find device RTR-NYC-01",
    thread_id="conversation_123",
    username="admin",
    user_token="Bearer token_here"
)

Subagent Configuration¶

Define subagents in ai_ops/agents/subagents.yaml:

nautobot-query:
  description: "Query Nautobot inventory"
  system_prompt: "You are a Nautobot query specialist..."
  tools:
    - mcp_tools

network-analyzer:
  description: "Analyze network topology"  
  system_prompt: "You are a network analysis specialist..."
  tools:
    - mcp_tools

Creating Skills¶

Create a skill directory with SKILL.md:

ai_ops/skills/my-skill/
└── SKILL.md                    # Skill instructions and examples

The SKILL.md file should include: - Description of the skill - When to use it - Available tools - Best practices - Examples

Comparison: Standard Agent vs Deep Agent¶

Feature	Standard Agent	Deep Agent
Framework	`create_agent` (LangChain)	`create_deep_agent` (deepagents)
Caching	None	Semantic cache with embeddings
Tool Retry	None	Automatic retry with backoff
Subagents	No	Yes, YAML-configured
Skills	No	Yes, directory-based
Memory	Checkpointer only	Checkpointer + Store
Checkpointer	MemorySaver (in-memory)	Redis/PostgreSQL with pooling
MCP Tools	Cached	Fresh per request (for auth)
Backend	N/A	CompositeBackend with routing

When to Use Each Agent¶

Use Standard Agent When:¶

Simple conversational queries
No need for semantic caching
Basic tool usage without retry logic
No subagent delegation needed
Existing functionality is sufficient

Use Deep Agent When:¶

Complex multi-step workflows
High volume → semantic caching saves costs
Tools have transient failures → retry helps
Need specialized subagents
Want skills-based guidance
Need cross-conversation memory

Performance Considerations¶

Semantic Cache¶

Cache Hit: Sub-10ms response (no LLM call)
Cache Miss: Normal LLM latency + cache write
Storage: ~1KB per cached response in Redis
Recommendation: Use for production with high query volume

Connection Pooling¶

Without Pool: New connection per request (~100ms overhead)
With Pool: Reuse connection (~1ms overhead)
Recommendation: Always enable in production Running with Langfuse

Start the development environment with Langfuse:

cd development
docker-compose -f docker-compose.base.yml \
               -f docker-compose.postgres.yml \
               -f docker-compose.redis.yml \
               -f docker-compose.langfuse.yml up

Access Langfuse UI at: http://localhost:8000

First-time setup: 1. Open http://localhost:8000 2. Create an account (stored locally) 3. Create a project 4. Copy API keys to creds.env

Optional: Disable Langfuse

# In development.env or creds.env
ENABLE_LANGFUSE=false

Troubleshooting¶

Langfuse Not Receiving Traces¶

# Check Langfuse services are running
docker-compose ps langfuse-web langfuse-worker

# Check connection from agent
# Look for: "✓ Langfuse observability enabled"

# Verify environment variables
echo $LANGFUSE_PUBLIC_KEY
echo $LANGFUSE_SECRET_KEY
echo $LANGFUSE_HOST

# Check Langfuse logs
docker-compose logs langfuse-web
docker-compose logs langfuse-worker

Tool Retry¶

Benefit: Reduces failure rate for transient errors
Cost: Additional latency on retry (retry_delay * attempts)
Recommendation: Enable with 2-3 retries max

Troubleshooting¶

Semantic Cache Not Working¶

# Check Redis connection
redis-cli -h redis ping

# Check cache initialization in logs
# Look for: "Semantic cache initialized successfully"

Subagents Not Loading¶

# Verify subagents.yaml exists
ls -la ai_ops/agents/subagents.yaml

# Check logs for subagent loading
# Look for: "Loaded N subagent(s) from..."

MCP Tools Authentication Failures¶

# Verify MCPServer status in Django admin
# Check that servers have status="Healthy"

# Verify auth token is being passed
# Look in logs for: "Loaded N tools from N MCP server(s) with fresh auth token"

Connection Pool Errors¶

# Check database connectivity
psql -h db -U nautobot -d nautobot

# Verify pool configuration
# CHECKPOINT_POOL_SIZE and CHECKPOINT_POOL_MIN_SIZE in .env

Migration Guide¶

To migrate from standard agent to deep agent:

Update dependencies: Run poetry install to get deepagents packages
Configure environment: Add deep agent settings to .env
Restart services: Required for new dependencies
Optional: Create subagents.yaml and skills
Test: Try deep agent via API with test queries
Monitor: Watch logs for caching, retry behavior
Optimize: Adjust cache threshold and retry settings based on usage

Future Enhancements¶

Planned improvements: - [ ] RAG utilities for vector search (rag_utils.py) - [ ] Database migration for LLMModel fields
- [ ] API endpoint for agent type selection - [ ] Admin UI enhancements for deep agent config - [ ] Performance metrics dashboard - [ ] A/B testing framework for comparing agents

References¶

deepagents Documentation
LangGraph Checkpointer
LangGraph Store
redisvl Documentation
network-agent implementation (internal reference)