Captain Claw

Multi-Modal AI Agent Framework - Comprehensive System Analysis

System Overview
137
Python Files
40+
Tools
13
Agent Mixins
50+
REST Endpoints
System Architecture
Core Agent Engine (13 Specialized Mixins)

🎯 Orchestration Mixin

Main turn-level request processing loop managing iteration budgets, progress tracking, and completion gating. Coordinates the entire agent execution flow.

🔧 Tool Loop Mixin

LLM tool call extraction, execution, and result management with duplicate detection and scale-aware guards. Executes function calls reliably.

✅ Completion Mixin

Multi-stage validation gates ensuring task requirements are met before response finalization. Ensures quality outputs.

🧠 Context Mixin

Dynamic system prompt construction, semantic memory integration, and intelligent message selection within token budgets.

💾 Session Mixin

Token-aware message handling, context compaction, and runtime configuration synchronization. Manages conversation state.

📁 File Operations Mixin

Script generation, execution, and structured result wrapping. Handles file-based workflows.

🛡️ Guard Mixin

Input/output content filtering and approval workflows. Ensures safe operation with human oversight.

🤖 Model Mixin

Runtime model selection and provider resolution. Supports multiple LLM providers.

⚙️ Pipeline Mixin

DAG-based task pipeline construction with dependency resolution and timeout management.

💭 Reasoning Mixin

Task contract generation, critic validation, and list-member extraction for complex reasoning.

🔍 Research Mixin

Multi-stage web research pipeline for entity extraction and content aggregation.

📊 Scale Detection Mixin

Large-scale list-processing task detection and advisory injection. Prevents context overflow.

🔄 Scale Loop Mixin

Per-item batch processing with constant-context isolation. Handles massive datasets efficiently.

Tool Ecosystem (40+ Tools)
Tools by Category

📄 File & Text Operations

  • Safe file reading
  • Sandboxed file writing
  • Surgical file editing
  • Pattern-based discovery

🌐 Web & Data Integration

  • HTTP content retrieval
  • Real-time web search
  • Google Drive/Workspace
  • Vector search indexing

📑 Document Processing

  • PDF/DOCX/XLSX extraction
  • OCR capabilities
  • Image generation
  • Batch summarization

🌍 Browser Automation

  • Playwright integration
  • Token-efficient trees
  • Workflow recording
  • API replay capability

💻 System & Hardware

  • Shell execution
  • GUI automation
  • Screen capture
  • Speech I/O

📌 Productivity & Context

  • Task management
  • Address book
  • Script registry
  • Email dispatch
Memory & Session Management
Multi-Layer Memory Architecture

Working Memory

In-turn context buffer with automatic compaction. Holds current conversation state and recent messages for immediate context.

Semantic Memory

SQLite FTS5 + vector embeddings with hybrid search. Session-scoped retrieval optimized for relevance and recency.

Deep Memory

Typesense-backed long-term archive with chunking and embedding. Persistent storage across sessions with vector similarity search.

File Registry

Logical-to-physical path mapping for cross-task artifact discovery. Enables downstream tasks to reference upstream outputs.

Web UI Infrastructure
API Endpoints Distribution

Core Server

aiohttp-based async web server with 100+ HTTP/WebSocket routes. Real-time callback routing to connected clients and multi-session state management.

WebSocket Communication

Chat message routing, session state sync, live speech-to-text streaming, and agent execution with concurrent task naming.

REST API (50+ Endpoints)

Session management, entity CRUD, datastore operations, cron scheduling, file browsing, configuration management, and OAuth flows.

Static Pages (23 Templates)

Chat interface, orchestration dashboard, workflow builder, memory explorer, settings panel, sessions manager, datastore UI, and more.

Platform Integrations
Supported Integration Platforms

Telegram

Per-user Agent instances with session isolation, user approval workflow, typing indicators, image support, slash commands, and cron job management.

Discord

DM-based polling interface with message normalization, bot mention detection, and audio file upload support.

Slack

DM-first polling with pagination, user caching, username resolution, and thread reply support.

Google OAuth

PKCE-based authorization flow with token lifecycle management and credential injection into Vertex AI provider.

BotPort Network

Distributed agent coordination with concern-based task delegation and multi-hop agent communication.

Hotkey Daemon

Global keyboard listener for voice activation with double/triple-tap state machine, screenshot capture, and clipboard detection.

Key Design Patterns

1. Mixin-Based Composition

The Agent class uses 13 mixins to provide distinct capabilities without deep inheritance hierarchies. Each mixin focuses on a specific concern (orchestration, tools, context, memory, etc.), enabling modular testing and feature toggling.

2. Callback-Driven Architecture

Agent execution feeds events through registered callbacks (status, thinking, tool_output, approval) that route to UI layers without tight coupling. Enables real-time monitoring and multi-client synchronization.

3. Async-First Design

All I/O operations use asyncio with non-blocking patterns. Long-running operations (LLM calls, file I/O, network requests) run in thread pools to prevent event loop blocking.

4. Token-Aware Context Management

The system tracks token consumption at multiple levels (message, turn, session) and implements intelligent context compaction, chunking, and message selection to stay within LLM context windows.

5. Multi-Layer Memory

Working memory (in-turn), semantic memory (session-scoped), and deep memory (long-term archive) provide different retrieval patterns optimized for recency, relevance, and scale.

6. Guard Rails & Approval Workflows

Input/output guards with configurable levels (stop_suspicious, ask_for_approval) and tool execution approval callbacks enable safe autonomous operation with human oversight.

7. Scale Detection & Micro-Loops

Automatic detection of large-scale list-processing tasks triggers a specialized micro-loop that processes items one-at-a-time with constant context, preventing context window overflow.

8. Orchestration & Parallelization

SessionOrchestrator decomposes complex requests into DAGs, executes tasks in parallel with dependency constraints, and manages timeout/retry policies for resilient multi-step workflows.

9. File Registry & Cross-Task Sharing

FileRegistry maps logical paths to physical locations, enabling downstream tasks to discover and reference upstream artifacts without knowledge of session IDs or directory structures.

10. Configuration Hierarchy

Environment variables → .env file → config.yaml (home) → config.yaml (local) → hardcoded defaults provide flexible configuration with security-sensitive overrides.

Notable Technical Achievements
🎯 Token-Efficient Browser Automation

PinchTab uses accessibility trees (~800 tokens) instead of screenshots (~2K+ tokens), achieving 2-3x token efficiency for web automation tasks.

📚 Chunked Processing Pipeline

Automatically detects context overflow and splits large documents into chunks, processes independently, and combines results via LLM synthesis.

🔀 Dual-Mode Orchestration

Supports both fast "loop" mode (direct tool execution) and "contracts" mode (planner + critic validation) for different task complexity levels.

📊 Intelligent Scale Detection

Automatically detects large-scale tasks and switches to per-item processing to prevent context explosion, with automatic list extraction after content fetch.

🔌 Multi-Provider LLM Abstraction

Unified interface supporting Ollama, OpenAI, Anthropic, Gemini, xAI with provider-specific quirks handled transparently.

🧠 Hybrid Memory Search

Combines full-text search (BM25) with vector embeddings and temporal decay scoring for intelligent context retrieval across multiple timescales.

⚡ Graceful Degradation

Every failure point has a fallback (chunk LLM failure → skip chunk; combine overflow → concatenate; vision failure → return screenshot path).

LLM Provider Support
Supported LLM Providers

Ollama

Direct HTTP client for local models. Enables offline operation with open-source LLMs.

OpenAI/ChatGPT

Standard API + ChatGPT Responses API with SSE streaming for real-time responses.

Anthropic Claude

Full API support with prompt caching for cost-efficient long-context processing.

Google Gemini

Via LiteLLM with async/sync handling and vision capabilities.

xAI Grok

Via LiteLLM integration for advanced reasoning tasks.

Features

Token rate limiting, provider-specific message conversion, unified tool schema, streaming, and usage tracking.

Configuration & Utilities

Configuration (Pydantic v2)

YAML persistence with local/home precedence, environment variable overrides for secrets, 30+ subsystems configuration including tools, skills, guards, memory, UI platforms.

Logging (structlog)

Structured logging with dynamic sink routing to TUI system panel and stderr fallback. Enables real-time monitoring and debugging.

Cron System

Human-readable schedule parsing (interval, daily, weekly), job execution with trigger tracking, and history persistence for scheduling recurring tasks.

Instruction Management

Two-tier directory system (system defaults + personal overrides), markdown template rendering with partial substitution, micro-template variants for context-specific prompts.

Personality System

Agent and per-user personality profiles in markdown format, prompt block injection for customization, enables personalized agent behavior.

Visualization Styles

Design preference management with LLM-powered style extraction from images/documents and cache invalidation on updates.

Datastore

SQLite-backed relational database with multi-format import/export (CSV, XLSX, JSON), granular protection system, type inference, and schema evolution.

Session Orchestrator

DAG-based task decomposition and execution, parallel task activation with traffic-light gating, timeout management, workflow persistence and templating.

Module Organization

Core Modules (20)

  • Agent & Mixins (13)
  • Memory Systems (3)
  • Session Management
  • Configuration
  • Logging
  • LLM Providers

Tools (40+)

  • File Operations (4)
  • Web Integration (8)
  • Document Processing (3)
  • Browser Automation (6)
  • System & Hardware (7)
  • Productivity (8)

Web Infrastructure (25+)

  • Core Server
  • WebSocket Handlers (3)
  • REST API Modules (20+)
  • Static Pages
  • OAuth Management

Platform Bridges (5)

  • Telegram Bridge
  • Discord Bridge
  • Slack Bridge
  • BotPort Client
  • Hotkey Daemon

Advanced Features (15+)

  • Task Graph & DAG
  • Skills System
  • File Tree Builder
  • Next Steps Extraction
  • Reflections & Learning
  • Session Export

CLI & Utilities (10+)

  • Terminal UI
  • Command Dispatch
  • Prompt Execution
  • Agent Pool
  • Runtime Context
  • Main Entry Point
System Metrics Summary
3.2M+
Source Characters
100+
HTTP/WS Routes
5
LLM Providers
3
Memory Layers