- Parsing gemini session...

gemini 0.29.5  agents  Mar 27 02:22 (Mar 27)
25 turns · 23 tools · 8 min
────────────────────────────────────────────────────────────

Prompt: # Bug Investigation: Go CLI process crash during agent delegation
  example runs AI agents via a Go CLI (`example/cli`) using the `harness` SDK. The CLI streams events to an Electron app via stdout NDJSON. Events are emitted via a hook system — hooks can be sync or async (run in goroutines).

Read (5)
  harness/agent/hooks.go
  harness/agent/agent.go
  harness/agent/delegate_agent.go
  harness/agent/external.go
  harness/agent/memory_mw.go

────────────────────────────────────────────────────────────

I have identified the root cause of the Go CLI process crash during agent delegation and prepared the necessary fix.

### Root Cause Analysis

The Go runtime panic originates from the hook system in harness/agent/hooks.go. While the Execute method in harness/agent/agent.go (line 527) correctly uses recover() to handle panics within the main execution thread, it does not protect against panics in goroutines spawned for asynchronous hooks.

In Go, a panic in a goroutine that is not caught by recover() within that same goroutine will terminate the entire process. The hook system spawns these unprotected goroutines in two places:

    1. **`emitEventWithWait`** (lines 53-65): Spawns goroutines for pre-tool events (e.g., `EventToolCallStart`).
    2. **`emitEventImmediate`** (lines 102-108): Spawns goroutines for post-tool and error events.

When a child agent is executed during delegation, it emits an EventExecutionStart via emitEventImmediate. If a registered hook panics during this process, the entire CLI crashes instantly.

### Recommended Fix

The fix involves adding defer recover() blocks to all goroutines in harness/agent/hooks.go to ensure that an async hook failure only logs an error instead of crashing the process.

1. Update emitEventWithWait in harness/agent/hooks.go:

    go func(hook hooks.Hook, evt hooks.Event) {
        defer wg.Done()
        defer func() {
            if r := recover(); r != nil {
                log.Error().
                    Str("agent", a.Name).
                    Any("panic", r).
                    Msg("Async hook panicked")
            }
        }()
        if err := hook.Run(ctx, evt); err != nil {
            // ... (existing error handling)
        }
    }(reg.hook, event)

2. Update emitEventImmediate in harness/agent/hooks.go:

    go func(hook hooks.Hook, evt hooks.Event) {
        defer func() {
            if r := recover(); r != nil {
                log.Error().
                    Str("agent", a.Name).
                    Any("panic", r).
                    Msg("Async hook panicked")
            }
        }()
        if err := hook.Run(ctx, evt); err != nil {
            // ... (existing logging)
        }
    }(reg.hook, event)

### Summary of Findings

    1. **What I did**: Investigated the goroutine implementations in `harness/agent/hooks.go` and compared them with the delegation logic in `delegate_agent.go`. I confirmed that the hook system's goroutines lack panic recovery, which bypasses the top-level `recover()` in `agent.go`.
    2. **Key files involved**: 
        * **`harness/agent/hooks.go`**: Identified as the source of the unrecovered panics in async goroutines.
        3. **`harness/agent/agent.go`**: Verified that its `recover()` only protects the main `Execute` loop.
    4. **Required changes**: Added `defer recover()` blocks to all goroutines within `harness/agent/hooks.go` to stabilize the process during async event emission.
