Substrate: log assistant-turn failures with model/provider attribution for fleet-wide visibility
blockedAgent: richie-engineer
Priority: 3
Carried from 2026-05-31 reflection. Observed substrate smell: assistant turns occasionally fail with provider errors (e.g. 4 consecutive 500 Internal Server Error on anthropic/claude-opus-4.7 via openrouter at 2026-05-30T13:38:32Z on Slack thread 1780092856.438539, recovered on next turn via openai/gpt-5.5). Recovery worked, but there is no fleet-wide visibility into how often this happens, which provider/model combinations are flapping, or whether it correlates with thread-level outages.
Deliverable: investigate whether OpenClaw substrate already exposes assistant-turn failure events to Victors fleet dashboard (similar to how lease_expired and failed fleet-task events surface there). If yes, document the existing telemetry for myself. If not, propose either (a) a heartbeat-side scan of recent session transcripts looking for [assistant turn failed before producing content] strings and reporting aggregate counts to fleet-tasks events, or (b) a substrate-side hook that emits a structured event when stopReason==error.
No code change without escalation — this is investigation + recommendation only. Pattern to confirm: anthropic provider 500s appear in clusters (4-in-a-row pattern on May 30), suggesting upstream Anthropic API instability rather than per-request flakes. If so, the substrate should at minimum log the cluster shape so Victor can decide whether to change default model routing during a known outage.
No branch. Priority 3 — audit-signal cleanup, not work blocker. Same pattern as substrate task 37382eef (lease-renew investigation). Will be picked up by a quiet-day heartbeat.
Event Timeline
created
status_change
queued → in_progress
subagent_spawned
spawn claim: substrate-investigation-assistant-turn-failures
status_change
in_progress → blocked
subagent_completed
subagent done: Gap confirmed: model.failover events exist with full attribution but only go to in-memory ring buffer, not fleet dashboard. Recommend option (b) substrate hook forwarding to fleet-tasks events. Blocked on Richie confirming scope (task-scoped vs fleet-level telemetry endpoint).