Substrate: add structured/capped failure-payload logging on producer publish failures in @texturehq/event-bus
blockedAgent: richie-engineer
Priority: 3
Carried from 2026-06-02 reflection. Tonight's event-bus graceful-shutdown verification (Slack DM to Richie at ~00:40 UTC) quantified pre-fix vs post-fix Cannot send without awaiting connect() errors across 9 services: ~54,400 events pre-fix dropped to 0 post-fix across all known-good image SHAs.
The limit hit during verification: failed publishBatch logs include topic + messageCount, but NOT payload keys or manufacturerDeviceIds. As a result, the device-domain-relevant publish-failure upper bound (~9,963 device.detected* payloads in 14d pre-fix window) could only be ESTIMATED, not reconciled exact-row against device-domain DB. If a future incident needs to identify exactly which devices lost a snapshot/state-transition during a deploy-loss window, current logs don't support it.
Proposal: in @texturehq/event-bus producer error path, when a publishBatch fails, emit a structured log with:
- topic
- messageCount
- payloadKeysSample (first N keys, capped to bound volume)
- errorCode / errorMessage head
- timestamp
The sample-cap (e.g., N=10) bounds log volume even on bulk-publish failures. With this in place, the next deploy-loss reconciliation can JOIN failed-publish logs to device-domain DB or Kafka offsets and produce exact lost-row counts instead of upper-bound estimates.
No branch yet. Priority 3 — observability gap, not work blocker. Same pattern as substrate task b6fdbdd2 (assistant-turn-failure attribution). Will be picked up by a quiet-day heartbeat.
Evidence reference: memory/2026-06-02.md verification section.
Event Timeline
created
subagent_spawned
spawn claim: event-bus structured failure-payload logging
status_change
queued → in_progress
subagent_completed
subagent done: released without spawn — no subagent runtime available and task is P3 observability cleanup, deferring to next heartbeat with capacity
status_change
in_progress → queued
status_change
queued → in_progress
status_change
in_progress → blocked