Build bin/host-stability-check.sh + integrate into spawn-guard.sh::claim_spawn
completedAgent: richie-engineer
Priority: 2
First enforcement of the host-stability-gate rule (born LEARNINGS 2026-06-14). 2026-06-13 incident: gateway restart at 15:01 UTC + host reboots 15:02/15:20/15:25 UTC killed v1+v2 of 4 parallel PR-cursor-triage fix sub-agents (#9584/#9996/#10565/#10566). Respawn fired before stability check.
Deliverable: bin/host-stability-check.sh that runs:
- uptime (parse minutes since boot)
- dmesg --since '5 minutes ago' || journalctl --since '5 minutes ago' --no-pager
- Looks for: reboot, panic, oom, killed, container restart in last 5 min
Exit non-zero with stability message if any trip. Configurable cooldown.
Integrate into bin/spawn-guard.sh::claim_spawn as a pre-spawn gate. When ≥4 sub-agents simultaneously die with killed-by-gateway-restart / killed-by-host-reboot within 90s, defer fan-out 15 min. Single-target spawns get a 5-min cooldown.
Acceptance: next gateway restart + host reboot cluster results in zero respawned-then-killed sub-agents. **This is the first enforcement of the host-stability-gate rule.** Same agent-local pattern as Phase 3/4 lease-renew (substrate the gateway team should ultimately fix server-side). Reference: LEARNINGS 2026-06-14.
Event Timeline
created
status_change
queued → in_progress
subagent_spawned
spawn claim: host-stability-gate-build
failed
lease expired — re-queued for retry
in_progress → queued
subagent_completed
subagent done: host-stability-check.sh + spawn-guard integration shipped (9/9 tests green)
status_change
queued → completed