IncidentGuard - Yincheng Zhou

The Problem

When production incidents hit, on-call engineers face immense pressure to act fast. But rushing a fix can make things worse — a hasty retry storm amplifies a partial outage into a full cascade. IncidentGuard introduces a simulate-before-execute paradigm: every candidate remediation action is run through a discrete-event simulation to predict its outcome before it touches production.

The system handles 4 failure scenarios: retry death spirals, payment gateway flapping, seat hold clogging, and regional saturation. For each, it generates a ranked action list using the Stratus counterfactual reasoning API, with predicted effects on latency, error rates, retry storms, and blast radius.

After execution, the pipeline closes the loop by querying Prometheus metrics to verify that the predicted recovery actually materialized, comparing drift scores between expected and observed outcomes.

Technical Highlights

Simulate Before Execute

Discrete-event simulation via SimPy predicts the impact of each candidate action. Unsafe plans are rejected before they reach production, preventing cascade failures.
Closed-Loop Pipeline

Plan → Execute → Verify. After action execution, Prometheus metrics are queried to compare predicted vs. actual recovery, producing a structured incident report.
3-Path Graceful Degradation

Primary: browser-controlled execution via OpenClaw. Fallback 1: saved-plan local state mutation. Fallback 2: deterministic keyword-based ranking if the Stratus API is unavailable.
46% Faster Recovery

Evaluated across 4 telehealth scheduling failure scenarios against a naive baseline. Stratus-ranked actions with simulation pre-screening consistently outperformed manual triage heuristics.

Under The Hood

Orchestration

The main pipeline runs through 5 phases: await-demo, plan, demo, fallback, and verify. An incident classifier recognizes failure patterns, generates shortlisted actions from a library of 15 healthcare scheduling remediation steps, and ranks them via the Stratus counterfactual reasoning API.

Metrics Verification

A Prometheus client queries live metrics (checkout latency p95, error rate, retry rate) both before and after action execution. Drift scores quantify the gap between predicted and actual recovery, surfacing when a remediation underperformed expectations.

Execution Surfaces

Four UI surfaces: an operations board for telehealth scheduling, a Guardrail Console with incident/decision/execution/verdict views, an OpenClaw browser automation panel, and a feature flags page for manual fallback controls.

The Problem

Technical Highlights

Simulate Before Execute

Closed-Loop Pipeline

3-Path Graceful Degradation

46% Faster Recovery

Under The Hood

Orchestration

Metrics Verification

Execution Surfaces