Multi-Agent Systems

IncidentGuard

A multi-agent incident remediation system that simulates outcomes before execution to reject unsafe plans, achieving 46% faster recovery with a closed-loop plan-execute-verify pipeline backed by Prometheus metrics.

Multi-Agent Counterfactual Simulation Prometheus Graceful Degradation

The Problem

When production incidents hit, on-call engineers face immense pressure to act fast. But rushing a fix can make things worse — a hasty retry storm amplifies a partial outage into a full cascade. IncidentGuard introduces a simulate-before-execute paradigm: every candidate remediation action is run through a discrete-event simulation to predict its outcome before it touches production.

The system handles 4 failure scenarios: retry death spirals, payment gateway flapping, seat hold clogging, and regional saturation. For each, it generates a ranked action list using the Stratus counterfactual reasoning API, with predicted effects on latency, error rates, retry storms, and blast radius.

After execution, the pipeline closes the loop by querying Prometheus metrics to verify that the predicted recovery actually materialized, comparing drift scores between expected and observed outcomes.

Technical Highlights

  • Simulate Before Execute

    Discrete-event simulation via SimPy predicts the impact of each candidate action. Unsafe plans are rejected before they reach production, preventing cascade failures.

  • Closed-Loop Pipeline

    Plan → Execute → Verify. After action execution, Prometheus metrics are queried to compare predicted vs. actual recovery, producing a structured incident report.

  • 3-Path Graceful Degradation

    Primary: browser-controlled execution via OpenClaw. Fallback 1: saved-plan local state mutation. Fallback 2: deterministic keyword-based ranking if the Stratus API is unavailable.

  • 46% Faster Recovery

    Evaluated across 4 telehealth scheduling failure scenarios against a naive baseline. Stratus-ranked actions with simulation pre-screening consistently outperformed manual triage heuristics.

Under The Hood

Orchestration

The main pipeline runs through 5 phases: await-demo, plan, demo, fallback, and verify. An incident classifier recognizes failure patterns, generates shortlisted actions from a library of 15 healthcare scheduling remediation steps, and ranks them via the Stratus counterfactual reasoning API.

Metrics Verification

A Prometheus client queries live metrics (checkout latency p95, error rate, retry rate) both before and after action execution. Drift scores quantify the gap between predicted and actual recovery, surfacing when a remediation underperformed expectations.

Execution Surfaces

Four UI surfaces: an operations board for telehealth scheduling, a Guardrail Console with incident/decision/execution/verdict views, an OpenClaw browser automation panel, and a feature flags page for manual fallback controls.