Cookbook

Respond to an incident

Walk a drift alert from the moment it fires through to a signed post-mortem.

This recipe takes a real signal — a drift alert from the model registry — and walks it through the full incident lifecycle. It's a template you can adapt for complaints, regulator queries, or any other inbound signal.

Scenario

Your nightly drift job pulls GET /v1/models/drift and finds a new alert-severity signal: gpt-4-2024-11-20's rejection rate has jumped from 6% (baseline, last 90 days) to 17% (current, last 30 days), a delta of +11pp. Sample size is 142.

Step 1 — Open an incident

Incidents are categorical (MODEL_FAILURE, BAD_ADVICE, DATA_ISSUE, SLA_BREACH, SECURITY, OTHER) and severity-graded (LOW, MEDIUM, HIGH, CRITICAL). Logging an incident anchors a snapshot on the immutable ledger as an INCIDENT_LOGGED event.

bash

curl -X POST https://api.bedrockgovernance.com/v1/incidents \
  -H "X-Bedrock-Key: bk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "title": "gpt-4-2024-11-20 rejection rate +11pp",
    "description": "Drift alert: rejection rate jumped from 6% (90d baseline) to 17% (30d) on 142 samples.",
    "severity": "HIGH",
    "category": "MODEL_FAILURE",
    "affectedReviewJobIds": []
  }'

Step 2 — Triage

Identify the affected model: (provider, version)
Pull the timeline: GET /v1/models/openai/gpt-4-2024-11-20/timeline
Identify the affected jobs: every job in the current window with that model
Identify the customers: every clientReference on those jobs

Step 3 — Investigate

Look for a cause:

Did the provider push a new minor version?
Did your prompt change?
Did the input distribution change (new product, new customer segment)?
Are the rejections concentrated in one product or one adviser?

Step 4 — Remediate

If the model is the cause:

Pin advisers to the previous version in your back-office.
File an impact assessment for the new version.
Run a back-test on the previous month's rejected cases against the old version.
Reinstate the new version only after the impact assessment is signed off.

Track each step on the incident as a remediation action — they appear in the audit trail alongside the original incident snapshot:

bash

curl -X POST https://api.bedrockgovernance.com/v1/incidents/$INCIDENT_ID/remediations \
  -H "X-Bedrock-Key: bk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "description": "Pin advisers to gpt-4-2024-09-15 in back-office",
    "ownerName": "Jane Smith",
    "dueAt": "2026-04-12T17:00:00.000Z"
  }'

Step 5 — Resolve

Updating the incident's status to RESOLVED (or CLOSED) anchors a second snapshot on the ledger as an INCIDENT_RESOLVED event. The root cause is recorded verbatim and the full incident history (creation, every status change, every remediation) becomes immutable. The post-mortem is signed and stored as a certificate, addressable forever via verify.bedrockgovernance.com/c/....

bash

curl -X POST https://api.bedrockgovernance.com/v1/incidents/$INCIDENT_ID \
  -H "X-Bedrock-Key: bk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "status": "RESOLVED",
    "rootCause": "Provider pushed an undocumented minor with stricter risk-scoring. Pinned advisers to the previous build pending impact-assessment sign-off."
  }'

Respond to an incident

Scenario

Step 1 — Open an incident

Step 2 — Triage

Step 3 — Investigate

Step 4 — Remediate

Step 5 — Resolve

See also