# Incident Response Runbook — SentryAgent.ai AgentIdP **Owner:** Security Engineering **Last updated:** 2026-03-31 **Applies to:** Production AgentIdP deployments This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring. --- ## 1. Auth Failure Spike ### Detection **Prometheus alert:** `AuthFailureSpike` ```yaml expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5 for: 2m severity: warning ``` Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes. ### Immediate Actions 1. Acknowledge the alert in PagerDuty / alerting system 2. Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment) 3. Check Prometheus dashboard for the geographic distribution of the failing requests ### Investigation Steps 1. **Identify source agents:** ```bash # Query audit log for recent auth failures curl -s -H "Authorization: Bearer " \ "https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100" ``` 2. **Check for brute-force patterns:** Look for repeated failures from the same `client_id` or IP address. 3. **Check if an agent's credentials expired:** ```bash # Look for expired credentials psql "$DATABASE_URL" -c " SELECT credential_id, client_id, expires_at FROM credentials WHERE status = 'active' AND expires_at < NOW() ORDER BY expires_at DESC LIMIT 20;" ``` 4. **Check for key compromise signals:** - Multiple agents failing simultaneously → possible key store issue - Single agent with high failure rate → possible credential stuffing or misconfiguration ### Escalation Path - **Warning (< 2 req/s):** Engineering on-call investigates within 1 hour - **Critical (> 2 req/s sustained):** CISO notified, potential account compromise investigation - **If credential compromise confirmed:** Revoke affected credentials immediately via `POST /agents/:id/credentials/:credId/revoke` --- ## 2. Anomalous Token Issuance ### Detection **Prometheus alert:** `AnomalousTokenIssuance` ```yaml expr: rate(agentidp_tokens_issued_total[5m]) > 10 for: 5m severity: warning ``` Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes. ### Immediate Actions 1. Acknowledge the alert 2. Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test) 3. Check the `scope` label breakdown on `agentidp_tokens_issued_total` to identify what scopes are being requested ### Investigation Steps 1. **Identify top issuing agents:** ```bash # Query audit log for recent token issuances curl -s -H "Authorization: Bearer " \ "https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100" ``` 2. **Check monthly token budget:** Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse. 3. **Check for abnormal scope combinations:** If tokens are being issued with `admin:orgs` or `audit:read` at high volume, this warrants immediate investigation. 4. **Check for valid business reason:** Contact the organization owner for the top-issuing agents. ### Escalation Path - **Warning:** Engineering on-call investigates within 4 hours - **If compromise suspected:** Revoke affected agent tokens via Redis revocation list, rotate credentials - **If systematic abuse confirmed:** Suspend the issuing agent(s) via `PATCH /agents/:id` with `status: suspended` --- ## 3. Audit Chain Integrity Failure ### Detection **Prometheus alert:** `AuditChainIntegrityFailed` ```yaml expr: agentidp_audit_chain_integrity == 0 for: 0m severity: critical ``` Fires immediately when `AuditChainVerificationJob` detects a break in the audit event hash chain. This is a **CRITICAL** security event — possible evidence of log tampering. ### Immediate Actions 1. **Do NOT attempt to repair the broken chain** — preserve all evidence 2. Notify CISO and security team immediately 3. Page the on-call security engineer with P0 priority 4. Capture the current state: ```bash curl -s -H "Authorization: Bearer " \ "https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json ``` ### Investigation Steps 1. **Determine the broken event:** The `brokenAtEventId` field in the `/audit/verify` response identifies the first broken event. 2. **Forensic analysis:** Follow the steps in `docs/compliance/audit-log-runbook.md` — "What to Do When brokenAtEventId is Returned". 3. **Check database access logs:** Review PostgreSQL `pg_stat_activity` and connection logs for unauthorized direct DB access. 4. **Check application logs:** Look for any errors from the immutability trigger (`audit_events_immutable`). 5. **Check Vault audit logs:** Review whether any encryption key access was abnormal. ### Escalation Path - **Immediate:** CISO + Legal + Security Engineering - **Within 1 hour:** Begin forensic preservation per incident response plan - **Within 24 hours:** Determine scope of compromise and notification obligations - **Customer notification:** Per contractual and regulatory obligations (GDPR, SOC 2 requirements) --- ## 4. Webhook Dead-Letter Accumulation ### Detection **Prometheus alert:** `WebhookDeadLetterAccumulating` ```yaml expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10 for: 0m severity: critical ``` Fires when more than 10 webhook deliveries reach dead-letter status within an hour. ### Immediate Actions 1. Acknowledge the alert 2. Check which `organization_id` labels are accumulating dead-letters: ```bash # Prometheus query: top organizations by dead-letter rate # agentidp_webhook_dead_letters_total (by organization_id) ``` 3. Check if the destination endpoints are reachable: ```bash curl -I https:/// ``` ### Investigation Steps 1. **List affected webhook subscriptions:** ```bash # Query delivery records for dead-letter status psql "$DATABASE_URL" -c " SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters FROM webhook_subscriptions s JOIN webhook_deliveries d ON d.subscription_id = s.id WHERE d.status = 'dead_letter' AND d.updated_at > NOW() - INTERVAL '2 hours' GROUP BY s.id ORDER BY dead_letters DESC LIMIT 20;" ``` 2. **Check delivery failure reasons:** ```bash psql "$DATABASE_URL" -c " SELECT http_status_code, COUNT(*) as count FROM webhook_deliveries WHERE status = 'dead_letter' AND updated_at > NOW() - INTERVAL '2 hours' GROUP BY http_status_code;" ``` 3. **Common causes and resolutions:** | HTTP Status | Likely Cause | Resolution | |---|---|---| | 0 / null | Network unreachable / DNS failure | Check recipient endpoint availability | | 401 / 403 | HMAC signature validation failing | Customer to verify HMAC secret | | 404 | Endpoint URL changed | Customer to update webhook URL | | 5xx | Recipient server error | Customer to investigate their endpoint | | Timeout | Slow recipient endpoint | Customer to optimize endpoint response time | 4. **Notify affected customers:** Contact the organization owner for high-volume dead-letter subscriptions. ### Escalation Path - **Warning (10-50/hr):** Engineering notifies affected customers, investigates endpoint health - **Critical (> 50/hr):** Engineering on-call + Platform reliability team engaged - **If systemic delivery infrastructure failure:** Activate incident bridge, escalate to VP Engineering