# Incident Response Runbook — SentryAgent.ai AgentIdP

**Owner:** Security Engineering
**Last updated:** 2026-03-31
**Applies to:** Production AgentIdP deployments

This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.

---

## 1. Auth Failure Spike

### Detection

**Prometheus alert:** `AuthFailureSpike`
```yaml
expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
for: 2m
severity: warning
```

Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.

### Immediate Actions

1. Acknowledge the alert in PagerDuty / alerting system
2. Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
3. Check Prometheus dashboard for the geographic distribution of the failing requests

### Investigation Steps

1. **Identify source agents:**
   ```bash
   # Query audit log for recent auth failures
   curl -s -H "Authorization: Bearer <admin-token>" \
     "https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100"
   ```

2. **Check for brute-force patterns:**
   Look for repeated failures from the same `client_id` or IP address.

3. **Check if an agent's credentials expired:**
   ```bash
   # Look for expired credentials
   psql "$DATABASE_URL" -c "
     SELECT credential_id, client_id, expires_at
     FROM credentials
     WHERE status = 'active' AND expires_at < NOW()
     ORDER BY expires_at DESC LIMIT 20;"
   ```

4. **Check for key compromise signals:**
   - Multiple agents failing simultaneously → possible key store issue
   - Single agent with high failure rate → possible credential stuffing or misconfiguration

### Escalation Path

- **Warning (< 2 req/s):** Engineering on-call investigates within 1 hour
- **Critical (> 2 req/s sustained):** CISO notified, potential account compromise investigation
- **If credential compromise confirmed:** Revoke affected credentials immediately via `POST /agents/:id/credentials/:credId/revoke`

---

## 2. Anomalous Token Issuance

### Detection

**Prometheus alert:** `AnomalousTokenIssuance`
```yaml
expr: rate(agentidp_tokens_issued_total[5m]) > 10
for: 5m
severity: warning
```

Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.

### Immediate Actions

1. Acknowledge the alert
2. Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
3. Check the `scope` label breakdown on `agentidp_tokens_issued_total` to identify what scopes are being requested

### Investigation Steps

1. **Identify top issuing agents:**
   ```bash
   # Query audit log for recent token issuances
   curl -s -H "Authorization: Bearer <admin-token>" \
     "https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100"
   ```

2. **Check monthly token budget:**
   Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.

3. **Check for abnormal scope combinations:**
   If tokens are being issued with `admin:orgs` or `audit:read` at high volume, this warrants immediate investigation.

4. **Check for valid business reason:**
   Contact the organization owner for the top-issuing agents.

### Escalation Path

- **Warning:** Engineering on-call investigates within 4 hours
- **If compromise suspected:** Revoke affected agent tokens via Redis revocation list, rotate credentials
- **If systematic abuse confirmed:** Suspend the issuing agent(s) via `PATCH /agents/:id` with `status: suspended`

---

## 3. Audit Chain Integrity Failure

### Detection

**Prometheus alert:** `AuditChainIntegrityFailed`
```yaml
expr: agentidp_audit_chain_integrity == 0
for: 0m
severity: critical
```

Fires immediately when `AuditChainVerificationJob` detects a break in the audit event hash chain.
This is a **CRITICAL** security event — possible evidence of log tampering.

### Immediate Actions

1. **Do NOT attempt to repair the broken chain** — preserve all evidence
2. Notify CISO and security team immediately
3. Page the on-call security engineer with P0 priority
4. Capture the current state:
   ```bash
   curl -s -H "Authorization: Bearer <audit-token>" \
     "https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json
   ```

### Investigation Steps

1. **Determine the broken event:**
   The `brokenAtEventId` field in the `/audit/verify` response identifies the first broken event.

2. **Forensic analysis:**
   Follow the steps in `docs/compliance/audit-log-runbook.md` — "What to Do When brokenAtEventId is Returned".

3. **Check database access logs:**
   Review PostgreSQL `pg_stat_activity` and connection logs for unauthorized direct DB access.

4. **Check application logs:**
   Look for any errors from the immutability trigger (`audit_events_immutable`).

5. **Check Vault audit logs:**
   Review whether any encryption key access was abnormal.

### Escalation Path

- **Immediate:** CISO + Legal + Security Engineering
- **Within 1 hour:** Begin forensic preservation per incident response plan
- **Within 24 hours:** Determine scope of compromise and notification obligations
- **Customer notification:** Per contractual and regulatory obligations (GDPR, SOC 2 requirements)

---

## 4. Webhook Dead-Letter Accumulation

### Detection

**Prometheus alert:** `WebhookDeadLetterAccumulating`
```yaml
expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
for: 0m
severity: critical
```

Fires when more than 10 webhook deliveries reach dead-letter status within an hour.

### Immediate Actions

1. Acknowledge the alert
2. Check which `organization_id` labels are accumulating dead-letters:
   ```bash
   # Prometheus query: top organizations by dead-letter rate
   # agentidp_webhook_dead_letters_total (by organization_id)
   ```

3. Check if the destination endpoints are reachable:
   ```bash
   curl -I https://<webhook-destination-url>/
   ```

### Investigation Steps

1. **List affected webhook subscriptions:**
   ```bash
   # Query delivery records for dead-letter status
   psql "$DATABASE_URL" -c "
     SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters
     FROM webhook_subscriptions s
     JOIN webhook_deliveries d ON d.subscription_id = s.id
     WHERE d.status = 'dead_letter'
       AND d.updated_at > NOW() - INTERVAL '2 hours'
     GROUP BY s.id
     ORDER BY dead_letters DESC
     LIMIT 20;"
   ```

2. **Check delivery failure reasons:**
   ```bash
   psql "$DATABASE_URL" -c "
     SELECT http_status_code, COUNT(*) as count
     FROM webhook_deliveries
     WHERE status = 'dead_letter'
       AND updated_at > NOW() - INTERVAL '2 hours'
     GROUP BY http_status_code;"
   ```

3. **Common causes and resolutions:**
   | HTTP Status | Likely Cause | Resolution |
   |---|---|---|
   | 0 / null | Network unreachable / DNS failure | Check recipient endpoint availability |
   | 401 / 403 | HMAC signature validation failing | Customer to verify HMAC secret |
   | 404 | Endpoint URL changed | Customer to update webhook URL |
   | 5xx | Recipient server error | Customer to investigate their endpoint |
   | Timeout | Slow recipient endpoint | Customer to optimize endpoint response time |

4. **Notify affected customers:**
   Contact the organization owner for high-volume dead-letter subscriptions.

### Escalation Path

- **Warning (10-50/hr):** Engineering notifies affected customers, investigates endpoint health
- **Critical (> 50/hr):** Engineering on-call + Platform reliability team engaged
- **If systemic delivery infrastructure failure:** Activate incident bridge, escalate to VP Engineering