Implements all 22 WS6 tasks completing Phase 3 Enterprise. Column-level encryption (AES-256-CBC, Vault-backed key) via EncryptionService applied to credentials.secret_hash, credentials.vault_path, webhook_subscriptions.vault_secret_path, and agent_did_keys.vault_key_path. Backward-compatible: isEncrypted() guard skips decryption for existing plaintext rows until next read-write cycle. Audit chain integrity (CC7.2): AuditRepository computes SHA-256 Merkle hash on every INSERT (hash = SHA-256(eventId+timestamp+action+outcome+agentId+orgId+prevHash)). AuditVerificationService walks the full chain verifying hash continuity. AuditChainVerificationJob runs hourly; sets agentidp_audit_chain_integrity Prometheus gauge to 1 (pass) or 0 (fail). TLS enforcement (CC6.7): TLSEnforcementMiddleware registered as first middleware in Express stack; 301 redirect on non-https X-Forwarded-Proto in production. SecretsRotationJob (CC9.2): hourly scan for credentials expiring within 7 days; increments agentidp_credentials_expiring_soon_total. ComplianceController + routes: GET /audit/verify (auth+audit:read scope, 30/min rate-limit); GET /compliance/controls (public, Cache-Control 60s). ComplianceStatusStore: module-level map updated by jobs, consumed by controller. Prometheus: 2 new metrics (agentidp_credentials_expiring_soon_total, agentidp_audit_chain_integrity); 6 alerting rules in alerts.yml. Compliance docs: soc2-controls-matrix.md, encryption-runbook.md, audit-log-runbook.md, incident-response.md, secrets-rotation.md. Tests: 557 unit tests passing (35 suites); 26 new tests (EncryptionService, AuditVerificationService); 19 compliance integration tests. TypeScript clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
230 lines
7.5 KiB
Markdown
230 lines
7.5 KiB
Markdown
# Incident Response Runbook — SentryAgent.ai AgentIdP
|
|
|
|
**Owner:** Security Engineering
|
|
**Last updated:** 2026-03-31
|
|
**Applies to:** Production AgentIdP deployments
|
|
|
|
This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.
|
|
|
|
---
|
|
|
|
## 1. Auth Failure Spike
|
|
|
|
### Detection
|
|
|
|
**Prometheus alert:** `AuthFailureSpike`
|
|
```yaml
|
|
expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
|
|
for: 2m
|
|
severity: warning
|
|
```
|
|
|
|
Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.
|
|
|
|
### Immediate Actions
|
|
|
|
1. Acknowledge the alert in PagerDuty / alerting system
|
|
2. Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
|
|
3. Check Prometheus dashboard for the geographic distribution of the failing requests
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Identify source agents:**
|
|
```bash
|
|
# Query audit log for recent auth failures
|
|
curl -s -H "Authorization: Bearer <admin-token>" \
|
|
"https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100"
|
|
```
|
|
|
|
2. **Check for brute-force patterns:**
|
|
Look for repeated failures from the same `client_id` or IP address.
|
|
|
|
3. **Check if an agent's credentials expired:**
|
|
```bash
|
|
# Look for expired credentials
|
|
psql "$DATABASE_URL" -c "
|
|
SELECT credential_id, client_id, expires_at
|
|
FROM credentials
|
|
WHERE status = 'active' AND expires_at < NOW()
|
|
ORDER BY expires_at DESC LIMIT 20;"
|
|
```
|
|
|
|
4. **Check for key compromise signals:**
|
|
- Multiple agents failing simultaneously → possible key store issue
|
|
- Single agent with high failure rate → possible credential stuffing or misconfiguration
|
|
|
|
### Escalation Path
|
|
|
|
- **Warning (< 2 req/s):** Engineering on-call investigates within 1 hour
|
|
- **Critical (> 2 req/s sustained):** CISO notified, potential account compromise investigation
|
|
- **If credential compromise confirmed:** Revoke affected credentials immediately via `POST /agents/:id/credentials/:credId/revoke`
|
|
|
|
---
|
|
|
|
## 2. Anomalous Token Issuance
|
|
|
|
### Detection
|
|
|
|
**Prometheus alert:** `AnomalousTokenIssuance`
|
|
```yaml
|
|
expr: rate(agentidp_tokens_issued_total[5m]) > 10
|
|
for: 5m
|
|
severity: warning
|
|
```
|
|
|
|
Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.
|
|
|
|
### Immediate Actions
|
|
|
|
1. Acknowledge the alert
|
|
2. Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
|
|
3. Check the `scope` label breakdown on `agentidp_tokens_issued_total` to identify what scopes are being requested
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Identify top issuing agents:**
|
|
```bash
|
|
# Query audit log for recent token issuances
|
|
curl -s -H "Authorization: Bearer <admin-token>" \
|
|
"https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100"
|
|
```
|
|
|
|
2. **Check monthly token budget:**
|
|
Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.
|
|
|
|
3. **Check for abnormal scope combinations:**
|
|
If tokens are being issued with `admin:orgs` or `audit:read` at high volume, this warrants immediate investigation.
|
|
|
|
4. **Check for valid business reason:**
|
|
Contact the organization owner for the top-issuing agents.
|
|
|
|
### Escalation Path
|
|
|
|
- **Warning:** Engineering on-call investigates within 4 hours
|
|
- **If compromise suspected:** Revoke affected agent tokens via Redis revocation list, rotate credentials
|
|
- **If systematic abuse confirmed:** Suspend the issuing agent(s) via `PATCH /agents/:id` with `status: suspended`
|
|
|
|
---
|
|
|
|
## 3. Audit Chain Integrity Failure
|
|
|
|
### Detection
|
|
|
|
**Prometheus alert:** `AuditChainIntegrityFailed`
|
|
```yaml
|
|
expr: agentidp_audit_chain_integrity == 0
|
|
for: 0m
|
|
severity: critical
|
|
```
|
|
|
|
Fires immediately when `AuditChainVerificationJob` detects a break in the audit event hash chain.
|
|
This is a **CRITICAL** security event — possible evidence of log tampering.
|
|
|
|
### Immediate Actions
|
|
|
|
1. **Do NOT attempt to repair the broken chain** — preserve all evidence
|
|
2. Notify CISO and security team immediately
|
|
3. Page the on-call security engineer with P0 priority
|
|
4. Capture the current state:
|
|
```bash
|
|
curl -s -H "Authorization: Bearer <audit-token>" \
|
|
"https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json
|
|
```
|
|
|
|
### Investigation Steps
|
|
|
|
1. **Determine the broken event:**
|
|
The `brokenAtEventId` field in the `/audit/verify` response identifies the first broken event.
|
|
|
|
2. **Forensic analysis:**
|
|
Follow the steps in `docs/compliance/audit-log-runbook.md` — "What to Do When brokenAtEventId is Returned".
|
|
|
|
3. **Check database access logs:**
|
|
Review PostgreSQL `pg_stat_activity` and connection logs for unauthorized direct DB access.
|
|
|
|
4. **Check application logs:**
|
|
Look for any errors from the immutability trigger (`audit_events_immutable`).
|
|
|
|
5. **Check Vault audit logs:**
|
|
Review whether any encryption key access was abnormal.
|
|
|
|
### Escalation Path
|
|
|
|
- **Immediate:** CISO + Legal + Security Engineering
|
|
- **Within 1 hour:** Begin forensic preservation per incident response plan
|
|
- **Within 24 hours:** Determine scope of compromise and notification obligations
|
|
- **Customer notification:** Per contractual and regulatory obligations (GDPR, SOC 2 requirements)
|
|
|
|
---
|
|
|
|
## 4. Webhook Dead-Letter Accumulation
|
|
|
|
### Detection
|
|
|
|
**Prometheus alert:** `WebhookDeadLetterAccumulating`
|
|
```yaml
|
|
expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
|
|
for: 0m
|
|
severity: critical
|
|
```
|
|
|
|
Fires when more than 10 webhook deliveries reach dead-letter status within an hour.
|
|
|
|
### Immediate Actions
|
|
|
|
1. Acknowledge the alert
|
|
2. Check which `organization_id` labels are accumulating dead-letters:
|
|
```bash
|
|
# Prometheus query: top organizations by dead-letter rate
|
|
# agentidp_webhook_dead_letters_total (by organization_id)
|
|
```
|
|
|
|
3. Check if the destination endpoints are reachable:
|
|
```bash
|
|
curl -I https://<webhook-destination-url>/
|
|
```
|
|
|
|
### Investigation Steps
|
|
|
|
1. **List affected webhook subscriptions:**
|
|
```bash
|
|
# Query delivery records for dead-letter status
|
|
psql "$DATABASE_URL" -c "
|
|
SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters
|
|
FROM webhook_subscriptions s
|
|
JOIN webhook_deliveries d ON d.subscription_id = s.id
|
|
WHERE d.status = 'dead_letter'
|
|
AND d.updated_at > NOW() - INTERVAL '2 hours'
|
|
GROUP BY s.id
|
|
ORDER BY dead_letters DESC
|
|
LIMIT 20;"
|
|
```
|
|
|
|
2. **Check delivery failure reasons:**
|
|
```bash
|
|
psql "$DATABASE_URL" -c "
|
|
SELECT http_status_code, COUNT(*) as count
|
|
FROM webhook_deliveries
|
|
WHERE status = 'dead_letter'
|
|
AND updated_at > NOW() - INTERVAL '2 hours'
|
|
GROUP BY http_status_code;"
|
|
```
|
|
|
|
3. **Common causes and resolutions:**
|
|
| HTTP Status | Likely Cause | Resolution |
|
|
|---|---|---|
|
|
| 0 / null | Network unreachable / DNS failure | Check recipient endpoint availability |
|
|
| 401 / 403 | HMAC signature validation failing | Customer to verify HMAC secret |
|
|
| 404 | Endpoint URL changed | Customer to update webhook URL |
|
|
| 5xx | Recipient server error | Customer to investigate their endpoint |
|
|
| Timeout | Slow recipient endpoint | Customer to optimize endpoint response time |
|
|
|
|
4. **Notify affected customers:**
|
|
Contact the organization owner for high-volume dead-letter subscriptions.
|
|
|
|
### Escalation Path
|
|
|
|
- **Warning (10-50/hr):** Engineering notifies affected customers, investigates endpoint health
|
|
- **Critical (> 50/hr):** Engineering on-call + Platform reliability team engaged
|
|
- **If systemic delivery infrastructure failure:** Activate incident bridge, escalate to VP Engineering
|