feat(phase-3): workstream 6 — SOC 2 Type II Preparation

Implements all 22 WS6 tasks completing Phase 3 Enterprise. Column-level encryption (AES-256-CBC, Vault-backed key) via EncryptionService applied to credentials.secret_hash, credentials.vault_path, webhook_subscriptions.vault_secret_path, and agent_did_keys.vault_key_path. Backward-compatible: isEncrypted() guard skips decryption for existing plaintext rows until next read-write cycle. Audit chain integrity (CC7.2): AuditRepository computes SHA-256 Merkle hash on every INSERT (hash = SHA-256(eventId+timestamp+action+outcome+agentId+orgId+prevHash)). AuditVerificationService walks the full chain verifying hash continuity. AuditChainVerificationJob runs hourly; sets agentidp_audit_chain_integrity Prometheus gauge to 1 (pass) or 0 (fail). TLS enforcement (CC6.7): TLSEnforcementMiddleware registered as first middleware in Express stack; 301 redirect on non-https X-Forwarded-Proto in production. SecretsRotationJob (CC9.2): hourly scan for credentials expiring within 7 days; increments agentidp_credentials_expiring_soon_total. ComplianceController + routes: GET /audit/verify (auth+audit:read scope, 30/min rate-limit); GET /compliance/controls (public, Cache-Control 60s). ComplianceStatusStore: module-level map updated by jobs, consumed by controller. Prometheus: 2 new metrics (agentidp_credentials_expiring_soon_total, agentidp_audit_chain_integrity); 6 alerting rules in alerts.yml. Compliance docs: soc2-controls-matrix.md, encryption-runbook.md, audit-log-runbook.md, incident-response.md, secrets-rotation.md. Tests: 557 unit tests passing (35 suites); 26 new tests (EncryptionService, AuditVerificationService); 19 compliance integration tests. TypeScript clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 00:41:53 +00:00
parent 272b69f18d
commit fd90b2acd1
35 changed files with 3715 additions and 26 deletions
--- a/docs/compliance/incident-response.md
+++ b/docs/compliance/incident-response.md
@@ -0,0 +1,229 @@
+# Incident Response Runbook — SentryAgent.ai AgentIdP
+
+**Owner:** Security Engineering
+**Last updated:** 2026-03-31
+**Applies to:** Production AgentIdP deployments
+
+This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.
+
+---
+
+## 1. Auth Failure Spike
+
+### Detection
+
+**Prometheus alert:** `AuthFailureSpike`
+```yaml
+expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
+for: 2m
+severity: warning
+```
+
+Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.
+
+### Immediate Actions
+
+1. Acknowledge the alert in PagerDuty / alerting system
+2. Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
+3. Check Prometheus dashboard for the geographic distribution of the failing requests
+
+### Investigation Steps
+
+1. **Identify source agents:**
+   ```bash
+   # Query audit log for recent auth failures
+   curl -s -H "Authorization: Bearer <admin-token>" \
+     "https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100"
+   ```
+
+2. **Check for brute-force patterns:**
+   Look for repeated failures from the same `client_id` or IP address.
+
+3. **Check if an agent's credentials expired:**
+   ```bash
+   # Look for expired credentials
+   psql "$DATABASE_URL" -c "
+     SELECT credential_id, client_id, expires_at
+     FROM credentials
+     WHERE status = 'active' AND expires_at < NOW()
+     ORDER BY expires_at DESC LIMIT 20;"
+   ```
+
+4. **Check for key compromise signals:**
+   - Multiple agents failing simultaneously → possible key store issue
+   - Single agent with high failure rate → possible credential stuffing or misconfiguration
+
+### Escalation Path
+
+- **Warning (< 2 req/s):** Engineering on-call investigates within 1 hour
+- **Critical (> 2 req/s sustained):** CISO notified, potential account compromise investigation
+- **If credential compromise confirmed:** Revoke affected credentials immediately via `POST /agents/:id/credentials/:credId/revoke`
+
+---
+
+## 2. Anomalous Token Issuance
+
+### Detection
+
+**Prometheus alert:** `AnomalousTokenIssuance`
+```yaml
+expr: rate(agentidp_tokens_issued_total[5m]) > 10
+for: 5m
+severity: warning
+```
+
+Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.
+
+### Immediate Actions
+
+1. Acknowledge the alert
+2. Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
+3. Check the `scope` label breakdown on `agentidp_tokens_issued_total` to identify what scopes are being requested
+
+### Investigation Steps
+
+1. **Identify top issuing agents:**
+   ```bash
+   # Query audit log for recent token issuances
+   curl -s -H "Authorization: Bearer <admin-token>" \
+     "https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100"
+   ```
+
+2. **Check monthly token budget:**
+   Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.
+
+3. **Check for abnormal scope combinations:**
+   If tokens are being issued with `admin:orgs` or `audit:read` at high volume, this warrants immediate investigation.
+
+4. **Check for valid business reason:**
+   Contact the organization owner for the top-issuing agents.
+
+### Escalation Path
+
+- **Warning:** Engineering on-call investigates within 4 hours
+- **If compromise suspected:** Revoke affected agent tokens via Redis revocation list, rotate credentials
+- **If systematic abuse confirmed:** Suspend the issuing agent(s) via `PATCH /agents/:id` with `status: suspended`
+
+---
+
+## 3. Audit Chain Integrity Failure
+
+### Detection
+
+**Prometheus alert:** `AuditChainIntegrityFailed`
+```yaml
+expr: agentidp_audit_chain_integrity == 0
+for: 0m
+severity: critical
+```
+
+Fires immediately when `AuditChainVerificationJob` detects a break in the audit event hash chain.
+This is a **CRITICAL** security event — possible evidence of log tampering.
+
+### Immediate Actions
+
+1. **Do NOT attempt to repair the broken chain** — preserve all evidence
+2. Notify CISO and security team immediately
+3. Page the on-call security engineer with P0 priority
+4. Capture the current state:
+   ```bash
+   curl -s -H "Authorization: Bearer <audit-token>" \
+     "https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json
+   ```
+
+### Investigation Steps
+
+1. **Determine the broken event:**
+   The `brokenAtEventId` field in the `/audit/verify` response identifies the first broken event.
+
+2. **Forensic analysis:**
+   Follow the steps in `docs/compliance/audit-log-runbook.md` — "What to Do When brokenAtEventId is Returned".
+
+3. **Check database access logs:**
+   Review PostgreSQL `pg_stat_activity` and connection logs for unauthorized direct DB access.
+
+4. **Check application logs:**
+   Look for any errors from the immutability trigger (`audit_events_immutable`).
+
+5. **Check Vault audit logs:**
+   Review whether any encryption key access was abnormal.
+
+### Escalation Path
+
+- **Immediate:** CISO + Legal + Security Engineering
+- **Within 1 hour:** Begin forensic preservation per incident response plan
+- **Within 24 hours:** Determine scope of compromise and notification obligations
+- **Customer notification:** Per contractual and regulatory obligations (GDPR, SOC 2 requirements)
+
+---
+
+## 4. Webhook Dead-Letter Accumulation
+
+### Detection
+
+**Prometheus alert:** `WebhookDeadLetterAccumulating`
+```yaml
+expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
+for: 0m
+severity: critical
+```
+
+Fires when more than 10 webhook deliveries reach dead-letter status within an hour.
+
+### Immediate Actions
+
+1. Acknowledge the alert
+2. Check which `organization_id` labels are accumulating dead-letters:
+   ```bash
+   # Prometheus query: top organizations by dead-letter rate
+   # agentidp_webhook_dead_letters_total (by organization_id)
+   ```
+
+3. Check if the destination endpoints are reachable:
+   ```bash
+   curl -I https://<webhook-destination-url>/
+   ```
+
+### Investigation Steps
+
+1. **List affected webhook subscriptions:**
+   ```bash
+   # Query delivery records for dead-letter status
+   psql "$DATABASE_URL" -c "
+     SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters
+     FROM webhook_subscriptions s
+     JOIN webhook_deliveries d ON d.subscription_id = s.id
+     WHERE d.status = 'dead_letter'
+       AND d.updated_at > NOW() - INTERVAL '2 hours'
+     GROUP BY s.id
+     ORDER BY dead_letters DESC
+     LIMIT 20;"
+   ```
+
+2. **Check delivery failure reasons:**
+   ```bash
+   psql "$DATABASE_URL" -c "
+     SELECT http_status_code, COUNT(*) as count
+     FROM webhook_deliveries
+     WHERE status = 'dead_letter'
+       AND updated_at > NOW() - INTERVAL '2 hours'
+     GROUP BY http_status_code;"
+   ```
+
+3. **Common causes and resolutions:**
+   | HTTP Status | Likely Cause | Resolution |
+   |---|---|---|
+   | 0 / null | Network unreachable / DNS failure | Check recipient endpoint availability |
+   | 401 / 403 | HMAC signature validation failing | Customer to verify HMAC secret |
+   | 404 | Endpoint URL changed | Customer to update webhook URL |
+   | 5xx | Recipient server error | Customer to investigate their endpoint |
+   | Timeout | Slow recipient endpoint | Customer to optimize endpoint response time |
+
+4. **Notify affected customers:**
+   Contact the organization owner for high-volume dead-letter subscriptions.
+
+### Escalation Path
+
+- **Warning (10-50/hr):** Engineering notifies affected customers, investigates endpoint health
+- **Critical (> 50/hr):** Engineering on-call + Platform reliability team engaged
+- **If systemic delivery infrastructure failure:** Activate incident bridge, escalate to VP Engineering