feat(phase-3): workstream 6 — SOC 2 Type II Preparation
Implements all 22 WS6 tasks completing Phase 3 Enterprise. Column-level encryption (AES-256-CBC, Vault-backed key) via EncryptionService applied to credentials.secret_hash, credentials.vault_path, webhook_subscriptions.vault_secret_path, and agent_did_keys.vault_key_path. Backward-compatible: isEncrypted() guard skips decryption for existing plaintext rows until next read-write cycle. Audit chain integrity (CC7.2): AuditRepository computes SHA-256 Merkle hash on every INSERT (hash = SHA-256(eventId+timestamp+action+outcome+agentId+orgId+prevHash)). AuditVerificationService walks the full chain verifying hash continuity. AuditChainVerificationJob runs hourly; sets agentidp_audit_chain_integrity Prometheus gauge to 1 (pass) or 0 (fail). TLS enforcement (CC6.7): TLSEnforcementMiddleware registered as first middleware in Express stack; 301 redirect on non-https X-Forwarded-Proto in production. SecretsRotationJob (CC9.2): hourly scan for credentials expiring within 7 days; increments agentidp_credentials_expiring_soon_total. ComplianceController + routes: GET /audit/verify (auth+audit:read scope, 30/min rate-limit); GET /compliance/controls (public, Cache-Control 60s). ComplianceStatusStore: module-level map updated by jobs, consumed by controller. Prometheus: 2 new metrics (agentidp_credentials_expiring_soon_total, agentidp_audit_chain_integrity); 6 alerting rules in alerts.yml. Compliance docs: soc2-controls-matrix.md, encryption-runbook.md, audit-log-runbook.md, incident-response.md, secrets-rotation.md. Tests: 557 unit tests passing (35 suites); 26 new tests (EncryptionService, AuditVerificationService); 19 compliance integration tests. TypeScript clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
229
docs/compliance/incident-response.md
Normal file
229
docs/compliance/incident-response.md
Normal file
@@ -0,0 +1,229 @@
|
||||
# Incident Response Runbook — SentryAgent.ai AgentIdP
|
||||
|
||||
**Owner:** Security Engineering
|
||||
**Last updated:** 2026-03-31
|
||||
**Applies to:** Production AgentIdP deployments
|
||||
|
||||
This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.
|
||||
|
||||
---
|
||||
|
||||
## 1. Auth Failure Spike
|
||||
|
||||
### Detection
|
||||
|
||||
**Prometheus alert:** `AuthFailureSpike`
|
||||
```yaml
|
||||
expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
|
||||
for: 2m
|
||||
severity: warning
|
||||
```
|
||||
|
||||
Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. Acknowledge the alert in PagerDuty / alerting system
|
||||
2. Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
|
||||
3. Check Prometheus dashboard for the geographic distribution of the failing requests
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Identify source agents:**
|
||||
```bash
|
||||
# Query audit log for recent auth failures
|
||||
curl -s -H "Authorization: Bearer <admin-token>" \
|
||||
"https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100"
|
||||
```
|
||||
|
||||
2. **Check for brute-force patterns:**
|
||||
Look for repeated failures from the same `client_id` or IP address.
|
||||
|
||||
3. **Check if an agent's credentials expired:**
|
||||
```bash
|
||||
# Look for expired credentials
|
||||
psql "$DATABASE_URL" -c "
|
||||
SELECT credential_id, client_id, expires_at
|
||||
FROM credentials
|
||||
WHERE status = 'active' AND expires_at < NOW()
|
||||
ORDER BY expires_at DESC LIMIT 20;"
|
||||
```
|
||||
|
||||
4. **Check for key compromise signals:**
|
||||
- Multiple agents failing simultaneously → possible key store issue
|
||||
- Single agent with high failure rate → possible credential stuffing or misconfiguration
|
||||
|
||||
### Escalation Path
|
||||
|
||||
- **Warning (< 2 req/s):** Engineering on-call investigates within 1 hour
|
||||
- **Critical (> 2 req/s sustained):** CISO notified, potential account compromise investigation
|
||||
- **If credential compromise confirmed:** Revoke affected credentials immediately via `POST /agents/:id/credentials/:credId/revoke`
|
||||
|
||||
---
|
||||
|
||||
## 2. Anomalous Token Issuance
|
||||
|
||||
### Detection
|
||||
|
||||
**Prometheus alert:** `AnomalousTokenIssuance`
|
||||
```yaml
|
||||
expr: rate(agentidp_tokens_issued_total[5m]) > 10
|
||||
for: 5m
|
||||
severity: warning
|
||||
```
|
||||
|
||||
Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. Acknowledge the alert
|
||||
2. Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
|
||||
3. Check the `scope` label breakdown on `agentidp_tokens_issued_total` to identify what scopes are being requested
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Identify top issuing agents:**
|
||||
```bash
|
||||
# Query audit log for recent token issuances
|
||||
curl -s -H "Authorization: Bearer <admin-token>" \
|
||||
"https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100"
|
||||
```
|
||||
|
||||
2. **Check monthly token budget:**
|
||||
Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.
|
||||
|
||||
3. **Check for abnormal scope combinations:**
|
||||
If tokens are being issued with `admin:orgs` or `audit:read` at high volume, this warrants immediate investigation.
|
||||
|
||||
4. **Check for valid business reason:**
|
||||
Contact the organization owner for the top-issuing agents.
|
||||
|
||||
### Escalation Path
|
||||
|
||||
- **Warning:** Engineering on-call investigates within 4 hours
|
||||
- **If compromise suspected:** Revoke affected agent tokens via Redis revocation list, rotate credentials
|
||||
- **If systematic abuse confirmed:** Suspend the issuing agent(s) via `PATCH /agents/:id` with `status: suspended`
|
||||
|
||||
---
|
||||
|
||||
## 3. Audit Chain Integrity Failure
|
||||
|
||||
### Detection
|
||||
|
||||
**Prometheus alert:** `AuditChainIntegrityFailed`
|
||||
```yaml
|
||||
expr: agentidp_audit_chain_integrity == 0
|
||||
for: 0m
|
||||
severity: critical
|
||||
```
|
||||
|
||||
Fires immediately when `AuditChainVerificationJob` detects a break in the audit event hash chain.
|
||||
This is a **CRITICAL** security event — possible evidence of log tampering.
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. **Do NOT attempt to repair the broken chain** — preserve all evidence
|
||||
2. Notify CISO and security team immediately
|
||||
3. Page the on-call security engineer with P0 priority
|
||||
4. Capture the current state:
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer <audit-token>" \
|
||||
"https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json
|
||||
```
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Determine the broken event:**
|
||||
The `brokenAtEventId` field in the `/audit/verify` response identifies the first broken event.
|
||||
|
||||
2. **Forensic analysis:**
|
||||
Follow the steps in `docs/compliance/audit-log-runbook.md` — "What to Do When brokenAtEventId is Returned".
|
||||
|
||||
3. **Check database access logs:**
|
||||
Review PostgreSQL `pg_stat_activity` and connection logs for unauthorized direct DB access.
|
||||
|
||||
4. **Check application logs:**
|
||||
Look for any errors from the immutability trigger (`audit_events_immutable`).
|
||||
|
||||
5. **Check Vault audit logs:**
|
||||
Review whether any encryption key access was abnormal.
|
||||
|
||||
### Escalation Path
|
||||
|
||||
- **Immediate:** CISO + Legal + Security Engineering
|
||||
- **Within 1 hour:** Begin forensic preservation per incident response plan
|
||||
- **Within 24 hours:** Determine scope of compromise and notification obligations
|
||||
- **Customer notification:** Per contractual and regulatory obligations (GDPR, SOC 2 requirements)
|
||||
|
||||
---
|
||||
|
||||
## 4. Webhook Dead-Letter Accumulation
|
||||
|
||||
### Detection
|
||||
|
||||
**Prometheus alert:** `WebhookDeadLetterAccumulating`
|
||||
```yaml
|
||||
expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
|
||||
for: 0m
|
||||
severity: critical
|
||||
```
|
||||
|
||||
Fires when more than 10 webhook deliveries reach dead-letter status within an hour.
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. Acknowledge the alert
|
||||
2. Check which `organization_id` labels are accumulating dead-letters:
|
||||
```bash
|
||||
# Prometheus query: top organizations by dead-letter rate
|
||||
# agentidp_webhook_dead_letters_total (by organization_id)
|
||||
```
|
||||
|
||||
3. Check if the destination endpoints are reachable:
|
||||
```bash
|
||||
curl -I https://<webhook-destination-url>/
|
||||
```
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **List affected webhook subscriptions:**
|
||||
```bash
|
||||
# Query delivery records for dead-letter status
|
||||
psql "$DATABASE_URL" -c "
|
||||
SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters
|
||||
FROM webhook_subscriptions s
|
||||
JOIN webhook_deliveries d ON d.subscription_id = s.id
|
||||
WHERE d.status = 'dead_letter'
|
||||
AND d.updated_at > NOW() - INTERVAL '2 hours'
|
||||
GROUP BY s.id
|
||||
ORDER BY dead_letters DESC
|
||||
LIMIT 20;"
|
||||
```
|
||||
|
||||
2. **Check delivery failure reasons:**
|
||||
```bash
|
||||
psql "$DATABASE_URL" -c "
|
||||
SELECT http_status_code, COUNT(*) as count
|
||||
FROM webhook_deliveries
|
||||
WHERE status = 'dead_letter'
|
||||
AND updated_at > NOW() - INTERVAL '2 hours'
|
||||
GROUP BY http_status_code;"
|
||||
```
|
||||
|
||||
3. **Common causes and resolutions:**
|
||||
| HTTP Status | Likely Cause | Resolution |
|
||||
|---|---|---|
|
||||
| 0 / null | Network unreachable / DNS failure | Check recipient endpoint availability |
|
||||
| 401 / 403 | HMAC signature validation failing | Customer to verify HMAC secret |
|
||||
| 404 | Endpoint URL changed | Customer to update webhook URL |
|
||||
| 5xx | Recipient server error | Customer to investigate their endpoint |
|
||||
| Timeout | Slow recipient endpoint | Customer to optimize endpoint response time |
|
||||
|
||||
4. **Notify affected customers:**
|
||||
Contact the organization owner for high-volume dead-letter subscriptions.
|
||||
|
||||
### Escalation Path
|
||||
|
||||
- **Warning (10-50/hr):** Engineering notifies affected customers, investigates endpoint health
|
||||
- **Critical (> 50/hr):** Engineering on-call + Platform reliability team engaged
|
||||
- **If systemic delivery infrastructure failure:** Activate incident bridge, escalate to VP Engineering
|
||||
Reference in New Issue
Block a user