feat(phase-3): workstream 6 — SOC 2 Type II Preparation

Implements all 22 WS6 tasks completing Phase 3 Enterprise.

Column-level encryption (AES-256-CBC, Vault-backed key) via EncryptionService
applied to credentials.secret_hash, credentials.vault_path,
webhook_subscriptions.vault_secret_path, and agent_did_keys.vault_key_path.
Backward-compatible: isEncrypted() guard skips decryption for existing
plaintext rows until next read-write cycle.

Audit chain integrity (CC7.2): AuditRepository computes SHA-256 Merkle hash
on every INSERT (hash = SHA-256(eventId+timestamp+action+outcome+agentId+orgId+prevHash)).
AuditVerificationService walks the full chain verifying hash continuity.
AuditChainVerificationJob runs hourly; sets agentidp_audit_chain_integrity
Prometheus gauge to 1 (pass) or 0 (fail).

TLS enforcement (CC6.7): TLSEnforcementMiddleware registered as first
middleware in Express stack; 301 redirect on non-https X-Forwarded-Proto
in production.

SecretsRotationJob (CC9.2): hourly scan for credentials expiring within 7
days; increments agentidp_credentials_expiring_soon_total.

ComplianceController + routes: GET /audit/verify (auth+audit:read scope,
30/min rate-limit); GET /compliance/controls (public, Cache-Control 60s).
ComplianceStatusStore: module-level map updated by jobs, consumed by controller.

Prometheus: 2 new metrics (agentidp_credentials_expiring_soon_total,
agentidp_audit_chain_integrity); 6 alerting rules in alerts.yml.

Compliance docs: soc2-controls-matrix.md, encryption-runbook.md,
audit-log-runbook.md, incident-response.md, secrets-rotation.md.

Tests: 557 unit tests passing (35 suites); 26 new tests (EncryptionService,
AuditVerificationService); 19 compliance integration tests. TypeScript clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
SentryAgent.ai Developer
2026-03-31 00:41:53 +00:00
parent 272b69f18d
commit fd90b2acd1
35 changed files with 3715 additions and 26 deletions

View File

@@ -0,0 +1,229 @@
# Incident Response Runbook — SentryAgent.ai AgentIdP
**Owner:** Security Engineering
**Last updated:** 2026-03-31
**Applies to:** Production AgentIdP deployments
This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.
---
## 1. Auth Failure Spike
### Detection
**Prometheus alert:** `AuthFailureSpike`
```yaml
expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
for: 2m
severity: warning
```
Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.
### Immediate Actions
1. Acknowledge the alert in PagerDuty / alerting system
2. Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
3. Check Prometheus dashboard for the geographic distribution of the failing requests
### Investigation Steps
1. **Identify source agents:**
```bash
# Query audit log for recent auth failures
curl -s -H "Authorization: Bearer <admin-token>" \
"https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100"
```
2. **Check for brute-force patterns:**
Look for repeated failures from the same `client_id` or IP address.
3. **Check if an agent's credentials expired:**
```bash
# Look for expired credentials
psql "$DATABASE_URL" -c "
SELECT credential_id, client_id, expires_at
FROM credentials
WHERE status = 'active' AND expires_at < NOW()
ORDER BY expires_at DESC LIMIT 20;"
```
4. **Check for key compromise signals:**
- Multiple agents failing simultaneously → possible key store issue
- Single agent with high failure rate → possible credential stuffing or misconfiguration
### Escalation Path
- **Warning (< 2 req/s):** Engineering on-call investigates within 1 hour
- **Critical (> 2 req/s sustained):** CISO notified, potential account compromise investigation
- **If credential compromise confirmed:** Revoke affected credentials immediately via `POST /agents/:id/credentials/:credId/revoke`
---
## 2. Anomalous Token Issuance
### Detection
**Prometheus alert:** `AnomalousTokenIssuance`
```yaml
expr: rate(agentidp_tokens_issued_total[5m]) > 10
for: 5m
severity: warning
```
Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.
### Immediate Actions
1. Acknowledge the alert
2. Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
3. Check the `scope` label breakdown on `agentidp_tokens_issued_total` to identify what scopes are being requested
### Investigation Steps
1. **Identify top issuing agents:**
```bash
# Query audit log for recent token issuances
curl -s -H "Authorization: Bearer <admin-token>" \
"https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100"
```
2. **Check monthly token budget:**
Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.
3. **Check for abnormal scope combinations:**
If tokens are being issued with `admin:orgs` or `audit:read` at high volume, this warrants immediate investigation.
4. **Check for valid business reason:**
Contact the organization owner for the top-issuing agents.
### Escalation Path
- **Warning:** Engineering on-call investigates within 4 hours
- **If compromise suspected:** Revoke affected agent tokens via Redis revocation list, rotate credentials
- **If systematic abuse confirmed:** Suspend the issuing agent(s) via `PATCH /agents/:id` with `status: suspended`
---
## 3. Audit Chain Integrity Failure
### Detection
**Prometheus alert:** `AuditChainIntegrityFailed`
```yaml
expr: agentidp_audit_chain_integrity == 0
for: 0m
severity: critical
```
Fires immediately when `AuditChainVerificationJob` detects a break in the audit event hash chain.
This is a **CRITICAL** security event — possible evidence of log tampering.
### Immediate Actions
1. **Do NOT attempt to repair the broken chain** — preserve all evidence
2. Notify CISO and security team immediately
3. Page the on-call security engineer with P0 priority
4. Capture the current state:
```bash
curl -s -H "Authorization: Bearer <audit-token>" \
"https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json
```
### Investigation Steps
1. **Determine the broken event:**
The `brokenAtEventId` field in the `/audit/verify` response identifies the first broken event.
2. **Forensic analysis:**
Follow the steps in `docs/compliance/audit-log-runbook.md` — "What to Do When brokenAtEventId is Returned".
3. **Check database access logs:**
Review PostgreSQL `pg_stat_activity` and connection logs for unauthorized direct DB access.
4. **Check application logs:**
Look for any errors from the immutability trigger (`audit_events_immutable`).
5. **Check Vault audit logs:**
Review whether any encryption key access was abnormal.
### Escalation Path
- **Immediate:** CISO + Legal + Security Engineering
- **Within 1 hour:** Begin forensic preservation per incident response plan
- **Within 24 hours:** Determine scope of compromise and notification obligations
- **Customer notification:** Per contractual and regulatory obligations (GDPR, SOC 2 requirements)
---
## 4. Webhook Dead-Letter Accumulation
### Detection
**Prometheus alert:** `WebhookDeadLetterAccumulating`
```yaml
expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
for: 0m
severity: critical
```
Fires when more than 10 webhook deliveries reach dead-letter status within an hour.
### Immediate Actions
1. Acknowledge the alert
2. Check which `organization_id` labels are accumulating dead-letters:
```bash
# Prometheus query: top organizations by dead-letter rate
# agentidp_webhook_dead_letters_total (by organization_id)
```
3. Check if the destination endpoints are reachable:
```bash
curl -I https://<webhook-destination-url>/
```
### Investigation Steps
1. **List affected webhook subscriptions:**
```bash
# Query delivery records for dead-letter status
psql "$DATABASE_URL" -c "
SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters
FROM webhook_subscriptions s
JOIN webhook_deliveries d ON d.subscription_id = s.id
WHERE d.status = 'dead_letter'
AND d.updated_at > NOW() - INTERVAL '2 hours'
GROUP BY s.id
ORDER BY dead_letters DESC
LIMIT 20;"
```
2. **Check delivery failure reasons:**
```bash
psql "$DATABASE_URL" -c "
SELECT http_status_code, COUNT(*) as count
FROM webhook_deliveries
WHERE status = 'dead_letter'
AND updated_at > NOW() - INTERVAL '2 hours'
GROUP BY http_status_code;"
```
3. **Common causes and resolutions:**
| HTTP Status | Likely Cause | Resolution |
|---|---|---|
| 0 / null | Network unreachable / DNS failure | Check recipient endpoint availability |
| 401 / 403 | HMAC signature validation failing | Customer to verify HMAC secret |
| 404 | Endpoint URL changed | Customer to update webhook URL |
| 5xx | Recipient server error | Customer to investigate their endpoint |
| Timeout | Slow recipient endpoint | Customer to optimize endpoint response time |
4. **Notify affected customers:**
Contact the organization owner for high-volume dead-letter subscriptions.
### Escalation Path
- **Warning (10-50/hr):** Engineering notifies affected customers, investigates endpoint health
- **Critical (> 50/hr):** Engineering on-call + Platform reliability team engaged
- **If systemic delivery infrastructure failure:** Activate incident bridge, escalate to VP Engineering