Implements all 22 WS6 tasks completing Phase 3 Enterprise. Column-level encryption (AES-256-CBC, Vault-backed key) via EncryptionService applied to credentials.secret_hash, credentials.vault_path, webhook_subscriptions.vault_secret_path, and agent_did_keys.vault_key_path. Backward-compatible: isEncrypted() guard skips decryption for existing plaintext rows until next read-write cycle. Audit chain integrity (CC7.2): AuditRepository computes SHA-256 Merkle hash on every INSERT (hash = SHA-256(eventId+timestamp+action+outcome+agentId+orgId+prevHash)). AuditVerificationService walks the full chain verifying hash continuity. AuditChainVerificationJob runs hourly; sets agentidp_audit_chain_integrity Prometheus gauge to 1 (pass) or 0 (fail). TLS enforcement (CC6.7): TLSEnforcementMiddleware registered as first middleware in Express stack; 301 redirect on non-https X-Forwarded-Proto in production. SecretsRotationJob (CC9.2): hourly scan for credentials expiring within 7 days; increments agentidp_credentials_expiring_soon_total. ComplianceController + routes: GET /audit/verify (auth+audit:read scope, 30/min rate-limit); GET /compliance/controls (public, Cache-Control 60s). ComplianceStatusStore: module-level map updated by jobs, consumed by controller. Prometheus: 2 new metrics (agentidp_credentials_expiring_soon_total, agentidp_audit_chain_integrity); 6 alerting rules in alerts.yml. Compliance docs: soc2-controls-matrix.md, encryption-runbook.md, audit-log-runbook.md, incident-response.md, secrets-rotation.md. Tests: 557 unit tests passing (35 suites); 26 new tests (EncryptionService, AuditVerificationService); 19 compliance integration tests. TypeScript clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.5 KiB
Incident Response Runbook — SentryAgent.ai AgentIdP
Owner: Security Engineering Last updated: 2026-03-31 Applies to: Production AgentIdP deployments
This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.
1. Auth Failure Spike
Detection
Prometheus alert: AuthFailureSpike
expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
for: 2m
severity: warning
Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.
Immediate Actions
- Acknowledge the alert in PagerDuty / alerting system
- Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
- Check Prometheus dashboard for the geographic distribution of the failing requests
Investigation Steps
-
Identify source agents:
# Query audit log for recent auth failures curl -s -H "Authorization: Bearer <admin-token>" \ "https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100" -
Check for brute-force patterns: Look for repeated failures from the same
client_idor IP address. -
Check if an agent's credentials expired:
# Look for expired credentials psql "$DATABASE_URL" -c " SELECT credential_id, client_id, expires_at FROM credentials WHERE status = 'active' AND expires_at < NOW() ORDER BY expires_at DESC LIMIT 20;" -
Check for key compromise signals:
- Multiple agents failing simultaneously → possible key store issue
- Single agent with high failure rate → possible credential stuffing or misconfiguration
Escalation Path
- Warning (< 2 req/s): Engineering on-call investigates within 1 hour
- Critical (> 2 req/s sustained): CISO notified, potential account compromise investigation
- If credential compromise confirmed: Revoke affected credentials immediately via
POST /agents/:id/credentials/:credId/revoke
2. Anomalous Token Issuance
Detection
Prometheus alert: AnomalousTokenIssuance
expr: rate(agentidp_tokens_issued_total[5m]) > 10
for: 5m
severity: warning
Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.
Immediate Actions
- Acknowledge the alert
- Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
- Check the
scopelabel breakdown onagentidp_tokens_issued_totalto identify what scopes are being requested
Investigation Steps
-
Identify top issuing agents:
# Query audit log for recent token issuances curl -s -H "Authorization: Bearer <admin-token>" \ "https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100" -
Check monthly token budget: Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.
-
Check for abnormal scope combinations: If tokens are being issued with
admin:orgsoraudit:readat high volume, this warrants immediate investigation. -
Check for valid business reason: Contact the organization owner for the top-issuing agents.
Escalation Path
- Warning: Engineering on-call investigates within 4 hours
- If compromise suspected: Revoke affected agent tokens via Redis revocation list, rotate credentials
- If systematic abuse confirmed: Suspend the issuing agent(s) via
PATCH /agents/:idwithstatus: suspended
3. Audit Chain Integrity Failure
Detection
Prometheus alert: AuditChainIntegrityFailed
expr: agentidp_audit_chain_integrity == 0
for: 0m
severity: critical
Fires immediately when AuditChainVerificationJob detects a break in the audit event hash chain.
This is a CRITICAL security event — possible evidence of log tampering.
Immediate Actions
- Do NOT attempt to repair the broken chain — preserve all evidence
- Notify CISO and security team immediately
- Page the on-call security engineer with P0 priority
- Capture the current state:
curl -s -H "Authorization: Bearer <audit-token>" \ "https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json
Investigation Steps
-
Determine the broken event: The
brokenAtEventIdfield in the/audit/verifyresponse identifies the first broken event. -
Forensic analysis: Follow the steps in
docs/compliance/audit-log-runbook.md— "What to Do When brokenAtEventId is Returned". -
Check database access logs: Review PostgreSQL
pg_stat_activityand connection logs for unauthorized direct DB access. -
Check application logs: Look for any errors from the immutability trigger (
audit_events_immutable). -
Check Vault audit logs: Review whether any encryption key access was abnormal.
Escalation Path
- Immediate: CISO + Legal + Security Engineering
- Within 1 hour: Begin forensic preservation per incident response plan
- Within 24 hours: Determine scope of compromise and notification obligations
- Customer notification: Per contractual and regulatory obligations (GDPR, SOC 2 requirements)
4. Webhook Dead-Letter Accumulation
Detection
Prometheus alert: WebhookDeadLetterAccumulating
expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
for: 0m
severity: critical
Fires when more than 10 webhook deliveries reach dead-letter status within an hour.
Immediate Actions
-
Acknowledge the alert
-
Check which
organization_idlabels are accumulating dead-letters:# Prometheus query: top organizations by dead-letter rate # agentidp_webhook_dead_letters_total (by organization_id) -
Check if the destination endpoints are reachable:
curl -I https://<webhook-destination-url>/
Investigation Steps
-
List affected webhook subscriptions:
# Query delivery records for dead-letter status psql "$DATABASE_URL" -c " SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters FROM webhook_subscriptions s JOIN webhook_deliveries d ON d.subscription_id = s.id WHERE d.status = 'dead_letter' AND d.updated_at > NOW() - INTERVAL '2 hours' GROUP BY s.id ORDER BY dead_letters DESC LIMIT 20;" -
Check delivery failure reasons:
psql "$DATABASE_URL" -c " SELECT http_status_code, COUNT(*) as count FROM webhook_deliveries WHERE status = 'dead_letter' AND updated_at > NOW() - INTERVAL '2 hours' GROUP BY http_status_code;" -
Common causes and resolutions:
HTTP Status Likely Cause Resolution 0 / null Network unreachable / DNS failure Check recipient endpoint availability 401 / 403 HMAC signature validation failing Customer to verify HMAC secret 404 Endpoint URL changed Customer to update webhook URL 5xx Recipient server error Customer to investigate their endpoint Timeout Slow recipient endpoint Customer to optimize endpoint response time -
Notify affected customers: Contact the organization owner for high-volume dead-letter subscriptions.
Escalation Path
- Warning (10-50/hr): Engineering notifies affected customers, investigates endpoint health
- Critical (> 50/hr): Engineering on-call + Platform reliability team engaged
- If systemic delivery infrastructure failure: Activate incident bridge, escalate to VP Engineering