Files

SentryAgent.ai Developer fd90b2acd1 feat(phase-3): workstream 6 — SOC 2 Type II Preparation

Implements all 22 WS6 tasks completing Phase 3 Enterprise.

Column-level encryption (AES-256-CBC, Vault-backed key) via EncryptionService
applied to credentials.secret_hash, credentials.vault_path,
webhook_subscriptions.vault_secret_path, and agent_did_keys.vault_key_path.
Backward-compatible: isEncrypted() guard skips decryption for existing
plaintext rows until next read-write cycle.

Audit chain integrity (CC7.2): AuditRepository computes SHA-256 Merkle hash
on every INSERT (hash = SHA-256(eventId+timestamp+action+outcome+agentId+orgId+prevHash)).
AuditVerificationService walks the full chain verifying hash continuity.
AuditChainVerificationJob runs hourly; sets agentidp_audit_chain_integrity
Prometheus gauge to 1 (pass) or 0 (fail).

TLS enforcement (CC6.7): TLSEnforcementMiddleware registered as first
middleware in Express stack; 301 redirect on non-https X-Forwarded-Proto
in production.

SecretsRotationJob (CC9.2): hourly scan for credentials expiring within 7
days; increments agentidp_credentials_expiring_soon_total.

ComplianceController + routes: GET /audit/verify (auth+audit:read scope,
30/min rate-limit); GET /compliance/controls (public, Cache-Control 60s).
ComplianceStatusStore: module-level map updated by jobs, consumed by controller.

Prometheus: 2 new metrics (agentidp_credentials_expiring_soon_total,
agentidp_audit_chain_integrity); 6 alerting rules in alerts.yml.

Compliance docs: soc2-controls-matrix.md, encryption-runbook.md,
audit-log-runbook.md, incident-response.md, secrets-rotation.md.

Tests: 557 unit tests passing (35 suites); 26 new tests (EncryptionService,
AuditVerificationService); 19 compliance integration tests. TypeScript clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-31 00:41:53 +00:00

7.5 KiB

Raw Blame History

Incident Response Runbook — SentryAgent.ai AgentIdP

Owner: Security Engineering Last updated: 2026-03-31 Applies to: Production AgentIdP deployments

This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.

1. Auth Failure Spike

Detection

Prometheus alert: AuthFailureSpike

expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
for: 2m
severity: warning

Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.

Immediate Actions

Acknowledge the alert in PagerDuty / alerting system
Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
Check Prometheus dashboard for the geographic distribution of the failing requests

Investigation Steps

Identify source agents:

# Query audit log for recent auth failures
curl -s -H "Authorization: Bearer <admin-token>" \
  "https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100"

Check for brute-force patterns: Look for repeated failures from the same client_id or IP address.

Check if an agent's credentials expired:

# Look for expired credentials
psql "$DATABASE_URL" -c "
  SELECT credential_id, client_id, expires_at
  FROM credentials
  WHERE status = 'active' AND expires_at < NOW()
  ORDER BY expires_at DESC LIMIT 20;"

Check for key compromise signals:
- Multiple agents failing simultaneously → possible key store issue
- Single agent with high failure rate → possible credential stuffing or misconfiguration

Escalation Path

Warning (< 2 req/s): Engineering on-call investigates within 1 hour
Critical (> 2 req/s sustained): CISO notified, potential account compromise investigation
If credential compromise confirmed: Revoke affected credentials immediately via POST /agents/:id/credentials/:credId/revoke

2. Anomalous Token Issuance

Detection

Prometheus alert: AnomalousTokenIssuance

expr: rate(agentidp_tokens_issued_total[5m]) > 10
for: 5m
severity: warning

Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.

Immediate Actions

Acknowledge the alert
Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
Check the scope label breakdown on agentidp_tokens_issued_total to identify what scopes are being requested

Investigation Steps

Identify top issuing agents:

# Query audit log for recent token issuances
curl -s -H "Authorization: Bearer <admin-token>" \
  "https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100"

Check monthly token budget: Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.
Check for abnormal scope combinations: If tokens are being issued with admin:orgs or audit:read at high volume, this warrants immediate investigation.
Check for valid business reason: Contact the organization owner for the top-issuing agents.

Escalation Path

Warning: Engineering on-call investigates within 4 hours
If compromise suspected: Revoke affected agent tokens via Redis revocation list, rotate credentials
If systematic abuse confirmed: Suspend the issuing agent(s) via PATCH /agents/:id with status: suspended

3. Audit Chain Integrity Failure

Detection

Prometheus alert: AuditChainIntegrityFailed

expr: agentidp_audit_chain_integrity == 0
for: 0m
severity: critical

Fires immediately when AuditChainVerificationJob detects a break in the audit event hash chain. This is a CRITICAL security event — possible evidence of log tampering.

Immediate Actions

Do NOT attempt to repair the broken chain — preserve all evidence
Notify CISO and security team immediately
Page the on-call security engineer with P0 priority

Capture the current state:

curl -s -H "Authorization: Bearer <audit-token>" \
  "https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json

Investigation Steps

Determine the broken event: The brokenAtEventId field in the /audit/verify response identifies the first broken event.
Forensic analysis: Follow the steps in docs/compliance/audit-log-runbook.md — "What to Do When brokenAtEventId is Returned".
Check database access logs: Review PostgreSQL pg_stat_activity and connection logs for unauthorized direct DB access.
Check application logs: Look for any errors from the immutability trigger (audit_events_immutable).
Check Vault audit logs: Review whether any encryption key access was abnormal.

Escalation Path

Immediate: CISO + Legal + Security Engineering
Within 1 hour: Begin forensic preservation per incident response plan
Within 24 hours: Determine scope of compromise and notification obligations
Customer notification: Per contractual and regulatory obligations (GDPR, SOC 2 requirements)

4. Webhook Dead-Letter Accumulation

Detection

Prometheus alert: WebhookDeadLetterAccumulating

expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
for: 0m
severity: critical

Fires when more than 10 webhook deliveries reach dead-letter status within an hour.

Immediate Actions

Acknowledge the alert

Check which organization_id labels are accumulating dead-letters:

# Prometheus query: top organizations by dead-letter rate
# agentidp_webhook_dead_letters_total (by organization_id)

Check if the destination endpoints are reachable:
```
curl -I https://<webhook-destination-url>/
```

Investigation Steps

List affected webhook subscriptions:

# Query delivery records for dead-letter status
psql "$DATABASE_URL" -c "
  SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters
  FROM webhook_subscriptions s
  JOIN webhook_deliveries d ON d.subscription_id = s.id
  WHERE d.status = 'dead_letter'
    AND d.updated_at > NOW() - INTERVAL '2 hours'
  GROUP BY s.id
  ORDER BY dead_letters DESC
  LIMIT 20;"

Check delivery failure reasons:

psql "$DATABASE_URL" -c "
  SELECT http_status_code, COUNT(*) as count
  FROM webhook_deliveries
  WHERE status = 'dead_letter'
    AND updated_at > NOW() - INTERVAL '2 hours'
  GROUP BY http_status_code;"

Common causes and resolutions:

HTTP Status	Likely Cause	Resolution
0 / null	Network unreachable / DNS failure	Check recipient endpoint availability
401 / 403	HMAC signature validation failing	Customer to verify HMAC secret
404	Endpoint URL changed	Customer to update webhook URL
5xx	Recipient server error	Customer to investigate their endpoint
Timeout	Slow recipient endpoint	Customer to optimize endpoint response time

Notify affected customers: Contact the organization owner for high-volume dead-letter subscriptions.

Escalation Path

Warning (10-50/hr): Engineering notifies affected customers, investigates endpoint health
Critical (> 50/hr): Engineering on-call + Platform reliability team engaged
If systemic delivery infrastructure failure: Activate incident bridge, escalate to VP Engineering

7.5 KiB Raw Blame History

Incident Response Runbook — SentryAgent.ai AgentIdP

1. Auth Failure Spike

Detection

Immediate Actions

Investigation Steps

Escalation Path

2. Anomalous Token Issuance

Detection

Immediate Actions

Investigation Steps

Escalation Path

3. Audit Chain Integrity Failure

Detection

Immediate Actions

Investigation Steps

Escalation Path

4. Webhook Dead-Letter Accumulation

Detection

Immediate Actions

Investigation Steps

Escalation Path

7.5 KiB

Raw Blame History