Files
sentryagent-idp/docs/compliance/incident-response.md
SentryAgent.ai Developer fd90b2acd1 feat(phase-3): workstream 6 — SOC 2 Type II Preparation
Implements all 22 WS6 tasks completing Phase 3 Enterprise.

Column-level encryption (AES-256-CBC, Vault-backed key) via EncryptionService
applied to credentials.secret_hash, credentials.vault_path,
webhook_subscriptions.vault_secret_path, and agent_did_keys.vault_key_path.
Backward-compatible: isEncrypted() guard skips decryption for existing
plaintext rows until next read-write cycle.

Audit chain integrity (CC7.2): AuditRepository computes SHA-256 Merkle hash
on every INSERT (hash = SHA-256(eventId+timestamp+action+outcome+agentId+orgId+prevHash)).
AuditVerificationService walks the full chain verifying hash continuity.
AuditChainVerificationJob runs hourly; sets agentidp_audit_chain_integrity
Prometheus gauge to 1 (pass) or 0 (fail).

TLS enforcement (CC6.7): TLSEnforcementMiddleware registered as first
middleware in Express stack; 301 redirect on non-https X-Forwarded-Proto
in production.

SecretsRotationJob (CC9.2): hourly scan for credentials expiring within 7
days; increments agentidp_credentials_expiring_soon_total.

ComplianceController + routes: GET /audit/verify (auth+audit:read scope,
30/min rate-limit); GET /compliance/controls (public, Cache-Control 60s).
ComplianceStatusStore: module-level map updated by jobs, consumed by controller.

Prometheus: 2 new metrics (agentidp_credentials_expiring_soon_total,
agentidp_audit_chain_integrity); 6 alerting rules in alerts.yml.

Compliance docs: soc2-controls-matrix.md, encryption-runbook.md,
audit-log-runbook.md, incident-response.md, secrets-rotation.md.

Tests: 557 unit tests passing (35 suites); 26 new tests (EncryptionService,
AuditVerificationService); 19 compliance integration tests. TypeScript clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 00:41:53 +00:00

7.5 KiB

Incident Response Runbook — SentryAgent.ai AgentIdP

Owner: Security Engineering Last updated: 2026-03-31 Applies to: Production AgentIdP deployments

This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.


1. Auth Failure Spike

Detection

Prometheus alert: AuthFailureSpike

expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
for: 2m
severity: warning

Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.

Immediate Actions

  1. Acknowledge the alert in PagerDuty / alerting system
  2. Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
  3. Check Prometheus dashboard for the geographic distribution of the failing requests

Investigation Steps

  1. Identify source agents:

    # Query audit log for recent auth failures
    curl -s -H "Authorization: Bearer <admin-token>" \
      "https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100"
    
  2. Check for brute-force patterns: Look for repeated failures from the same client_id or IP address.

  3. Check if an agent's credentials expired:

    # Look for expired credentials
    psql "$DATABASE_URL" -c "
      SELECT credential_id, client_id, expires_at
      FROM credentials
      WHERE status = 'active' AND expires_at < NOW()
      ORDER BY expires_at DESC LIMIT 20;"
    
  4. Check for key compromise signals:

    • Multiple agents failing simultaneously → possible key store issue
    • Single agent with high failure rate → possible credential stuffing or misconfiguration

Escalation Path

  • Warning (< 2 req/s): Engineering on-call investigates within 1 hour
  • Critical (> 2 req/s sustained): CISO notified, potential account compromise investigation
  • If credential compromise confirmed: Revoke affected credentials immediately via POST /agents/:id/credentials/:credId/revoke

2. Anomalous Token Issuance

Detection

Prometheus alert: AnomalousTokenIssuance

expr: rate(agentidp_tokens_issued_total[5m]) > 10
for: 5m
severity: warning

Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.

Immediate Actions

  1. Acknowledge the alert
  2. Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
  3. Check the scope label breakdown on agentidp_tokens_issued_total to identify what scopes are being requested

Investigation Steps

  1. Identify top issuing agents:

    # Query audit log for recent token issuances
    curl -s -H "Authorization: Bearer <admin-token>" \
      "https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100"
    
  2. Check monthly token budget: Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.

  3. Check for abnormal scope combinations: If tokens are being issued with admin:orgs or audit:read at high volume, this warrants immediate investigation.

  4. Check for valid business reason: Contact the organization owner for the top-issuing agents.

Escalation Path

  • Warning: Engineering on-call investigates within 4 hours
  • If compromise suspected: Revoke affected agent tokens via Redis revocation list, rotate credentials
  • If systematic abuse confirmed: Suspend the issuing agent(s) via PATCH /agents/:id with status: suspended

3. Audit Chain Integrity Failure

Detection

Prometheus alert: AuditChainIntegrityFailed

expr: agentidp_audit_chain_integrity == 0
for: 0m
severity: critical

Fires immediately when AuditChainVerificationJob detects a break in the audit event hash chain. This is a CRITICAL security event — possible evidence of log tampering.

Immediate Actions

  1. Do NOT attempt to repair the broken chain — preserve all evidence
  2. Notify CISO and security team immediately
  3. Page the on-call security engineer with P0 priority
  4. Capture the current state:
    curl -s -H "Authorization: Bearer <audit-token>" \
      "https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json
    

Investigation Steps

  1. Determine the broken event: The brokenAtEventId field in the /audit/verify response identifies the first broken event.

  2. Forensic analysis: Follow the steps in docs/compliance/audit-log-runbook.md — "What to Do When brokenAtEventId is Returned".

  3. Check database access logs: Review PostgreSQL pg_stat_activity and connection logs for unauthorized direct DB access.

  4. Check application logs: Look for any errors from the immutability trigger (audit_events_immutable).

  5. Check Vault audit logs: Review whether any encryption key access was abnormal.

Escalation Path

  • Immediate: CISO + Legal + Security Engineering
  • Within 1 hour: Begin forensic preservation per incident response plan
  • Within 24 hours: Determine scope of compromise and notification obligations
  • Customer notification: Per contractual and regulatory obligations (GDPR, SOC 2 requirements)

4. Webhook Dead-Letter Accumulation

Detection

Prometheus alert: WebhookDeadLetterAccumulating

expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
for: 0m
severity: critical

Fires when more than 10 webhook deliveries reach dead-letter status within an hour.

Immediate Actions

  1. Acknowledge the alert

  2. Check which organization_id labels are accumulating dead-letters:

    # Prometheus query: top organizations by dead-letter rate
    # agentidp_webhook_dead_letters_total (by organization_id)
    
  3. Check if the destination endpoints are reachable:

    curl -I https://<webhook-destination-url>/
    

Investigation Steps

  1. List affected webhook subscriptions:

    # Query delivery records for dead-letter status
    psql "$DATABASE_URL" -c "
      SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters
      FROM webhook_subscriptions s
      JOIN webhook_deliveries d ON d.subscription_id = s.id
      WHERE d.status = 'dead_letter'
        AND d.updated_at > NOW() - INTERVAL '2 hours'
      GROUP BY s.id
      ORDER BY dead_letters DESC
      LIMIT 20;"
    
  2. Check delivery failure reasons:

    psql "$DATABASE_URL" -c "
      SELECT http_status_code, COUNT(*) as count
      FROM webhook_deliveries
      WHERE status = 'dead_letter'
        AND updated_at > NOW() - INTERVAL '2 hours'
      GROUP BY http_status_code;"
    
  3. Common causes and resolutions:

    HTTP Status Likely Cause Resolution
    0 / null Network unreachable / DNS failure Check recipient endpoint availability
    401 / 403 HMAC signature validation failing Customer to verify HMAC secret
    404 Endpoint URL changed Customer to update webhook URL
    5xx Recipient server error Customer to investigate their endpoint
    Timeout Slow recipient endpoint Customer to optimize endpoint response time
  4. Notify affected customers: Contact the organization owner for high-volume dead-letter subscriptions.

Escalation Path

  • Warning (10-50/hr): Engineering notifies affected customers, investigates endpoint health
  • Critical (> 50/hr): Engineering on-call + Platform reliability team engaged
  • If systemic delivery infrastructure failure: Activate incident bridge, escalate to VP Engineering