feat(phase-3): workstream 6 — SOC 2 Type II Preparation

Implements all 22 WS6 tasks completing Phase 3 Enterprise. Column-level encryption (AES-256-CBC, Vault-backed key) via EncryptionService applied to credentials.secret_hash, credentials.vault_path, webhook_subscriptions.vault_secret_path, and agent_did_keys.vault_key_path. Backward-compatible: isEncrypted() guard skips decryption for existing plaintext rows until next read-write cycle. Audit chain integrity (CC7.2): AuditRepository computes SHA-256 Merkle hash on every INSERT (hash = SHA-256(eventId+timestamp+action+outcome+agentId+orgId+prevHash)). AuditVerificationService walks the full chain verifying hash continuity. AuditChainVerificationJob runs hourly; sets agentidp_audit_chain_integrity Prometheus gauge to 1 (pass) or 0 (fail). TLS enforcement (CC6.7): TLSEnforcementMiddleware registered as first middleware in Express stack; 301 redirect on non-https X-Forwarded-Proto in production. SecretsRotationJob (CC9.2): hourly scan for credentials expiring within 7 days; increments agentidp_credentials_expiring_soon_total. ComplianceController + routes: GET /audit/verify (auth+audit:read scope, 30/min rate-limit); GET /compliance/controls (public, Cache-Control 60s). ComplianceStatusStore: module-level map updated by jobs, consumed by controller. Prometheus: 2 new metrics (agentidp_credentials_expiring_soon_total, agentidp_audit_chain_integrity); 6 alerting rules in alerts.yml. Compliance docs: soc2-controls-matrix.md, encryption-runbook.md, audit-log-runbook.md, incident-response.md, secrets-rotation.md. Tests: 557 unit tests passing (35 suites); 26 new tests (EncryptionService, AuditVerificationService); 19 compliance integration tests. TypeScript clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 00:41:53 +00:00
parent 272b69f18d
commit fd90b2acd1
35 changed files with 3715 additions and 26 deletions
--- a/docs/compliance/audit-log-runbook.md
+++ b/docs/compliance/audit-log-runbook.md
@@ -0,0 +1,172 @@
+# Audit Log Chain Verification Runbook — SentryAgent.ai AgentIdP
+
+**Control:** SOC 2 CC7.2 — Audit Log Integrity
+**Service:** `src/services/AuditVerificationService.ts`
+**Job:** `src/jobs/AuditChainVerificationJob.ts`
+**Endpoint:** `GET /api/v1/audit/verify`
+
+---
+
+## Overview
+
+Every audit event in the `audit_events` PostgreSQL table is linked to the previous one
+via a SHA-256 hash chain. Each event stores:
+
+- `hash` — SHA-256 of `(eventId + timestamp.toISOString() + action + outcome + agentId + organizationId + previousHash)`
+- `previous_hash` — the `hash` of the immediately preceding event (ordered by `timestamp ASC, event_id ASC`)
+
+The first event in the chain uses `previous_hash = ''` (empty string sentinel).
+
+A PostgreSQL trigger (`trg_audit_events_immutable`) prevents UPDATE and DELETE operations
+on `audit_events`, making the log tamper-evident at the database level.
+
+---
+
+## Running GET /audit/verify
+
+### Full chain verification (no date range)
+
+```bash
+# Requires Bearer token with audit:read scope
+curl -s -H "Authorization: Bearer <token>" \
+  "https://api.sentryagent.ai/v1/audit/verify"
+```
+
+**Response (chain intact):**
+```json
+{
+  "verified": true,
+  "checkedCount": 18504,
+  "brokenAtEventId": null
+}
+```
+
+**Response (chain break detected):**
+```json
+{
+  "verified": false,
+  "checkedCount": 1203,
+  "brokenAtEventId": "c4d5e6f7-a8b9-0123-cdef-456789012345"
+}
+```
+
+### Date-ranged verification
+
+```bash
+curl -s -H "Authorization: Bearer <token>" \
+  "https://api.sentryagent.ai/v1/audit/verify?fromDate=2026-03-01T00:00:00.000Z&toDate=2026-03-31T23:59:59.999Z"
+```
+
+### Interpreting the response
+
+| Field | Meaning |
+|---|---|
+| `verified: true` | All events in the checked range maintain valid hash chain linkage |
+| `verified: false` | At least one chain break detected — see `brokenAtEventId` |
+| `checkedCount` | Number of events examined (0 = no events in range) |
+| `brokenAtEventId` | UUID of the first event where the chain fails (`null` if verified) |
+| `fromDate` / `toDate` | Echo of the date range parameters (only present if supplied) |
+
+---
+
+## AuditChainVerificationJob
+
+The `AuditChainVerificationJob` runs automatically in the background every hour (default).
+Configure the interval via `AUDIT_CHAIN_VERIFICATION_INTERVAL_MS` (milliseconds).
+
+On each tick it calls `verifyChain()` and:
+- Sets Prometheus gauge `agentidp_audit_chain_integrity` to **1** (passing)
+- Updates `ComplianceStatusStore` with `CC7.2 = passing`
+
+If verification fails:
+- Sets gauge to **0**
+- Updates `ComplianceStatusStore` with `CC7.2 = failing`
+- Prometheus alert `AuditChainIntegrityFailed` fires immediately (severity: critical)
+- Application logs: `[AuditChainVerificationJob] Chain BROKEN at event <uuid>`
+
+---
+
+## What to Do When `brokenAtEventId` is Returned
+
+### Step 1: Preserve Evidence
+
+Immediately capture the full state of the audit log for forensic analysis:
+
+```sql
+-- Export all events around the break point
+SELECT event_id, timestamp, action, outcome, agent_id, organization_id, hash, previous_hash
+FROM audit_events
+WHERE timestamp >= (
+  SELECT timestamp - INTERVAL '1 hour'
+  FROM audit_events WHERE event_id = '<brokenAtEventId>'
+)
+ORDER BY timestamp ASC, event_id ASC;
+```
+
+Save the output to a secure, immutable location (e.g. S3 with object locking).
+
+### Step 2: Identify the Break Type
+
+Compare the recomputed hash for the broken event with its stored hash:
+
+```bash
+# Using Node.js
+node -e "
+const crypto = require('crypto');
+const eventId = '<event_id>';
+const timestamp = '<timestamp_from_db>';
+const action = '<action>';
+const outcome = '<outcome>';
+const agentId = '<agent_id>';
+const orgId = '<organization_id>';
+const prevHash = '<previous_hash_from_db>';
+const expected = crypto.createHash('sha256')
+  .update(eventId + new Date(timestamp).toISOString() + action + outcome + agentId + orgId + prevHash)
+  .digest('hex');
+console.log('Expected hash:', expected);
+console.log('Stored hash: <hash_from_db>');
+console.log('Match:', expected === '<hash_from_db>');
+"
+```
+
+Possible break types:
+- **Hash mismatch only** — event data was modified after insertion
+- **previous_hash mismatch** — an event was inserted/deleted before this event in the chain
+- **Both mismatched** — multiple modifications or an injection attack
+
+### Step 3: Escalate
+
+A chain break is a **critical security incident**. Immediately:
+
+1. Notify the security team and CISO
+2. Engage incident response procedure (`docs/compliance/incident-response.md` — Audit Chain Integrity Failure section)
+3. Do NOT attempt to "fix" the hash — preserve the broken state as evidence
+4. Consider temporarily suspending API access pending investigation
+5. Notify affected customers per data breach notification obligations
+
+### Step 4: Forensic Investigation
+
+Using PostgreSQL audit logs, Vault audit logs, and application logs:
+- Identify which application process or database connection modified the row
+- Correlate with access logs and authentication events
+- Determine the extent of the compromise (single row vs. systematic)
+
+---
+
+## Verification Rate Limiting
+
+`GET /audit/verify` is rate-limited to **30 requests/minute** per `client_id`.
+For continuous monitoring, use `AuditChainVerificationJob` (background job, no rate limit)
+and poll `GET /compliance/controls` instead.
+
+---
+
+## SOC 2 Evidence Package
+
+For auditors, provide:
+
+1. `GET /audit/verify` response (full chain, no date filter) — save as JSON
+2. Prometheus metric export: `agentidp_audit_chain_integrity` time series (30/60/90 days)
+3. PostgreSQL trigger definition: `\d+ audit_events` in psql
+4. `src/db/migrations/020_add_audit_chain_columns.sql` — shows immutability trigger DDL
+5. `docs/openapi/compliance.yaml` — endpoint specification
--- a/docs/compliance/encryption-runbook.md
+++ b/docs/compliance/encryption-runbook.md
@@ -0,0 +1,159 @@
+# Encryption Key Rotation Runbook — SentryAgent.ai AgentIdP
+
+**Control:** SOC 2 CC6.1 — Encryption at Rest
+**Service:** `src/services/EncryptionService.ts`
+**Vault path:** Configured via `ENCRYPTION_KEY_VAULT_PATH` env var (default: `secret/data/agentidp/encryption-key`)
+
+---
+
+## Overview
+
+AgentIdP uses AES-256-CBC column-level encryption for sensitive PostgreSQL columns.
+The encryption key is a 64-character hex string (32 bytes) stored in HashiCorp Vault.
+The `EncryptionService` fetches the key once and caches it in process memory.
+
+Encrypted format: `base64(IV):base64(ciphertext)` where IV is 16 random bytes per encryption call.
+
+---
+
+## Key Rotation Procedure
+
+### Prerequisites
+
+- Access to HashiCorp Vault with write permissions to the encryption key path
+- Access to the production application environment (to trigger restart)
+- At least one backup of the current key stored securely offline
+
+### Step 1: Generate a New Key
+
+Generate a cryptographically strong 32-byte (64-character hex) key:
+
+```bash
+openssl rand -hex 32
+# Example output: a1b2c3d4e5f6...  (64 hex chars)
+```
+
+Record the new key securely.
+
+### Step 2: Backup the Current Key
+
+Before overwriting, read and securely store the current key:
+
+```bash
+vault kv get -field=encryptionKey secret/agentidp/encryption-key > /secure/backup/encryption-key-$(date +%Y%m%d).txt
+```
+
+Store in a hardware security module (HSM) or offline key store.
+
+### Step 3: Write the New Key to Vault
+
+```bash
+vault kv put secret/agentidp/encryption-key encryptionKey="<new-64-char-hex-key>"
+```
+
+Verify the write:
+
+```bash
+vault kv get secret/agentidp/encryption-key
+```
+
+Confirm the `encryptionKey` field contains exactly 64 hex characters.
+
+### Step 4: Restart the Application
+
+The `EncryptionService` caches the key in process memory. A restart forces a re-fetch from Vault:
+
+```bash
+# Kubernetes rolling restart
+kubectl rollout restart deployment/agentidp
+
+# Docker Compose
+docker-compose restart agentidp
+
+# PM2
+pm2 restart agentidp
+```
+
+### Step 5: Verify Key Pick-Up
+
+Check the application logs for:
+
+```
+[AgentIdP] EncryptionService enabled — sensitive columns encrypted at rest (SOC 2 CC6.1)
+```
+
+Call the compliance controls endpoint to confirm the control is passing:
+
+```bash
+curl -s https://api.sentryagent.ai/v1/compliance/controls | jq '.controls[] | select(.id == "CC6.1")'
+```
+
+Expected output:
+```json
+{ "id": "CC6.1", "name": "Encryption at Rest", "status": "passing", "lastChecked": "..." }
+```
+
+### Step 6: Re-encryption of Existing Rows
+
+Existing rows encrypted with the old key will fail to decrypt after key rotation.
+Re-encryption happens lazily: the next time each row is read and re-written (e.g. credential rotation,
+webhook update), the application will decrypt with the old key and re-encrypt with the new one.
+
+For immediate full re-encryption, use the re-encryption script:
+
+```bash
+# Run the re-encryption migration script (reads old key from backup, encrypts with new key)
+# Note: This script requires both old and new keys to be available
+ts-node scripts/reencrypt-columns.ts --old-key-file /secure/backup/encryption-key-<date>.txt
+```
+
+---
+
+## Emergency Rollback
+
+If the new key causes issues (e.g. test failures, decryption errors), roll back:
+
+### Step 1: Restore Old Key to Vault
+
+```bash
+vault kv put secret/agentidp/encryption-key encryptionKey="<old-64-char-hex-key-from-backup>"
+```
+
+### Step 2: Restart the Application
+
+```bash
+kubectl rollout restart deployment/agentidp
+```
+
+### Step 3: Verify Recovery
+
+```bash
+curl -s https://api.sentryagent.ai/v1/compliance/controls | jq '.controls[] | select(.id == "CC6.1")'
+```
+
+### Step 4: Investigate Root Cause
+
+Review application logs for `AES-256-CBC decryption failed` errors and audit the cause before
+reattempting rotation.
+
+---
+
+## Troubleshooting
+
+| Symptom | Likely Cause | Resolution |
+|---|---|---|
+| `Invalid encryption key ... expected a 64-character hex string` | Key in Vault is wrong length or encoding | Re-write correct key to Vault, restart |
+| `AES-256-CBC decryption failed — possible key mismatch` | Key rotated but rows still encrypted with old key | Rollback to old key, then migrate properly |
+| `CC6.1` status shows `unknown` | Vault unreachable, key fetch failed | Check Vault connectivity, `VAULT_ADDR`, `VAULT_TOKEN` |
+
+---
+
+## Audit Evidence
+
+After rotation, record the following for SOC 2 evidence:
+
+- Date of rotation
+- Who performed the rotation (approver + executor)
+- Vault audit log entry confirming the key write
+- Application log confirming EncryptionService initialised with new key
+- `GET /compliance/controls` response showing CC6.1 = passing
--- a/docs/compliance/incident-response.md
+++ b/docs/compliance/incident-response.md
@@ -0,0 +1,229 @@
+# Incident Response Runbook — SentryAgent.ai AgentIdP
+
+**Owner:** Security Engineering
+**Last updated:** 2026-03-31
+**Applies to:** Production AgentIdP deployments
+
+This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.
+
+---
+
+## 1. Auth Failure Spike
+
+### Detection
+
+**Prometheus alert:** `AuthFailureSpike`
+```yaml
+expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
+for: 2m
+severity: warning
+```
+
+Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.
+
+### Immediate Actions
+
+1. Acknowledge the alert in PagerDuty / alerting system
+2. Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
+3. Check Prometheus dashboard for the geographic distribution of the failing requests
+
+### Investigation Steps
+
+1. **Identify source agents:**
+   ```bash
+   # Query audit log for recent auth failures
+   curl -s -H "Authorization: Bearer <admin-token>" \
+     "https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100"
+   ```
+
+2. **Check for brute-force patterns:**
+   Look for repeated failures from the same `client_id` or IP address.
+
+3. **Check if an agent's credentials expired:**
+   ```bash
+   # Look for expired credentials
+   psql "$DATABASE_URL" -c "
+     SELECT credential_id, client_id, expires_at
+     FROM credentials
+     WHERE status = 'active' AND expires_at < NOW()
+     ORDER BY expires_at DESC LIMIT 20;"
+   ```
+
+4. **Check for key compromise signals:**
+   - Multiple agents failing simultaneously → possible key store issue
+   - Single agent with high failure rate → possible credential stuffing or misconfiguration
+
+### Escalation Path
+
+- **Warning (< 2 req/s):** Engineering on-call investigates within 1 hour
+- **Critical (> 2 req/s sustained):** CISO notified, potential account compromise investigation
+- **If credential compromise confirmed:** Revoke affected credentials immediately via `POST /agents/:id/credentials/:credId/revoke`
+
+---
+
+## 2. Anomalous Token Issuance
+
+### Detection
+
+**Prometheus alert:** `AnomalousTokenIssuance`
+```yaml
+expr: rate(agentidp_tokens_issued_total[5m]) > 10
+for: 5m
+severity: warning
+```
+
+Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.
+
+### Immediate Actions
+
+1. Acknowledge the alert
+2. Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
+3. Check the `scope` label breakdown on `agentidp_tokens_issued_total` to identify what scopes are being requested
+
+### Investigation Steps
+
+1. **Identify top issuing agents:**
+   ```bash
+   # Query audit log for recent token issuances
+   curl -s -H "Authorization: Bearer <admin-token>" \
+     "https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100"
+   ```
+
+2. **Check monthly token budget:**
+   Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.
+
+3. **Check for abnormal scope combinations:**
+   If tokens are being issued with `admin:orgs` or `audit:read` at high volume, this warrants immediate investigation.
+
+4. **Check for valid business reason:**
+   Contact the organization owner for the top-issuing agents.
+
+### Escalation Path
+
+- **Warning:** Engineering on-call investigates within 4 hours
+- **If compromise suspected:** Revoke affected agent tokens via Redis revocation list, rotate credentials
+- **If systematic abuse confirmed:** Suspend the issuing agent(s) via `PATCH /agents/:id` with `status: suspended`
+
+---
+
+## 3. Audit Chain Integrity Failure
+
+### Detection
+
+**Prometheus alert:** `AuditChainIntegrityFailed`
+```yaml
+expr: agentidp_audit_chain_integrity == 0
+for: 0m
+severity: critical
+```
+
+Fires immediately when `AuditChainVerificationJob` detects a break in the audit event hash chain.
+This is a **CRITICAL** security event — possible evidence of log tampering.
+
+### Immediate Actions
+
+1. **Do NOT attempt to repair the broken chain** — preserve all evidence
+2. Notify CISO and security team immediately
+3. Page the on-call security engineer with P0 priority
+4. Capture the current state:
+   ```bash
+   curl -s -H "Authorization: Bearer <audit-token>" \
+     "https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json
+   ```
+
+### Investigation Steps
+
+1. **Determine the broken event:**
+   The `brokenAtEventId` field in the `/audit/verify` response identifies the first broken event.
+
+2. **Forensic analysis:**
+   Follow the steps in `docs/compliance/audit-log-runbook.md` — "What to Do When brokenAtEventId is Returned".
+
+3. **Check database access logs:**
+   Review PostgreSQL `pg_stat_activity` and connection logs for unauthorized direct DB access.
+
+4. **Check application logs:**
+   Look for any errors from the immutability trigger (`audit_events_immutable`).
+
+5. **Check Vault audit logs:**
+   Review whether any encryption key access was abnormal.
+
+### Escalation Path
+
+- **Immediate:** CISO + Legal + Security Engineering
+- **Within 1 hour:** Begin forensic preservation per incident response plan
+- **Within 24 hours:** Determine scope of compromise and notification obligations
+- **Customer notification:** Per contractual and regulatory obligations (GDPR, SOC 2 requirements)
+
+---
+
+## 4. Webhook Dead-Letter Accumulation
+
+### Detection
+
+**Prometheus alert:** `WebhookDeadLetterAccumulating`
+```yaml
+expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
+for: 0m
+severity: critical
+```
+
+Fires when more than 10 webhook deliveries reach dead-letter status within an hour.
+
+### Immediate Actions
+
+1. Acknowledge the alert
+2. Check which `organization_id` labels are accumulating dead-letters:
+   ```bash
+   # Prometheus query: top organizations by dead-letter rate
+   # agentidp_webhook_dead_letters_total (by organization_id)
+   ```
+
+3. Check if the destination endpoints are reachable:
+   ```bash
+   curl -I https://<webhook-destination-url>/
+   ```
+
+### Investigation Steps
+
+1. **List affected webhook subscriptions:**
+   ```bash
+   # Query delivery records for dead-letter status
+   psql "$DATABASE_URL" -c "
+     SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters
+     FROM webhook_subscriptions s
+     JOIN webhook_deliveries d ON d.subscription_id = s.id
+     WHERE d.status = 'dead_letter'
+       AND d.updated_at > NOW() - INTERVAL '2 hours'
+     GROUP BY s.id
+     ORDER BY dead_letters DESC
+     LIMIT 20;"
+   ```
+
+2. **Check delivery failure reasons:**
+   ```bash
+   psql "$DATABASE_URL" -c "
+     SELECT http_status_code, COUNT(*) as count
+     FROM webhook_deliveries
+     WHERE status = 'dead_letter'
+       AND updated_at > NOW() - INTERVAL '2 hours'
+     GROUP BY http_status_code;"
+   ```
+
+3. **Common causes and resolutions:**
+   | HTTP Status | Likely Cause | Resolution |
+   |---|---|---|
+   | 0 / null | Network unreachable / DNS failure | Check recipient endpoint availability |
+   | 401 / 403 | HMAC signature validation failing | Customer to verify HMAC secret |
+   | 404 | Endpoint URL changed | Customer to update webhook URL |
+   | 5xx | Recipient server error | Customer to investigate their endpoint |
+   | Timeout | Slow recipient endpoint | Customer to optimize endpoint response time |
+
+4. **Notify affected customers:**
+   Contact the organization owner for high-volume dead-letter subscriptions.
+
+### Escalation Path
+
+- **Warning (10-50/hr):** Engineering notifies affected customers, investigates endpoint health
+- **Critical (> 50/hr):** Engineering on-call + Platform reliability team engaged
+- **If systemic delivery infrastructure failure:** Activate incident bridge, escalate to VP Engineering
--- a/docs/compliance/secrets-rotation.md
+++ b/docs/compliance/secrets-rotation.md
@@ -0,0 +1,142 @@
+# Secrets Rotation Runbook — SentryAgent.ai AgentIdP
+
+**Control:** SOC 2 CC9.2 — Secrets Rotation
+**Last updated:** 2026-03-31
+
+---
+
+## Overview
+
+AgentIdP manages three categories of secrets that require periodic rotation:
+
+1. **Agent client secrets** — Per-credential client secrets used for OAuth 2.0 token issuance
+2. **OIDC signing keys** — RSA/EC keys used to sign ID tokens
+3. **AES-256-CBC encryption key** — Column-level database encryption key (see `encryption-runbook.md`)
+
+---
+
+## 1. Agent Credential (Client Secret) Rotation
+
+### API endpoint
+
+```
+POST /api/v1/agents/:agentId/credentials/:credentialId/rotate
+```
+
+Requires Bearer token with `agents:write` scope.
+
+### Procedure
+
+```bash
+# 1. List active credentials for the agent
+curl -s -H "Authorization: Bearer <token>" \
+  "https://api.sentryagent.ai/v1/agents/<agentId>/credentials?status=active"
+
+# 2. Rotate the credential (generate new secret)
+curl -s -X POST \
+  -H "Authorization: Bearer <token>" \
+  -H "Content-Type: application/json" \
+  -d '{"expiresAt": "2027-03-31T00:00:00.000Z"}' \
+  "https://api.sentryagent.ai/v1/agents/<agentId>/credentials/<credentialId>/rotate"
+
+# Response includes the new clientSecret — store it immediately; it is never shown again
+```
+
+### Key points
+
+- The new `clientSecret` is returned **once only** — store it securely before the response is discarded
+- The agent's previous secret is immediately invalidated (Vault KV v2 version overwritten)
+- An audit event `credential.rotated` is logged to the immutable audit chain
+- A `credential.rotated` webhook event is dispatched to all active subscriptions
+
+### Recommended rotation schedule
+
+| Credential type | Recommended rotation interval |
+|---|---|
+| Production agent credentials | 90 days |
+| Staging / development credentials | 180 days |
+| Service account credentials | 365 days (annual) |
+| Credentials involved in a security incident | Immediately |
+
+### Automated expiry detection
+
+`SecretsRotationJob` runs hourly and queries credentials expiring within 7 days.
+Prometheus alert `CredentialExpiryApproaching` fires immediately when any are detected.
+Respond to this alert by rotating the flagged credential(s) before the expiry date.
+
+---
+
+## 2. OIDC Signing Key Rotation
+
+### Overview
+
+OIDC signing keys are managed by `OIDCKeyService` (`src/services/OIDCKeyService.ts`).
+Keys are stored in the `oidc_keys` PostgreSQL table. The current active key is used to
+sign all new ID tokens; public keys are exposed via `GET /.well-known/jwks.json`.
+
+### When to rotate
+
+- Key compromise or suspected exposure
+- Scheduled rotation (recommended every 90 days for production)
+- Algorithm upgrade (e.g. RS256 → ES256)
+
+### Rotation procedure
+
+OIDC key rotation is handled automatically by `OIDCKeyService.ensureCurrentKey()`:
+
+```bash
+# Force generation of a new signing key by calling the internal rotate endpoint
+# (or trigger by redeploying with OIDC_FORCE_KEY_ROTATION=true)
+
+# 1. Mark current key as inactive (if manual rotation is required)
+psql "$DATABASE_URL" -c "
+  UPDATE oidc_keys
+  SET active = false
+  WHERE active = true;"
+
+# 2. Restart the application — ensureCurrentKey() will generate a new key on startup
+kubectl rollout restart deployment/agentidp
+```
+
+### JWKS update behavior
+
+- Old public keys remain in `GET /.well-known/jwks.json` for **24 hours** after rotation
+  (grace period for in-flight tokens)
+- After the grace period, old keys are removed from the JWKS endpoint
+- Redis JWKS cache TTL is configured by `JWKS_CACHE_TTL_SECONDS` (default: 3600)
+
+### Impact on existing tokens
+
+Existing valid tokens signed with the old key **continue to work** until they expire,
+as long as the old public key remains in JWKS. After the grace period, old tokens
+will fail verification.
+
+---
+
+## 3. Encryption Key Rotation
+
+See `docs/compliance/encryption-runbook.md` for the full AES-256-CBC encryption key rotation procedure.
+
+**Summary:** Generate new 32-byte hex key → write to Vault at `ENCRYPTION_KEY_VAULT_PATH` → restart app → existing rows re-encrypted lazily on next read-write cycle.
+
+---
+
+## Schedule Recommendations
+
+| Secret Type | Production Interval | Staging Interval | Trigger for Immediate Rotation |
+|---|---|---|---|
+| Agent client secrets | 90 days | 180 days | Credential suspected compromised |
+| OIDC signing keys | 90 days | 180 days | Key file exposed, algorithm upgrade |
+| AES-256-CBC encryption key | 365 days (annual) | On demand | Key exposed, Vault breach, compliance audit requirement |
+| Webhook HMAC secrets | Per customer policy | N/A | Webhook endpoint compromised |
+
+---
+
+## Compliance Evidence
+
+For SOC 2 CC9.2 evidence collection:
+
+- Prometheus metric history: `agentidp_credentials_expiring_soon_total`
+- Audit log entries with `action: credential.rotated` — query via `GET /audit?action=credential.rotated`
+- Key rotation records from Vault audit log
+- This runbook + sign-off from Security Engineering
--- a/docs/compliance/soc2-controls-matrix.md
+++ b/docs/compliance/soc2-controls-matrix.md
@@ -0,0 +1,42 @@
+# SOC 2 Type II Controls Matrix — SentryAgent.ai AgentIdP
+
+This document maps the five in-scope SOC 2 Trust Services Criteria (TSC) controls to their
+corresponding implementation artefacts, mechanisms, and automated verification methods.
+
+---
+
+## Controls Matrix
+
+| Control ID | TSC Criterion Name | Implementation File | Mechanism | Automated Check |
+|---|---|---|---|---|
+| **CC6.1** | Encryption at Rest | `src/services/EncryptionService.ts` | AES-256-CBC column-level encryption on `credentials.secret_hash`, `credentials.vault_path`, `webhook_subscriptions.vault_secret_path`, `agent_did_keys.vault_key_path`. Key is stored in HashiCorp Vault KV v2 at path configured by `ENCRYPTION_KEY_VAULT_PATH`. IV is randomised per encryption call. Backward-compat: `isEncrypted()` gate allows plaintext rows to coexist during migration. | `GET /api/v1/compliance/controls` returns `CC6.1` status. Status is set to `passing` on service startup when `EncryptionService` initialises. |
+| **CC6.7** | TLS Enforcement | `src/middleware/TLSEnforcementMiddleware.ts` | Express middleware registered as the **first** middleware in the app stack (before all routes and body parsers). In `NODE_ENV=production`, checks `X-Forwarded-Proto` header set by the upstream load balancer/reverse proxy. Any non-HTTPS request receives a `301 Moved Permanently` redirect to `https://`. | `GET /api/v1/compliance/controls` returns `CC6.7` status. TLS enforcement is a static configuration control; status is set to `passing` on application startup. |
+| **CC7.2** | Audit Log Integrity | `src/services/AuditVerificationService.ts`, `src/repositories/AuditRepository.ts`, `src/jobs/AuditChainVerificationJob.ts` | Each audit event (`audit_events` table) stores a `hash` (SHA-256 of `eventId + timestamp + action + outcome + agentId + organizationId + previousHash`) and `previous_hash` linking it to the prior event. An immutability trigger prevents UPDATE/DELETE on `audit_events`. `AuditChainVerificationJob` re-walks the entire chain every hour. | Prometheus gauge `agentidp_audit_chain_integrity` (1 = passing, 0 = failing). Prometheus alert `AuditChainIntegrityFailed` fires when gauge = 0. `GET /api/v1/audit/verify` triggers an on-demand verification. `GET /api/v1/compliance/controls` returns `CC7.2` status. |
+| **CC9.2** | Secrets Rotation | `src/jobs/SecretsRotationJob.ts` | `SecretsRotationJob` runs every hour (configurable via `SECRETS_ROTATION_CHECK_INTERVAL_MS`) and queries `credentials` for `active` credentials expiring within 7 days. For each, it increments the `agentidp_credentials_expiring_soon_total` Prometheus counter with the owning `agent_id`. Operators are expected to act on the alert within the 7-day window. | Prometheus counter `agentidp_credentials_expiring_soon_total` per `agent_id`. Prometheus alert `CredentialExpiryApproaching` fires when any increase is detected. `GET /api/v1/compliance/controls` returns `CC9.2` status. |
+| **CC7.1** | Webhook Dead-Letter Monitoring | `src/workers/WebhookDeliveryWorker.ts` | `WebhookDeliveryWorker` processes webhook deliveries from a Redis queue. After exhausting all retry attempts (configurable `WEBHOOK_MAX_RETRIES`), the delivery is moved to dead-letter status and `agentidp_webhook_dead_letters_total` is incremented. | Prometheus counter `agentidp_webhook_dead_letters_total` per `organization_id`. Prometheus alert `WebhookDeadLetterAccumulating` fires when > 10 dead-letters accumulate in 1 hour. `GET /api/v1/compliance/controls` returns `CC7.1` status. |
+
+---
+
+## Evidence Collection
+
+For a SOC 2 Type II audit, the following evidence should be collected:
+
+| Evidence Type | Collection Method |
+|---|---|
+| Encryption at rest configuration | Export Vault KV v2 policy + `_encryption_migration_log` table contents |
+| TLS certificate and enforcement logs | Load balancer access logs + `X-Forwarded-Proto` middleware responses |
+| Audit chain integrity report | `GET /api/v1/audit/verify` with full date range |
+| Secrets rotation compliance | Prometheus metric history for `agentidp_credentials_expiring_soon_total` |
+| Webhook dead-letter rate | Prometheus metric history for `agentidp_webhook_dead_letters_total` |
+| Immutable audit log dump | Direct PostgreSQL export of `audit_events` table with hash verification |
+
+---
+
+## References
+
+- SOC 2 Trust Services Criteria: [AICPA TSC 2017](https://www.aicpa.org/resources/article/trust-services-criteria)
+- OpenAPI spec: `docs/openapi/compliance.yaml`
+- Encryption runbook: `docs/compliance/encryption-runbook.md`
+- Audit log runbook: `docs/compliance/audit-log-runbook.md`
+- Incident response: `docs/compliance/incident-response.md`
+- Secrets rotation: `docs/compliance/secrets-rotation.md`