feat(phase-3): workstream 6 — SOC 2 Type II Preparation
Implements all 22 WS6 tasks completing Phase 3 Enterprise. Column-level encryption (AES-256-CBC, Vault-backed key) via EncryptionService applied to credentials.secret_hash, credentials.vault_path, webhook_subscriptions.vault_secret_path, and agent_did_keys.vault_key_path. Backward-compatible: isEncrypted() guard skips decryption for existing plaintext rows until next read-write cycle. Audit chain integrity (CC7.2): AuditRepository computes SHA-256 Merkle hash on every INSERT (hash = SHA-256(eventId+timestamp+action+outcome+agentId+orgId+prevHash)). AuditVerificationService walks the full chain verifying hash continuity. AuditChainVerificationJob runs hourly; sets agentidp_audit_chain_integrity Prometheus gauge to 1 (pass) or 0 (fail). TLS enforcement (CC6.7): TLSEnforcementMiddleware registered as first middleware in Express stack; 301 redirect on non-https X-Forwarded-Proto in production. SecretsRotationJob (CC9.2): hourly scan for credentials expiring within 7 days; increments agentidp_credentials_expiring_soon_total. ComplianceController + routes: GET /audit/verify (auth+audit:read scope, 30/min rate-limit); GET /compliance/controls (public, Cache-Control 60s). ComplianceStatusStore: module-level map updated by jobs, consumed by controller. Prometheus: 2 new metrics (agentidp_credentials_expiring_soon_total, agentidp_audit_chain_integrity); 6 alerting rules in alerts.yml. Compliance docs: soc2-controls-matrix.md, encryption-runbook.md, audit-log-runbook.md, incident-response.md, secrets-rotation.md. Tests: 557 unit tests passing (35 suites); 26 new tests (EncryptionService, AuditVerificationService); 19 compliance integration tests. TypeScript clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
172
docs/compliance/audit-log-runbook.md
Normal file
172
docs/compliance/audit-log-runbook.md
Normal file
@@ -0,0 +1,172 @@
|
||||
# Audit Log Chain Verification Runbook — SentryAgent.ai AgentIdP
|
||||
|
||||
**Control:** SOC 2 CC7.2 — Audit Log Integrity
|
||||
**Service:** `src/services/AuditVerificationService.ts`
|
||||
**Job:** `src/jobs/AuditChainVerificationJob.ts`
|
||||
**Endpoint:** `GET /api/v1/audit/verify`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Every audit event in the `audit_events` PostgreSQL table is linked to the previous one
|
||||
via a SHA-256 hash chain. Each event stores:
|
||||
|
||||
- `hash` — SHA-256 of `(eventId + timestamp.toISOString() + action + outcome + agentId + organizationId + previousHash)`
|
||||
- `previous_hash` — the `hash` of the immediately preceding event (ordered by `timestamp ASC, event_id ASC`)
|
||||
|
||||
The first event in the chain uses `previous_hash = ''` (empty string sentinel).
|
||||
|
||||
A PostgreSQL trigger (`trg_audit_events_immutable`) prevents UPDATE and DELETE operations
|
||||
on `audit_events`, making the log tamper-evident at the database level.
|
||||
|
||||
---
|
||||
|
||||
## Running GET /audit/verify
|
||||
|
||||
### Full chain verification (no date range)
|
||||
|
||||
```bash
|
||||
# Requires Bearer token with audit:read scope
|
||||
curl -s -H "Authorization: Bearer <token>" \
|
||||
"https://api.sentryagent.ai/v1/audit/verify"
|
||||
```
|
||||
|
||||
**Response (chain intact):**
|
||||
```json
|
||||
{
|
||||
"verified": true,
|
||||
"checkedCount": 18504,
|
||||
"brokenAtEventId": null
|
||||
}
|
||||
```
|
||||
|
||||
**Response (chain break detected):**
|
||||
```json
|
||||
{
|
||||
"verified": false,
|
||||
"checkedCount": 1203,
|
||||
"brokenAtEventId": "c4d5e6f7-a8b9-0123-cdef-456789012345"
|
||||
}
|
||||
```
|
||||
|
||||
### Date-ranged verification
|
||||
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer <token>" \
|
||||
"https://api.sentryagent.ai/v1/audit/verify?fromDate=2026-03-01T00:00:00.000Z&toDate=2026-03-31T23:59:59.999Z"
|
||||
```
|
||||
|
||||
### Interpreting the response
|
||||
|
||||
| Field | Meaning |
|
||||
|---|---|
|
||||
| `verified: true` | All events in the checked range maintain valid hash chain linkage |
|
||||
| `verified: false` | At least one chain break detected — see `brokenAtEventId` |
|
||||
| `checkedCount` | Number of events examined (0 = no events in range) |
|
||||
| `brokenAtEventId` | UUID of the first event where the chain fails (`null` if verified) |
|
||||
| `fromDate` / `toDate` | Echo of the date range parameters (only present if supplied) |
|
||||
|
||||
---
|
||||
|
||||
## AuditChainVerificationJob
|
||||
|
||||
The `AuditChainVerificationJob` runs automatically in the background every hour (default).
|
||||
Configure the interval via `AUDIT_CHAIN_VERIFICATION_INTERVAL_MS` (milliseconds).
|
||||
|
||||
On each tick it calls `verifyChain()` and:
|
||||
- Sets Prometheus gauge `agentidp_audit_chain_integrity` to **1** (passing)
|
||||
- Updates `ComplianceStatusStore` with `CC7.2 = passing`
|
||||
|
||||
If verification fails:
|
||||
- Sets gauge to **0**
|
||||
- Updates `ComplianceStatusStore` with `CC7.2 = failing`
|
||||
- Prometheus alert `AuditChainIntegrityFailed` fires immediately (severity: critical)
|
||||
- Application logs: `[AuditChainVerificationJob] Chain BROKEN at event <uuid>`
|
||||
|
||||
---
|
||||
|
||||
## What to Do When `brokenAtEventId` is Returned
|
||||
|
||||
### Step 1: Preserve Evidence
|
||||
|
||||
Immediately capture the full state of the audit log for forensic analysis:
|
||||
|
||||
```sql
|
||||
-- Export all events around the break point
|
||||
SELECT event_id, timestamp, action, outcome, agent_id, organization_id, hash, previous_hash
|
||||
FROM audit_events
|
||||
WHERE timestamp >= (
|
||||
SELECT timestamp - INTERVAL '1 hour'
|
||||
FROM audit_events WHERE event_id = '<brokenAtEventId>'
|
||||
)
|
||||
ORDER BY timestamp ASC, event_id ASC;
|
||||
```
|
||||
|
||||
Save the output to a secure, immutable location (e.g. S3 with object locking).
|
||||
|
||||
### Step 2: Identify the Break Type
|
||||
|
||||
Compare the recomputed hash for the broken event with its stored hash:
|
||||
|
||||
```bash
|
||||
# Using Node.js
|
||||
node -e "
|
||||
const crypto = require('crypto');
|
||||
const eventId = '<event_id>';
|
||||
const timestamp = '<timestamp_from_db>';
|
||||
const action = '<action>';
|
||||
const outcome = '<outcome>';
|
||||
const agentId = '<agent_id>';
|
||||
const orgId = '<organization_id>';
|
||||
const prevHash = '<previous_hash_from_db>';
|
||||
const expected = crypto.createHash('sha256')
|
||||
.update(eventId + new Date(timestamp).toISOString() + action + outcome + agentId + orgId + prevHash)
|
||||
.digest('hex');
|
||||
console.log('Expected hash:', expected);
|
||||
console.log('Stored hash: <hash_from_db>');
|
||||
console.log('Match:', expected === '<hash_from_db>');
|
||||
"
|
||||
```
|
||||
|
||||
Possible break types:
|
||||
- **Hash mismatch only** — event data was modified after insertion
|
||||
- **previous_hash mismatch** — an event was inserted/deleted before this event in the chain
|
||||
- **Both mismatched** — multiple modifications or an injection attack
|
||||
|
||||
### Step 3: Escalate
|
||||
|
||||
A chain break is a **critical security incident**. Immediately:
|
||||
|
||||
1. Notify the security team and CISO
|
||||
2. Engage incident response procedure (`docs/compliance/incident-response.md` — Audit Chain Integrity Failure section)
|
||||
3. Do NOT attempt to "fix" the hash — preserve the broken state as evidence
|
||||
4. Consider temporarily suspending API access pending investigation
|
||||
5. Notify affected customers per data breach notification obligations
|
||||
|
||||
### Step 4: Forensic Investigation
|
||||
|
||||
Using PostgreSQL audit logs, Vault audit logs, and application logs:
|
||||
- Identify which application process or database connection modified the row
|
||||
- Correlate with access logs and authentication events
|
||||
- Determine the extent of the compromise (single row vs. systematic)
|
||||
|
||||
---
|
||||
|
||||
## Verification Rate Limiting
|
||||
|
||||
`GET /audit/verify` is rate-limited to **30 requests/minute** per `client_id`.
|
||||
For continuous monitoring, use `AuditChainVerificationJob` (background job, no rate limit)
|
||||
and poll `GET /compliance/controls` instead.
|
||||
|
||||
---
|
||||
|
||||
## SOC 2 Evidence Package
|
||||
|
||||
For auditors, provide:
|
||||
|
||||
1. `GET /audit/verify` response (full chain, no date filter) — save as JSON
|
||||
2. Prometheus metric export: `agentidp_audit_chain_integrity` time series (30/60/90 days)
|
||||
3. PostgreSQL trigger definition: `\d+ audit_events` in psql
|
||||
4. `src/db/migrations/020_add_audit_chain_columns.sql` — shows immutability trigger DDL
|
||||
5. `docs/openapi/compliance.yaml` — endpoint specification
|
||||
159
docs/compliance/encryption-runbook.md
Normal file
159
docs/compliance/encryption-runbook.md
Normal file
@@ -0,0 +1,159 @@
|
||||
# Encryption Key Rotation Runbook — SentryAgent.ai AgentIdP
|
||||
|
||||
**Control:** SOC 2 CC6.1 — Encryption at Rest
|
||||
**Service:** `src/services/EncryptionService.ts`
|
||||
**Vault path:** Configured via `ENCRYPTION_KEY_VAULT_PATH` env var (default: `secret/data/agentidp/encryption-key`)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
AgentIdP uses AES-256-CBC column-level encryption for sensitive PostgreSQL columns.
|
||||
The encryption key is a 64-character hex string (32 bytes) stored in HashiCorp Vault.
|
||||
The `EncryptionService` fetches the key once and caches it in process memory.
|
||||
|
||||
Encrypted format: `base64(IV):base64(ciphertext)` where IV is 16 random bytes per encryption call.
|
||||
|
||||
---
|
||||
|
||||
## Key Rotation Procedure
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Access to HashiCorp Vault with write permissions to the encryption key path
|
||||
- Access to the production application environment (to trigger restart)
|
||||
- At least one backup of the current key stored securely offline
|
||||
|
||||
### Step 1: Generate a New Key
|
||||
|
||||
Generate a cryptographically strong 32-byte (64-character hex) key:
|
||||
|
||||
```bash
|
||||
openssl rand -hex 32
|
||||
# Example output: a1b2c3d4e5f6... (64 hex chars)
|
||||
```
|
||||
|
||||
Record the new key securely.
|
||||
|
||||
### Step 2: Backup the Current Key
|
||||
|
||||
Before overwriting, read and securely store the current key:
|
||||
|
||||
```bash
|
||||
vault kv get -field=encryptionKey secret/agentidp/encryption-key > /secure/backup/encryption-key-$(date +%Y%m%d).txt
|
||||
```
|
||||
|
||||
Store in a hardware security module (HSM) or offline key store.
|
||||
|
||||
### Step 3: Write the New Key to Vault
|
||||
|
||||
```bash
|
||||
vault kv put secret/agentidp/encryption-key encryptionKey="<new-64-char-hex-key>"
|
||||
```
|
||||
|
||||
Verify the write:
|
||||
|
||||
```bash
|
||||
vault kv get secret/agentidp/encryption-key
|
||||
```
|
||||
|
||||
Confirm the `encryptionKey` field contains exactly 64 hex characters.
|
||||
|
||||
### Step 4: Restart the Application
|
||||
|
||||
The `EncryptionService` caches the key in process memory. A restart forces a re-fetch from Vault:
|
||||
|
||||
```bash
|
||||
# Kubernetes rolling restart
|
||||
kubectl rollout restart deployment/agentidp
|
||||
|
||||
# Docker Compose
|
||||
docker-compose restart agentidp
|
||||
|
||||
# PM2
|
||||
pm2 restart agentidp
|
||||
```
|
||||
|
||||
### Step 5: Verify Key Pick-Up
|
||||
|
||||
Check the application logs for:
|
||||
|
||||
```
|
||||
[AgentIdP] EncryptionService enabled — sensitive columns encrypted at rest (SOC 2 CC6.1)
|
||||
```
|
||||
|
||||
Call the compliance controls endpoint to confirm the control is passing:
|
||||
|
||||
```bash
|
||||
curl -s https://api.sentryagent.ai/v1/compliance/controls | jq '.controls[] | select(.id == "CC6.1")'
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```json
|
||||
{ "id": "CC6.1", "name": "Encryption at Rest", "status": "passing", "lastChecked": "..." }
|
||||
```
|
||||
|
||||
### Step 6: Re-encryption of Existing Rows
|
||||
|
||||
Existing rows encrypted with the old key will fail to decrypt after key rotation.
|
||||
Re-encryption happens lazily: the next time each row is read and re-written (e.g. credential rotation,
|
||||
webhook update), the application will decrypt with the old key and re-encrypt with the new one.
|
||||
|
||||
For immediate full re-encryption, use the re-encryption script:
|
||||
|
||||
```bash
|
||||
# Run the re-encryption migration script (reads old key from backup, encrypts with new key)
|
||||
# Note: This script requires both old and new keys to be available
|
||||
ts-node scripts/reencrypt-columns.ts --old-key-file /secure/backup/encryption-key-<date>.txt
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Rollback
|
||||
|
||||
If the new key causes issues (e.g. test failures, decryption errors), roll back:
|
||||
|
||||
### Step 1: Restore Old Key to Vault
|
||||
|
||||
```bash
|
||||
vault kv put secret/agentidp/encryption-key encryptionKey="<old-64-char-hex-key-from-backup>"
|
||||
```
|
||||
|
||||
### Step 2: Restart the Application
|
||||
|
||||
```bash
|
||||
kubectl rollout restart deployment/agentidp
|
||||
```
|
||||
|
||||
### Step 3: Verify Recovery
|
||||
|
||||
```bash
|
||||
curl -s https://api.sentryagent.ai/v1/compliance/controls | jq '.controls[] | select(.id == "CC6.1")'
|
||||
```
|
||||
|
||||
### Step 4: Investigate Root Cause
|
||||
|
||||
Review application logs for `AES-256-CBC decryption failed` errors and audit the cause before
|
||||
reattempting rotation.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Likely Cause | Resolution |
|
||||
|---|---|---|
|
||||
| `Invalid encryption key ... expected a 64-character hex string` | Key in Vault is wrong length or encoding | Re-write correct key to Vault, restart |
|
||||
| `AES-256-CBC decryption failed — possible key mismatch` | Key rotated but rows still encrypted with old key | Rollback to old key, then migrate properly |
|
||||
| `CC6.1` status shows `unknown` | Vault unreachable, key fetch failed | Check Vault connectivity, `VAULT_ADDR`, `VAULT_TOKEN` |
|
||||
|
||||
---
|
||||
|
||||
## Audit Evidence
|
||||
|
||||
After rotation, record the following for SOC 2 evidence:
|
||||
|
||||
- Date of rotation
|
||||
- Who performed the rotation (approver + executor)
|
||||
- Vault audit log entry confirming the key write
|
||||
- Application log confirming EncryptionService initialised with new key
|
||||
- `GET /compliance/controls` response showing CC6.1 = passing
|
||||
229
docs/compliance/incident-response.md
Normal file
229
docs/compliance/incident-response.md
Normal file
@@ -0,0 +1,229 @@
|
||||
# Incident Response Runbook — SentryAgent.ai AgentIdP
|
||||
|
||||
**Owner:** Security Engineering
|
||||
**Last updated:** 2026-03-31
|
||||
**Applies to:** Production AgentIdP deployments
|
||||
|
||||
This runbook covers the four incident types most relevant to SOC 2 Type II compliance monitoring.
|
||||
|
||||
---
|
||||
|
||||
## 1. Auth Failure Spike
|
||||
|
||||
### Detection
|
||||
|
||||
**Prometheus alert:** `AuthFailureSpike`
|
||||
```yaml
|
||||
expr: rate(agentidp_http_requests_total{status_code="401"}[5m]) > 0.5
|
||||
for: 2m
|
||||
severity: warning
|
||||
```
|
||||
|
||||
Triggers when the rate of HTTP 401 responses exceeds 0.5 per second sustained over 2 minutes.
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. Acknowledge the alert in PagerDuty / alerting system
|
||||
2. Check whether the spike correlates with a scheduled process (e.g. batch agent key rotation, deployment)
|
||||
3. Check Prometheus dashboard for the geographic distribution of the failing requests
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Identify source agents:**
|
||||
```bash
|
||||
# Query audit log for recent auth failures
|
||||
curl -s -H "Authorization: Bearer <admin-token>" \
|
||||
"https://api.sentryagent.ai/v1/audit?action=auth.failed&limit=100"
|
||||
```
|
||||
|
||||
2. **Check for brute-force patterns:**
|
||||
Look for repeated failures from the same `client_id` or IP address.
|
||||
|
||||
3. **Check if an agent's credentials expired:**
|
||||
```bash
|
||||
# Look for expired credentials
|
||||
psql "$DATABASE_URL" -c "
|
||||
SELECT credential_id, client_id, expires_at
|
||||
FROM credentials
|
||||
WHERE status = 'active' AND expires_at < NOW()
|
||||
ORDER BY expires_at DESC LIMIT 20;"
|
||||
```
|
||||
|
||||
4. **Check for key compromise signals:**
|
||||
- Multiple agents failing simultaneously → possible key store issue
|
||||
- Single agent with high failure rate → possible credential stuffing or misconfiguration
|
||||
|
||||
### Escalation Path
|
||||
|
||||
- **Warning (< 2 req/s):** Engineering on-call investigates within 1 hour
|
||||
- **Critical (> 2 req/s sustained):** CISO notified, potential account compromise investigation
|
||||
- **If credential compromise confirmed:** Revoke affected credentials immediately via `POST /agents/:id/credentials/:credId/revoke`
|
||||
|
||||
---
|
||||
|
||||
## 2. Anomalous Token Issuance
|
||||
|
||||
### Detection
|
||||
|
||||
**Prometheus alert:** `AnomalousTokenIssuance`
|
||||
```yaml
|
||||
expr: rate(agentidp_tokens_issued_total[5m]) > 10
|
||||
for: 5m
|
||||
severity: warning
|
||||
```
|
||||
|
||||
Triggers when token issuance rate exceeds 10 per second for 5 continuous minutes.
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. Acknowledge the alert
|
||||
2. Determine if a legitimate mass-scale operation is underway (e.g. new customer onboarding, load test)
|
||||
3. Check the `scope` label breakdown on `agentidp_tokens_issued_total` to identify what scopes are being requested
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Identify top issuing agents:**
|
||||
```bash
|
||||
# Query audit log for recent token issuances
|
||||
curl -s -H "Authorization: Bearer <admin-token>" \
|
||||
"https://api.sentryagent.ai/v1/audit?action=token.issued&limit=100"
|
||||
```
|
||||
|
||||
2. **Check monthly token budget:**
|
||||
Each agent is limited to 10,000 tokens/month (free tier). A single agent hitting the limit may indicate automation abuse.
|
||||
|
||||
3. **Check for abnormal scope combinations:**
|
||||
If tokens are being issued with `admin:orgs` or `audit:read` at high volume, this warrants immediate investigation.
|
||||
|
||||
4. **Check for valid business reason:**
|
||||
Contact the organization owner for the top-issuing agents.
|
||||
|
||||
### Escalation Path
|
||||
|
||||
- **Warning:** Engineering on-call investigates within 4 hours
|
||||
- **If compromise suspected:** Revoke affected agent tokens via Redis revocation list, rotate credentials
|
||||
- **If systematic abuse confirmed:** Suspend the issuing agent(s) via `PATCH /agents/:id` with `status: suspended`
|
||||
|
||||
---
|
||||
|
||||
## 3. Audit Chain Integrity Failure
|
||||
|
||||
### Detection
|
||||
|
||||
**Prometheus alert:** `AuditChainIntegrityFailed`
|
||||
```yaml
|
||||
expr: agentidp_audit_chain_integrity == 0
|
||||
for: 0m
|
||||
severity: critical
|
||||
```
|
||||
|
||||
Fires immediately when `AuditChainVerificationJob` detects a break in the audit event hash chain.
|
||||
This is a **CRITICAL** security event — possible evidence of log tampering.
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. **Do NOT attempt to repair the broken chain** — preserve all evidence
|
||||
2. Notify CISO and security team immediately
|
||||
3. Page the on-call security engineer with P0 priority
|
||||
4. Capture the current state:
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer <audit-token>" \
|
||||
"https://api.sentryagent.ai/v1/audit/verify" | tee /secure/incident-$(date +%Y%m%d-%H%M).json
|
||||
```
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **Determine the broken event:**
|
||||
The `brokenAtEventId` field in the `/audit/verify` response identifies the first broken event.
|
||||
|
||||
2. **Forensic analysis:**
|
||||
Follow the steps in `docs/compliance/audit-log-runbook.md` — "What to Do When brokenAtEventId is Returned".
|
||||
|
||||
3. **Check database access logs:**
|
||||
Review PostgreSQL `pg_stat_activity` and connection logs for unauthorized direct DB access.
|
||||
|
||||
4. **Check application logs:**
|
||||
Look for any errors from the immutability trigger (`audit_events_immutable`).
|
||||
|
||||
5. **Check Vault audit logs:**
|
||||
Review whether any encryption key access was abnormal.
|
||||
|
||||
### Escalation Path
|
||||
|
||||
- **Immediate:** CISO + Legal + Security Engineering
|
||||
- **Within 1 hour:** Begin forensic preservation per incident response plan
|
||||
- **Within 24 hours:** Determine scope of compromise and notification obligations
|
||||
- **Customer notification:** Per contractual and regulatory obligations (GDPR, SOC 2 requirements)
|
||||
|
||||
---
|
||||
|
||||
## 4. Webhook Dead-Letter Accumulation
|
||||
|
||||
### Detection
|
||||
|
||||
**Prometheus alert:** `WebhookDeadLetterAccumulating`
|
||||
```yaml
|
||||
expr: increase(agentidp_webhook_dead_letters_total[1h]) > 10
|
||||
for: 0m
|
||||
severity: critical
|
||||
```
|
||||
|
||||
Fires when more than 10 webhook deliveries reach dead-letter status within an hour.
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. Acknowledge the alert
|
||||
2. Check which `organization_id` labels are accumulating dead-letters:
|
||||
```bash
|
||||
# Prometheus query: top organizations by dead-letter rate
|
||||
# agentidp_webhook_dead_letters_total (by organization_id)
|
||||
```
|
||||
|
||||
3. Check if the destination endpoints are reachable:
|
||||
```bash
|
||||
curl -I https://<webhook-destination-url>/
|
||||
```
|
||||
|
||||
### Investigation Steps
|
||||
|
||||
1. **List affected webhook subscriptions:**
|
||||
```bash
|
||||
# Query delivery records for dead-letter status
|
||||
psql "$DATABASE_URL" -c "
|
||||
SELECT s.id, s.organization_id, s.url, COUNT(d.id) AS dead_letters
|
||||
FROM webhook_subscriptions s
|
||||
JOIN webhook_deliveries d ON d.subscription_id = s.id
|
||||
WHERE d.status = 'dead_letter'
|
||||
AND d.updated_at > NOW() - INTERVAL '2 hours'
|
||||
GROUP BY s.id
|
||||
ORDER BY dead_letters DESC
|
||||
LIMIT 20;"
|
||||
```
|
||||
|
||||
2. **Check delivery failure reasons:**
|
||||
```bash
|
||||
psql "$DATABASE_URL" -c "
|
||||
SELECT http_status_code, COUNT(*) as count
|
||||
FROM webhook_deliveries
|
||||
WHERE status = 'dead_letter'
|
||||
AND updated_at > NOW() - INTERVAL '2 hours'
|
||||
GROUP BY http_status_code;"
|
||||
```
|
||||
|
||||
3. **Common causes and resolutions:**
|
||||
| HTTP Status | Likely Cause | Resolution |
|
||||
|---|---|---|
|
||||
| 0 / null | Network unreachable / DNS failure | Check recipient endpoint availability |
|
||||
| 401 / 403 | HMAC signature validation failing | Customer to verify HMAC secret |
|
||||
| 404 | Endpoint URL changed | Customer to update webhook URL |
|
||||
| 5xx | Recipient server error | Customer to investigate their endpoint |
|
||||
| Timeout | Slow recipient endpoint | Customer to optimize endpoint response time |
|
||||
|
||||
4. **Notify affected customers:**
|
||||
Contact the organization owner for high-volume dead-letter subscriptions.
|
||||
|
||||
### Escalation Path
|
||||
|
||||
- **Warning (10-50/hr):** Engineering notifies affected customers, investigates endpoint health
|
||||
- **Critical (> 50/hr):** Engineering on-call + Platform reliability team engaged
|
||||
- **If systemic delivery infrastructure failure:** Activate incident bridge, escalate to VP Engineering
|
||||
142
docs/compliance/secrets-rotation.md
Normal file
142
docs/compliance/secrets-rotation.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# Secrets Rotation Runbook — SentryAgent.ai AgentIdP
|
||||
|
||||
**Control:** SOC 2 CC9.2 — Secrets Rotation
|
||||
**Last updated:** 2026-03-31
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
AgentIdP manages three categories of secrets that require periodic rotation:
|
||||
|
||||
1. **Agent client secrets** — Per-credential client secrets used for OAuth 2.0 token issuance
|
||||
2. **OIDC signing keys** — RSA/EC keys used to sign ID tokens
|
||||
3. **AES-256-CBC encryption key** — Column-level database encryption key (see `encryption-runbook.md`)
|
||||
|
||||
---
|
||||
|
||||
## 1. Agent Credential (Client Secret) Rotation
|
||||
|
||||
### API endpoint
|
||||
|
||||
```
|
||||
POST /api/v1/agents/:agentId/credentials/:credentialId/rotate
|
||||
```
|
||||
|
||||
Requires Bearer token with `agents:write` scope.
|
||||
|
||||
### Procedure
|
||||
|
||||
```bash
|
||||
# 1. List active credentials for the agent
|
||||
curl -s -H "Authorization: Bearer <token>" \
|
||||
"https://api.sentryagent.ai/v1/agents/<agentId>/credentials?status=active"
|
||||
|
||||
# 2. Rotate the credential (generate new secret)
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer <token>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"expiresAt": "2027-03-31T00:00:00.000Z"}' \
|
||||
"https://api.sentryagent.ai/v1/agents/<agentId>/credentials/<credentialId>/rotate"
|
||||
|
||||
# Response includes the new clientSecret — store it immediately; it is never shown again
|
||||
```
|
||||
|
||||
### Key points
|
||||
|
||||
- The new `clientSecret` is returned **once only** — store it securely before the response is discarded
|
||||
- The agent's previous secret is immediately invalidated (Vault KV v2 version overwritten)
|
||||
- An audit event `credential.rotated` is logged to the immutable audit chain
|
||||
- A `credential.rotated` webhook event is dispatched to all active subscriptions
|
||||
|
||||
### Recommended rotation schedule
|
||||
|
||||
| Credential type | Recommended rotation interval |
|
||||
|---|---|
|
||||
| Production agent credentials | 90 days |
|
||||
| Staging / development credentials | 180 days |
|
||||
| Service account credentials | 365 days (annual) |
|
||||
| Credentials involved in a security incident | Immediately |
|
||||
|
||||
### Automated expiry detection
|
||||
|
||||
`SecretsRotationJob` runs hourly and queries credentials expiring within 7 days.
|
||||
Prometheus alert `CredentialExpiryApproaching` fires immediately when any are detected.
|
||||
Respond to this alert by rotating the flagged credential(s) before the expiry date.
|
||||
|
||||
---
|
||||
|
||||
## 2. OIDC Signing Key Rotation
|
||||
|
||||
### Overview
|
||||
|
||||
OIDC signing keys are managed by `OIDCKeyService` (`src/services/OIDCKeyService.ts`).
|
||||
Keys are stored in the `oidc_keys` PostgreSQL table. The current active key is used to
|
||||
sign all new ID tokens; public keys are exposed via `GET /.well-known/jwks.json`.
|
||||
|
||||
### When to rotate
|
||||
|
||||
- Key compromise or suspected exposure
|
||||
- Scheduled rotation (recommended every 90 days for production)
|
||||
- Algorithm upgrade (e.g. RS256 → ES256)
|
||||
|
||||
### Rotation procedure
|
||||
|
||||
OIDC key rotation is handled automatically by `OIDCKeyService.ensureCurrentKey()`:
|
||||
|
||||
```bash
|
||||
# Force generation of a new signing key by calling the internal rotate endpoint
|
||||
# (or trigger by redeploying with OIDC_FORCE_KEY_ROTATION=true)
|
||||
|
||||
# 1. Mark current key as inactive (if manual rotation is required)
|
||||
psql "$DATABASE_URL" -c "
|
||||
UPDATE oidc_keys
|
||||
SET active = false
|
||||
WHERE active = true;"
|
||||
|
||||
# 2. Restart the application — ensureCurrentKey() will generate a new key on startup
|
||||
kubectl rollout restart deployment/agentidp
|
||||
```
|
||||
|
||||
### JWKS update behavior
|
||||
|
||||
- Old public keys remain in `GET /.well-known/jwks.json` for **24 hours** after rotation
|
||||
(grace period for in-flight tokens)
|
||||
- After the grace period, old keys are removed from the JWKS endpoint
|
||||
- Redis JWKS cache TTL is configured by `JWKS_CACHE_TTL_SECONDS` (default: 3600)
|
||||
|
||||
### Impact on existing tokens
|
||||
|
||||
Existing valid tokens signed with the old key **continue to work** until they expire,
|
||||
as long as the old public key remains in JWKS. After the grace period, old tokens
|
||||
will fail verification.
|
||||
|
||||
---
|
||||
|
||||
## 3. Encryption Key Rotation
|
||||
|
||||
See `docs/compliance/encryption-runbook.md` for the full AES-256-CBC encryption key rotation procedure.
|
||||
|
||||
**Summary:** Generate new 32-byte hex key → write to Vault at `ENCRYPTION_KEY_VAULT_PATH` → restart app → existing rows re-encrypted lazily on next read-write cycle.
|
||||
|
||||
---
|
||||
|
||||
## Schedule Recommendations
|
||||
|
||||
| Secret Type | Production Interval | Staging Interval | Trigger for Immediate Rotation |
|
||||
|---|---|---|---|
|
||||
| Agent client secrets | 90 days | 180 days | Credential suspected compromised |
|
||||
| OIDC signing keys | 90 days | 180 days | Key file exposed, algorithm upgrade |
|
||||
| AES-256-CBC encryption key | 365 days (annual) | On demand | Key exposed, Vault breach, compliance audit requirement |
|
||||
| Webhook HMAC secrets | Per customer policy | N/A | Webhook endpoint compromised |
|
||||
|
||||
---
|
||||
|
||||
## Compliance Evidence
|
||||
|
||||
For SOC 2 CC9.2 evidence collection:
|
||||
|
||||
- Prometheus metric history: `agentidp_credentials_expiring_soon_total`
|
||||
- Audit log entries with `action: credential.rotated` — query via `GET /audit?action=credential.rotated`
|
||||
- Key rotation records from Vault audit log
|
||||
- This runbook + sign-off from Security Engineering
|
||||
42
docs/compliance/soc2-controls-matrix.md
Normal file
42
docs/compliance/soc2-controls-matrix.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# SOC 2 Type II Controls Matrix — SentryAgent.ai AgentIdP
|
||||
|
||||
This document maps the five in-scope SOC 2 Trust Services Criteria (TSC) controls to their
|
||||
corresponding implementation artefacts, mechanisms, and automated verification methods.
|
||||
|
||||
---
|
||||
|
||||
## Controls Matrix
|
||||
|
||||
| Control ID | TSC Criterion Name | Implementation File | Mechanism | Automated Check |
|
||||
|---|---|---|---|---|
|
||||
| **CC6.1** | Encryption at Rest | `src/services/EncryptionService.ts` | AES-256-CBC column-level encryption on `credentials.secret_hash`, `credentials.vault_path`, `webhook_subscriptions.vault_secret_path`, `agent_did_keys.vault_key_path`. Key is stored in HashiCorp Vault KV v2 at path configured by `ENCRYPTION_KEY_VAULT_PATH`. IV is randomised per encryption call. Backward-compat: `isEncrypted()` gate allows plaintext rows to coexist during migration. | `GET /api/v1/compliance/controls` returns `CC6.1` status. Status is set to `passing` on service startup when `EncryptionService` initialises. |
|
||||
| **CC6.7** | TLS Enforcement | `src/middleware/TLSEnforcementMiddleware.ts` | Express middleware registered as the **first** middleware in the app stack (before all routes and body parsers). In `NODE_ENV=production`, checks `X-Forwarded-Proto` header set by the upstream load balancer/reverse proxy. Any non-HTTPS request receives a `301 Moved Permanently` redirect to `https://`. | `GET /api/v1/compliance/controls` returns `CC6.7` status. TLS enforcement is a static configuration control; status is set to `passing` on application startup. |
|
||||
| **CC7.2** | Audit Log Integrity | `src/services/AuditVerificationService.ts`, `src/repositories/AuditRepository.ts`, `src/jobs/AuditChainVerificationJob.ts` | Each audit event (`audit_events` table) stores a `hash` (SHA-256 of `eventId + timestamp + action + outcome + agentId + organizationId + previousHash`) and `previous_hash` linking it to the prior event. An immutability trigger prevents UPDATE/DELETE on `audit_events`. `AuditChainVerificationJob` re-walks the entire chain every hour. | Prometheus gauge `agentidp_audit_chain_integrity` (1 = passing, 0 = failing). Prometheus alert `AuditChainIntegrityFailed` fires when gauge = 0. `GET /api/v1/audit/verify` triggers an on-demand verification. `GET /api/v1/compliance/controls` returns `CC7.2` status. |
|
||||
| **CC9.2** | Secrets Rotation | `src/jobs/SecretsRotationJob.ts` | `SecretsRotationJob` runs every hour (configurable via `SECRETS_ROTATION_CHECK_INTERVAL_MS`) and queries `credentials` for `active` credentials expiring within 7 days. For each, it increments the `agentidp_credentials_expiring_soon_total` Prometheus counter with the owning `agent_id`. Operators are expected to act on the alert within the 7-day window. | Prometheus counter `agentidp_credentials_expiring_soon_total` per `agent_id`. Prometheus alert `CredentialExpiryApproaching` fires when any increase is detected. `GET /api/v1/compliance/controls` returns `CC9.2` status. |
|
||||
| **CC7.1** | Webhook Dead-Letter Monitoring | `src/workers/WebhookDeliveryWorker.ts` | `WebhookDeliveryWorker` processes webhook deliveries from a Redis queue. After exhausting all retry attempts (configurable `WEBHOOK_MAX_RETRIES`), the delivery is moved to dead-letter status and `agentidp_webhook_dead_letters_total` is incremented. | Prometheus counter `agentidp_webhook_dead_letters_total` per `organization_id`. Prometheus alert `WebhookDeadLetterAccumulating` fires when > 10 dead-letters accumulate in 1 hour. `GET /api/v1/compliance/controls` returns `CC7.1` status. |
|
||||
|
||||
---
|
||||
|
||||
## Evidence Collection
|
||||
|
||||
For a SOC 2 Type II audit, the following evidence should be collected:
|
||||
|
||||
| Evidence Type | Collection Method |
|
||||
|---|---|
|
||||
| Encryption at rest configuration | Export Vault KV v2 policy + `_encryption_migration_log` table contents |
|
||||
| TLS certificate and enforcement logs | Load balancer access logs + `X-Forwarded-Proto` middleware responses |
|
||||
| Audit chain integrity report | `GET /api/v1/audit/verify` with full date range |
|
||||
| Secrets rotation compliance | Prometheus metric history for `agentidp_credentials_expiring_soon_total` |
|
||||
| Webhook dead-letter rate | Prometheus metric history for `agentidp_webhook_dead_letters_total` |
|
||||
| Immutable audit log dump | Direct PostgreSQL export of `audit_events` table with hash verification |
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- SOC 2 Trust Services Criteria: [AICPA TSC 2017](https://www.aicpa.org/resources/article/trust-services-criteria)
|
||||
- OpenAPI spec: `docs/openapi/compliance.yaml`
|
||||
- Encryption runbook: `docs/compliance/encryption-runbook.md`
|
||||
- Audit log runbook: `docs/compliance/audit-log-runbook.md`
|
||||
- Incident response: `docs/compliance/incident-response.md`
|
||||
- Secrets rotation: `docs/compliance/secrets-rotation.md`
|
||||
Reference in New Issue
Block a user