Files
SentryAgent.ai Developer f1fbe0e29a chore(openspec): archive all completed changes, sync 14 new specs to library
Archived 4 completed OpenSpec changes (2026-04-02):
- phase-3-enterprise (100/100 tasks) — 6 Phase 3 capabilities synced
- devops-documentation (48/48 tasks) — 3 new + 1 merged capability
- bedroom-developer-docs (33/33 tasks) — 4 new capabilities synced
- engineering-docs (superseded by 2026-03-29 archive) — no tasks

Main spec library grows from 21 → 35 capabilities (+14 new):
federation, multi-tenancy, oidc, soc2, w3c-dids, webhooks,
database, operations, system-overview, api-reference, core-concepts,
developer-guides, quick-start + deployment (merged additive requirements)

Active changes: 0 — project board is clear for Phase 4 planning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 03:50:47 +00:00

270 lines
18 KiB
Markdown

# Phase 3: Enterprise — Technical Design
**Date**: 2026-03-29
**Author**: Virtual Architect
**Status**: Draft — pending CEO approval of proposal
---
## Architecture Overview
Phase 3 transforms AgentIdP from a single-tenant OAuth 2.0 server into a multi-tenant, W3C DID-issuing, OIDC-compliant, federated enterprise identity platform. The architecture remains monolithic Express (no microservices split) to avoid operational complexity, but clear service boundaries are enforced internally.
```
┌──────────────────────────────────────────────────────┐
│ AgentIdP Server (Express) │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Middleware Stack (ordered) │ │
│ │ TLS Enforcement → Auth → Org Context → OPA │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ │
│ │ OrgSvc │ │ DIDSvc │ │OIDCSvc │ │FedSvc │ │
│ └──────────┘ └──────────┘ └──────────┘ └───────┘ │
│ ┌──────────┐ ┌──────────┐ │
│ │ WebhookQ │ │ SOC2Ctrl │ │
│ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────────┘
│ │ │
┌────────▼──┐ ┌─────▼───┐ ┌──▼──────────┐
│PostgreSQL │ │ Redis │ │ Vault │
│(org rows) │ │(webhook │ │ (secrets) │
└───────────┘ │ queue) │ └─────────────┘
└─────────┘
```
---
## Architectural Decision Records
---
### D1: Multi-Tenancy Model
**Status**: Accepted
**Decision**: Row-level tenancy — add `organization_id` (UUID, NOT NULL) to every domain table. No schema-per-tenant, no database-per-tenant.
**Rationale**: Row-level tenancy is operationally the simplest approach: a single database, a single schema, a single connection pool. All queries are augmented with an `organization_id` filter extracted from the authenticated JWT. PostgreSQL Row-Level Security (RLS) is enabled on all tenant-scoped tables as a defense-in-depth measure — even if the application filter is accidentally omitted, the database enforces isolation.
**Alternatives Considered**:
| Option | Pros | Cons | Rejected because |
|--------|------|------|-----------------|
| Schema-per-tenant | Strong isolation, independent migrations | Complex migration tooling, connection pool explosion at scale | Operational overhead exceeds threat model requirement |
| Database-per-tenant | Maximum isolation | Separate connection pool, backup, monitoring per tenant | Prohibitive at 100+ orgs; overkill for our threat model |
| Row-level (chosen) | Simple, fast, single migration path | RLS must be enforced consistently | Chosen — enforce via both application and RLS |
**Consequences**:
- Every domain table gets an `organization_id` column and a corresponding index
- All service methods accept `organizationId: string` as a required parameter
- JWT payload extended to include `organization_id` claim
- Existing single-tenant data migrated to a default `system` organization
- PostgreSQL RLS policies written for all tenant tables
---
### D2: DID Method Selection
**Status**: Accepted
**Decision**: `did:web` — DID Documents served over HTTPS at well-known and per-agent URLs.
**Rationale**: `did:web` requires no blockchain, no ledger, and no external infrastructure beyond the HTTPS server already running. It is W3C DID Core 1.0 compliant, supported by all major DID resolvers, and is the preferred method for enterprise deployments where an organization controls its own domain. It aligns directly with the `did:web` identifier scheme used in AGNTCY agent card specifications.
**Alternatives Considered**:
| Option | Pros | Cons | Rejected because |
|--------|------|------|-----------------|
| `did:web` (chosen) | No blockchain, HTTPS-based, enterprise-friendly | DID tied to domain; moving the domain invalidates DIDs | Accepted tradeoff — enterprise deployments have stable domains |
| `did:key` | Self-contained, no infrastructure | Not anchored — anyone can generate any `did:key`; no discovery | No trust anchor; not suitable for enterprise identity |
| `did:ethr` | Ethereum-anchored, decentralized | Blockchain dependency, gas costs, not enterprise-standard | Blockchain dependency is a non-starter for regulated enterprises |
**Consequences**:
- DID for the AgentIdP instance: `did:web:<hostname>`
- DID for an agent: `did:web:<hostname>:agents:<agentId>`
- DID Documents served at `/.well-known/did.json` and `/agents/:id/did`
- Domain change requires DID migration — document this in ops runbook
---
### D3: OIDC Library Selection
**Status**: Accepted
**Decision**: `oidc-provider` npm package — a certified, RFC-compliant OIDC server library.
**Rationale**: `oidc-provider` is the most widely deployed Node.js OIDC library, passing the OpenID Foundation's official conformance test suite. Building OIDC from scratch on top of our existing JWT infrastructure would require implementing Discovery, JWKS rotation, ID token construction, and claim aggregation correctly against multiple RFCs. The certified library eliminates that risk and reduces implementation surface area. It integrates cleanly with Express as a mounted middleware.
**Alternatives Considered**:
| Option | Pros | Cons | Rejected because |
|--------|------|------|-----------------|
| `oidc-provider` (chosen) | Certified, RFC-complete, actively maintained | Adds a significant dependency | Risk of non-compliance from custom implementation outweighs dependency cost |
| Custom JWT extension | Full control, no new dependency | High risk of spec deviation; ID token, Discovery, JWKS are complex | RFC compliance cannot be self-certified |
| `keycloak` sidecar | Battle-tested, full-featured | Heavyweight Java service; architectural mismatch | Not Node.js; adds operational complexity |
**Consequences**:
- `oidc-provider` is mounted at `/oidc` in Express
- OIDC Discovery served at `/.well-known/openid-configuration` (proxied from oidc-provider)
- JWKS served at `/.well-known/jwks.json`
- Adapter written to store OIDC sessions in Redis (oidc-provider's adapter interface)
- Existing `POST /oauth2/token` route extended, not replaced — maintains backward compatibility
---
### D4: Federation Protocol
**Status**: Accepted
**Decision**: Signed JWT assertions — remote AgentIdP instances present a signed JWT; the receiving instance verifies the signature against the registered JWKS of the issuing instance.
**Rationale**: JWT assertion federation reuses the existing JWT infrastructure (`jsonwebtoken`, JWKS endpoint from OIDC workstream). No new protocol is introduced. The trust model is explicit: operators register partner instances with their JWKS URL. This aligns with RFC 7523 (JWT Profile for OAuth 2.0 Client Authentication) and the AGNTCY inter-agent trust model.
**Alternatives Considered**:
| Option | Pros | Cons | Rejected because |
|--------|------|------|-----------------|
| Signed JWT assertions (chosen) | Uses existing JWT infra, explicit trust registry, RFC-aligned | JWKS URL must be reachable at verification time | Acceptable operational constraint; JWKS can be cached |
| mTLS | Strong cryptographic identity | Certificate management overhead, PKI required per partner | Cert management complexity not justified when JWT assertions suffice |
| AGNTCY-specific protocol | Native alignment | Spec still evolving; risk of churn | Build on stable JWT base; adapt to AGNTCY extensions as spec matures |
**Consequences**:
- New `federation_partners` table: `id`, `name`, `jwks_url`, `issuer`, `trusted_since`, `organization_id`
- JWKS of partner instances cached in Redis with TTL
- `POST /federation/verify` accepts a bearer token from a remote instance and returns verification result
- Federation tokens are not accepted for agent management endpoints — only for identity assertion
---
### D5: Webhook Delivery Architecture
**Status**: Accepted
**Decision**: Async delivery via Redis-backed `bull` queue with exponential backoff retry (max 10 attempts over 24 hours).
**Rationale**: Synchronous webhook delivery from within a request handler would add latency and create tight coupling between event generation and delivery outcome. The Redis queue (`bull`) decouples delivery: events are enqueued immediately, a background worker delivers them. `bull` provides built-in retry, delay, and failure tracking without introducing a new infrastructure component (Redis is already present). HMAC-SHA256 signing on every delivery allows recipients to verify authenticity.
**Alternatives Considered**:
| Option | Pros | Cons | Rejected because |
|--------|------|------|-----------------|
| Redis queue via `bull` (chosen) | Reuses existing Redis, retry built-in, low operational overhead | Delivery tied to Redis availability | Acceptable — Redis is already a required dependency |
| Synchronous in-request delivery | Simplest implementation | Adds latency to event-generating requests; failure blocks response | Unacceptable latency and coupling |
| Dedicated message broker (RabbitMQ) | Robust, durable | New infrastructure dependency | Operational overhead; Redis already present |
| Kafka (primary) | High-throughput, durable | Overkill for webhook delivery; complex operations | Optional adapter only; not primary delivery mechanism |
**Consequences**:
- New `webhook_subscriptions` and `webhook_deliveries` tables
- `bull` worker process runs in same Node.js instance (separate worker thread via `bull`)
- Retry schedule: 1m, 5m, 15m, 1h, 4h, 12h, 24h (exponential backoff)
- Failed delivery after 10 attempts moves to dead-letter; operator alerted
- Optional Kafka adapter: if `KAFKA_BROKERS` env var is set, events are also produced to Kafka
---
### D6: SOC 2 Scope
**Status**: Accepted
**Decision**: Target SOC 2 Type II (operational, not just design). All controls implemented in code. Audit period: 6 months post-Phase 3 launch.
**Rationale**: SOC 2 Type I certifies that controls are designed correctly. SOC 2 Type II certifies that they operate continuously over a period of time. Enterprise customers in regulated industries (finance, healthcare, government) require Type II. Implementing the controls now, with the 6-month operational window beginning at Phase 3 launch, puts us on the fastest possible path to Type II certification.
**Alternatives Considered**:
| Option | Pros | Cons | Rejected because |
|--------|------|------|-----------------|
| Type II from launch (chosen) | Satisfies enterprise requirements | Requires 6-month operation window | Accepted — the controls are implemented in Phase 3; audit window starts after launch |
| Type I only | Faster to certify | Not accepted by most enterprise procurement | Insufficient for target customers |
| ISO 27001 instead | International standard | Larger scope, longer implementation | SOC 2 is standard for US market; add ISO 27001 in Phase 4 |
**Consequences**:
- Encryption at rest: `pgcrypto` extension for column-level encryption on `credentials.secret_hash` and `credentials.vault_path`
- TLS enforcement: Express middleware rejects HTTP requests (not HTTPS) in production
- Secrets rotation: cron-based job that triggers credential rotation reminders and Vault lease renewals
- Security alerting: Prometheus alerting rules for auth failure spikes, rate limit exhaustion, anomalous token issuance
- Audit log immutability: Merkle hash chain (each row's hash includes the previous row's hash)
---
### D7: Audit Log Immutability — Merkle Hash Chain
**Status**: Accepted
**Decision**: Each `audit_logs` row carries a `hash` field: `SHA-256(eventId + timestamp + action + outcome + agentId + previousHash)`. The chain starts with a genesis hash. Verification is a sequential pass over all rows in insertion order.
**Rationale**: Append-only logs in PostgreSQL can be altered by a DBA with sufficient access. A Merkle-style hash chain makes tampering detectable without requiring a blockchain. Any modification to a historical row breaks the chain from that point forward. Verification is a simple sequential computation that can be run on demand or as a scheduled integrity check.
**Alternatives Considered**:
| Option | Pros | Cons | Rejected because |
|--------|------|------|-----------------|
| Merkle hash chain in PostgreSQL (chosen) | No new infra, tamper-evident, verifiable | DBA can re-compute hashes after tampering if they control the algorithm | Acceptable — threat model is accidental/low-sophistication modification; cryptographic chain deters opportunistic tampering |
| Blockchain anchor | Cryptographically immutable | Blockchain dependency, cost, latency | Excessive for current threat model |
| Write-once S3/GCS export | External immutability | Delayed; operational complexity | Added complexity; hash chain provides continuous coverage |
**Consequences**:
- New `hash` (VARCHAR 64) and `previous_hash` (VARCHAR 64) columns on `audit_logs`
- `AuditService.create()` computes hash before insert — adds ~1ms latency per audit event
- New `GET /audit/verify` endpoint: returns chain integrity status (admin only)
- `audit_logs` table has an `INSERT`-only trigger that prevents `UPDATE` and `DELETE` via PostgreSQL trigger
---
### D8: Organization Context in JWT
**Status**: Accepted
**Decision**: Add `organization_id` claim to JWT access tokens issued by `POST /oauth2/token`. All downstream middleware extracts `organization_id` from the token — no separate lookup required.
**Rationale**: Including `organization_id` in the JWT keeps the middleware stack stateless. The alternative — looking up the organization from the database on every request — adds latency and a database round-trip to every authenticated call. The JWT is already signed; adding a claim costs nothing cryptographically.
**Consequences**:
- `ITokenPayload` interface extended: `organization_id: string`
- All service methods receive `organizationId` from `req.user.organization_id`
- Token introspection response includes `organization_id`
- Agents registered before multi-tenancy belong to the default `system` organization
---
## Component Interaction Map (Phase 3)
```
┌──────────────────────┐
│ Web Dashboard │
│ (+ Org Mgmt pages) │
└──────────┬───────────┘
│ HTTPS
┌───────────────────────▼─────────────────────────────┐
│ AgentIdP Server │
│ │
│ TLS MW → Auth MW → OrgContext MW → OPA MW │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────────────┐ │
│ │ OrgService│ │DIDService │ │ OIDCProvider │ │
│ └───────────┘ └───────────┘ │ (oidc-provider) │ │
│ ┌───────────┐ ┌───────────┐ └───────────────────┘ │
│ │ FedService│ │WebhookSvc │ │
│ └───────────┘ └───────────┘ │
│ ┌─────────────────────────┐ │
│ │ SOC2Controls (cross-cut)│ │
│ └─────────────────────────┘ │
└──────────┬──────────────┬──────────────┬─────────────┘
│ │ │
┌────────▼──┐ ┌───────▼──┐ ┌──────▼──────┐
│PostgreSQL │ │ Redis │ │ Vault │
│ + RLS │ │ +bull Q │ │ (secrets) │
└───────────┘ └──────────┘ └─────────────┘
┌────────▼──────┐
│ Prometheus │
│ + Alerting │
└────────┬──────┘
┌────────▼──────┐
│ Grafana │
└───────────────┘
```