Archived 4 completed OpenSpec changes (2026-04-02): - phase-3-enterprise (100/100 tasks) — 6 Phase 3 capabilities synced - devops-documentation (48/48 tasks) — 3 new + 1 merged capability - bedroom-developer-docs (33/33 tasks) — 4 new capabilities synced - engineering-docs (superseded by 2026-03-29 archive) — no tasks Main spec library grows from 21 → 35 capabilities (+14 new): federation, multi-tenancy, oidc, soc2, w3c-dids, webhooks, database, operations, system-overview, api-reference, core-concepts, developer-guides, quick-start + deployment (merged additive requirements) Active changes: 0 — project board is clear for Phase 4 planning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
18 KiB
Phase 3: Enterprise — Technical Design
Date: 2026-03-29 Author: Virtual Architect Status: Draft — pending CEO approval of proposal
Architecture Overview
Phase 3 transforms AgentIdP from a single-tenant OAuth 2.0 server into a multi-tenant, W3C DID-issuing, OIDC-compliant, federated enterprise identity platform. The architecture remains monolithic Express (no microservices split) to avoid operational complexity, but clear service boundaries are enforced internally.
┌──────────────────────────────────────────────────────┐
│ AgentIdP Server (Express) │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Middleware Stack (ordered) │ │
│ │ TLS Enforcement → Auth → Org Context → OPA │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ │
│ │ OrgSvc │ │ DIDSvc │ │OIDCSvc │ │FedSvc │ │
│ └──────────┘ └──────────┘ └──────────┘ └───────┘ │
│ ┌──────────┐ ┌──────────┐ │
│ │ WebhookQ │ │ SOC2Ctrl │ │
│ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────────┘
│ │ │
┌────────▼──┐ ┌─────▼───┐ ┌──▼──────────┐
│PostgreSQL │ │ Redis │ │ Vault │
│(org rows) │ │(webhook │ │ (secrets) │
└───────────┘ │ queue) │ └─────────────┘
└─────────┘
Architectural Decision Records
D1: Multi-Tenancy Model
Status: Accepted
Decision: Row-level tenancy — add organization_id (UUID, NOT NULL) to every domain table. No schema-per-tenant, no database-per-tenant.
Rationale: Row-level tenancy is operationally the simplest approach: a single database, a single schema, a single connection pool. All queries are augmented with an organization_id filter extracted from the authenticated JWT. PostgreSQL Row-Level Security (RLS) is enabled on all tenant-scoped tables as a defense-in-depth measure — even if the application filter is accidentally omitted, the database enforces isolation.
Alternatives Considered:
| Option | Pros | Cons | Rejected because |
|---|---|---|---|
| Schema-per-tenant | Strong isolation, independent migrations | Complex migration tooling, connection pool explosion at scale | Operational overhead exceeds threat model requirement |
| Database-per-tenant | Maximum isolation | Separate connection pool, backup, monitoring per tenant | Prohibitive at 100+ orgs; overkill for our threat model |
| Row-level (chosen) | Simple, fast, single migration path | RLS must be enforced consistently | Chosen — enforce via both application and RLS |
Consequences:
- Every domain table gets an
organization_idcolumn and a corresponding index - All service methods accept
organizationId: stringas a required parameter - JWT payload extended to include
organization_idclaim - Existing single-tenant data migrated to a default
systemorganization - PostgreSQL RLS policies written for all tenant tables
D2: DID Method Selection
Status: Accepted
Decision: did:web — DID Documents served over HTTPS at well-known and per-agent URLs.
Rationale: did:web requires no blockchain, no ledger, and no external infrastructure beyond the HTTPS server already running. It is W3C DID Core 1.0 compliant, supported by all major DID resolvers, and is the preferred method for enterprise deployments where an organization controls its own domain. It aligns directly with the did:web identifier scheme used in AGNTCY agent card specifications.
Alternatives Considered:
| Option | Pros | Cons | Rejected because |
|---|---|---|---|
did:web (chosen) |
No blockchain, HTTPS-based, enterprise-friendly | DID tied to domain; moving the domain invalidates DIDs | Accepted tradeoff — enterprise deployments have stable domains |
did:key |
Self-contained, no infrastructure | Not anchored — anyone can generate any did:key; no discovery |
No trust anchor; not suitable for enterprise identity |
did:ethr |
Ethereum-anchored, decentralized | Blockchain dependency, gas costs, not enterprise-standard | Blockchain dependency is a non-starter for regulated enterprises |
Consequences:
- DID for the AgentIdP instance:
did:web:<hostname> - DID for an agent:
did:web:<hostname>:agents:<agentId> - DID Documents served at
/.well-known/did.jsonand/agents/:id/did - Domain change requires DID migration — document this in ops runbook
D3: OIDC Library Selection
Status: Accepted
Decision: oidc-provider npm package — a certified, RFC-compliant OIDC server library.
Rationale: oidc-provider is the most widely deployed Node.js OIDC library, passing the OpenID Foundation's official conformance test suite. Building OIDC from scratch on top of our existing JWT infrastructure would require implementing Discovery, JWKS rotation, ID token construction, and claim aggregation correctly against multiple RFCs. The certified library eliminates that risk and reduces implementation surface area. It integrates cleanly with Express as a mounted middleware.
Alternatives Considered:
| Option | Pros | Cons | Rejected because |
|---|---|---|---|
oidc-provider (chosen) |
Certified, RFC-complete, actively maintained | Adds a significant dependency | Risk of non-compliance from custom implementation outweighs dependency cost |
| Custom JWT extension | Full control, no new dependency | High risk of spec deviation; ID token, Discovery, JWKS are complex | RFC compliance cannot be self-certified |
keycloak sidecar |
Battle-tested, full-featured | Heavyweight Java service; architectural mismatch | Not Node.js; adds operational complexity |
Consequences:
oidc-provideris mounted at/oidcin Express- OIDC Discovery served at
/.well-known/openid-configuration(proxied from oidc-provider) - JWKS served at
/.well-known/jwks.json - Adapter written to store OIDC sessions in Redis (oidc-provider's adapter interface)
- Existing
POST /oauth2/tokenroute extended, not replaced — maintains backward compatibility
D4: Federation Protocol
Status: Accepted
Decision: Signed JWT assertions — remote AgentIdP instances present a signed JWT; the receiving instance verifies the signature against the registered JWKS of the issuing instance.
Rationale: JWT assertion federation reuses the existing JWT infrastructure (jsonwebtoken, JWKS endpoint from OIDC workstream). No new protocol is introduced. The trust model is explicit: operators register partner instances with their JWKS URL. This aligns with RFC 7523 (JWT Profile for OAuth 2.0 Client Authentication) and the AGNTCY inter-agent trust model.
Alternatives Considered:
| Option | Pros | Cons | Rejected because |
|---|---|---|---|
| Signed JWT assertions (chosen) | Uses existing JWT infra, explicit trust registry, RFC-aligned | JWKS URL must be reachable at verification time | Acceptable operational constraint; JWKS can be cached |
| mTLS | Strong cryptographic identity | Certificate management overhead, PKI required per partner | Cert management complexity not justified when JWT assertions suffice |
| AGNTCY-specific protocol | Native alignment | Spec still evolving; risk of churn | Build on stable JWT base; adapt to AGNTCY extensions as spec matures |
Consequences:
- New
federation_partnerstable:id,name,jwks_url,issuer,trusted_since,organization_id - JWKS of partner instances cached in Redis with TTL
POST /federation/verifyaccepts a bearer token from a remote instance and returns verification result- Federation tokens are not accepted for agent management endpoints — only for identity assertion
D5: Webhook Delivery Architecture
Status: Accepted
Decision: Async delivery via Redis-backed bull queue with exponential backoff retry (max 10 attempts over 24 hours).
Rationale: Synchronous webhook delivery from within a request handler would add latency and create tight coupling between event generation and delivery outcome. The Redis queue (bull) decouples delivery: events are enqueued immediately, a background worker delivers them. bull provides built-in retry, delay, and failure tracking without introducing a new infrastructure component (Redis is already present). HMAC-SHA256 signing on every delivery allows recipients to verify authenticity.
Alternatives Considered:
| Option | Pros | Cons | Rejected because |
|---|---|---|---|
Redis queue via bull (chosen) |
Reuses existing Redis, retry built-in, low operational overhead | Delivery tied to Redis availability | Acceptable — Redis is already a required dependency |
| Synchronous in-request delivery | Simplest implementation | Adds latency to event-generating requests; failure blocks response | Unacceptable latency and coupling |
| Dedicated message broker (RabbitMQ) | Robust, durable | New infrastructure dependency | Operational overhead; Redis already present |
| Kafka (primary) | High-throughput, durable | Overkill for webhook delivery; complex operations | Optional adapter only; not primary delivery mechanism |
Consequences:
- New
webhook_subscriptionsandwebhook_deliveriestables bullworker process runs in same Node.js instance (separate worker thread viabull)- Retry schedule: 1m, 5m, 15m, 1h, 4h, 12h, 24h (exponential backoff)
- Failed delivery after 10 attempts moves to dead-letter; operator alerted
- Optional Kafka adapter: if
KAFKA_BROKERSenv var is set, events are also produced to Kafka
D6: SOC 2 Scope
Status: Accepted
Decision: Target SOC 2 Type II (operational, not just design). All controls implemented in code. Audit period: 6 months post-Phase 3 launch.
Rationale: SOC 2 Type I certifies that controls are designed correctly. SOC 2 Type II certifies that they operate continuously over a period of time. Enterprise customers in regulated industries (finance, healthcare, government) require Type II. Implementing the controls now, with the 6-month operational window beginning at Phase 3 launch, puts us on the fastest possible path to Type II certification.
Alternatives Considered:
| Option | Pros | Cons | Rejected because |
|---|---|---|---|
| Type II from launch (chosen) | Satisfies enterprise requirements | Requires 6-month operation window | Accepted — the controls are implemented in Phase 3; audit window starts after launch |
| Type I only | Faster to certify | Not accepted by most enterprise procurement | Insufficient for target customers |
| ISO 27001 instead | International standard | Larger scope, longer implementation | SOC 2 is standard for US market; add ISO 27001 in Phase 4 |
Consequences:
- Encryption at rest:
pgcryptoextension for column-level encryption oncredentials.secret_hashandcredentials.vault_path - TLS enforcement: Express middleware rejects HTTP requests (not HTTPS) in production
- Secrets rotation: cron-based job that triggers credential rotation reminders and Vault lease renewals
- Security alerting: Prometheus alerting rules for auth failure spikes, rate limit exhaustion, anomalous token issuance
- Audit log immutability: Merkle hash chain (each row's hash includes the previous row's hash)
D7: Audit Log Immutability — Merkle Hash Chain
Status: Accepted
Decision: Each audit_logs row carries a hash field: SHA-256(eventId + timestamp + action + outcome + agentId + previousHash). The chain starts with a genesis hash. Verification is a sequential pass over all rows in insertion order.
Rationale: Append-only logs in PostgreSQL can be altered by a DBA with sufficient access. A Merkle-style hash chain makes tampering detectable without requiring a blockchain. Any modification to a historical row breaks the chain from that point forward. Verification is a simple sequential computation that can be run on demand or as a scheduled integrity check.
Alternatives Considered:
| Option | Pros | Cons | Rejected because |
|---|---|---|---|
| Merkle hash chain in PostgreSQL (chosen) | No new infra, tamper-evident, verifiable | DBA can re-compute hashes after tampering if they control the algorithm | Acceptable — threat model is accidental/low-sophistication modification; cryptographic chain deters opportunistic tampering |
| Blockchain anchor | Cryptographically immutable | Blockchain dependency, cost, latency | Excessive for current threat model |
| Write-once S3/GCS export | External immutability | Delayed; operational complexity | Added complexity; hash chain provides continuous coverage |
Consequences:
- New
hash(VARCHAR 64) andprevious_hash(VARCHAR 64) columns onaudit_logs AuditService.create()computes hash before insert — adds ~1ms latency per audit event- New
GET /audit/verifyendpoint: returns chain integrity status (admin only) audit_logstable has anINSERT-only trigger that preventsUPDATEandDELETEvia PostgreSQL trigger
D8: Organization Context in JWT
Status: Accepted
Decision: Add organization_id claim to JWT access tokens issued by POST /oauth2/token. All downstream middleware extracts organization_id from the token — no separate lookup required.
Rationale: Including organization_id in the JWT keeps the middleware stack stateless. The alternative — looking up the organization from the database on every request — adds latency and a database round-trip to every authenticated call. The JWT is already signed; adding a claim costs nothing cryptographically.
Consequences:
ITokenPayloadinterface extended:organization_id: string- All service methods receive
organizationIdfromreq.user.organization_id - Token introspection response includes
organization_id - Agents registered before multi-tenancy belong to the default
systemorganization
Component Interaction Map (Phase 3)
┌──────────────────────┐
│ Web Dashboard │
│ (+ Org Mgmt pages) │
└──────────┬───────────┘
│ HTTPS
┌───────────────────────▼─────────────────────────────┐
│ AgentIdP Server │
│ │
│ TLS MW → Auth MW → OrgContext MW → OPA MW │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────────────┐ │
│ │ OrgService│ │DIDService │ │ OIDCProvider │ │
│ └───────────┘ └───────────┘ │ (oidc-provider) │ │
│ ┌───────────┐ ┌───────────┐ └───────────────────┘ │
│ │ FedService│ │WebhookSvc │ │
│ └───────────┘ └───────────┘ │
│ ┌─────────────────────────┐ │
│ │ SOC2Controls (cross-cut)│ │
│ └─────────────────────────┘ │
└──────────┬──────────────┬──────────────┬─────────────┘
│ │ │
┌────────▼──┐ ┌───────▼──┐ ┌──────▼──────┐
│PostgreSQL │ │ Redis │ │ Vault │
│ + RLS │ │ +bull Q │ │ (secrets) │
└───────────┘ └──────────┘ └─────────────┘
│
┌────────▼──────┐
│ Prometheus │
│ + Alerting │
└────────┬──────┘
│
┌────────▼──────┐
│ Grafana │
└───────────────┘