Files
SentryAgent.ai Developer f1fbe0e29a chore(openspec): archive all completed changes, sync 14 new specs to library
Archived 4 completed OpenSpec changes (2026-04-02):
- phase-3-enterprise (100/100 tasks) — 6 Phase 3 capabilities synced
- devops-documentation (48/48 tasks) — 3 new + 1 merged capability
- bedroom-developer-docs (33/33 tasks) — 4 new capabilities synced
- engineering-docs (superseded by 2026-03-29 archive) — no tasks

Main spec library grows from 21 → 35 capabilities (+14 new):
federation, multi-tenancy, oidc, soc2, w3c-dids, webhooks,
database, operations, system-overview, api-reference, core-concepts,
developer-guides, quick-start + deployment (merged additive requirements)

Active changes: 0 — project board is clear for Phase 4 planning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 03:50:47 +00:00

18 KiB

Phase 3: Enterprise — Technical Design

Date: 2026-03-29 Author: Virtual Architect Status: Draft — pending CEO approval of proposal


Architecture Overview

Phase 3 transforms AgentIdP from a single-tenant OAuth 2.0 server into a multi-tenant, W3C DID-issuing, OIDC-compliant, federated enterprise identity platform. The architecture remains monolithic Express (no microservices split) to avoid operational complexity, but clear service boundaries are enforced internally.

                         ┌──────────────────────────────────────────────────────┐
                         │                AgentIdP Server (Express)             │
                         │                                                      │
                         │  ┌────────────────────────────────────────────────┐  │
                         │  │  Middleware Stack (ordered)                    │  │
                         │  │  TLS Enforcement → Auth → Org Context → OPA   │  │
                         │  └────────────────────────────────────────────────┘  │
                         │                                                      │
                         │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐  │
                         │  │ OrgSvc   │ │ DIDSvc   │ │OIDCSvc   │ │FedSvc │  │
                         │  └──────────┘ └──────────┘ └──────────┘ └───────┘  │
                         │  ┌──────────┐ ┌──────────┐                          │
                         │  │ WebhookQ │ │ SOC2Ctrl │                          │
                         │  └──────────┘ └──────────┘                          │
                         └──────────────────────────────────────────────────────┘
                                    │            │          │
                           ┌────────▼──┐  ┌─────▼───┐  ┌──▼──────────┐
                           │PostgreSQL │  │  Redis  │  │    Vault    │
                           │(org rows) │  │(webhook │  │  (secrets)  │
                           └───────────┘  │ queue)  │  └─────────────┘
                                          └─────────┘

Architectural Decision Records


D1: Multi-Tenancy Model

Status: Accepted

Decision: Row-level tenancy — add organization_id (UUID, NOT NULL) to every domain table. No schema-per-tenant, no database-per-tenant.

Rationale: Row-level tenancy is operationally the simplest approach: a single database, a single schema, a single connection pool. All queries are augmented with an organization_id filter extracted from the authenticated JWT. PostgreSQL Row-Level Security (RLS) is enabled on all tenant-scoped tables as a defense-in-depth measure — even if the application filter is accidentally omitted, the database enforces isolation.

Alternatives Considered:

Option Pros Cons Rejected because
Schema-per-tenant Strong isolation, independent migrations Complex migration tooling, connection pool explosion at scale Operational overhead exceeds threat model requirement
Database-per-tenant Maximum isolation Separate connection pool, backup, monitoring per tenant Prohibitive at 100+ orgs; overkill for our threat model
Row-level (chosen) Simple, fast, single migration path RLS must be enforced consistently Chosen — enforce via both application and RLS

Consequences:

  • Every domain table gets an organization_id column and a corresponding index
  • All service methods accept organizationId: string as a required parameter
  • JWT payload extended to include organization_id claim
  • Existing single-tenant data migrated to a default system organization
  • PostgreSQL RLS policies written for all tenant tables

D2: DID Method Selection

Status: Accepted

Decision: did:web — DID Documents served over HTTPS at well-known and per-agent URLs.

Rationale: did:web requires no blockchain, no ledger, and no external infrastructure beyond the HTTPS server already running. It is W3C DID Core 1.0 compliant, supported by all major DID resolvers, and is the preferred method for enterprise deployments where an organization controls its own domain. It aligns directly with the did:web identifier scheme used in AGNTCY agent card specifications.

Alternatives Considered:

Option Pros Cons Rejected because
did:web (chosen) No blockchain, HTTPS-based, enterprise-friendly DID tied to domain; moving the domain invalidates DIDs Accepted tradeoff — enterprise deployments have stable domains
did:key Self-contained, no infrastructure Not anchored — anyone can generate any did:key; no discovery No trust anchor; not suitable for enterprise identity
did:ethr Ethereum-anchored, decentralized Blockchain dependency, gas costs, not enterprise-standard Blockchain dependency is a non-starter for regulated enterprises

Consequences:

  • DID for the AgentIdP instance: did:web:<hostname>
  • DID for an agent: did:web:<hostname>:agents:<agentId>
  • DID Documents served at /.well-known/did.json and /agents/:id/did
  • Domain change requires DID migration — document this in ops runbook

D3: OIDC Library Selection

Status: Accepted

Decision: oidc-provider npm package — a certified, RFC-compliant OIDC server library.

Rationale: oidc-provider is the most widely deployed Node.js OIDC library, passing the OpenID Foundation's official conformance test suite. Building OIDC from scratch on top of our existing JWT infrastructure would require implementing Discovery, JWKS rotation, ID token construction, and claim aggregation correctly against multiple RFCs. The certified library eliminates that risk and reduces implementation surface area. It integrates cleanly with Express as a mounted middleware.

Alternatives Considered:

Option Pros Cons Rejected because
oidc-provider (chosen) Certified, RFC-complete, actively maintained Adds a significant dependency Risk of non-compliance from custom implementation outweighs dependency cost
Custom JWT extension Full control, no new dependency High risk of spec deviation; ID token, Discovery, JWKS are complex RFC compliance cannot be self-certified
keycloak sidecar Battle-tested, full-featured Heavyweight Java service; architectural mismatch Not Node.js; adds operational complexity

Consequences:

  • oidc-provider is mounted at /oidc in Express
  • OIDC Discovery served at /.well-known/openid-configuration (proxied from oidc-provider)
  • JWKS served at /.well-known/jwks.json
  • Adapter written to store OIDC sessions in Redis (oidc-provider's adapter interface)
  • Existing POST /oauth2/token route extended, not replaced — maintains backward compatibility

D4: Federation Protocol

Status: Accepted

Decision: Signed JWT assertions — remote AgentIdP instances present a signed JWT; the receiving instance verifies the signature against the registered JWKS of the issuing instance.

Rationale: JWT assertion federation reuses the existing JWT infrastructure (jsonwebtoken, JWKS endpoint from OIDC workstream). No new protocol is introduced. The trust model is explicit: operators register partner instances with their JWKS URL. This aligns with RFC 7523 (JWT Profile for OAuth 2.0 Client Authentication) and the AGNTCY inter-agent trust model.

Alternatives Considered:

Option Pros Cons Rejected because
Signed JWT assertions (chosen) Uses existing JWT infra, explicit trust registry, RFC-aligned JWKS URL must be reachable at verification time Acceptable operational constraint; JWKS can be cached
mTLS Strong cryptographic identity Certificate management overhead, PKI required per partner Cert management complexity not justified when JWT assertions suffice
AGNTCY-specific protocol Native alignment Spec still evolving; risk of churn Build on stable JWT base; adapt to AGNTCY extensions as spec matures

Consequences:

  • New federation_partners table: id, name, jwks_url, issuer, trusted_since, organization_id
  • JWKS of partner instances cached in Redis with TTL
  • POST /federation/verify accepts a bearer token from a remote instance and returns verification result
  • Federation tokens are not accepted for agent management endpoints — only for identity assertion

D5: Webhook Delivery Architecture

Status: Accepted

Decision: Async delivery via Redis-backed bull queue with exponential backoff retry (max 10 attempts over 24 hours).

Rationale: Synchronous webhook delivery from within a request handler would add latency and create tight coupling between event generation and delivery outcome. The Redis queue (bull) decouples delivery: events are enqueued immediately, a background worker delivers them. bull provides built-in retry, delay, and failure tracking without introducing a new infrastructure component (Redis is already present). HMAC-SHA256 signing on every delivery allows recipients to verify authenticity.

Alternatives Considered:

Option Pros Cons Rejected because
Redis queue via bull (chosen) Reuses existing Redis, retry built-in, low operational overhead Delivery tied to Redis availability Acceptable — Redis is already a required dependency
Synchronous in-request delivery Simplest implementation Adds latency to event-generating requests; failure blocks response Unacceptable latency and coupling
Dedicated message broker (RabbitMQ) Robust, durable New infrastructure dependency Operational overhead; Redis already present
Kafka (primary) High-throughput, durable Overkill for webhook delivery; complex operations Optional adapter only; not primary delivery mechanism

Consequences:

  • New webhook_subscriptions and webhook_deliveries tables
  • bull worker process runs in same Node.js instance (separate worker thread via bull)
  • Retry schedule: 1m, 5m, 15m, 1h, 4h, 12h, 24h (exponential backoff)
  • Failed delivery after 10 attempts moves to dead-letter; operator alerted
  • Optional Kafka adapter: if KAFKA_BROKERS env var is set, events are also produced to Kafka

D6: SOC 2 Scope

Status: Accepted

Decision: Target SOC 2 Type II (operational, not just design). All controls implemented in code. Audit period: 6 months post-Phase 3 launch.

Rationale: SOC 2 Type I certifies that controls are designed correctly. SOC 2 Type II certifies that they operate continuously over a period of time. Enterprise customers in regulated industries (finance, healthcare, government) require Type II. Implementing the controls now, with the 6-month operational window beginning at Phase 3 launch, puts us on the fastest possible path to Type II certification.

Alternatives Considered:

Option Pros Cons Rejected because
Type II from launch (chosen) Satisfies enterprise requirements Requires 6-month operation window Accepted — the controls are implemented in Phase 3; audit window starts after launch
Type I only Faster to certify Not accepted by most enterprise procurement Insufficient for target customers
ISO 27001 instead International standard Larger scope, longer implementation SOC 2 is standard for US market; add ISO 27001 in Phase 4

Consequences:

  • Encryption at rest: pgcrypto extension for column-level encryption on credentials.secret_hash and credentials.vault_path
  • TLS enforcement: Express middleware rejects HTTP requests (not HTTPS) in production
  • Secrets rotation: cron-based job that triggers credential rotation reminders and Vault lease renewals
  • Security alerting: Prometheus alerting rules for auth failure spikes, rate limit exhaustion, anomalous token issuance
  • Audit log immutability: Merkle hash chain (each row's hash includes the previous row's hash)

D7: Audit Log Immutability — Merkle Hash Chain

Status: Accepted

Decision: Each audit_logs row carries a hash field: SHA-256(eventId + timestamp + action + outcome + agentId + previousHash). The chain starts with a genesis hash. Verification is a sequential pass over all rows in insertion order.

Rationale: Append-only logs in PostgreSQL can be altered by a DBA with sufficient access. A Merkle-style hash chain makes tampering detectable without requiring a blockchain. Any modification to a historical row breaks the chain from that point forward. Verification is a simple sequential computation that can be run on demand or as a scheduled integrity check.

Alternatives Considered:

Option Pros Cons Rejected because
Merkle hash chain in PostgreSQL (chosen) No new infra, tamper-evident, verifiable DBA can re-compute hashes after tampering if they control the algorithm Acceptable — threat model is accidental/low-sophistication modification; cryptographic chain deters opportunistic tampering
Blockchain anchor Cryptographically immutable Blockchain dependency, cost, latency Excessive for current threat model
Write-once S3/GCS export External immutability Delayed; operational complexity Added complexity; hash chain provides continuous coverage

Consequences:

  • New hash (VARCHAR 64) and previous_hash (VARCHAR 64) columns on audit_logs
  • AuditService.create() computes hash before insert — adds ~1ms latency per audit event
  • New GET /audit/verify endpoint: returns chain integrity status (admin only)
  • audit_logs table has an INSERT-only trigger that prevents UPDATE and DELETE via PostgreSQL trigger

D8: Organization Context in JWT

Status: Accepted

Decision: Add organization_id claim to JWT access tokens issued by POST /oauth2/token. All downstream middleware extracts organization_id from the token — no separate lookup required.

Rationale: Including organization_id in the JWT keeps the middleware stack stateless. The alternative — looking up the organization from the database on every request — adds latency and a database round-trip to every authenticated call. The JWT is already signed; adding a claim costs nothing cryptographically.

Consequences:

  • ITokenPayload interface extended: organization_id: string
  • All service methods receive organizationId from req.user.organization_id
  • Token introspection response includes organization_id
  • Agents registered before multi-tenancy belong to the default system organization

Component Interaction Map (Phase 3)

                    ┌──────────────────────┐
                    │     Web Dashboard    │
                    │ (+ Org Mgmt pages)   │
                    └──────────┬───────────┘
                               │ HTTPS
       ┌───────────────────────▼─────────────────────────────┐
       │                AgentIdP Server                       │
       │                                                      │
       │  TLS MW → Auth MW → OrgContext MW → OPA MW           │
       │                                                      │
       │  ┌───────────┐ ┌───────────┐ ┌───────────────────┐  │
       │  │ OrgService│ │DIDService │ │ OIDCProvider      │  │
       │  └───────────┘ └───────────┘ │ (oidc-provider)   │  │
       │  ┌───────────┐ ┌───────────┐ └───────────────────┘  │
       │  │ FedService│ │WebhookSvc │                         │
       │  └───────────┘ └───────────┘                         │
       │  ┌─────────────────────────┐                         │
       │  │ SOC2Controls (cross-cut)│                         │
       │  └─────────────────────────┘                         │
       └──────────┬──────────────┬──────────────┬─────────────┘
                  │              │              │
         ┌────────▼──┐   ┌───────▼──┐   ┌──────▼──────┐
         │PostgreSQL │   │  Redis   │   │    Vault    │
         │ + RLS     │   │ +bull Q  │   │  (secrets)  │
         └───────────┘   └──────────┘   └─────────────┘
                  │
         ┌────────▼──────┐
         │  Prometheus   │
         │  + Alerting   │
         └────────┬──────┘
                  │
         ┌────────▼──────┐
         │    Grafana    │
         └───────────────┘