Files

SentryAgent.ai Developer f1fbe0e29a chore(openspec): archive all completed changes, sync 14 new specs to library

Archived 4 completed OpenSpec changes (2026-04-02):
- phase-3-enterprise (100/100 tasks) — 6 Phase 3 capabilities synced
- devops-documentation (48/48 tasks) — 3 new + 1 merged capability
- bedroom-developer-docs (33/33 tasks) — 4 new capabilities synced
- engineering-docs (superseded by 2026-03-29 archive) — no tasks

Main spec library grows from 21 → 35 capabilities (+14 new):
federation, multi-tenancy, oidc, soc2, w3c-dids, webhooks,
database, operations, system-overview, api-reference, core-concepts,
developer-guides, quick-start + deployment (merged additive requirements)

Active changes: 0 — project board is clear for Phase 4 planning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-02 03:50:47 +00:00

18 KiB

Raw Permalink Blame History

Phase 3: Enterprise — Technical Design

Date: 2026-03-29 Author: Virtual Architect Status: Draft — pending CEO approval of proposal

Architecture Overview

Phase 3 transforms AgentIdP from a single-tenant OAuth 2.0 server into a multi-tenant, W3C DID-issuing, OIDC-compliant, federated enterprise identity platform. The architecture remains monolithic Express (no microservices split) to avoid operational complexity, but clear service boundaries are enforced internally.

                         ┌──────────────────────────────────────────────────────┐
                         │                AgentIdP Server (Express)             │
                         │                                                      │
                         │  ┌────────────────────────────────────────────────┐  │
                         │  │  Middleware Stack (ordered)                    │  │
                         │  │  TLS Enforcement → Auth → Org Context → OPA   │  │
                         │  └────────────────────────────────────────────────┘  │
                         │                                                      │
                         │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐  │
                         │  │ OrgSvc   │ │ DIDSvc   │ │OIDCSvc   │ │FedSvc │  │
                         │  └──────────┘ └──────────┘ └──────────┘ └───────┘  │
                         │  ┌──────────┐ ┌──────────┐                          │
                         │  │ WebhookQ │ │ SOC2Ctrl │                          │
                         │  └──────────┘ └──────────┘                          │
                         └──────────────────────────────────────────────────────┘
                                    │            │          │
                           ┌────────▼──┐  ┌─────▼───┐  ┌──▼──────────┐
                           │PostgreSQL │  │  Redis  │  │    Vault    │
                           │(org rows) │  │(webhook │  │  (secrets)  │
                           └───────────┘  │ queue)  │  └─────────────┘
                                          └─────────┘

Architectural Decision Records

D1: Multi-Tenancy Model

Status: Accepted

Decision: Row-level tenancy — add organization_id (UUID, NOT NULL) to every domain table. No schema-per-tenant, no database-per-tenant.

Rationale: Row-level tenancy is operationally the simplest approach: a single database, a single schema, a single connection pool. All queries are augmented with an organization_id filter extracted from the authenticated JWT. PostgreSQL Row-Level Security (RLS) is enabled on all tenant-scoped tables as a defense-in-depth measure — even if the application filter is accidentally omitted, the database enforces isolation.

Alternatives Considered:

Option	Pros	Cons	Rejected because
Schema-per-tenant	Strong isolation, independent migrations	Complex migration tooling, connection pool explosion at scale	Operational overhead exceeds threat model requirement
Database-per-tenant	Maximum isolation	Separate connection pool, backup, monitoring per tenant	Prohibitive at 100+ orgs; overkill for our threat model
Row-level (chosen)	Simple, fast, single migration path	RLS must be enforced consistently	Chosen — enforce via both application and RLS

Consequences:

Every domain table gets an organization_id column and a corresponding index
All service methods accept organizationId: string as a required parameter
JWT payload extended to include organization_id claim
Existing single-tenant data migrated to a default system organization
PostgreSQL RLS policies written for all tenant tables

D2: DID Method Selection

Status: Accepted

Decision: did:web — DID Documents served over HTTPS at well-known and per-agent URLs.

Rationale: did:web requires no blockchain, no ledger, and no external infrastructure beyond the HTTPS server already running. It is W3C DID Core 1.0 compliant, supported by all major DID resolvers, and is the preferred method for enterprise deployments where an organization controls its own domain. It aligns directly with the did:web identifier scheme used in AGNTCY agent card specifications.

Alternatives Considered:

Option	Pros	Cons	Rejected because
`did:web` (chosen)	No blockchain, HTTPS-based, enterprise-friendly	DID tied to domain; moving the domain invalidates DIDs	Accepted tradeoff — enterprise deployments have stable domains
`did:key`	Self-contained, no infrastructure	Not anchored — anyone can generate any `did:key`; no discovery	No trust anchor; not suitable for enterprise identity
`did:ethr`	Ethereum-anchored, decentralized	Blockchain dependency, gas costs, not enterprise-standard	Blockchain dependency is a non-starter for regulated enterprises

Consequences:

DID for the AgentIdP instance: did:web:<hostname>
DID for an agent: did:web:<hostname>:agents:<agentId>
DID Documents served at /.well-known/did.json and /agents/:id/did
Domain change requires DID migration — document this in ops runbook

D3: OIDC Library Selection

Status: Accepted

Decision: oidc-provider npm package — a certified, RFC-compliant OIDC server library.

Rationale: oidc-provider is the most widely deployed Node.js OIDC library, passing the OpenID Foundation's official conformance test suite. Building OIDC from scratch on top of our existing JWT infrastructure would require implementing Discovery, JWKS rotation, ID token construction, and claim aggregation correctly against multiple RFCs. The certified library eliminates that risk and reduces implementation surface area. It integrates cleanly with Express as a mounted middleware.

Alternatives Considered:

Option	Pros	Cons	Rejected because
`oidc-provider` (chosen)	Certified, RFC-complete, actively maintained	Adds a significant dependency	Risk of non-compliance from custom implementation outweighs dependency cost
Custom JWT extension	Full control, no new dependency	High risk of spec deviation; ID token, Discovery, JWKS are complex	RFC compliance cannot be self-certified
`keycloak` sidecar	Battle-tested, full-featured	Heavyweight Java service; architectural mismatch	Not Node.js; adds operational complexity

Consequences:

oidc-provider is mounted at /oidc in Express
OIDC Discovery served at /.well-known/openid-configuration (proxied from oidc-provider)
JWKS served at /.well-known/jwks.json
Adapter written to store OIDC sessions in Redis (oidc-provider's adapter interface)
Existing POST /oauth2/token route extended, not replaced — maintains backward compatibility

D4: Federation Protocol

Status: Accepted

Decision: Signed JWT assertions — remote AgentIdP instances present a signed JWT; the receiving instance verifies the signature against the registered JWKS of the issuing instance.

Rationale: JWT assertion federation reuses the existing JWT infrastructure (jsonwebtoken, JWKS endpoint from OIDC workstream). No new protocol is introduced. The trust model is explicit: operators register partner instances with their JWKS URL. This aligns with RFC 7523 (JWT Profile for OAuth 2.0 Client Authentication) and the AGNTCY inter-agent trust model.

Alternatives Considered:

Option	Pros	Cons	Rejected because
Signed JWT assertions (chosen)	Uses existing JWT infra, explicit trust registry, RFC-aligned	JWKS URL must be reachable at verification time	Acceptable operational constraint; JWKS can be cached
mTLS	Strong cryptographic identity	Certificate management overhead, PKI required per partner	Cert management complexity not justified when JWT assertions suffice
AGNTCY-specific protocol	Native alignment	Spec still evolving; risk of churn	Build on stable JWT base; adapt to AGNTCY extensions as spec matures

Consequences:

New federation_partners table: id, name, jwks_url, issuer, trusted_since, organization_id
JWKS of partner instances cached in Redis with TTL
POST /federation/verify accepts a bearer token from a remote instance and returns verification result
Federation tokens are not accepted for agent management endpoints — only for identity assertion

D5: Webhook Delivery Architecture

Status: Accepted

Decision: Async delivery via Redis-backed bull queue with exponential backoff retry (max 10 attempts over 24 hours).

Rationale: Synchronous webhook delivery from within a request handler would add latency and create tight coupling between event generation and delivery outcome. The Redis queue (bull) decouples delivery: events are enqueued immediately, a background worker delivers them. bull provides built-in retry, delay, and failure tracking without introducing a new infrastructure component (Redis is already present). HMAC-SHA256 signing on every delivery allows recipients to verify authenticity.

Alternatives Considered:

Option	Pros	Cons	Rejected because
Redis queue via `bull` (chosen)	Reuses existing Redis, retry built-in, low operational overhead	Delivery tied to Redis availability	Acceptable — Redis is already a required dependency
Synchronous in-request delivery	Simplest implementation	Adds latency to event-generating requests; failure blocks response	Unacceptable latency and coupling
Dedicated message broker (RabbitMQ)	Robust, durable	New infrastructure dependency	Operational overhead; Redis already present
Kafka (primary)	High-throughput, durable	Overkill for webhook delivery; complex operations	Optional adapter only; not primary delivery mechanism

Consequences:

New webhook_subscriptions and webhook_deliveries tables
bull worker process runs in same Node.js instance (separate worker thread via bull)
Retry schedule: 1m, 5m, 15m, 1h, 4h, 12h, 24h (exponential backoff)
Failed delivery after 10 attempts moves to dead-letter; operator alerted
Optional Kafka adapter: if KAFKA_BROKERS env var is set, events are also produced to Kafka

D6: SOC 2 Scope

Status: Accepted

Decision: Target SOC 2 Type II (operational, not just design). All controls implemented in code. Audit period: 6 months post-Phase 3 launch.

Rationale: SOC 2 Type I certifies that controls are designed correctly. SOC 2 Type II certifies that they operate continuously over a period of time. Enterprise customers in regulated industries (finance, healthcare, government) require Type II. Implementing the controls now, with the 6-month operational window beginning at Phase 3 launch, puts us on the fastest possible path to Type II certification.

Alternatives Considered:

Option	Pros	Cons	Rejected because
Type II from launch (chosen)	Satisfies enterprise requirements	Requires 6-month operation window	Accepted — the controls are implemented in Phase 3; audit window starts after launch
Type I only	Faster to certify	Not accepted by most enterprise procurement	Insufficient for target customers
ISO 27001 instead	International standard	Larger scope, longer implementation	SOC 2 is standard for US market; add ISO 27001 in Phase 4

Consequences:

Encryption at rest: pgcrypto extension for column-level encryption on credentials.secret_hash and credentials.vault_path
TLS enforcement: Express middleware rejects HTTP requests (not HTTPS) in production
Secrets rotation: cron-based job that triggers credential rotation reminders and Vault lease renewals
Security alerting: Prometheus alerting rules for auth failure spikes, rate limit exhaustion, anomalous token issuance
Audit log immutability: Merkle hash chain (each row's hash includes the previous row's hash)

D7: Audit Log Immutability — Merkle Hash Chain

Status: Accepted

Decision: Each audit_logs row carries a hash field: SHA-256(eventId + timestamp + action + outcome + agentId + previousHash). The chain starts with a genesis hash. Verification is a sequential pass over all rows in insertion order.

Rationale: Append-only logs in PostgreSQL can be altered by a DBA with sufficient access. A Merkle-style hash chain makes tampering detectable without requiring a blockchain. Any modification to a historical row breaks the chain from that point forward. Verification is a simple sequential computation that can be run on demand or as a scheduled integrity check.

Alternatives Considered:

Option	Pros	Cons	Rejected because
Merkle hash chain in PostgreSQL (chosen)	No new infra, tamper-evident, verifiable	DBA can re-compute hashes after tampering if they control the algorithm	Acceptable — threat model is accidental/low-sophistication modification; cryptographic chain deters opportunistic tampering
Blockchain anchor	Cryptographically immutable	Blockchain dependency, cost, latency	Excessive for current threat model
Write-once S3/GCS export	External immutability	Delayed; operational complexity	Added complexity; hash chain provides continuous coverage

Consequences:

New hash (VARCHAR 64) and previous_hash (VARCHAR 64) columns on audit_logs
AuditService.create() computes hash before insert — adds ~1ms latency per audit event
New GET /audit/verify endpoint: returns chain integrity status (admin only)
audit_logs table has an INSERT-only trigger that prevents UPDATE and DELETE via PostgreSQL trigger

D8: Organization Context in JWT

Status: Accepted

Decision: Add organization_id claim to JWT access tokens issued by POST /oauth2/token. All downstream middleware extracts organization_id from the token — no separate lookup required.

Rationale: Including organization_id in the JWT keeps the middleware stack stateless. The alternative — looking up the organization from the database on every request — adds latency and a database round-trip to every authenticated call. The JWT is already signed; adding a claim costs nothing cryptographically.

Consequences:

ITokenPayload interface extended: organization_id: string
All service methods receive organizationId from req.user.organization_id
Token introspection response includes organization_id
Agents registered before multi-tenancy belong to the default system organization

Component Interaction Map (Phase 3)

                    ┌──────────────────────┐
                    │     Web Dashboard    │
                    │ (+ Org Mgmt pages)   │
                    └──────────┬───────────┘
                               │ HTTPS
       ┌───────────────────────▼─────────────────────────────┐
       │                AgentIdP Server                       │
       │                                                      │
       │  TLS MW → Auth MW → OrgContext MW → OPA MW           │
       │                                                      │
       │  ┌───────────┐ ┌───────────┐ ┌───────────────────┐  │
       │  │ OrgService│ │DIDService │ │ OIDCProvider      │  │
       │  └───────────┘ └───────────┘ │ (oidc-provider)   │  │
       │  ┌───────────┐ ┌───────────┐ └───────────────────┘  │
       │  │ FedService│ │WebhookSvc │                         │
       │  └───────────┘ └───────────┘                         │
       │  ┌─────────────────────────┐                         │
       │  │ SOC2Controls (cross-cut)│                         │
       │  └─────────────────────────┘                         │
       └──────────┬──────────────┬──────────────┬─────────────┘
                  │              │              │
         ┌────────▼──┐   ┌───────▼──┐   ┌──────▼──────┐
         │PostgreSQL │   │  Redis   │   │    Vault    │
         │ + RLS     │   │ +bull Q  │   │  (secrets)  │
         └───────────┘   └──────────┘   └─────────────┘
                  │
         ┌────────▼──────┐
         │  Prometheus   │
         │  + Alerting   │
         └────────┬──────┘
                  │
         ┌────────▼──────┐
         │    Grafana    │
         └───────────────┘

18 KiB Raw Permalink Blame History

Phase 3: Enterprise — Technical Design

Architecture Overview

Architectural Decision Records

D1: Multi-Tenancy Model

D2: DID Method Selection

D3: OIDC Library Selection

D4: Federation Protocol

D5: Webhook Delivery Architecture

D6: SOC 2 Scope

D7: Audit Log Immutability — Merkle Hash Chain

D8: Organization Context in JWT

Component Interaction Map (Phase 3)

18 KiB

Raw Permalink Blame History