- Replace all docker-compose.yml/docker-compose.monitoring.yml references with compose.yaml/compose.monitoring.yaml (modern Compose Spec naming) - Replace all `docker-compose` CLI commands with `docker compose` (plugin syntax) - Update Dockerfile stage descriptions: node:18-alpine → node:20.11-bookworm-slim, built-in node user → explicit nodeapp:1001 non-root user - Update image version references: postgres:14-alpine → postgres:14.12-alpine3.19, redis:7-alpine → redis:7.2-alpine3.19 - Externalize postgres credentials: hardcoded values → POSTGRES_USER/PASSWORD/DB env vars - Externalize Grafana admin password: hardcoded 'agentidp' → GF_ADMIN_PASSWORD env var - Add Docker Compose Variables section to environment-variables.md (POSTGRES_*, GF_ADMIN_PASSWORD) - Update local-development.md Step 3: cp .env.example .env, document POSTGRES_* purpose - Update quick-start.md: cp .env.example .env, use awk/sed for JWT key injection - Update 07-dev-setup.md: remove 'no .env.example' claim, reference cp .env.example - Update docker-compose.yml key file description in 04-codebase-structure.md - Update monitoring overlay launch commands across all docs (compose.yaml + compose.monitoring.yaml) - Update volume names to kebab-case: postgres_data → postgres-data, redis_data → redis-data - Fix compliance encryption-runbook: docker-compose restart agentidp → docker compose restart app All docs now consistent with compose.yaml in repo root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
335 lines
20 KiB
Markdown
335 lines
20 KiB
Markdown
# Technology Stack and Architecture Decision Records
|
||
|
||
Every technology choice in AgentIdP was made deliberately. This document records the
|
||
decision, rationale, and alternatives considered for each major technology. New engineers
|
||
should read this before making any technology additions or changes — the pattern here is
|
||
the template for future ADRs.
|
||
|
||
---
|
||
|
||
### ADR-1: Node.js 18 LTS
|
||
|
||
**Status**: Adopted
|
||
**Component**: AgentIdP server runtime and Node.js SDK runtime
|
||
|
||
**Decision**: Use Node.js 18 LTS as the server runtime.
|
||
|
||
**Rationale**: Node.js 18 LTS provides native `fetch`, native ESM support, and a
|
||
stable V8 engine with long-term security updates. The ecosystem for Express, PostgreSQL
|
||
(`pg`), Redis (`redis`), JWT (`jsonwebtoken`), and bcrypt (`bcryptjs`) is mature and
|
||
well-maintained on this version. The non-blocking I/O model is well-suited for an IdP
|
||
that handles many concurrent short-lived authentication requests. The `engines.node`
|
||
field in `package.json` enforces `>=18.0.0`.
|
||
|
||
**Alternatives considered**:
|
||
- Deno — rejected because the npm ecosystem compatibility layer introduced friction with key dependencies (`pg`, `bcryptjs`), and the production deployment story on ECS and Cloud Run was less mature at the time of the decision.
|
||
- Bun — rejected because it lacked LTS stability guarantees at the time of the decision, which is not acceptable for a security-critical authentication service.
|
||
|
||
**Consequences**: All Dockerfiles and Terraform ECS/Cloud Run task definitions must
|
||
target Node.js 18 or a compatible LTS release. Upgrading the Node.js version requires
|
||
CTO approval and a QA sign-off on the full test suite.
|
||
|
||
---
|
||
|
||
### ADR-2: TypeScript 5.3 Strict Mode
|
||
|
||
**Status**: Adopted
|
||
**Component**: All source files — server, all SDKs, dashboard
|
||
|
||
**Decision**: TypeScript 5.3 with `strict: true` and every additional strictness flag enabled in `tsconfig.json`.
|
||
|
||
**Rationale**: AgentIdP handles authentication tokens and cryptographic secrets. Type
|
||
errors in this domain can cause security vulnerabilities — a value that should be
|
||
`string | null` treated as `string` can produce silent authentication bypasses. Strict
|
||
TypeScript with `noImplicitAny`, `strictNullChecks`, `noUnusedLocals`, `noUnusedParameters`,
|
||
and `noImplicitReturns` makes these classes of bug a compile-time error rather than a
|
||
runtime failure in production.
|
||
|
||
**Alternatives considered**:
|
||
- Plain JavaScript — rejected because a security-critical IdP with no type safety is not a system this team is willing to ship. Every public method, every error boundary, and every data transformation must be typed.
|
||
|
||
**Consequences**: All new code must compile cleanly under `tsc --strict`. Zero `any`
|
||
types — ever. No exceptions granted without CTO approval. The `tsconfig.json` enables
|
||
`noImplicitAny`, `strictNullChecks`, `strictFunctionTypes`, `strictBindCallApply`,
|
||
`strictPropertyInitialization`, `noImplicitThis`, `alwaysStrict`, `noUnusedLocals`,
|
||
`noUnusedParameters`, `noImplicitReturns`, and `noFallthroughCasesInSwitch`.
|
||
|
||
---
|
||
|
||
### ADR-3: Express 4.18
|
||
|
||
**Status**: Adopted
|
||
**Component**: HTTP server framework
|
||
|
||
**Decision**: Use Express 4.18 as the HTTP framework.
|
||
|
||
**Rationale**: Express is the most widely understood Node.js HTTP framework. Its
|
||
middleware model (`(req, res, next)`) maps directly to the IdP's layered architecture:
|
||
`helmet` → `cors` → `metricsMiddleware` → `authMiddleware` → `opaMiddleware` →
|
||
controller → service → repository → `errorHandler`. The ecosystem for Express
|
||
middleware (`helmet`, `cors`, `morgan`) is mature. For a spec-first project, Express's
|
||
lack of convention about code structure is a feature — the architecture is explicit and
|
||
fully visible in `src/app.ts`.
|
||
|
||
**Alternatives considered**:
|
||
- Fastify — rejected because the team's familiarity was lower and the performance gains would be negligible for a token service whose latency is dominated by PostgreSQL queries and bcrypt comparisons.
|
||
- NestJS — rejected because its decorator-heavy convention-over-configuration style adds complexity not appropriate for the current team size and project scope.
|
||
- Koa — rejected because its ecosystem is smaller and fewer engineers are familiar with it.
|
||
|
||
**Consequences**: All HTTP concerns (routing, middleware, error handling) use the
|
||
Express 4 API. The `errorHandler` middleware must remain the last `app.use()` call in
|
||
`src/app.ts`.
|
||
|
||
---
|
||
|
||
### ADR-4: PostgreSQL 14
|
||
|
||
**Status**: Adopted
|
||
**Component**: Primary data store for agents, credentials, and audit events
|
||
|
||
**Decision**: Use PostgreSQL 14 as the primary relational database.
|
||
|
||
**Rationale**: The audit log requires ACID guarantees — partial writes or uncommitted
|
||
reads are not acceptable for a compliance-grade append-only event store. PostgreSQL's
|
||
`JSONB` column type is used for the `metadata` field in `audit_events`, allowing
|
||
structured context data without schema changes for each new event type. PostgreSQL's
|
||
row-level security is available for multi-tenancy if that becomes a Phase 3 requirement.
|
||
|
||
**Alternatives considered**:
|
||
- MySQL — rejected because its JSON support is weaker than PostgreSQL's `JSONB` with GIN indexing, and its default transaction isolation level has historically produced surprises.
|
||
- MongoDB — rejected because the audit log must be append-only and ACID-safe. MongoDB's document model requires explicit multi-document transactions for ACID behaviour, and the schema flexibility is not needed here.
|
||
|
||
**Consequences**: All schema changes go through numbered SQL migration files in
|
||
`src/db/migrations/`. Migration files are append-only — never modify an existing
|
||
migration. New tables require a new numbered file (e.g. `005_create_agent_groups.sql`).
|
||
|
||
---
|
||
|
||
### ADR-5: Redis 7
|
||
|
||
**Status**: Adopted
|
||
**Component**: Token revocation list, monthly usage counters, rate-limit sliding window
|
||
|
||
**Decision**: Use Redis 7 as the in-memory data store.
|
||
|
||
**Rationale**: Token revocation requires O(1) key lookup with TTL-based automatic
|
||
expiry. `SET revoked:{jti} 1 EX {seconds_until_expiry}` stores a revocation entry
|
||
that expires precisely when the token itself would have expired — zero manual cleanup
|
||
required. The monthly token counter uses Redis `INCR`, which is atomic and O(1). The
|
||
rate-limiter uses a Redis sorted set for the sliding-window algorithm.
|
||
|
||
**Alternatives considered**:
|
||
- Memcached — rejected because Memcached does not support per-key TTL on sorted-set structures, which is required for the sliding-window rate-limiter.
|
||
- PostgreSQL for revocation — rejected because the token verification path is the hot path in every authenticated request. A PostgreSQL round-trip adds 5–15 ms compared to a Redis `GET` at sub-millisecond latency.
|
||
|
||
**Consequences**: Redis is a required infrastructure dependency. A Redis instance must
|
||
be running and reachable via `REDIS_URL` before the server starts. `compose.yaml`
|
||
provides a Redis 7.2 Alpine container for local development on port 6379.
|
||
|
||
---
|
||
|
||
### ADR-6: HashiCorp Vault
|
||
|
||
**Status**: Adopted (opt-in)
|
||
**Component**: Credential secret storage — alternative to bcrypt in PostgreSQL
|
||
|
||
**Decision**: Integrate HashiCorp Vault KV v2 as an opt-in secret storage backend for agent credentials.
|
||
|
||
**Rationale**: The Phase 1 bcrypt approach stores hashes in PostgreSQL. While bcrypt
|
||
hashes cannot be reversed, some enterprises require that secrets never touch a relational
|
||
database — even in hashed form. Vault provides a dedicated secrets management plane with
|
||
HSM backing and an independent audit trail at the secrets level. The `verifySecret`
|
||
method in `VaultClient` uses `crypto.timingSafeEqual` to prevent timing-based
|
||
side-channel attacks when comparing stored and candidate secrets.
|
||
|
||
**Alternatives considered**:
|
||
- AWS Secrets Manager — rejected because it introduces cloud-vendor lock-in. AgentIdP must run identically on AWS, GCP, and on-premises; a Vault-based approach works in all environments.
|
||
- Plain bcrypt only — retained as the fallback path. When `VAULT_ADDR` is not set, `createVaultClientFromEnv()` returns `null` and the server operates identically to Phase 1.
|
||
|
||
**Consequences**: Vault is controlled by `VAULT_ADDR` (required), `VAULT_TOKEN`
|
||
(required), and `VAULT_MOUNT` (optional, defaults to `secret`). When these are not set,
|
||
bcrypt is used unchanged. Credential rows carry a nullable `vault_path` column: `null`
|
||
means bcrypt; a non-null path means Vault verification is used.
|
||
|
||
---
|
||
|
||
### ADR-7: OPA (Open Policy Agent)
|
||
|
||
**Status**: Adopted
|
||
**Component**: Request authorisation — scope enforcement on all protected endpoints
|
||
|
||
**Decision**: Use Open Policy Agent with a Rego policy compiled to a Wasm bundle for runtime authorisation.
|
||
|
||
**Rationale**: Hard-coded scope checks in middleware would require a code deployment
|
||
for every policy change. OPA decouples the policy (`policies/authz.rego`) from the
|
||
server code. The policy can be updated, re-compiled to Wasm, and hot-reloaded via
|
||
`SIGHUP` without restarting the server. The `@open-policy-agent/opa-wasm` package
|
||
evaluates the compiled Wasm bundle in-process with microsecond latency. When no Wasm
|
||
bundle is present (development, CI), the middleware falls back to a TypeScript
|
||
implementation that reads `policies/data/scopes.json`.
|
||
|
||
**Alternatives considered**:
|
||
- Custom middleware with hard-coded scope checks — rejected because policy changes require code changes and a full deployment cycle. As the endpoint surface grows this becomes unmanageable.
|
||
- Casbin — rejected because its RBAC/ABAC model is less expressive than Rego for the compound `method + path + scope-intersection` pattern AgentIdP requires.
|
||
|
||
**Consequences**: All authorisation rules live in `policies/authz.rego` and
|
||
`policies/data/scopes.json`. Adding a new endpoint requires adding its scope
|
||
requirement to `scopes.json`. A policy change is deployed by updating `scopes.json`
|
||
(or `authz.wasm`) and sending `SIGHUP` to the running process — no redeployment needed.
|
||
|
||
---
|
||
|
||
### ADR-8: React 18 + Vite 5
|
||
|
||
**Status**: Adopted
|
||
**Component**: Web dashboard SPA (`dashboard/`)
|
||
|
||
**Decision**: Use React 18 with Vite 5 as the web dashboard framework and build tool.
|
||
|
||
**Rationale**: React 18's concurrent rendering model handles the dashboard's async data
|
||
fetching patterns cleanly. The `@sentryagent/idp-sdk` Node.js package is reused
|
||
directly in the dashboard via `TokenManager` for authentication, avoiding duplicated
|
||
API client code. Vite 5 provides sub-second HMR in development and a fast production
|
||
build with tree-shaking. The dashboard is built to `dashboard/dist/` and served as
|
||
static files from Express at `/dashboard`, keeping the deployment footprint to a
|
||
single container.
|
||
|
||
**Alternatives considered**:
|
||
- Next.js — rejected because server-side rendering is not needed for an internal operator dashboard, and the added complexity of a Next.js server is not justified.
|
||
- Vue — rejected because the broader SentryAgent.ai ecosystem is React-first; consistency reduces context-switching overhead.
|
||
|
||
**Consequences**: The dashboard must be built (`cd dashboard && npm run build`) before
|
||
Express can serve it. In local development, run `cd dashboard && npm run dev` to use
|
||
Vite's dev server with HMR; the Vite proxy forwards `/api/` calls to Express at
|
||
`localhost:3000`.
|
||
|
||
---
|
||
|
||
### ADR-9: Prometheus + Grafana
|
||
|
||
**Status**: Adopted
|
||
**Component**: Operational metrics collection and visualisation
|
||
|
||
**Decision**: Use Prometheus for metrics collection and Grafana for dashboards.
|
||
|
||
**Rationale**: Prometheus is the industry standard for metrics in container
|
||
environments. The `prom-client` npm package integrates natively with Express and
|
||
provides `Counter` and `Histogram` metric types that cover all observability needs for
|
||
AgentIdP. Grafana's YAML provisioning in `monitoring/grafana/provisioning/` makes
|
||
dashboards reproducible and version-controlled. The monitoring stack runs as a Docker
|
||
Compose overlay (`compose.monitoring.yaml`) without interfering with the base dev
|
||
environment.
|
||
|
||
**Alternatives considered**:
|
||
- Datadog — rejected because SaaS cost and vendor lock-in are not acceptable for a free, open-source product. Operators who self-host AgentIdP should not be required to pay for monitoring.
|
||
- StatsD — rejected because StatsD's flat metric model lacks label/dimension support, which is essential for distinguishing metrics by `method`, `route`, and `status_code`.
|
||
|
||
**Consequences**: All metric definitions live exclusively in `src/metrics/registry.ts`.
|
||
No other file may instantiate a `Counter` or `Histogram` — all other files import
|
||
specific metrics from that registry. Grafana is available at port 3001 when the
|
||
monitoring overlay is running.
|
||
|
||
---
|
||
|
||
### ADR-10: Terraform
|
||
|
||
**Status**: Adopted
|
||
**Component**: Infrastructure as code — multi-region AWS + GCP deployment
|
||
|
||
**Decision**: Use Terraform with HCL for all infrastructure provisioning across AWS and GCP.
|
||
|
||
**Rationale**: Terraform's HCL syntax is readable and its provider ecosystem covers
|
||
both AWS and GCP with the same toolchain. Reusable modules in `terraform/modules/`
|
||
(agentidp, lb, rds, redis) are composed in environment-specific configurations under
|
||
`terraform/environments/aws/` and `terraform/environments/gcp/`. All infrastructure
|
||
changes go through `terraform plan` review before `terraform apply`, providing a
|
||
diff-based approval workflow.
|
||
|
||
**Alternatives considered**:
|
||
- Pulumi — rejected because the Pulumi provider ecosystem for AWS and GCP was less mature than Terraform's at the time of the Phase 2 decision, and HCL is more readable for non-engineers reviewing infrastructure changes.
|
||
- AWS CDK — rejected because it is AWS-only. AgentIdP must deploy identically to both AWS and GCP.
|
||
|
||
**Consequences**: All infrastructure changes must go through Terraform. No manual edits
|
||
via the AWS console or GCP console are permitted — they will be overwritten on the next
|
||
`terraform apply`. Terraform state is stored in a remote backend and must not be edited
|
||
manually.
|
||
|
||
---
|
||
|
||
### ADR-11: Stripe
|
||
|
||
**Status**: Adopted
|
||
**Component**: Billing — subscription management and payment processing
|
||
|
||
**Decision**: Use Stripe as the payment processing and subscription management platform. The `stripe` npm package (v21+) handles Checkout Session creation, webhook event verification, and subscription lifecycle events.
|
||
|
||
**Rationale**: Stripe's hosted Checkout flow eliminates the need to handle PCI-DSS scope for card data. The `stripe.webhooks.constructEvent()` method uses HMAC-SHA256 to verify incoming webhook payloads, preventing replay attacks. The `checkout.session.completed` event carries `metadata: { orgId, targetTier }`, allowing `BillingService` to delegate tier upgrades to `TierService.applyUpgrade()` without coupling billing logic to tier logic.
|
||
|
||
**Alternatives considered**:
|
||
- Paddle — rejected because its global merchant-of-record model introduced complexities with the open-source free tier.
|
||
- Braintree — rejected because Stripe's webhook reliability and developer experience are superior.
|
||
|
||
**Consequences**: Stripe requires `STRIPE_SECRET_KEY` (for API calls) and `STRIPE_WEBHOOK_SECRET` (`whsec_...`, for webhook verification). Per-tier Stripe price IDs are configured via `STRIPE_PRICE_ID_PRO` and `STRIPE_PRICE_ID_ENTERPRISE`. All billing webhook handlers must pass the raw `Buffer` body (not parsed JSON) to `stripe.webhooks.constructEvent()` — use `express.raw()` middleware on the webhook route.
|
||
|
||
---
|
||
|
||
### ADR-12: oidc-provider (A2A Delegation)
|
||
|
||
**Status**: Adopted
|
||
**Component**: A2A delegation — OIDC provider for agent-to-agent trust tokens
|
||
|
||
**Decision**: Use the `oidc-provider` npm package (v9.7.x) as the OIDC provider for issuing A2A delegation tokens. The provider is mounted as a sub-application at `/oidc` within the Express app.
|
||
|
||
**Rationale**: `oidc-provider` is a certified OpenID Connect implementation that handles the full OIDC protocol, including JWKS serving, token endpoint, and discovery document. Rather than implementing a custom delegation token format, using a standards-compliant OIDC provider means delegation tokens can be verified by any OIDC-aware party using the published JWKS at `/oidc/jwks`.
|
||
|
||
**Alternatives considered**:
|
||
- Custom JWT signing — rejected because hand-rolled token formats cannot benefit from OIDC tooling and interoperability.
|
||
|
||
**Consequences**: `A2A_ENABLED` env var gates the OIDC provider — when set to `'false'`, delegation endpoints return 404. The `OIDC_ISSUER` env var must be set to the full base URL of the OIDC provider (e.g. `https://api.sentryagent.ai`).
|
||
|
||
---
|
||
|
||
### ADR-13: Next.js 14 (Developer Portal)
|
||
|
||
**Status**: Adopted
|
||
**Component**: Developer Portal (`portal/`) — public-facing documentation and onboarding
|
||
|
||
**Decision**: Use Next.js 14 (App Router) with Tailwind CSS for the developer portal. The portal is a separate process served on its own port (independent of the Express API server).
|
||
|
||
**Rationale**: The developer portal has different performance and SEO requirements than the internal operator dashboard (`dashboard/`). Next.js 14's App Router supports React Server Components, which allows the marketing and documentation pages to be statically generated while the analytics dashboard and API Explorer are client-rendered. Tailwind CSS enables rapid UI development consistent with the design system.
|
||
|
||
**Alternatives considered**:
|
||
- Extending the Vite dashboard — rejected because the developer portal requires server-side rendering for SEO on marketing pages, which Vite does not provide.
|
||
- Docusaurus — rejected because the portal includes interactive components (Swagger Explorer, analytics charts) that are not well-suited to a documentation-only tool.
|
||
|
||
**Consequences**: The portal (`portal/`) has its own `package.json`, `tsconfig.json`, `tailwind.config.ts`, and `next.config.js`. It is built and run independently: `cd portal && npm install && npm run dev`. The portal calls the AgentIdP REST API using the same `@sentryagent/idp-sdk` as the dashboard.
|
||
|
||
---
|
||
|
||
### ADR-14: bull (Job Queue) + kafkajs (Event Streaming)
|
||
|
||
**Status**: Adopted (opt-in)
|
||
**Component**: Async job processing and event streaming
|
||
|
||
**Decision**: Use `bull` (Redis-backed job queue) for async webhook delivery retries and `kafkajs` for event streaming to external consumers. Both are opt-in — the system operates correctly without Kafka configured.
|
||
|
||
**Rationale**: Webhook delivery requires retry logic with exponential backoff and dead-letter handling. `bull` provides this out of the box using the existing Redis dependency. `kafkajs` enables high-throughput event streaming for analytics and audit events to external data pipelines without blocking the primary request path.
|
||
|
||
**Alternatives considered**:
|
||
- BullMQ — considered as a more modern alternative to `bull` but rejected to avoid adding a new package family during Phase 6. Migration is a future backlog item.
|
||
|
||
**Consequences**: Kafka is entirely optional. When `KAFKA_BROKERS` is not set, `kafkajs` is not initialised and no events are published. The `bull` queue for webhook delivery requires only the existing Redis instance.
|
||
|
||
---
|
||
|
||
### ADR-15: did-resolver + web-did-resolver (W3C DIDs)
|
||
|
||
**Status**: Adopted
|
||
**Component**: W3C DID Core 1.0 document resolution
|
||
|
||
**Decision**: Use `did-resolver` (v4.1.x) as the DID resolution framework and `web-did-resolver` (v2.0.x) for the `did:web` method implementation.
|
||
|
||
**Rationale**: `did-resolver` provides a pluggable resolver interface used by both the server (for internal resolution) and by third parties who want to verify AgentIdP-issued DIDs. The `did:web` method maps DID identifiers to HTTPS URLs hosting the DID document JSON, requiring no blockchain. `DIDService` generates documents that conform to the W3C DID Core 1.0 specification and include AGNTCY-specific extension fields.
|
||
|
||
**Consequences**: `DID_WEB_DOMAIN` env var is required for DID generation. DID documents are cached in Redis (`did:doc:<agentId>`, TTL from `DID_DOCUMENT_CACHE_TTL_SECONDS`, default 300s). Private keys are stored in HashiCorp Vault KV v2 when Vault is configured; in dev mode, a `dev:no-vault` marker is stored and keys are ephemeral.
|