Files
sentryagent-idp/docs/engineering/03-tech-stack.md
SentryAgent.ai Developer f9a6a8aafb docs(devops): update all documentation for DockerSpec compliance
- Replace all docker-compose.yml/docker-compose.monitoring.yml references with
  compose.yaml/compose.monitoring.yaml (modern Compose Spec naming)
- Replace all `docker-compose` CLI commands with `docker compose` (plugin syntax)
- Update Dockerfile stage descriptions: node:18-alpine → node:20.11-bookworm-slim,
  built-in node user → explicit nodeapp:1001 non-root user
- Update image version references: postgres:14-alpine → postgres:14.12-alpine3.19,
  redis:7-alpine → redis:7.2-alpine3.19
- Externalize postgres credentials: hardcoded values → POSTGRES_USER/PASSWORD/DB env vars
- Externalize Grafana admin password: hardcoded 'agentidp' → GF_ADMIN_PASSWORD env var
- Add Docker Compose Variables section to environment-variables.md (POSTGRES_*, GF_ADMIN_PASSWORD)
- Update local-development.md Step 3: cp .env.example .env, document POSTGRES_* purpose
- Update quick-start.md: cp .env.example .env, use awk/sed for JWT key injection
- Update 07-dev-setup.md: remove 'no .env.example' claim, reference cp .env.example
- Update docker-compose.yml key file description in 04-codebase-structure.md
- Update monitoring overlay launch commands across all docs (compose.yaml + compose.monitoring.yaml)
- Update volume names to kebab-case: postgres_data → postgres-data, redis_data → redis-data
- Fix compliance encryption-runbook: docker-compose restart agentidp → docker compose restart app

All docs now consistent with compose.yaml in repo root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 08:27:37 +00:00

335 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Technology Stack and Architecture Decision Records
Every technology choice in AgentIdP was made deliberately. This document records the
decision, rationale, and alternatives considered for each major technology. New engineers
should read this before making any technology additions or changes — the pattern here is
the template for future ADRs.
---
### ADR-1: Node.js 18 LTS
**Status**: Adopted
**Component**: AgentIdP server runtime and Node.js SDK runtime
**Decision**: Use Node.js 18 LTS as the server runtime.
**Rationale**: Node.js 18 LTS provides native `fetch`, native ESM support, and a
stable V8 engine with long-term security updates. The ecosystem for Express, PostgreSQL
(`pg`), Redis (`redis`), JWT (`jsonwebtoken`), and bcrypt (`bcryptjs`) is mature and
well-maintained on this version. The non-blocking I/O model is well-suited for an IdP
that handles many concurrent short-lived authentication requests. The `engines.node`
field in `package.json` enforces `>=18.0.0`.
**Alternatives considered**:
- Deno — rejected because the npm ecosystem compatibility layer introduced friction with key dependencies (`pg`, `bcryptjs`), and the production deployment story on ECS and Cloud Run was less mature at the time of the decision.
- Bun — rejected because it lacked LTS stability guarantees at the time of the decision, which is not acceptable for a security-critical authentication service.
**Consequences**: All Dockerfiles and Terraform ECS/Cloud Run task definitions must
target Node.js 18 or a compatible LTS release. Upgrading the Node.js version requires
CTO approval and a QA sign-off on the full test suite.
---
### ADR-2: TypeScript 5.3 Strict Mode
**Status**: Adopted
**Component**: All source files — server, all SDKs, dashboard
**Decision**: TypeScript 5.3 with `strict: true` and every additional strictness flag enabled in `tsconfig.json`.
**Rationale**: AgentIdP handles authentication tokens and cryptographic secrets. Type
errors in this domain can cause security vulnerabilities — a value that should be
`string | null` treated as `string` can produce silent authentication bypasses. Strict
TypeScript with `noImplicitAny`, `strictNullChecks`, `noUnusedLocals`, `noUnusedParameters`,
and `noImplicitReturns` makes these classes of bug a compile-time error rather than a
runtime failure in production.
**Alternatives considered**:
- Plain JavaScript — rejected because a security-critical IdP with no type safety is not a system this team is willing to ship. Every public method, every error boundary, and every data transformation must be typed.
**Consequences**: All new code must compile cleanly under `tsc --strict`. Zero `any`
types — ever. No exceptions granted without CTO approval. The `tsconfig.json` enables
`noImplicitAny`, `strictNullChecks`, `strictFunctionTypes`, `strictBindCallApply`,
`strictPropertyInitialization`, `noImplicitThis`, `alwaysStrict`, `noUnusedLocals`,
`noUnusedParameters`, `noImplicitReturns`, and `noFallthroughCasesInSwitch`.
---
### ADR-3: Express 4.18
**Status**: Adopted
**Component**: HTTP server framework
**Decision**: Use Express 4.18 as the HTTP framework.
**Rationale**: Express is the most widely understood Node.js HTTP framework. Its
middleware model (`(req, res, next)`) maps directly to the IdP's layered architecture:
`helmet``cors``metricsMiddleware``authMiddleware``opaMiddleware`
controller → service → repository → `errorHandler`. The ecosystem for Express
middleware (`helmet`, `cors`, `morgan`) is mature. For a spec-first project, Express's
lack of convention about code structure is a feature — the architecture is explicit and
fully visible in `src/app.ts`.
**Alternatives considered**:
- Fastify — rejected because the team's familiarity was lower and the performance gains would be negligible for a token service whose latency is dominated by PostgreSQL queries and bcrypt comparisons.
- NestJS — rejected because its decorator-heavy convention-over-configuration style adds complexity not appropriate for the current team size and project scope.
- Koa — rejected because its ecosystem is smaller and fewer engineers are familiar with it.
**Consequences**: All HTTP concerns (routing, middleware, error handling) use the
Express 4 API. The `errorHandler` middleware must remain the last `app.use()` call in
`src/app.ts`.
---
### ADR-4: PostgreSQL 14
**Status**: Adopted
**Component**: Primary data store for agents, credentials, and audit events
**Decision**: Use PostgreSQL 14 as the primary relational database.
**Rationale**: The audit log requires ACID guarantees — partial writes or uncommitted
reads are not acceptable for a compliance-grade append-only event store. PostgreSQL's
`JSONB` column type is used for the `metadata` field in `audit_events`, allowing
structured context data without schema changes for each new event type. PostgreSQL's
row-level security is available for multi-tenancy if that becomes a Phase 3 requirement.
**Alternatives considered**:
- MySQL — rejected because its JSON support is weaker than PostgreSQL's `JSONB` with GIN indexing, and its default transaction isolation level has historically produced surprises.
- MongoDB — rejected because the audit log must be append-only and ACID-safe. MongoDB's document model requires explicit multi-document transactions for ACID behaviour, and the schema flexibility is not needed here.
**Consequences**: All schema changes go through numbered SQL migration files in
`src/db/migrations/`. Migration files are append-only — never modify an existing
migration. New tables require a new numbered file (e.g. `005_create_agent_groups.sql`).
---
### ADR-5: Redis 7
**Status**: Adopted
**Component**: Token revocation list, monthly usage counters, rate-limit sliding window
**Decision**: Use Redis 7 as the in-memory data store.
**Rationale**: Token revocation requires O(1) key lookup with TTL-based automatic
expiry. `SET revoked:{jti} 1 EX {seconds_until_expiry}` stores a revocation entry
that expires precisely when the token itself would have expired — zero manual cleanup
required. The monthly token counter uses Redis `INCR`, which is atomic and O(1). The
rate-limiter uses a Redis sorted set for the sliding-window algorithm.
**Alternatives considered**:
- Memcached — rejected because Memcached does not support per-key TTL on sorted-set structures, which is required for the sliding-window rate-limiter.
- PostgreSQL for revocation — rejected because the token verification path is the hot path in every authenticated request. A PostgreSQL round-trip adds 515 ms compared to a Redis `GET` at sub-millisecond latency.
**Consequences**: Redis is a required infrastructure dependency. A Redis instance must
be running and reachable via `REDIS_URL` before the server starts. `compose.yaml`
provides a Redis 7.2 Alpine container for local development on port 6379.
---
### ADR-6: HashiCorp Vault
**Status**: Adopted (opt-in)
**Component**: Credential secret storage — alternative to bcrypt in PostgreSQL
**Decision**: Integrate HashiCorp Vault KV v2 as an opt-in secret storage backend for agent credentials.
**Rationale**: The Phase 1 bcrypt approach stores hashes in PostgreSQL. While bcrypt
hashes cannot be reversed, some enterprises require that secrets never touch a relational
database — even in hashed form. Vault provides a dedicated secrets management plane with
HSM backing and an independent audit trail at the secrets level. The `verifySecret`
method in `VaultClient` uses `crypto.timingSafeEqual` to prevent timing-based
side-channel attacks when comparing stored and candidate secrets.
**Alternatives considered**:
- AWS Secrets Manager — rejected because it introduces cloud-vendor lock-in. AgentIdP must run identically on AWS, GCP, and on-premises; a Vault-based approach works in all environments.
- Plain bcrypt only — retained as the fallback path. When `VAULT_ADDR` is not set, `createVaultClientFromEnv()` returns `null` and the server operates identically to Phase 1.
**Consequences**: Vault is controlled by `VAULT_ADDR` (required), `VAULT_TOKEN`
(required), and `VAULT_MOUNT` (optional, defaults to `secret`). When these are not set,
bcrypt is used unchanged. Credential rows carry a nullable `vault_path` column: `null`
means bcrypt; a non-null path means Vault verification is used.
---
### ADR-7: OPA (Open Policy Agent)
**Status**: Adopted
**Component**: Request authorisation — scope enforcement on all protected endpoints
**Decision**: Use Open Policy Agent with a Rego policy compiled to a Wasm bundle for runtime authorisation.
**Rationale**: Hard-coded scope checks in middleware would require a code deployment
for every policy change. OPA decouples the policy (`policies/authz.rego`) from the
server code. The policy can be updated, re-compiled to Wasm, and hot-reloaded via
`SIGHUP` without restarting the server. The `@open-policy-agent/opa-wasm` package
evaluates the compiled Wasm bundle in-process with microsecond latency. When no Wasm
bundle is present (development, CI), the middleware falls back to a TypeScript
implementation that reads `policies/data/scopes.json`.
**Alternatives considered**:
- Custom middleware with hard-coded scope checks — rejected because policy changes require code changes and a full deployment cycle. As the endpoint surface grows this becomes unmanageable.
- Casbin — rejected because its RBAC/ABAC model is less expressive than Rego for the compound `method + path + scope-intersection` pattern AgentIdP requires.
**Consequences**: All authorisation rules live in `policies/authz.rego` and
`policies/data/scopes.json`. Adding a new endpoint requires adding its scope
requirement to `scopes.json`. A policy change is deployed by updating `scopes.json`
(or `authz.wasm`) and sending `SIGHUP` to the running process — no redeployment needed.
---
### ADR-8: React 18 + Vite 5
**Status**: Adopted
**Component**: Web dashboard SPA (`dashboard/`)
**Decision**: Use React 18 with Vite 5 as the web dashboard framework and build tool.
**Rationale**: React 18's concurrent rendering model handles the dashboard's async data
fetching patterns cleanly. The `@sentryagent/idp-sdk` Node.js package is reused
directly in the dashboard via `TokenManager` for authentication, avoiding duplicated
API client code. Vite 5 provides sub-second HMR in development and a fast production
build with tree-shaking. The dashboard is built to `dashboard/dist/` and served as
static files from Express at `/dashboard`, keeping the deployment footprint to a
single container.
**Alternatives considered**:
- Next.js — rejected because server-side rendering is not needed for an internal operator dashboard, and the added complexity of a Next.js server is not justified.
- Vue — rejected because the broader SentryAgent.ai ecosystem is React-first; consistency reduces context-switching overhead.
**Consequences**: The dashboard must be built (`cd dashboard && npm run build`) before
Express can serve it. In local development, run `cd dashboard && npm run dev` to use
Vite's dev server with HMR; the Vite proxy forwards `/api/` calls to Express at
`localhost:3000`.
---
### ADR-9: Prometheus + Grafana
**Status**: Adopted
**Component**: Operational metrics collection and visualisation
**Decision**: Use Prometheus for metrics collection and Grafana for dashboards.
**Rationale**: Prometheus is the industry standard for metrics in container
environments. The `prom-client` npm package integrates natively with Express and
provides `Counter` and `Histogram` metric types that cover all observability needs for
AgentIdP. Grafana's YAML provisioning in `monitoring/grafana/provisioning/` makes
dashboards reproducible and version-controlled. The monitoring stack runs as a Docker
Compose overlay (`compose.monitoring.yaml`) without interfering with the base dev
environment.
**Alternatives considered**:
- Datadog — rejected because SaaS cost and vendor lock-in are not acceptable for a free, open-source product. Operators who self-host AgentIdP should not be required to pay for monitoring.
- StatsD — rejected because StatsD's flat metric model lacks label/dimension support, which is essential for distinguishing metrics by `method`, `route`, and `status_code`.
**Consequences**: All metric definitions live exclusively in `src/metrics/registry.ts`.
No other file may instantiate a `Counter` or `Histogram` — all other files import
specific metrics from that registry. Grafana is available at port 3001 when the
monitoring overlay is running.
---
### ADR-10: Terraform
**Status**: Adopted
**Component**: Infrastructure as code — multi-region AWS + GCP deployment
**Decision**: Use Terraform with HCL for all infrastructure provisioning across AWS and GCP.
**Rationale**: Terraform's HCL syntax is readable and its provider ecosystem covers
both AWS and GCP with the same toolchain. Reusable modules in `terraform/modules/`
(agentidp, lb, rds, redis) are composed in environment-specific configurations under
`terraform/environments/aws/` and `terraform/environments/gcp/`. All infrastructure
changes go through `terraform plan` review before `terraform apply`, providing a
diff-based approval workflow.
**Alternatives considered**:
- Pulumi — rejected because the Pulumi provider ecosystem for AWS and GCP was less mature than Terraform's at the time of the Phase 2 decision, and HCL is more readable for non-engineers reviewing infrastructure changes.
- AWS CDK — rejected because it is AWS-only. AgentIdP must deploy identically to both AWS and GCP.
**Consequences**: All infrastructure changes must go through Terraform. No manual edits
via the AWS console or GCP console are permitted — they will be overwritten on the next
`terraform apply`. Terraform state is stored in a remote backend and must not be edited
manually.
---
### ADR-11: Stripe
**Status**: Adopted
**Component**: Billing — subscription management and payment processing
**Decision**: Use Stripe as the payment processing and subscription management platform. The `stripe` npm package (v21+) handles Checkout Session creation, webhook event verification, and subscription lifecycle events.
**Rationale**: Stripe's hosted Checkout flow eliminates the need to handle PCI-DSS scope for card data. The `stripe.webhooks.constructEvent()` method uses HMAC-SHA256 to verify incoming webhook payloads, preventing replay attacks. The `checkout.session.completed` event carries `metadata: { orgId, targetTier }`, allowing `BillingService` to delegate tier upgrades to `TierService.applyUpgrade()` without coupling billing logic to tier logic.
**Alternatives considered**:
- Paddle — rejected because its global merchant-of-record model introduced complexities with the open-source free tier.
- Braintree — rejected because Stripe's webhook reliability and developer experience are superior.
**Consequences**: Stripe requires `STRIPE_SECRET_KEY` (for API calls) and `STRIPE_WEBHOOK_SECRET` (`whsec_...`, for webhook verification). Per-tier Stripe price IDs are configured via `STRIPE_PRICE_ID_PRO` and `STRIPE_PRICE_ID_ENTERPRISE`. All billing webhook handlers must pass the raw `Buffer` body (not parsed JSON) to `stripe.webhooks.constructEvent()` — use `express.raw()` middleware on the webhook route.
---
### ADR-12: oidc-provider (A2A Delegation)
**Status**: Adopted
**Component**: A2A delegation — OIDC provider for agent-to-agent trust tokens
**Decision**: Use the `oidc-provider` npm package (v9.7.x) as the OIDC provider for issuing A2A delegation tokens. The provider is mounted as a sub-application at `/oidc` within the Express app.
**Rationale**: `oidc-provider` is a certified OpenID Connect implementation that handles the full OIDC protocol, including JWKS serving, token endpoint, and discovery document. Rather than implementing a custom delegation token format, using a standards-compliant OIDC provider means delegation tokens can be verified by any OIDC-aware party using the published JWKS at `/oidc/jwks`.
**Alternatives considered**:
- Custom JWT signing — rejected because hand-rolled token formats cannot benefit from OIDC tooling and interoperability.
**Consequences**: `A2A_ENABLED` env var gates the OIDC provider — when set to `'false'`, delegation endpoints return 404. The `OIDC_ISSUER` env var must be set to the full base URL of the OIDC provider (e.g. `https://api.sentryagent.ai`).
---
### ADR-13: Next.js 14 (Developer Portal)
**Status**: Adopted
**Component**: Developer Portal (`portal/`) — public-facing documentation and onboarding
**Decision**: Use Next.js 14 (App Router) with Tailwind CSS for the developer portal. The portal is a separate process served on its own port (independent of the Express API server).
**Rationale**: The developer portal has different performance and SEO requirements than the internal operator dashboard (`dashboard/`). Next.js 14's App Router supports React Server Components, which allows the marketing and documentation pages to be statically generated while the analytics dashboard and API Explorer are client-rendered. Tailwind CSS enables rapid UI development consistent with the design system.
**Alternatives considered**:
- Extending the Vite dashboard — rejected because the developer portal requires server-side rendering for SEO on marketing pages, which Vite does not provide.
- Docusaurus — rejected because the portal includes interactive components (Swagger Explorer, analytics charts) that are not well-suited to a documentation-only tool.
**Consequences**: The portal (`portal/`) has its own `package.json`, `tsconfig.json`, `tailwind.config.ts`, and `next.config.js`. It is built and run independently: `cd portal && npm install && npm run dev`. The portal calls the AgentIdP REST API using the same `@sentryagent/idp-sdk` as the dashboard.
---
### ADR-14: bull (Job Queue) + kafkajs (Event Streaming)
**Status**: Adopted (opt-in)
**Component**: Async job processing and event streaming
**Decision**: Use `bull` (Redis-backed job queue) for async webhook delivery retries and `kafkajs` for event streaming to external consumers. Both are opt-in — the system operates correctly without Kafka configured.
**Rationale**: Webhook delivery requires retry logic with exponential backoff and dead-letter handling. `bull` provides this out of the box using the existing Redis dependency. `kafkajs` enables high-throughput event streaming for analytics and audit events to external data pipelines without blocking the primary request path.
**Alternatives considered**:
- BullMQ — considered as a more modern alternative to `bull` but rejected to avoid adding a new package family during Phase 6. Migration is a future backlog item.
**Consequences**: Kafka is entirely optional. When `KAFKA_BROKERS` is not set, `kafkajs` is not initialised and no events are published. The `bull` queue for webhook delivery requires only the existing Redis instance.
---
### ADR-15: did-resolver + web-did-resolver (W3C DIDs)
**Status**: Adopted
**Component**: W3C DID Core 1.0 document resolution
**Decision**: Use `did-resolver` (v4.1.x) as the DID resolution framework and `web-did-resolver` (v2.0.x) for the `did:web` method implementation.
**Rationale**: `did-resolver` provides a pluggable resolver interface used by both the server (for internal resolution) and by third parties who want to verify AgentIdP-issued DIDs. The `did:web` method maps DID identifiers to HTTPS URLs hosting the DID document JSON, requiring no blockchain. `DIDService` generates documents that conform to the W3C DID Core 1.0 specification and include AGNTCY-specific extension fields.
**Consequences**: `DID_WEB_DOMAIN` env var is required for DID generation. DID documents are cached in Redis (`did:doc:<agentId>`, TTL from `DID_DOCUMENT_CACHE_TTL_SECONDS`, default 300s). Private keys are stored in HashiCorp Vault KV v2 when Vault is configured; in dev mode, a `dev:no-vault` marker is stored and keys are ephemeral.