Files
sentryagent-idp/docs/engineering/03-tech-stack.md
SentryAgent.ai Developer f9a6a8aafb docs(devops): update all documentation for DockerSpec compliance
- Replace all docker-compose.yml/docker-compose.monitoring.yml references with
  compose.yaml/compose.monitoring.yaml (modern Compose Spec naming)
- Replace all `docker-compose` CLI commands with `docker compose` (plugin syntax)
- Update Dockerfile stage descriptions: node:18-alpine → node:20.11-bookworm-slim,
  built-in node user → explicit nodeapp:1001 non-root user
- Update image version references: postgres:14-alpine → postgres:14.12-alpine3.19,
  redis:7-alpine → redis:7.2-alpine3.19
- Externalize postgres credentials: hardcoded values → POSTGRES_USER/PASSWORD/DB env vars
- Externalize Grafana admin password: hardcoded 'agentidp' → GF_ADMIN_PASSWORD env var
- Add Docker Compose Variables section to environment-variables.md (POSTGRES_*, GF_ADMIN_PASSWORD)
- Update local-development.md Step 3: cp .env.example .env, document POSTGRES_* purpose
- Update quick-start.md: cp .env.example .env, use awk/sed for JWT key injection
- Update 07-dev-setup.md: remove 'no .env.example' claim, reference cp .env.example
- Update docker-compose.yml key file description in 04-codebase-structure.md
- Update monitoring overlay launch commands across all docs (compose.yaml + compose.monitoring.yaml)
- Update volume names to kebab-case: postgres_data → postgres-data, redis_data → redis-data
- Fix compliance encryption-runbook: docker-compose restart agentidp → docker compose restart app

All docs now consistent with compose.yaml in repo root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 08:27:37 +00:00

20 KiB
Raw Permalink Blame History

Technology Stack and Architecture Decision Records

Every technology choice in AgentIdP was made deliberately. This document records the decision, rationale, and alternatives considered for each major technology. New engineers should read this before making any technology additions or changes — the pattern here is the template for future ADRs.


ADR-1: Node.js 18 LTS

Status: Adopted Component: AgentIdP server runtime and Node.js SDK runtime

Decision: Use Node.js 18 LTS as the server runtime.

Rationale: Node.js 18 LTS provides native fetch, native ESM support, and a stable V8 engine with long-term security updates. The ecosystem for Express, PostgreSQL (pg), Redis (redis), JWT (jsonwebtoken), and bcrypt (bcryptjs) is mature and well-maintained on this version. The non-blocking I/O model is well-suited for an IdP that handles many concurrent short-lived authentication requests. The engines.node field in package.json enforces >=18.0.0.

Alternatives considered:

  • Deno — rejected because the npm ecosystem compatibility layer introduced friction with key dependencies (pg, bcryptjs), and the production deployment story on ECS and Cloud Run was less mature at the time of the decision.
  • Bun — rejected because it lacked LTS stability guarantees at the time of the decision, which is not acceptable for a security-critical authentication service.

Consequences: All Dockerfiles and Terraform ECS/Cloud Run task definitions must target Node.js 18 or a compatible LTS release. Upgrading the Node.js version requires CTO approval and a QA sign-off on the full test suite.


ADR-2: TypeScript 5.3 Strict Mode

Status: Adopted Component: All source files — server, all SDKs, dashboard

Decision: TypeScript 5.3 with strict: true and every additional strictness flag enabled in tsconfig.json.

Rationale: AgentIdP handles authentication tokens and cryptographic secrets. Type errors in this domain can cause security vulnerabilities — a value that should be string | null treated as string can produce silent authentication bypasses. Strict TypeScript with noImplicitAny, strictNullChecks, noUnusedLocals, noUnusedParameters, and noImplicitReturns makes these classes of bug a compile-time error rather than a runtime failure in production.

Alternatives considered:

  • Plain JavaScript — rejected because a security-critical IdP with no type safety is not a system this team is willing to ship. Every public method, every error boundary, and every data transformation must be typed.

Consequences: All new code must compile cleanly under tsc --strict. Zero any types — ever. No exceptions granted without CTO approval. The tsconfig.json enables noImplicitAny, strictNullChecks, strictFunctionTypes, strictBindCallApply, strictPropertyInitialization, noImplicitThis, alwaysStrict, noUnusedLocals, noUnusedParameters, noImplicitReturns, and noFallthroughCasesInSwitch.


ADR-3: Express 4.18

Status: Adopted Component: HTTP server framework

Decision: Use Express 4.18 as the HTTP framework.

Rationale: Express is the most widely understood Node.js HTTP framework. Its middleware model ((req, res, next)) maps directly to the IdP's layered architecture: helmetcorsmetricsMiddlewareauthMiddlewareopaMiddleware → controller → service → repository → errorHandler. The ecosystem for Express middleware (helmet, cors, morgan) is mature. For a spec-first project, Express's lack of convention about code structure is a feature — the architecture is explicit and fully visible in src/app.ts.

Alternatives considered:

  • Fastify — rejected because the team's familiarity was lower and the performance gains would be negligible for a token service whose latency is dominated by PostgreSQL queries and bcrypt comparisons.
  • NestJS — rejected because its decorator-heavy convention-over-configuration style adds complexity not appropriate for the current team size and project scope.
  • Koa — rejected because its ecosystem is smaller and fewer engineers are familiar with it.

Consequences: All HTTP concerns (routing, middleware, error handling) use the Express 4 API. The errorHandler middleware must remain the last app.use() call in src/app.ts.


ADR-4: PostgreSQL 14

Status: Adopted Component: Primary data store for agents, credentials, and audit events

Decision: Use PostgreSQL 14 as the primary relational database.

Rationale: The audit log requires ACID guarantees — partial writes or uncommitted reads are not acceptable for a compliance-grade append-only event store. PostgreSQL's JSONB column type is used for the metadata field in audit_events, allowing structured context data without schema changes for each new event type. PostgreSQL's row-level security is available for multi-tenancy if that becomes a Phase 3 requirement.

Alternatives considered:

  • MySQL — rejected because its JSON support is weaker than PostgreSQL's JSONB with GIN indexing, and its default transaction isolation level has historically produced surprises.
  • MongoDB — rejected because the audit log must be append-only and ACID-safe. MongoDB's document model requires explicit multi-document transactions for ACID behaviour, and the schema flexibility is not needed here.

Consequences: All schema changes go through numbered SQL migration files in src/db/migrations/. Migration files are append-only — never modify an existing migration. New tables require a new numbered file (e.g. 005_create_agent_groups.sql).


ADR-5: Redis 7

Status: Adopted Component: Token revocation list, monthly usage counters, rate-limit sliding window

Decision: Use Redis 7 as the in-memory data store.

Rationale: Token revocation requires O(1) key lookup with TTL-based automatic expiry. SET revoked:{jti} 1 EX {seconds_until_expiry} stores a revocation entry that expires precisely when the token itself would have expired — zero manual cleanup required. The monthly token counter uses Redis INCR, which is atomic and O(1). The rate-limiter uses a Redis sorted set for the sliding-window algorithm.

Alternatives considered:

  • Memcached — rejected because Memcached does not support per-key TTL on sorted-set structures, which is required for the sliding-window rate-limiter.
  • PostgreSQL for revocation — rejected because the token verification path is the hot path in every authenticated request. A PostgreSQL round-trip adds 515 ms compared to a Redis GET at sub-millisecond latency.

Consequences: Redis is a required infrastructure dependency. A Redis instance must be running and reachable via REDIS_URL before the server starts. compose.yaml provides a Redis 7.2 Alpine container for local development on port 6379.


ADR-6: HashiCorp Vault

Status: Adopted (opt-in) Component: Credential secret storage — alternative to bcrypt in PostgreSQL

Decision: Integrate HashiCorp Vault KV v2 as an opt-in secret storage backend for agent credentials.

Rationale: The Phase 1 bcrypt approach stores hashes in PostgreSQL. While bcrypt hashes cannot be reversed, some enterprises require that secrets never touch a relational database — even in hashed form. Vault provides a dedicated secrets management plane with HSM backing and an independent audit trail at the secrets level. The verifySecret method in VaultClient uses crypto.timingSafeEqual to prevent timing-based side-channel attacks when comparing stored and candidate secrets.

Alternatives considered:

  • AWS Secrets Manager — rejected because it introduces cloud-vendor lock-in. AgentIdP must run identically on AWS, GCP, and on-premises; a Vault-based approach works in all environments.
  • Plain bcrypt only — retained as the fallback path. When VAULT_ADDR is not set, createVaultClientFromEnv() returns null and the server operates identically to Phase 1.

Consequences: Vault is controlled by VAULT_ADDR (required), VAULT_TOKEN (required), and VAULT_MOUNT (optional, defaults to secret). When these are not set, bcrypt is used unchanged. Credential rows carry a nullable vault_path column: null means bcrypt; a non-null path means Vault verification is used.


ADR-7: OPA (Open Policy Agent)

Status: Adopted Component: Request authorisation — scope enforcement on all protected endpoints

Decision: Use Open Policy Agent with a Rego policy compiled to a Wasm bundle for runtime authorisation.

Rationale: Hard-coded scope checks in middleware would require a code deployment for every policy change. OPA decouples the policy (policies/authz.rego) from the server code. The policy can be updated, re-compiled to Wasm, and hot-reloaded via SIGHUP without restarting the server. The @open-policy-agent/opa-wasm package evaluates the compiled Wasm bundle in-process with microsecond latency. When no Wasm bundle is present (development, CI), the middleware falls back to a TypeScript implementation that reads policies/data/scopes.json.

Alternatives considered:

  • Custom middleware with hard-coded scope checks — rejected because policy changes require code changes and a full deployment cycle. As the endpoint surface grows this becomes unmanageable.
  • Casbin — rejected because its RBAC/ABAC model is less expressive than Rego for the compound method + path + scope-intersection pattern AgentIdP requires.

Consequences: All authorisation rules live in policies/authz.rego and policies/data/scopes.json. Adding a new endpoint requires adding its scope requirement to scopes.json. A policy change is deployed by updating scopes.json (or authz.wasm) and sending SIGHUP to the running process — no redeployment needed.


ADR-8: React 18 + Vite 5

Status: Adopted Component: Web dashboard SPA (dashboard/)

Decision: Use React 18 with Vite 5 as the web dashboard framework and build tool.

Rationale: React 18's concurrent rendering model handles the dashboard's async data fetching patterns cleanly. The @sentryagent/idp-sdk Node.js package is reused directly in the dashboard via TokenManager for authentication, avoiding duplicated API client code. Vite 5 provides sub-second HMR in development and a fast production build with tree-shaking. The dashboard is built to dashboard/dist/ and served as static files from Express at /dashboard, keeping the deployment footprint to a single container.

Alternatives considered:

  • Next.js — rejected because server-side rendering is not needed for an internal operator dashboard, and the added complexity of a Next.js server is not justified.
  • Vue — rejected because the broader SentryAgent.ai ecosystem is React-first; consistency reduces context-switching overhead.

Consequences: The dashboard must be built (cd dashboard && npm run build) before Express can serve it. In local development, run cd dashboard && npm run dev to use Vite's dev server with HMR; the Vite proxy forwards /api/ calls to Express at localhost:3000.


ADR-9: Prometheus + Grafana

Status: Adopted Component: Operational metrics collection and visualisation

Decision: Use Prometheus for metrics collection and Grafana for dashboards.

Rationale: Prometheus is the industry standard for metrics in container environments. The prom-client npm package integrates natively with Express and provides Counter and Histogram metric types that cover all observability needs for AgentIdP. Grafana's YAML provisioning in monitoring/grafana/provisioning/ makes dashboards reproducible and version-controlled. The monitoring stack runs as a Docker Compose overlay (compose.monitoring.yaml) without interfering with the base dev environment.

Alternatives considered:

  • Datadog — rejected because SaaS cost and vendor lock-in are not acceptable for a free, open-source product. Operators who self-host AgentIdP should not be required to pay for monitoring.
  • StatsD — rejected because StatsD's flat metric model lacks label/dimension support, which is essential for distinguishing metrics by method, route, and status_code.

Consequences: All metric definitions live exclusively in src/metrics/registry.ts. No other file may instantiate a Counter or Histogram — all other files import specific metrics from that registry. Grafana is available at port 3001 when the monitoring overlay is running.


ADR-10: Terraform

Status: Adopted Component: Infrastructure as code — multi-region AWS + GCP deployment

Decision: Use Terraform with HCL for all infrastructure provisioning across AWS and GCP.

Rationale: Terraform's HCL syntax is readable and its provider ecosystem covers both AWS and GCP with the same toolchain. Reusable modules in terraform/modules/ (agentidp, lb, rds, redis) are composed in environment-specific configurations under terraform/environments/aws/ and terraform/environments/gcp/. All infrastructure changes go through terraform plan review before terraform apply, providing a diff-based approval workflow.

Alternatives considered:

  • Pulumi — rejected because the Pulumi provider ecosystem for AWS and GCP was less mature than Terraform's at the time of the Phase 2 decision, and HCL is more readable for non-engineers reviewing infrastructure changes.
  • AWS CDK — rejected because it is AWS-only. AgentIdP must deploy identically to both AWS and GCP.

Consequences: All infrastructure changes must go through Terraform. No manual edits via the AWS console or GCP console are permitted — they will be overwritten on the next terraform apply. Terraform state is stored in a remote backend and must not be edited manually.


ADR-11: Stripe

Status: Adopted Component: Billing — subscription management and payment processing

Decision: Use Stripe as the payment processing and subscription management platform. The stripe npm package (v21+) handles Checkout Session creation, webhook event verification, and subscription lifecycle events.

Rationale: Stripe's hosted Checkout flow eliminates the need to handle PCI-DSS scope for card data. The stripe.webhooks.constructEvent() method uses HMAC-SHA256 to verify incoming webhook payloads, preventing replay attacks. The checkout.session.completed event carries metadata: { orgId, targetTier }, allowing BillingService to delegate tier upgrades to TierService.applyUpgrade() without coupling billing logic to tier logic.

Alternatives considered:

  • Paddle — rejected because its global merchant-of-record model introduced complexities with the open-source free tier.
  • Braintree — rejected because Stripe's webhook reliability and developer experience are superior.

Consequences: Stripe requires STRIPE_SECRET_KEY (for API calls) and STRIPE_WEBHOOK_SECRET (whsec_..., for webhook verification). Per-tier Stripe price IDs are configured via STRIPE_PRICE_ID_PRO and STRIPE_PRICE_ID_ENTERPRISE. All billing webhook handlers must pass the raw Buffer body (not parsed JSON) to stripe.webhooks.constructEvent() — use express.raw() middleware on the webhook route.


ADR-12: oidc-provider (A2A Delegation)

Status: Adopted Component: A2A delegation — OIDC provider for agent-to-agent trust tokens

Decision: Use the oidc-provider npm package (v9.7.x) as the OIDC provider for issuing A2A delegation tokens. The provider is mounted as a sub-application at /oidc within the Express app.

Rationale: oidc-provider is a certified OpenID Connect implementation that handles the full OIDC protocol, including JWKS serving, token endpoint, and discovery document. Rather than implementing a custom delegation token format, using a standards-compliant OIDC provider means delegation tokens can be verified by any OIDC-aware party using the published JWKS at /oidc/jwks.

Alternatives considered:

  • Custom JWT signing — rejected because hand-rolled token formats cannot benefit from OIDC tooling and interoperability.

Consequences: A2A_ENABLED env var gates the OIDC provider — when set to 'false', delegation endpoints return 404. The OIDC_ISSUER env var must be set to the full base URL of the OIDC provider (e.g. https://api.sentryagent.ai).


ADR-13: Next.js 14 (Developer Portal)

Status: Adopted Component: Developer Portal (portal/) — public-facing documentation and onboarding

Decision: Use Next.js 14 (App Router) with Tailwind CSS for the developer portal. The portal is a separate process served on its own port (independent of the Express API server).

Rationale: The developer portal has different performance and SEO requirements than the internal operator dashboard (dashboard/). Next.js 14's App Router supports React Server Components, which allows the marketing and documentation pages to be statically generated while the analytics dashboard and API Explorer are client-rendered. Tailwind CSS enables rapid UI development consistent with the design system.

Alternatives considered:

  • Extending the Vite dashboard — rejected because the developer portal requires server-side rendering for SEO on marketing pages, which Vite does not provide.
  • Docusaurus — rejected because the portal includes interactive components (Swagger Explorer, analytics charts) that are not well-suited to a documentation-only tool.

Consequences: The portal (portal/) has its own package.json, tsconfig.json, tailwind.config.ts, and next.config.js. It is built and run independently: cd portal && npm install && npm run dev. The portal calls the AgentIdP REST API using the same @sentryagent/idp-sdk as the dashboard.


ADR-14: bull (Job Queue) + kafkajs (Event Streaming)

Status: Adopted (opt-in) Component: Async job processing and event streaming

Decision: Use bull (Redis-backed job queue) for async webhook delivery retries and kafkajs for event streaming to external consumers. Both are opt-in — the system operates correctly without Kafka configured.

Rationale: Webhook delivery requires retry logic with exponential backoff and dead-letter handling. bull provides this out of the box using the existing Redis dependency. kafkajs enables high-throughput event streaming for analytics and audit events to external data pipelines without blocking the primary request path.

Alternatives considered:

  • BullMQ — considered as a more modern alternative to bull but rejected to avoid adding a new package family during Phase 6. Migration is a future backlog item.

Consequences: Kafka is entirely optional. When KAFKA_BROKERS is not set, kafkajs is not initialised and no events are published. The bull queue for webhook delivery requires only the existing Redis instance.


ADR-15: did-resolver + web-did-resolver (W3C DIDs)

Status: Adopted Component: W3C DID Core 1.0 document resolution

Decision: Use did-resolver (v4.1.x) as the DID resolution framework and web-did-resolver (v2.0.x) for the did:web method implementation.

Rationale: did-resolver provides a pluggable resolver interface used by both the server (for internal resolution) and by third parties who want to verify AgentIdP-issued DIDs. The did:web method maps DID identifiers to HTTPS URLs hosting the DID document JSON, requiring no blockchain. DIDService generates documents that conform to the W3C DID Core 1.0 specification and include AGNTCY-specific extension fields.

Consequences: DID_WEB_DOMAIN env var is required for DID generation. DID documents are cached in Redis (did:doc:<agentId>, TTL from DID_DOCUMENT_CACHE_TTL_SECONDS, default 300s). Private keys are stored in HashiCorp Vault KV v2 when Vault is configured; in dev mode, a dev:no-vault marker is stored and keys are ephemeral.