Complete docs/engineering/ suite — 12 documents covering company overview, system architecture, tech stack ADRs, codebase structure, service deep dives, annotated code walkthroughs, dev setup, engineering workflow, testing strategy, deployment/ops, SDK guide, and README index. All content verified against source files. All 82 tasks in openspec/changes/engineering-docs/tasks.md marked complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
14 KiB
Technology Stack and Architecture Decision Records
Every technology choice in AgentIdP was made deliberately. This document records the decision, rationale, and alternatives considered for each major technology. New engineers should read this before making any technology additions or changes — the pattern here is the template for future ADRs.
ADR-1: Node.js 18 LTS
Status: Adopted Component: AgentIdP server runtime and Node.js SDK runtime
Decision: Use Node.js 18 LTS as the server runtime.
Rationale: Node.js 18 LTS provides native fetch, native ESM support, and a
stable V8 engine with long-term security updates. The ecosystem for Express, PostgreSQL
(pg), Redis (redis), JWT (jsonwebtoken), and bcrypt (bcryptjs) is mature and
well-maintained on this version. The non-blocking I/O model is well-suited for an IdP
that handles many concurrent short-lived authentication requests. The engines.node
field in package.json enforces >=18.0.0.
Alternatives considered:
- Deno — rejected because the npm ecosystem compatibility layer introduced friction with key dependencies (
pg,bcryptjs), and the production deployment story on ECS and Cloud Run was less mature at the time of the decision. - Bun — rejected because it lacked LTS stability guarantees at the time of the decision, which is not acceptable for a security-critical authentication service.
Consequences: All Dockerfiles and Terraform ECS/Cloud Run task definitions must target Node.js 18 or a compatible LTS release. Upgrading the Node.js version requires CTO approval and a QA sign-off on the full test suite.
ADR-2: TypeScript 5.3 Strict Mode
Status: Adopted Component: All source files — server, all SDKs, dashboard
Decision: TypeScript 5.3 with strict: true and every additional strictness flag enabled in tsconfig.json.
Rationale: AgentIdP handles authentication tokens and cryptographic secrets. Type
errors in this domain can cause security vulnerabilities — a value that should be
string | null treated as string can produce silent authentication bypasses. Strict
TypeScript with noImplicitAny, strictNullChecks, noUnusedLocals, noUnusedParameters,
and noImplicitReturns makes these classes of bug a compile-time error rather than a
runtime failure in production.
Alternatives considered:
- Plain JavaScript — rejected because a security-critical IdP with no type safety is not a system this team is willing to ship. Every public method, every error boundary, and every data transformation must be typed.
Consequences: All new code must compile cleanly under tsc --strict. Zero any
types — ever. No exceptions granted without CTO approval. The tsconfig.json enables
noImplicitAny, strictNullChecks, strictFunctionTypes, strictBindCallApply,
strictPropertyInitialization, noImplicitThis, alwaysStrict, noUnusedLocals,
noUnusedParameters, noImplicitReturns, and noFallthroughCasesInSwitch.
ADR-3: Express 4.18
Status: Adopted Component: HTTP server framework
Decision: Use Express 4.18 as the HTTP framework.
Rationale: Express is the most widely understood Node.js HTTP framework. Its
middleware model ((req, res, next)) maps directly to the IdP's layered architecture:
helmet → cors → metricsMiddleware → authMiddleware → opaMiddleware →
controller → service → repository → errorHandler. The ecosystem for Express
middleware (helmet, cors, morgan) is mature. For a spec-first project, Express's
lack of convention about code structure is a feature — the architecture is explicit and
fully visible in src/app.ts.
Alternatives considered:
- Fastify — rejected because the team's familiarity was lower and the performance gains would be negligible for a token service whose latency is dominated by PostgreSQL queries and bcrypt comparisons.
- NestJS — rejected because its decorator-heavy convention-over-configuration style adds complexity not appropriate for the current team size and project scope.
- Koa — rejected because its ecosystem is smaller and fewer engineers are familiar with it.
Consequences: All HTTP concerns (routing, middleware, error handling) use the
Express 4 API. The errorHandler middleware must remain the last app.use() call in
src/app.ts.
ADR-4: PostgreSQL 14
Status: Adopted Component: Primary data store for agents, credentials, and audit events
Decision: Use PostgreSQL 14 as the primary relational database.
Rationale: The audit log requires ACID guarantees — partial writes or uncommitted
reads are not acceptable for a compliance-grade append-only event store. PostgreSQL's
JSONB column type is used for the metadata field in audit_events, allowing
structured context data without schema changes for each new event type. PostgreSQL's
row-level security is available for multi-tenancy if that becomes a Phase 3 requirement.
Alternatives considered:
- MySQL — rejected because its JSON support is weaker than PostgreSQL's
JSONBwith GIN indexing, and its default transaction isolation level has historically produced surprises. - MongoDB — rejected because the audit log must be append-only and ACID-safe. MongoDB's document model requires explicit multi-document transactions for ACID behaviour, and the schema flexibility is not needed here.
Consequences: All schema changes go through numbered SQL migration files in
src/db/migrations/. Migration files are append-only — never modify an existing
migration. New tables require a new numbered file (e.g. 005_create_agent_groups.sql).
ADR-5: Redis 7
Status: Adopted Component: Token revocation list, monthly usage counters, rate-limit sliding window
Decision: Use Redis 7 as the in-memory data store.
Rationale: Token revocation requires O(1) key lookup with TTL-based automatic
expiry. SET revoked:{jti} 1 EX {seconds_until_expiry} stores a revocation entry
that expires precisely when the token itself would have expired — zero manual cleanup
required. The monthly token counter uses Redis INCR, which is atomic and O(1). The
rate-limiter uses a Redis sorted set for the sliding-window algorithm.
Alternatives considered:
- Memcached — rejected because Memcached does not support per-key TTL on sorted-set structures, which is required for the sliding-window rate-limiter.
- PostgreSQL for revocation — rejected because the token verification path is the hot path in every authenticated request. A PostgreSQL round-trip adds 5–15 ms compared to a Redis
GETat sub-millisecond latency.
Consequences: Redis is a required infrastructure dependency. A Redis instance must
be running and reachable via REDIS_URL before the server starts. docker-compose.yml
provides a Redis 7 Alpine container for local development on port 6379.
ADR-6: HashiCorp Vault
Status: Adopted (opt-in) Component: Credential secret storage — alternative to bcrypt in PostgreSQL
Decision: Integrate HashiCorp Vault KV v2 as an opt-in secret storage backend for agent credentials.
Rationale: The Phase 1 bcrypt approach stores hashes in PostgreSQL. While bcrypt
hashes cannot be reversed, some enterprises require that secrets never touch a relational
database — even in hashed form. Vault provides a dedicated secrets management plane with
HSM backing and an independent audit trail at the secrets level. The verifySecret
method in VaultClient uses crypto.timingSafeEqual to prevent timing-based
side-channel attacks when comparing stored and candidate secrets.
Alternatives considered:
- AWS Secrets Manager — rejected because it introduces cloud-vendor lock-in. AgentIdP must run identically on AWS, GCP, and on-premises; a Vault-based approach works in all environments.
- Plain bcrypt only — retained as the fallback path. When
VAULT_ADDRis not set,createVaultClientFromEnv()returnsnulland the server operates identically to Phase 1.
Consequences: Vault is controlled by VAULT_ADDR (required), VAULT_TOKEN
(required), and VAULT_MOUNT (optional, defaults to secret). When these are not set,
bcrypt is used unchanged. Credential rows carry a nullable vault_path column: null
means bcrypt; a non-null path means Vault verification is used.
ADR-7: OPA (Open Policy Agent)
Status: Adopted Component: Request authorisation — scope enforcement on all protected endpoints
Decision: Use Open Policy Agent with a Rego policy compiled to a Wasm bundle for runtime authorisation.
Rationale: Hard-coded scope checks in middleware would require a code deployment
for every policy change. OPA decouples the policy (policies/authz.rego) from the
server code. The policy can be updated, re-compiled to Wasm, and hot-reloaded via
SIGHUP without restarting the server. The @open-policy-agent/opa-wasm package
evaluates the compiled Wasm bundle in-process with microsecond latency. When no Wasm
bundle is present (development, CI), the middleware falls back to a TypeScript
implementation that reads policies/data/scopes.json.
Alternatives considered:
- Custom middleware with hard-coded scope checks — rejected because policy changes require code changes and a full deployment cycle. As the endpoint surface grows this becomes unmanageable.
- Casbin — rejected because its RBAC/ABAC model is less expressive than Rego for the compound
method + path + scope-intersectionpattern AgentIdP requires.
Consequences: All authorisation rules live in policies/authz.rego and
policies/data/scopes.json. Adding a new endpoint requires adding its scope
requirement to scopes.json. A policy change is deployed by updating scopes.json
(or authz.wasm) and sending SIGHUP to the running process — no redeployment needed.
ADR-8: React 18 + Vite 5
Status: Adopted
Component: Web dashboard SPA (dashboard/)
Decision: Use React 18 with Vite 5 as the web dashboard framework and build tool.
Rationale: React 18's concurrent rendering model handles the dashboard's async data
fetching patterns cleanly. The @sentryagent/idp-sdk Node.js package is reused
directly in the dashboard via TokenManager for authentication, avoiding duplicated
API client code. Vite 5 provides sub-second HMR in development and a fast production
build with tree-shaking. The dashboard is built to dashboard/dist/ and served as
static files from Express at /dashboard, keeping the deployment footprint to a
single container.
Alternatives considered:
- Next.js — rejected because server-side rendering is not needed for an internal operator dashboard, and the added complexity of a Next.js server is not justified.
- Vue — rejected because the broader SentryAgent.ai ecosystem is React-first; consistency reduces context-switching overhead.
Consequences: The dashboard must be built (cd dashboard && npm run build) before
Express can serve it. In local development, run cd dashboard && npm run dev to use
Vite's dev server with HMR; the Vite proxy forwards /api/ calls to Express at
localhost:3000.
ADR-9: Prometheus + Grafana
Status: Adopted Component: Operational metrics collection and visualisation
Decision: Use Prometheus for metrics collection and Grafana for dashboards.
Rationale: Prometheus is the industry standard for metrics in container
environments. The prom-client npm package integrates natively with Express and
provides Counter and Histogram metric types that cover all observability needs for
AgentIdP. Grafana's YAML provisioning in monitoring/grafana/provisioning/ makes
dashboards reproducible and version-controlled. The monitoring stack runs as a Docker
Compose overlay (docker-compose.monitoring.yml) without interfering with the base dev
environment.
Alternatives considered:
- Datadog — rejected because SaaS cost and vendor lock-in are not acceptable for a free, open-source product. Operators who self-host AgentIdP should not be required to pay for monitoring.
- StatsD — rejected because StatsD's flat metric model lacks label/dimension support, which is essential for distinguishing metrics by
method,route, andstatus_code.
Consequences: All metric definitions live exclusively in src/metrics/registry.ts.
No other file may instantiate a Counter or Histogram — all other files import
specific metrics from that registry. Grafana is available at port 3001 when the
monitoring overlay is running.
ADR-10: Terraform
Status: Adopted Component: Infrastructure as code — multi-region AWS + GCP deployment
Decision: Use Terraform with HCL for all infrastructure provisioning across AWS and GCP.
Rationale: Terraform's HCL syntax is readable and its provider ecosystem covers
both AWS and GCP with the same toolchain. Reusable modules in terraform/modules/
(agentidp, lb, rds, redis) are composed in environment-specific configurations under
terraform/environments/aws/ and terraform/environments/gcp/. All infrastructure
changes go through terraform plan review before terraform apply, providing a
diff-based approval workflow.
Alternatives considered:
- Pulumi — rejected because the Pulumi provider ecosystem for AWS and GCP was less mature than Terraform's at the time of the Phase 2 decision, and HCL is more readable for non-engineers reviewing infrastructure changes.
- AWS CDK — rejected because it is AWS-only. AgentIdP must deploy identically to both AWS and GCP.
Consequences: All infrastructure changes must go through Terraform. No manual edits
via the AWS console or GCP console are permitted — they will be overwritten on the next
terraform apply. Terraform state is stored in a remote backend and must not be edited
manually.