Complete docs/engineering/ suite — 12 documents covering company overview, system architecture, tech stack ADRs, codebase structure, service deep dives, annotated code walkthroughs, dev setup, engineering workflow, testing strategy, deployment/ops, SDK guide, and README index. All content verified against source files. All 82 tasks in openspec/changes/engineering-docs/tasks.md marked complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
256 lines
14 KiB
Markdown
256 lines
14 KiB
Markdown
# Technology Stack and Architecture Decision Records
|
||
|
||
Every technology choice in AgentIdP was made deliberately. This document records the
|
||
decision, rationale, and alternatives considered for each major technology. New engineers
|
||
should read this before making any technology additions or changes — the pattern here is
|
||
the template for future ADRs.
|
||
|
||
---
|
||
|
||
### ADR-1: Node.js 18 LTS
|
||
|
||
**Status**: Adopted
|
||
**Component**: AgentIdP server runtime and Node.js SDK runtime
|
||
|
||
**Decision**: Use Node.js 18 LTS as the server runtime.
|
||
|
||
**Rationale**: Node.js 18 LTS provides native `fetch`, native ESM support, and a
|
||
stable V8 engine with long-term security updates. The ecosystem for Express, PostgreSQL
|
||
(`pg`), Redis (`redis`), JWT (`jsonwebtoken`), and bcrypt (`bcryptjs`) is mature and
|
||
well-maintained on this version. The non-blocking I/O model is well-suited for an IdP
|
||
that handles many concurrent short-lived authentication requests. The `engines.node`
|
||
field in `package.json` enforces `>=18.0.0`.
|
||
|
||
**Alternatives considered**:
|
||
- Deno — rejected because the npm ecosystem compatibility layer introduced friction with key dependencies (`pg`, `bcryptjs`), and the production deployment story on ECS and Cloud Run was less mature at the time of the decision.
|
||
- Bun — rejected because it lacked LTS stability guarantees at the time of the decision, which is not acceptable for a security-critical authentication service.
|
||
|
||
**Consequences**: All Dockerfiles and Terraform ECS/Cloud Run task definitions must
|
||
target Node.js 18 or a compatible LTS release. Upgrading the Node.js version requires
|
||
CTO approval and a QA sign-off on the full test suite.
|
||
|
||
---
|
||
|
||
### ADR-2: TypeScript 5.3 Strict Mode
|
||
|
||
**Status**: Adopted
|
||
**Component**: All source files — server, all SDKs, dashboard
|
||
|
||
**Decision**: TypeScript 5.3 with `strict: true` and every additional strictness flag enabled in `tsconfig.json`.
|
||
|
||
**Rationale**: AgentIdP handles authentication tokens and cryptographic secrets. Type
|
||
errors in this domain can cause security vulnerabilities — a value that should be
|
||
`string | null` treated as `string` can produce silent authentication bypasses. Strict
|
||
TypeScript with `noImplicitAny`, `strictNullChecks`, `noUnusedLocals`, `noUnusedParameters`,
|
||
and `noImplicitReturns` makes these classes of bug a compile-time error rather than a
|
||
runtime failure in production.
|
||
|
||
**Alternatives considered**:
|
||
- Plain JavaScript — rejected because a security-critical IdP with no type safety is not a system this team is willing to ship. Every public method, every error boundary, and every data transformation must be typed.
|
||
|
||
**Consequences**: All new code must compile cleanly under `tsc --strict`. Zero `any`
|
||
types — ever. No exceptions granted without CTO approval. The `tsconfig.json` enables
|
||
`noImplicitAny`, `strictNullChecks`, `strictFunctionTypes`, `strictBindCallApply`,
|
||
`strictPropertyInitialization`, `noImplicitThis`, `alwaysStrict`, `noUnusedLocals`,
|
||
`noUnusedParameters`, `noImplicitReturns`, and `noFallthroughCasesInSwitch`.
|
||
|
||
---
|
||
|
||
### ADR-3: Express 4.18
|
||
|
||
**Status**: Adopted
|
||
**Component**: HTTP server framework
|
||
|
||
**Decision**: Use Express 4.18 as the HTTP framework.
|
||
|
||
**Rationale**: Express is the most widely understood Node.js HTTP framework. Its
|
||
middleware model (`(req, res, next)`) maps directly to the IdP's layered architecture:
|
||
`helmet` → `cors` → `metricsMiddleware` → `authMiddleware` → `opaMiddleware` →
|
||
controller → service → repository → `errorHandler`. The ecosystem for Express
|
||
middleware (`helmet`, `cors`, `morgan`) is mature. For a spec-first project, Express's
|
||
lack of convention about code structure is a feature — the architecture is explicit and
|
||
fully visible in `src/app.ts`.
|
||
|
||
**Alternatives considered**:
|
||
- Fastify — rejected because the team's familiarity was lower and the performance gains would be negligible for a token service whose latency is dominated by PostgreSQL queries and bcrypt comparisons.
|
||
- NestJS — rejected because its decorator-heavy convention-over-configuration style adds complexity not appropriate for the current team size and project scope.
|
||
- Koa — rejected because its ecosystem is smaller and fewer engineers are familiar with it.
|
||
|
||
**Consequences**: All HTTP concerns (routing, middleware, error handling) use the
|
||
Express 4 API. The `errorHandler` middleware must remain the last `app.use()` call in
|
||
`src/app.ts`.
|
||
|
||
---
|
||
|
||
### ADR-4: PostgreSQL 14
|
||
|
||
**Status**: Adopted
|
||
**Component**: Primary data store for agents, credentials, and audit events
|
||
|
||
**Decision**: Use PostgreSQL 14 as the primary relational database.
|
||
|
||
**Rationale**: The audit log requires ACID guarantees — partial writes or uncommitted
|
||
reads are not acceptable for a compliance-grade append-only event store. PostgreSQL's
|
||
`JSONB` column type is used for the `metadata` field in `audit_events`, allowing
|
||
structured context data without schema changes for each new event type. PostgreSQL's
|
||
row-level security is available for multi-tenancy if that becomes a Phase 3 requirement.
|
||
|
||
**Alternatives considered**:
|
||
- MySQL — rejected because its JSON support is weaker than PostgreSQL's `JSONB` with GIN indexing, and its default transaction isolation level has historically produced surprises.
|
||
- MongoDB — rejected because the audit log must be append-only and ACID-safe. MongoDB's document model requires explicit multi-document transactions for ACID behaviour, and the schema flexibility is not needed here.
|
||
|
||
**Consequences**: All schema changes go through numbered SQL migration files in
|
||
`src/db/migrations/`. Migration files are append-only — never modify an existing
|
||
migration. New tables require a new numbered file (e.g. `005_create_agent_groups.sql`).
|
||
|
||
---
|
||
|
||
### ADR-5: Redis 7
|
||
|
||
**Status**: Adopted
|
||
**Component**: Token revocation list, monthly usage counters, rate-limit sliding window
|
||
|
||
**Decision**: Use Redis 7 as the in-memory data store.
|
||
|
||
**Rationale**: Token revocation requires O(1) key lookup with TTL-based automatic
|
||
expiry. `SET revoked:{jti} 1 EX {seconds_until_expiry}` stores a revocation entry
|
||
that expires precisely when the token itself would have expired — zero manual cleanup
|
||
required. The monthly token counter uses Redis `INCR`, which is atomic and O(1). The
|
||
rate-limiter uses a Redis sorted set for the sliding-window algorithm.
|
||
|
||
**Alternatives considered**:
|
||
- Memcached — rejected because Memcached does not support per-key TTL on sorted-set structures, which is required for the sliding-window rate-limiter.
|
||
- PostgreSQL for revocation — rejected because the token verification path is the hot path in every authenticated request. A PostgreSQL round-trip adds 5–15 ms compared to a Redis `GET` at sub-millisecond latency.
|
||
|
||
**Consequences**: Redis is a required infrastructure dependency. A Redis instance must
|
||
be running and reachable via `REDIS_URL` before the server starts. `docker-compose.yml`
|
||
provides a Redis 7 Alpine container for local development on port 6379.
|
||
|
||
---
|
||
|
||
### ADR-6: HashiCorp Vault
|
||
|
||
**Status**: Adopted (opt-in)
|
||
**Component**: Credential secret storage — alternative to bcrypt in PostgreSQL
|
||
|
||
**Decision**: Integrate HashiCorp Vault KV v2 as an opt-in secret storage backend for agent credentials.
|
||
|
||
**Rationale**: The Phase 1 bcrypt approach stores hashes in PostgreSQL. While bcrypt
|
||
hashes cannot be reversed, some enterprises require that secrets never touch a relational
|
||
database — even in hashed form. Vault provides a dedicated secrets management plane with
|
||
HSM backing and an independent audit trail at the secrets level. The `verifySecret`
|
||
method in `VaultClient` uses `crypto.timingSafeEqual` to prevent timing-based
|
||
side-channel attacks when comparing stored and candidate secrets.
|
||
|
||
**Alternatives considered**:
|
||
- AWS Secrets Manager — rejected because it introduces cloud-vendor lock-in. AgentIdP must run identically on AWS, GCP, and on-premises; a Vault-based approach works in all environments.
|
||
- Plain bcrypt only — retained as the fallback path. When `VAULT_ADDR` is not set, `createVaultClientFromEnv()` returns `null` and the server operates identically to Phase 1.
|
||
|
||
**Consequences**: Vault is controlled by `VAULT_ADDR` (required), `VAULT_TOKEN`
|
||
(required), and `VAULT_MOUNT` (optional, defaults to `secret`). When these are not set,
|
||
bcrypt is used unchanged. Credential rows carry a nullable `vault_path` column: `null`
|
||
means bcrypt; a non-null path means Vault verification is used.
|
||
|
||
---
|
||
|
||
### ADR-7: OPA (Open Policy Agent)
|
||
|
||
**Status**: Adopted
|
||
**Component**: Request authorisation — scope enforcement on all protected endpoints
|
||
|
||
**Decision**: Use Open Policy Agent with a Rego policy compiled to a Wasm bundle for runtime authorisation.
|
||
|
||
**Rationale**: Hard-coded scope checks in middleware would require a code deployment
|
||
for every policy change. OPA decouples the policy (`policies/authz.rego`) from the
|
||
server code. The policy can be updated, re-compiled to Wasm, and hot-reloaded via
|
||
`SIGHUP` without restarting the server. The `@open-policy-agent/opa-wasm` package
|
||
evaluates the compiled Wasm bundle in-process with microsecond latency. When no Wasm
|
||
bundle is present (development, CI), the middleware falls back to a TypeScript
|
||
implementation that reads `policies/data/scopes.json`.
|
||
|
||
**Alternatives considered**:
|
||
- Custom middleware with hard-coded scope checks — rejected because policy changes require code changes and a full deployment cycle. As the endpoint surface grows this becomes unmanageable.
|
||
- Casbin — rejected because its RBAC/ABAC model is less expressive than Rego for the compound `method + path + scope-intersection` pattern AgentIdP requires.
|
||
|
||
**Consequences**: All authorisation rules live in `policies/authz.rego` and
|
||
`policies/data/scopes.json`. Adding a new endpoint requires adding its scope
|
||
requirement to `scopes.json`. A policy change is deployed by updating `scopes.json`
|
||
(or `authz.wasm`) and sending `SIGHUP` to the running process — no redeployment needed.
|
||
|
||
---
|
||
|
||
### ADR-8: React 18 + Vite 5
|
||
|
||
**Status**: Adopted
|
||
**Component**: Web dashboard SPA (`dashboard/`)
|
||
|
||
**Decision**: Use React 18 with Vite 5 as the web dashboard framework and build tool.
|
||
|
||
**Rationale**: React 18's concurrent rendering model handles the dashboard's async data
|
||
fetching patterns cleanly. The `@sentryagent/idp-sdk` Node.js package is reused
|
||
directly in the dashboard via `TokenManager` for authentication, avoiding duplicated
|
||
API client code. Vite 5 provides sub-second HMR in development and a fast production
|
||
build with tree-shaking. The dashboard is built to `dashboard/dist/` and served as
|
||
static files from Express at `/dashboard`, keeping the deployment footprint to a
|
||
single container.
|
||
|
||
**Alternatives considered**:
|
||
- Next.js — rejected because server-side rendering is not needed for an internal operator dashboard, and the added complexity of a Next.js server is not justified.
|
||
- Vue — rejected because the broader SentryAgent.ai ecosystem is React-first; consistency reduces context-switching overhead.
|
||
|
||
**Consequences**: The dashboard must be built (`cd dashboard && npm run build`) before
|
||
Express can serve it. In local development, run `cd dashboard && npm run dev` to use
|
||
Vite's dev server with HMR; the Vite proxy forwards `/api/` calls to Express at
|
||
`localhost:3000`.
|
||
|
||
---
|
||
|
||
### ADR-9: Prometheus + Grafana
|
||
|
||
**Status**: Adopted
|
||
**Component**: Operational metrics collection and visualisation
|
||
|
||
**Decision**: Use Prometheus for metrics collection and Grafana for dashboards.
|
||
|
||
**Rationale**: Prometheus is the industry standard for metrics in container
|
||
environments. The `prom-client` npm package integrates natively with Express and
|
||
provides `Counter` and `Histogram` metric types that cover all observability needs for
|
||
AgentIdP. Grafana's YAML provisioning in `monitoring/grafana/provisioning/` makes
|
||
dashboards reproducible and version-controlled. The monitoring stack runs as a Docker
|
||
Compose overlay (`docker-compose.monitoring.yml`) without interfering with the base dev
|
||
environment.
|
||
|
||
**Alternatives considered**:
|
||
- Datadog — rejected because SaaS cost and vendor lock-in are not acceptable for a free, open-source product. Operators who self-host AgentIdP should not be required to pay for monitoring.
|
||
- StatsD — rejected because StatsD's flat metric model lacks label/dimension support, which is essential for distinguishing metrics by `method`, `route`, and `status_code`.
|
||
|
||
**Consequences**: All metric definitions live exclusively in `src/metrics/registry.ts`.
|
||
No other file may instantiate a `Counter` or `Histogram` — all other files import
|
||
specific metrics from that registry. Grafana is available at port 3001 when the
|
||
monitoring overlay is running.
|
||
|
||
---
|
||
|
||
### ADR-10: Terraform
|
||
|
||
**Status**: Adopted
|
||
**Component**: Infrastructure as code — multi-region AWS + GCP deployment
|
||
|
||
**Decision**: Use Terraform with HCL for all infrastructure provisioning across AWS and GCP.
|
||
|
||
**Rationale**: Terraform's HCL syntax is readable and its provider ecosystem covers
|
||
both AWS and GCP with the same toolchain. Reusable modules in `terraform/modules/`
|
||
(agentidp, lb, rds, redis) are composed in environment-specific configurations under
|
||
`terraform/environments/aws/` and `terraform/environments/gcp/`. All infrastructure
|
||
changes go through `terraform plan` review before `terraform apply`, providing a
|
||
diff-based approval workflow.
|
||
|
||
**Alternatives considered**:
|
||
- Pulumi — rejected because the Pulumi provider ecosystem for AWS and GCP was less mature than Terraform's at the time of the Phase 2 decision, and HCL is more readable for non-engineers reviewing infrastructure changes.
|
||
- AWS CDK — rejected because it is AWS-only. AgentIdP must deploy identically to both AWS and GCP.
|
||
|
||
**Consequences**: All infrastructure changes must go through Terraform. No manual edits
|
||
via the AWS console or GCP console are permitted — they will be overwritten on the next
|
||
`terraform apply`. Terraform state is stored in a remote backend and must not be edited
|
||
manually.
|