sentryagent-idp/docs/engineering/03-tech-stack.md

# Technology Stack and Architecture Decision Records

Every technology choice in AgentIdP was made deliberately. This document records the
decision, rationale, and alternatives considered for each major technology. New engineers
should read this before making any technology additions or changes — the pattern here is
the template for future ADRs.

---

### ADR-1: Node.js 18 LTS

**Status**: Adopted
**Component**: AgentIdP server runtime and Node.js SDK runtime

**Decision**: Use Node.js 18 LTS as the server runtime.

**Rationale**: Node.js 18 LTS provides native `fetch`, native ESM support, and a
stable V8 engine with long-term security updates. The ecosystem for Express, PostgreSQL
(`pg`), Redis (`redis`), JWT (`jsonwebtoken`), and bcrypt (`bcryptjs`) is mature and
well-maintained on this version. The non-blocking I/O model is well-suited for an IdP
that handles many concurrent short-lived authentication requests. The `engines.node`
field in `package.json` enforces `>=18.0.0`.

**Alternatives considered**:
- Deno — rejected because the npm ecosystem compatibility layer introduced friction with key dependencies (`pg`, `bcryptjs`), and the production deployment story on ECS and Cloud Run was less mature at the time of the decision.
- Bun — rejected because it lacked LTS stability guarantees at the time of the decision, which is not acceptable for a security-critical authentication service.

**Consequences**: All Dockerfiles and Terraform ECS/Cloud Run task definitions must
target Node.js 18 or a compatible LTS release. Upgrading the Node.js version requires
CTO approval and a QA sign-off on the full test suite.

---

### ADR-2: TypeScript 5.3 Strict Mode

**Status**: Adopted
**Component**: All source files — server, all SDKs, dashboard

**Decision**: TypeScript 5.3 with `strict: true` and every additional strictness flag enabled in `tsconfig.json`.

**Rationale**: AgentIdP handles authentication tokens and cryptographic secrets. Type
errors in this domain can cause security vulnerabilities — a value that should be
`string | null` treated as `string` can produce silent authentication bypasses. Strict
TypeScript with `noImplicitAny`, `strictNullChecks`, `noUnusedLocals`, `noUnusedParameters`,
and `noImplicitReturns` makes these classes of bug a compile-time error rather than a
runtime failure in production.

**Alternatives considered**:
- Plain JavaScript — rejected because a security-critical IdP with no type safety is not a system this team is willing to ship. Every public method, every error boundary, and every data transformation must be typed.

**Consequences**: All new code must compile cleanly under `tsc --strict`. Zero `any`
types — ever. No exceptions granted without CTO approval. The `tsconfig.json` enables
`noImplicitAny`, `strictNullChecks`, `strictFunctionTypes`, `strictBindCallApply`,
`strictPropertyInitialization`, `noImplicitThis`, `alwaysStrict`, `noUnusedLocals`,
`noUnusedParameters`, `noImplicitReturns`, and `noFallthroughCasesInSwitch`.

---

### ADR-3: Express 4.18

**Status**: Adopted
**Component**: HTTP server framework

**Decision**: Use Express 4.18 as the HTTP framework.

**Rationale**: Express is the most widely understood Node.js HTTP framework. Its
middleware model (`(req, res, next)`) maps directly to the IdP's layered architecture:
`helmet` → `cors` → `metricsMiddleware` → `authMiddleware` → `opaMiddleware` →
controller → service → repository → `errorHandler`. The ecosystem for Express
middleware (`helmet`, `cors`, `morgan`) is mature. For a spec-first project, Express's
lack of convention about code structure is a feature — the architecture is explicit and
fully visible in `src/app.ts`.

**Alternatives considered**:
- Fastify — rejected because the team's familiarity was lower and the performance gains would be negligible for a token service whose latency is dominated by PostgreSQL queries and bcrypt comparisons.
- NestJS — rejected because its decorator-heavy convention-over-configuration style adds complexity not appropriate for the current team size and project scope.
- Koa — rejected because its ecosystem is smaller and fewer engineers are familiar with it.

**Consequences**: All HTTP concerns (routing, middleware, error handling) use the
Express 4 API. The `errorHandler` middleware must remain the last `app.use()` call in
`src/app.ts`.

---

### ADR-4: PostgreSQL 14

**Status**: Adopted
**Component**: Primary data store for agents, credentials, and audit events

**Decision**: Use PostgreSQL 14 as the primary relational database.

**Rationale**: The audit log requires ACID guarantees — partial writes or uncommitted
reads are not acceptable for a compliance-grade append-only event store. PostgreSQL's
`JSONB` column type is used for the `metadata` field in `audit_events`, allowing
structured context data without schema changes for each new event type. PostgreSQL's
row-level security is available for multi-tenancy if that becomes a Phase 3 requirement.

**Alternatives considered**:
- MySQL — rejected because its JSON support is weaker than PostgreSQL's `JSONB` with GIN indexing, and its default transaction isolation level has historically produced surprises.
- MongoDB — rejected because the audit log must be append-only and ACID-safe. MongoDB's document model requires explicit multi-document transactions for ACID behaviour, and the schema flexibility is not needed here.

**Consequences**: All schema changes go through numbered SQL migration files in
`src/db/migrations/`. Migration files are append-only — never modify an existing
migration. New tables require a new numbered file (e.g. `005_create_agent_groups.sql`).

---

### ADR-5: Redis 7

**Status**: Adopted
**Component**: Token revocation list, monthly usage counters, rate-limit sliding window

**Decision**: Use Redis 7 as the in-memory data store.

**Rationale**: Token revocation requires O(1) key lookup with TTL-based automatic
expiry. `SET revoked:{jti} 1 EX {seconds_until_expiry}` stores a revocation entry
that expires precisely when the token itself would have expired — zero manual cleanup
required. The monthly token counter uses Redis `INCR`, which is atomic and O(1). The
rate-limiter uses a Redis sorted set for the sliding-window algorithm.

**Alternatives considered**:
- Memcached — rejected because Memcached does not support per-key TTL on sorted-set structures, which is required for the sliding-window rate-limiter.
- PostgreSQL for revocation — rejected because the token verification path is the hot path in every authenticated request. A PostgreSQL round-trip adds 5–15 ms compared to a Redis `GET` at sub-millisecond latency.

**Consequences**: Redis is a required infrastructure dependency. A Redis instance must
be running and reachable via `REDIS_URL` before the server starts. `docker-compose.yml`
provides a Redis 7 Alpine container for local development on port 6379.

---

### ADR-6: HashiCorp Vault

**Status**: Adopted (opt-in)
**Component**: Credential secret storage — alternative to bcrypt in PostgreSQL

**Decision**: Integrate HashiCorp Vault KV v2 as an opt-in secret storage backend for agent credentials.

**Rationale**: The Phase 1 bcrypt approach stores hashes in PostgreSQL. While bcrypt
hashes cannot be reversed, some enterprises require that secrets never touch a relational
database — even in hashed form. Vault provides a dedicated secrets management plane with
HSM backing and an independent audit trail at the secrets level. The `verifySecret`
method in `VaultClient` uses `crypto.timingSafeEqual` to prevent timing-based
side-channel attacks when comparing stored and candidate secrets.

**Alternatives considered**:
- AWS Secrets Manager — rejected because it introduces cloud-vendor lock-in. AgentIdP must run identically on AWS, GCP, and on-premises; a Vault-based approach works in all environments.
- Plain bcrypt only — retained as the fallback path. When `VAULT_ADDR` is not set, `createVaultClientFromEnv()` returns `null` and the server operates identically to Phase 1.

**Consequences**: Vault is controlled by `VAULT_ADDR` (required), `VAULT_TOKEN`
(required), and `VAULT_MOUNT` (optional, defaults to `secret`). When these are not set,
bcrypt is used unchanged. Credential rows carry a nullable `vault_path` column: `null`
means bcrypt; a non-null path means Vault verification is used.

---

### ADR-7: OPA (Open Policy Agent)

**Status**: Adopted
**Component**: Request authorisation — scope enforcement on all protected endpoints

**Decision**: Use Open Policy Agent with a Rego policy compiled to a Wasm bundle for runtime authorisation.

**Rationale**: Hard-coded scope checks in middleware would require a code deployment
for every policy change. OPA decouples the policy (`policies/authz.rego`) from the
server code. The policy can be updated, re-compiled to Wasm, and hot-reloaded via
`SIGHUP` without restarting the server. The `@open-policy-agent/opa-wasm` package
evaluates the compiled Wasm bundle in-process with microsecond latency. When no Wasm
bundle is present (development, CI), the middleware falls back to a TypeScript
implementation that reads `policies/data/scopes.json`.

**Alternatives considered**:
- Custom middleware with hard-coded scope checks — rejected because policy changes require code changes and a full deployment cycle. As the endpoint surface grows this becomes unmanageable.
- Casbin — rejected because its RBAC/ABAC model is less expressive than Rego for the compound `method + path + scope-intersection` pattern AgentIdP requires.

**Consequences**: All authorisation rules live in `policies/authz.rego` and
`policies/data/scopes.json`. Adding a new endpoint requires adding its scope
requirement to `scopes.json`. A policy change is deployed by updating `scopes.json`
(or `authz.wasm`) and sending `SIGHUP` to the running process — no redeployment needed.

---

### ADR-8: React 18 + Vite 5

**Status**: Adopted
**Component**: Web dashboard SPA (`dashboard/`)

**Decision**: Use React 18 with Vite 5 as the web dashboard framework and build tool.

**Rationale**: React 18's concurrent rendering model handles the dashboard's async data
fetching patterns cleanly. The `@sentryagent/idp-sdk` Node.js package is reused
directly in the dashboard via `TokenManager` for authentication, avoiding duplicated
API client code. Vite 5 provides sub-second HMR in development and a fast production
build with tree-shaking. The dashboard is built to `dashboard/dist/` and served as
static files from Express at `/dashboard`, keeping the deployment footprint to a
single container.

**Alternatives considered**:
- Next.js — rejected because server-side rendering is not needed for an internal operator dashboard, and the added complexity of a Next.js server is not justified.
- Vue — rejected because the broader SentryAgent.ai ecosystem is React-first; consistency reduces context-switching overhead.

**Consequences**: The dashboard must be built (`cd dashboard && npm run build`) before
Express can serve it. In local development, run `cd dashboard && npm run dev` to use
Vite's dev server with HMR; the Vite proxy forwards `/api/` calls to Express at
`localhost:3000`.

---

### ADR-9: Prometheus + Grafana

**Status**: Adopted
**Component**: Operational metrics collection and visualisation

**Decision**: Use Prometheus for metrics collection and Grafana for dashboards.

**Rationale**: Prometheus is the industry standard for metrics in container
environments. The `prom-client` npm package integrates natively with Express and
provides `Counter` and `Histogram` metric types that cover all observability needs for
AgentIdP. Grafana's YAML provisioning in `monitoring/grafana/provisioning/` makes
dashboards reproducible and version-controlled. The monitoring stack runs as a Docker
Compose overlay (`docker-compose.monitoring.yml`) without interfering with the base dev
environment.

**Alternatives considered**:
- Datadog — rejected because SaaS cost and vendor lock-in are not acceptable for a free, open-source product. Operators who self-host AgentIdP should not be required to pay for monitoring.
- StatsD — rejected because StatsD's flat metric model lacks label/dimension support, which is essential for distinguishing metrics by `method`, `route`, and `status_code`.

**Consequences**: All metric definitions live exclusively in `src/metrics/registry.ts`.
No other file may instantiate a `Counter` or `Histogram` — all other files import
specific metrics from that registry. Grafana is available at port 3001 when the
monitoring overlay is running.

---

### ADR-10: Terraform

**Status**: Adopted
**Component**: Infrastructure as code — multi-region AWS + GCP deployment

**Decision**: Use Terraform with HCL for all infrastructure provisioning across AWS and GCP.

**Rationale**: Terraform's HCL syntax is readable and its provider ecosystem covers
both AWS and GCP with the same toolchain. Reusable modules in `terraform/modules/`
(agentidp, lb, rds, redis) are composed in environment-specific configurations under
`terraform/environments/aws/` and `terraform/environments/gcp/`. All infrastructure
changes go through `terraform plan` review before `terraform apply`, providing a
diff-based approval workflow.

**Alternatives considered**:
- Pulumi — rejected because the Pulumi provider ecosystem for AWS and GCP was less mature than Terraform's at the time of the Phase 2 decision, and HCL is more readable for non-engineers reviewing infrastructure changes.
- AWS CDK — rejected because it is AWS-only. AgentIdP must deploy identically to both AWS and GCP.

**Consequences**: All infrastructure changes must go through Terraform. No manual edits
via the AWS console or GCP console are permitted — they will be overwritten on the next
`terraform apply`. Terraform state is stored in a remote backend and must not be edited
manually.