Files

SentryAgent.ai Developer eced5f8699 docs: engineering knowledge base for new hires

Complete docs/engineering/ suite — 12 documents covering company overview,
system architecture, tech stack ADRs, codebase structure, service deep dives,
annotated code walkthroughs, dev setup, engineering workflow, testing strategy,
deployment/ops, SDK guide, and README index. All content verified against
source files. All 82 tasks in openspec/changes/engineering-docs/tasks.md
marked complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-29 12:38:42 +00:00

14 KiB

Raw Blame History

Technology Stack and Architecture Decision Records

Every technology choice in AgentIdP was made deliberately. This document records the decision, rationale, and alternatives considered for each major technology. New engineers should read this before making any technology additions or changes — the pattern here is the template for future ADRs.

ADR-1: Node.js 18 LTS

Status: Adopted Component: AgentIdP server runtime and Node.js SDK runtime

Decision: Use Node.js 18 LTS as the server runtime.

Rationale: Node.js 18 LTS provides native fetch, native ESM support, and a stable V8 engine with long-term security updates. The ecosystem for Express, PostgreSQL (pg), Redis (redis), JWT (jsonwebtoken), and bcrypt (bcryptjs) is mature and well-maintained on this version. The non-blocking I/O model is well-suited for an IdP that handles many concurrent short-lived authentication requests. The engines.node field in package.json enforces >=18.0.0.

Alternatives considered:

Deno — rejected because the npm ecosystem compatibility layer introduced friction with key dependencies (pg, bcryptjs), and the production deployment story on ECS and Cloud Run was less mature at the time of the decision.
Bun — rejected because it lacked LTS stability guarantees at the time of the decision, which is not acceptable for a security-critical authentication service.

Consequences: All Dockerfiles and Terraform ECS/Cloud Run task definitions must target Node.js 18 or a compatible LTS release. Upgrading the Node.js version requires CTO approval and a QA sign-off on the full test suite.

ADR-2: TypeScript 5.3 Strict Mode

Status: Adopted Component: All source files — server, all SDKs, dashboard

Decision: TypeScript 5.3 with strict: true and every additional strictness flag enabled in tsconfig.json.

Rationale: AgentIdP handles authentication tokens and cryptographic secrets. Type errors in this domain can cause security vulnerabilities — a value that should be string | null treated as string can produce silent authentication bypasses. Strict TypeScript with noImplicitAny, strictNullChecks, noUnusedLocals, noUnusedParameters, and noImplicitReturns makes these classes of bug a compile-time error rather than a runtime failure in production.

Alternatives considered:

Plain JavaScript — rejected because a security-critical IdP with no type safety is not a system this team is willing to ship. Every public method, every error boundary, and every data transformation must be typed.

Consequences: All new code must compile cleanly under tsc --strict. Zero any types — ever. No exceptions granted without CTO approval. The tsconfig.json enables noImplicitAny, strictNullChecks, strictFunctionTypes, strictBindCallApply, strictPropertyInitialization, noImplicitThis, alwaysStrict, noUnusedLocals, noUnusedParameters, noImplicitReturns, and noFallthroughCasesInSwitch.

ADR-3: Express 4.18

Status: Adopted Component: HTTP server framework

Decision: Use Express 4.18 as the HTTP framework.

Rationale: Express is the most widely understood Node.js HTTP framework. Its middleware model ((req, res, next)) maps directly to the IdP's layered architecture: helmet → cors → metricsMiddleware → authMiddleware → opaMiddleware → controller → service → repository → errorHandler. The ecosystem for Express middleware (helmet, cors, morgan) is mature. For a spec-first project, Express's lack of convention about code structure is a feature — the architecture is explicit and fully visible in src/app.ts.

Alternatives considered:

Fastify — rejected because the team's familiarity was lower and the performance gains would be negligible for a token service whose latency is dominated by PostgreSQL queries and bcrypt comparisons.
NestJS — rejected because its decorator-heavy convention-over-configuration style adds complexity not appropriate for the current team size and project scope.
Koa — rejected because its ecosystem is smaller and fewer engineers are familiar with it.

Consequences: All HTTP concerns (routing, middleware, error handling) use the Express 4 API. The errorHandler middleware must remain the last app.use() call in src/app.ts.

ADR-4: PostgreSQL 14

Status: Adopted Component: Primary data store for agents, credentials, and audit events

Decision: Use PostgreSQL 14 as the primary relational database.

Rationale: The audit log requires ACID guarantees — partial writes or uncommitted reads are not acceptable for a compliance-grade append-only event store. PostgreSQL's JSONB column type is used for the metadata field in audit_events, allowing structured context data without schema changes for each new event type. PostgreSQL's row-level security is available for multi-tenancy if that becomes a Phase 3 requirement.

Alternatives considered:

MySQL — rejected because its JSON support is weaker than PostgreSQL's JSONB with GIN indexing, and its default transaction isolation level has historically produced surprises.
MongoDB — rejected because the audit log must be append-only and ACID-safe. MongoDB's document model requires explicit multi-document transactions for ACID behaviour, and the schema flexibility is not needed here.

Consequences: All schema changes go through numbered SQL migration files in src/db/migrations/. Migration files are append-only — never modify an existing migration. New tables require a new numbered file (e.g. 005_create_agent_groups.sql).

ADR-5: Redis 7

Status: Adopted Component: Token revocation list, monthly usage counters, rate-limit sliding window

Decision: Use Redis 7 as the in-memory data store.

Rationale: Token revocation requires O(1) key lookup with TTL-based automatic expiry. SET revoked:{jti} 1 EX {seconds_until_expiry} stores a revocation entry that expires precisely when the token itself would have expired — zero manual cleanup required. The monthly token counter uses Redis INCR, which is atomic and O(1). The rate-limiter uses a Redis sorted set for the sliding-window algorithm.

Alternatives considered:

Memcached — rejected because Memcached does not support per-key TTL on sorted-set structures, which is required for the sliding-window rate-limiter.
PostgreSQL for revocation — rejected because the token verification path is the hot path in every authenticated request. A PostgreSQL round-trip adds 5–15 ms compared to a Redis GET at sub-millisecond latency.

Consequences: Redis is a required infrastructure dependency. A Redis instance must be running and reachable via REDIS_URL before the server starts. docker-compose.yml provides a Redis 7 Alpine container for local development on port 6379.

ADR-6: HashiCorp Vault

Status: Adopted (opt-in) Component: Credential secret storage — alternative to bcrypt in PostgreSQL

Decision: Integrate HashiCorp Vault KV v2 as an opt-in secret storage backend for agent credentials.

Rationale: The Phase 1 bcrypt approach stores hashes in PostgreSQL. While bcrypt hashes cannot be reversed, some enterprises require that secrets never touch a relational database — even in hashed form. Vault provides a dedicated secrets management plane with HSM backing and an independent audit trail at the secrets level. The verifySecret method in VaultClient uses crypto.timingSafeEqual to prevent timing-based side-channel attacks when comparing stored and candidate secrets.

Alternatives considered:

AWS Secrets Manager — rejected because it introduces cloud-vendor lock-in. AgentIdP must run identically on AWS, GCP, and on-premises; a Vault-based approach works in all environments.
Plain bcrypt only — retained as the fallback path. When VAULT_ADDR is not set, createVaultClientFromEnv() returns null and the server operates identically to Phase 1.

Consequences: Vault is controlled by VAULT_ADDR (required), VAULT_TOKEN (required), and VAULT_MOUNT (optional, defaults to secret). When these are not set, bcrypt is used unchanged. Credential rows carry a nullable vault_path column: null means bcrypt; a non-null path means Vault verification is used.

ADR-7: OPA (Open Policy Agent)

Status: Adopted Component: Request authorisation — scope enforcement on all protected endpoints

Decision: Use Open Policy Agent with a Rego policy compiled to a Wasm bundle for runtime authorisation.

Rationale: Hard-coded scope checks in middleware would require a code deployment for every policy change. OPA decouples the policy (policies/authz.rego) from the server code. The policy can be updated, re-compiled to Wasm, and hot-reloaded via SIGHUP without restarting the server. The @open-policy-agent/opa-wasm package evaluates the compiled Wasm bundle in-process with microsecond latency. When no Wasm bundle is present (development, CI), the middleware falls back to a TypeScript implementation that reads policies/data/scopes.json.

Alternatives considered:

Custom middleware with hard-coded scope checks — rejected because policy changes require code changes and a full deployment cycle. As the endpoint surface grows this becomes unmanageable.
Casbin — rejected because its RBAC/ABAC model is less expressive than Rego for the compound method + path + scope-intersection pattern AgentIdP requires.

Consequences: All authorisation rules live in policies/authz.rego and policies/data/scopes.json. Adding a new endpoint requires adding its scope requirement to scopes.json. A policy change is deployed by updating scopes.json (or authz.wasm) and sending SIGHUP to the running process — no redeployment needed.

ADR-8: React 18 + Vite 5

Status: Adopted Component: Web dashboard SPA (dashboard/)

Decision: Use React 18 with Vite 5 as the web dashboard framework and build tool.

Rationale: React 18's concurrent rendering model handles the dashboard's async data fetching patterns cleanly. The @sentryagent/idp-sdk Node.js package is reused directly in the dashboard via TokenManager for authentication, avoiding duplicated API client code. Vite 5 provides sub-second HMR in development and a fast production build with tree-shaking. The dashboard is built to dashboard/dist/ and served as static files from Express at /dashboard, keeping the deployment footprint to a single container.

Alternatives considered:

Next.js — rejected because server-side rendering is not needed for an internal operator dashboard, and the added complexity of a Next.js server is not justified.
Vue — rejected because the broader SentryAgent.ai ecosystem is React-first; consistency reduces context-switching overhead.

Consequences: The dashboard must be built (cd dashboard && npm run build) before Express can serve it. In local development, run cd dashboard && npm run dev to use Vite's dev server with HMR; the Vite proxy forwards /api/ calls to Express at localhost:3000.

ADR-9: Prometheus + Grafana

Status: Adopted Component: Operational metrics collection and visualisation

Decision: Use Prometheus for metrics collection and Grafana for dashboards.

Rationale: Prometheus is the industry standard for metrics in container environments. The prom-client npm package integrates natively with Express and provides Counter and Histogram metric types that cover all observability needs for AgentIdP. Grafana's YAML provisioning in monitoring/grafana/provisioning/ makes dashboards reproducible and version-controlled. The monitoring stack runs as a Docker Compose overlay (docker-compose.monitoring.yml) without interfering with the base dev environment.

Alternatives considered:

Datadog — rejected because SaaS cost and vendor lock-in are not acceptable for a free, open-source product. Operators who self-host AgentIdP should not be required to pay for monitoring.
StatsD — rejected because StatsD's flat metric model lacks label/dimension support, which is essential for distinguishing metrics by method, route, and status_code.

Consequences: All metric definitions live exclusively in src/metrics/registry.ts. No other file may instantiate a Counter or Histogram — all other files import specific metrics from that registry. Grafana is available at port 3001 when the monitoring overlay is running.

ADR-10: Terraform

Status: Adopted Component: Infrastructure as code — multi-region AWS + GCP deployment

Decision: Use Terraform with HCL for all infrastructure provisioning across AWS and GCP.

Rationale: Terraform's HCL syntax is readable and its provider ecosystem covers both AWS and GCP with the same toolchain. Reusable modules in terraform/modules/ (agentidp, lb, rds, redis) are composed in environment-specific configurations under terraform/environments/aws/ and terraform/environments/gcp/. All infrastructure changes go through terraform plan review before terraform apply, providing a diff-based approval workflow.

Alternatives considered:

Pulumi — rejected because the Pulumi provider ecosystem for AWS and GCP was less mature than Terraform's at the time of the Phase 2 decision, and HCL is more readable for non-engineers reviewing infrastructure changes.
AWS CDK — rejected because it is AWS-only. AgentIdP must deploy identically to both AWS and GCP.

Consequences: All infrastructure changes must go through Terraform. No manual edits via the AWS console or GCP console are permitted — they will be overwritten on the next terraform apply. Terraform state is stored in a remote backend and must not be edited manually.

14 KiB Raw Blame History Unescape Escape

Technology Stack and Architecture Decision Records

ADR-1: Node.js 18 LTS

ADR-2: TypeScript 5.3 Strict Mode

ADR-3: Express 4.18

ADR-4: PostgreSQL 14

ADR-5: Redis 7

ADR-6: HashiCorp Vault

ADR-7: OPA (Open Policy Agent)

ADR-8: React 18 + Vite 5

ADR-9: Prometheus + Grafana

ADR-10: Terraform

14 KiB

Raw Blame History