docs: engineering knowledge base for new hires

Complete docs/engineering/ suite — 12 documents covering company overview, system architecture, tech stack ADRs, codebase structure, service deep dives, annotated code walkthroughs, dev setup, engineering workflow, testing strategy, deployment/ops, SDK guide, and README index. All content verified against source files. All 82 tasks in openspec/changes/engineering-docs/tasks.md marked complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 12:38:42 +00:00
parent 1f95cfe89d
commit eced5f8699
13 changed files with 3820 additions and 0 deletions
--- a/docs/engineering/05-services.md
+++ b/docs/engineering/05-services.md
@@ -0,0 +1,342 @@
+# Service Deep Dives
+
+---
+
+### AgentService
+
+**Purpose**: Manages the full lifecycle of AI agent identities — registration, retrieval, updates, and decommissioning.
+
+**Responsibility boundary**: AgentService does not handle HTTP, credential secrets,
+token issuance, or audit log queries. It delegates all data access to
+`AgentRepository` and `CredentialRepository`, and all audit logging to `AuditService`.
+It enforces free-tier limits and domain rules before any data is written.
+
+**Public interface** (key methods):
+
+| Method | Parameters | Returns | Description |
+|--------|-----------|---------|-------------|
+| `registerAgent` | `data: ICreateAgentRequest, ipAddress: string, userAgent: string` | `Promise<IAgent>` | Checks the free-tier 100-agent limit, enforces email uniqueness, creates the agent record, writes an `agent.created` audit event, increments `agentidp_agents_registered_total` Prometheus counter |
+| `getAgentById` | `agentId: string` | `Promise<IAgent>` | Retrieves a single agent by UUID; throws `AgentNotFoundError` if not found |
+| `listAgents` | `filters: IAgentListFilters` | `Promise<IPaginatedAgentsResponse>` | Returns a paginated, optionally filtered list; filters include `owner`, `agentType`, `status`, `page`, `limit` |
+| `updateAgent` | `agentId: string, data: IUpdateAgentRequest, ipAddress: string, userAgent: string` | `Promise<IAgent>` | Partially updates agent metadata; rejects updates to decommissioned agents; determines the correct audit action (`agent.updated`, `agent.suspended`, `agent.reactivated`, `agent.decommissioned`) based on status transition |
+| `decommissionAgent` | `agentId: string, ipAddress: string, userAgent: string` | `Promise<void>` | Soft-deletes the agent (sets `status = 'decommissioned'`); revokes all active credentials by calling `credentialRepository.revokeAllForAgent(agentId)` before decommissioning |
+
+**Database / storage schema**:
+- Table `agents`: `agent_id` (UUID PK), `email` (UNIQUE), `agent_type`, `version`, `capabilities` (text array), `owner`, `deployment_env`, `status`, `created_at`, `updated_at`.
+- No Redis usage — AgentService is PostgreSQL-only.
+
+**Error types**:
+- `FreeTierLimitError` (403) — 100-agent limit reached
+- `AgentAlreadyExistsError` (409) — email already registered
+- `AgentNotFoundError` (404) — agent UUID not found
+- `AgentAlreadyDecommissionedError` (409) — agent is already decommissioned
+
+**Configuration**: None — AgentService reads no environment variables. The free-tier limit (`FREE_TIER_MAX_AGENTS = 100`) is a module-level constant.
+
+---
+
+### OAuth2Service
+
+**Purpose**: Issues, introspects, and revokes RS256 JWT access tokens via the OAuth 2.0 Client Credentials grant.
+
+**Responsibility boundary**: OAuth2Service does not know about HTTP or routing. It
+receives already-extracted values (`clientId`, `clientSecret`, `scope`) from the
+controller, resolves credential verification (Vault or bcrypt), enforces the 10,000
+tokens/month free-tier limit, and returns a typed `ITokenResponse`. All audit writes
+on high-throughput paths (issue, introspect, revoke) are fire-and-forget (`void`) to
+keep token endpoint latency low.
+
+**Public interface** (key methods):
+
+| Method | Parameters | Returns | Description |
+|--------|-----------|---------|-------------|
+| `issueToken` | `clientId: string, clientSecret: string, scope: string, ipAddress: string, userAgent: string` | `Promise<ITokenResponse>` | Verifies credentials (Vault or bcrypt), checks agent status, enforces 10k/month limit, signs RS256 JWT, increments monthly counter and audit event as fire-and-forget |
+| `introspectToken` | `token: string, callerPayload: ITokenPayload, ipAddress: string, userAgent: string` | `Promise<IIntrospectResponse>` | Verifies JWT signature and checks Redis revocation list; always returns 200 with `active: true/false` per RFC 7662 |
+| `revokeToken` | `token: string, callerPayload: ITokenPayload, ipAddress: string, userAgent: string` | `Promise<void>` | Decodes token without verification; enforces that caller can only revoke their own tokens (`decoded.sub === callerPayload.sub`); adds JTI to Redis revocation list with TTL matching token expiry |
+
+**Database / storage schema**:
+- Redis key `revoked:{jti}` — value `1`, TTL = seconds until token expiry. Written on revocation; read on every authenticated request via `authMiddleware`.
+- Redis key `monthly:tokens:{agentId}:{yyyy-mm}` — integer counter, incremented on every successful token issuance. Read to enforce the 10k/month free-tier limit.
+
+**Error types**:
+- `AuthenticationError` (401) — agent not found, or no active credential matches the provided secret
+- `AuthorizationError` (403) — agent is suspended or decommissioned; or caller attempts to revoke another agent's token
+- `FreeTierLimitError` (403) — 10,000 tokens/month limit reached
+
+**Configuration**:
+- `JWT_PRIVATE_KEY` — PEM-encoded RSA private key, required, read at app startup in `src/app.ts`
+- `JWT_PUBLIC_KEY` — PEM-encoded RSA public key, required, read at app startup and in `authMiddleware`
+- `VAULT_ADDR`, `VAULT_TOKEN`, `VAULT_MOUNT` — optional; when set, Vault is used for credential verification instead of bcrypt
+
+---
+
+### CredentialService
+
+**Purpose**: Manages the full lifecycle of agent credentials — generation, listing, rotation, and revocation.
+
+**Responsibility boundary**: CredentialService does not know about HTTP or token
+issuance. It enforces that credentials can only be generated for `active` agents. It
+delegates secret storage to either `VaultClient` (Phase 2) or bcrypt (Phase 1 fallback).
+The plain-text `clientSecret` is generated here, returned once in the response, and
+never stored or logged — only the bcrypt hash or Vault path is persisted.
+
+**Public interface** (key methods):
+
+| Method | Parameters | Returns | Description |
+|--------|-----------|---------|-------------|
+| `generateCredential` | `agentId: string, data: IGenerateCredentialRequest, ipAddress: string, userAgent: string` | `Promise<ICredentialWithSecret>` | Verifies agent exists and is `active`; generates a cryptographically random secret via `generateClientSecret()`; writes to Vault (when configured) or hashes with bcrypt; returns `ICredentialWithSecret` — the only time the plain-text secret is returned |
+| `listCredentials` | `agentId: string, filters: ICredentialListFilters` | `Promise<IPaginatedCredentialsResponse>` | Returns paginated credentials for an agent; `clientSecret` is never included in list responses |
+| `rotateCredential` | `agentId: string, credentialId: string, data: IGenerateCredentialRequest, ipAddress: string, userAgent: string` | `Promise<ICredentialWithSecret>` | Generates a new secret for the same `credentialId`; overwrites Vault entry (new KV v2 version) or updates bcrypt hash; old secret is immediately invalidated; returns new `ICredentialWithSecret` once |
+| `revokeCredential` | `agentId: string, credentialId: string, ipAddress: string, userAgent: string` | `Promise<void>` | Sets credential `status = 'revoked'`; permanently deletes the Vault secret via `vaultClient.deleteSecret()` when Vault is configured; idempotent rejection of already-revoked credentials with `CredentialAlreadyRevokedError` |
+
+**Database / storage schema**:
+- Table `credentials`: `credential_id` (UUID PK), `client_id` (= `agentId`, FK to `agents`), `secret_hash` (bcrypt hash; empty string when Vault path is set), `vault_path` (nullable — KV v2 data path), `status`, `created_at`, `expires_at` (nullable), `revoked_at` (nullable).
+
+**Error types**:
+- `AgentNotFoundError` (404) — agent UUID not found
+- `CredentialError` (400) — agent is not in `active` status (code: `AGENT_NOT_ACTIVE`)
+- `CredentialNotFoundError` (404) — credential not found or belongs to a different agent
+- `CredentialAlreadyRevokedError` (409) — credential is already revoked
+
+**Configuration**:
+- `VAULT_ADDR`, `VAULT_TOKEN`, `VAULT_MOUNT` — optional; when set, new credentials are stored in Vault KV v2 instead of bcrypt. Existing bcrypt-based credentials continue to work unchanged.
+
+---
+
+### AuditService
+
+**Purpose**: Creates and queries immutable audit events for compliance and observability.
+
+**Responsibility boundary**: AuditService does not know about HTTP, tokens, or agents.
+It receives already-assembled event data from other services and delegates all
+persistence to `AuditRepository`. It enforces the 90-day free-tier retention window
+on all query and retrieval operations — events older than 90 days are treated as
+non-existent.
+
+**Public interface** (key methods):
+
+| Method | Parameters | Returns | Description |
+|--------|-----------|---------|-------------|
+| `logEvent` | `agentId: string, action: AuditAction, outcome: AuditOutcome, ipAddress: string, userAgent: string, metadata: Record<string, unknown>` | `Promise<IAuditEvent>` | Writes an immutable audit row to PostgreSQL. For token endpoints, callers use `void` (fire-and-forget). For CRUD operations, callers `await` this method. |
+| `queryEvents` | `filters: IAuditListFilters` | `Promise<IPaginatedAuditEventsResponse>` | Returns paginated, filtered audit events; enforces the 90-day retention window by computing the cutoff date and rejecting queries with `fromDate` before the cutoff; validates that `fromDate <= toDate` |
+| `getEventById` | `eventId: string` | `Promise<IAuditEvent>` | Retrieves a single event by UUID; returns `AuditEventNotFoundError` for both genuinely missing events and events outside the 90-day retention window (indistinguishable by design) |
+
+**Database / storage schema**:
+- Table `audit_events`: `event_id` (UUID PK), `agent_id` (text FK to agents), `action` (text — one of the `AuditAction` union type values), `outcome` (`success` or `failure`), `ip_address` (text), `user_agent` (text), `metadata` (JSONB), `timestamp` (timestamptz, NOT NULL, indexed).
+- No Redis usage — AuditService is PostgreSQL-only.
+
+**Error types**:
+- `AuditEventNotFoundError` (404) — event not found or outside retention window
+- `RetentionWindowError` (400) — query `fromDate` is before the 90-day retention cutoff
+- `ValidationError` (400) — `fromDate` is after `toDate`
+
+**Configuration**: None — the retention window (`FREE_TIER_RETENTION_DAYS = 90`) is a module-level constant.
+
+---
+
+### VaultClient
+
+**Purpose**: Wraps HashiCorp Vault KV v2 operations for credential secret storage and verification.
+
+**Responsibility boundary**: VaultClient is a client adapter — it knows only about
+Vault API calls. It has no knowledge of business rules, HTTP, or PostgreSQL. It is
+injected into `CredentialService` and `OAuth2Service` via constructor injection. When
+`VAULT_ADDR` is not set, `createVaultClientFromEnv()` returns `null` and the bcrypt
+code path is used unchanged.
+
+**Public methods**:
+
+| Method | Parameters | Returns | Description |
+|--------|-----------|---------|-------------|
+| `writeSecret` | `agentId: string, credentialId: string, plainSecret: string` | `Promise<string>` | Writes the plain-text secret to the KV v2 data path; returns the path; creates a new KV v2 version on subsequent calls (used for rotation) |
+| `readSecret` | `agentId: string, credentialId: string` | `Promise<string>` | Reads and returns the plain-text secret from Vault; throws `CredentialError` if the path is not found or the read fails |
+| `verifySecret` | `agentId: string, credentialId: string, candidateSecret: string` | `Promise<boolean>` | Reads the stored secret via `readSecret`, then compares using `crypto.timingSafeEqual` to prevent timing-based side-channel attacks; returns `false` on any Vault error rather than throwing |
+| `deleteSecret` | `agentId: string, credentialId: string` | `Promise<void>` | Permanently deletes all versions of a credential secret by calling the KV v2 metadata path (`DELETE {mount}/metadata/agentidp/agents/{agentId}/credentials/{credentialId}`) |
+
+**KV v2 path structure**:
+- Data path: `{mount}/data/agentidp/agents/{agentId}/credentials/{credentialId}`
+- Metadata path (for permanent deletion): `{mount}/metadata/agentidp/agents/{agentId}/credentials/{credentialId}`
+- Default mount: `secret` (overridable via `VAULT_MOUNT`)
+
+**Opt-in configuration**:
+- `VAULT_ADDR` — Vault server address (e.g. `http://127.0.0.1:8200`) — required to enable Vault mode
+- `VAULT_TOKEN` — Vault authentication token — required to enable Vault mode
+- `VAULT_MOUNT` — KV v2 mount path — optional, defaults to `secret`
+
+**Constant-time comparison rationale**: The `verifySecret` method uses Node.js
+`crypto.timingSafeEqual` instead of `===` to prevent attackers from inferring the
+length or content of stored secrets by measuring how long the comparison takes. When
+the stored and candidate secrets differ in length, a dummy `timingSafeEqual` call is
+still performed to eliminate the timing signal from the early-exit path.
+
+---
+
+### OPA Policy Engine
+
+**Purpose**: Enforces scope-based authorisation on every protected HTTP request without requiring a code deployment to change access rules.
+
+**Responsibility boundary**: The OPA policy engine (`src/middleware/opa.ts`) is a
+middleware layer — it does not know about business rules, credentials, or audit events.
+It receives the HTTP method, full request path, and caller scopes from `req.user`, and
+returns allow or deny. All policy logic lives in `policies/authz.rego` and
+`policies/data/scopes.json`.
+
+**Policy file locations**:
+- `policies/authz.rego` — Rego policy defining `normalise_path`, `lookup_key`, and the `allow` rule. Evaluated by the Wasm bundle when compiled; replicated in TypeScript for the fallback path.
+- `policies/data/scopes.json` — JSON map of `"METHOD:/path/pattern"` → `[required_scopes]`. Loaded as data into the Wasm policy and used directly by the TypeScript fallback.
+- `policies/authz.wasm` — compiled Wasm bundle (not committed to source control; built from `authz.rego` using the OPA CLI). When present, the Wasm path is used; when absent, the TypeScript fallback reads `scopes.json`.
+
+**How `opaMiddleware` evaluates input**:
+
+1. `createOpaMiddleware()` is called once at app startup in `src/app.ts`.
+2. It attempts to load `policies/authz.wasm`. If found, `loadPolicy(wasmBuffer)` is called and `scopes.json` data is injected via `loaded.setData(parsed)`.
+3. If no Wasm bundle is found, `scopes.json` is loaded into `scopesMap` as the TypeScript fallback.
+4. On every request, the middleware builds an `OpaInput` object: `{ method: req.method, path: req.baseUrl + req.path, scopes: req.user.scope.split(' ') }`.
+5. `evaluate(input)` checks the Wasm policy (if loaded) or applies `normalisePath` + scope-intersection logic against `scopesMap`. Returns `false` if neither is loaded (fail-closed).
+6. If `evaluate` returns `false`, the middleware calls `next(new AuthorizationError())`.
+
+**How to write a new policy rule**:
+
+1. Add the new endpoint's scope requirement to `policies/data/scopes.json`:
+   ```json
+   "GET:/api/v1/reports": ["reports:read"]
+   ```
+2. Add `"reports:read"` to the `OAuthScope` union type in `src/types/index.ts`.
+3. If Wasm mode is in use, recompile `authz.rego` to `authz.wasm` using the OPA CLI: `opa build policies/authz.rego -d policies/data/ -o policies/authz.wasm`.
+4. Send `SIGHUP` to the running process to hot-reload: `kill -HUP <pid>`.
+
+**How to test a policy rule**:
+```bash
+# Using the OPA CLI directly
+opa eval --data policies/data/scopes.json \
+         --input '{"method":"GET","path":"/api/v1/agents","scopes":["agents:read"]}' \
+         --bundle policies/ \
+         'data.authz.allow'
+```
+Expected output: `true`. Replace method/path/scopes to test deny cases.
+
+**Hot-reload via SIGHUP**: When `SIGHUP` is received by the Node.js process,
+`server.ts` calls `reloadOpaPolicy()`. This re-executes the same startup loading logic:
+tries to load the Wasm bundle, falls back to `scopes.json`. The in-memory `wasmPolicy`
+and `scopesMap` module-level variables are replaced atomically. No requests are dropped.
+
+---
+
+### Web Dashboard
+
+**Purpose**: Provides a browser-based UI for human operators to manage agents, credentials, and audit logs without writing API calls directly.
+
+**Responsibility boundary**: The dashboard is a pure client-side React SPA. It has no
+server-side logic. It calls the AgentIdP REST API using the `@sentryagent/idp-sdk`
+`TokenManager` for authentication and a typed `ApiClient` from `dashboard/src/lib/client.ts`
+for all API calls. It never stores the `access_token` in localStorage — only
+`client_id`, `client_secret`, and `baseUrl` are stored in `sessionStorage` (cleared
+on tab close).
+
+**React component structure**:
+
+```
+dashboard/src/
+├── main.tsx               # React root — mounts App into #root, wraps with BrowserRouter
+├── App.tsx                # Route definitions — AuthProvider, RequireAuth, AppShell
+├── lib/
+│   ├── auth.tsx           # AuthContext, AuthProvider, useAuth hook, sessionStorage helpers
+│   └── client.ts          # Typed ApiClient class — wraps fetch with TokenManager token injection
+├── components/
+│   ├── RequireAuth.tsx    # Route guard — redirects to /dashboard/login if not authenticated
+│   └── layout/AppShell.tsx # Persistent sidebar navigation + Outlet for page content
+└── pages/
+    ├── Login.tsx          # Login form — calls auth.login(), redirects to /dashboard/agents
+    ├── Agents.tsx         # Paginated agents list with status filter and search
+    ├── AgentDetail.tsx    # Single agent view — status, metadata, update, decommission actions
+    ├── Credentials.tsx    # Credential list for an agent — generate, rotate, revoke actions
+    ├── AuditLog.tsx       # Paginated audit log with date range and action filters
+    └── Health.tsx         # /health endpoint response — PostgreSQL and Redis status display
+```
+
+**Authentication flow with sessionStorage**:
+1. On `Login.tsx` form submit, `auth.login(creds)` is called.
+2. `validateCredentials(creds)` creates a `TokenManager` and calls `getToken()` — if this succeeds, the credentials are valid.
+3. `saveCredentials(creds)` stores `{ clientId, clientSecret, baseUrl }` in `sessionStorage` under key `agentidp_credentials`.
+4. On every subsequent API call, `getClient()` in `lib/client.ts` reads credentials from `sessionStorage`, creates a `TokenManager`, and injects the current `access_token` into the `Authorization: Bearer` header. The `TokenManager` handles automatic token refresh when the token is expired.
+5. `auth.logout()` calls `clearCredentials()` (removes the `sessionStorage` key) and navigates to `/dashboard/login`.
+
+**Main views and their API calls**:
+- **Agents** — `GET /api/v1/agents?page=N&limit=20` — paginated list with `status` filter
+- **AgentDetail** — `GET /api/v1/agents/:id`, `PATCH /api/v1/agents/:id`, `DELETE /api/v1/agents/:id`
+- **Credentials** — `GET /api/v1/agents/:id/credentials`, `POST /api/v1/agents/:id/credentials`, `POST /api/v1/agents/:id/credentials/:credId/rotate`, `DELETE /api/v1/agents/:id/credentials/:credId`
+- **AuditLog** — `GET /api/v1/audit?page=N&limit=20&fromDate=...&toDate=...`
+- **Health** — `GET /health`
+
+**Local development**:
+```bash
+cd dashboard
+npm install
+npm run dev    # Vite dev server with HMR — dashboard available at http://localhost:5173/dashboard
+```
+The Vite dev server proxies `/api/` calls to the Express server at `http://localhost:3000`.
+The Express server must be running separately for API calls to work.
+
+---
+
+### Prometheus/Grafana Monitoring
+
+**Purpose**: Provides operational visibility into AgentIdP's HTTP traffic, token issuance rates, agent registration rates, database latency, and Redis command latency.
+
+**Responsibility boundary**: The metrics middleware (`src/middleware/metrics.ts`) and
+the metrics registry (`src/metrics/registry.ts`) are observability concerns only — they
+do not affect business logic. Metrics are exposed at `GET /metrics` via
+`createMetricsRouter()` using `metricsRegistry.metrics()` from `prom-client`. The
+`/metrics` endpoint is unauthenticated, intended for scraping by Prometheus only and
+not exposed to the public internet.
+
+**Key metrics with labels**:
+
+| Metric Name | Type | Labels | Description |
+|-------------|------|--------|-------------|
+| `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | Total HTTP requests received; route is normalised (UUIDs replaced with `:id`) |
+| `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP request duration; buckets from 5ms to 2.5s |
+| `agentidp_tokens_issued_total` | Counter | `scope` | Total OAuth 2.0 access tokens successfully issued |
+| `agentidp_agents_registered_total` | Counter | `deployment_env` | Total AI agents successfully registered |
+| `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration; buckets from 1ms to 1s |
+| `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration; buckets from 0.5ms to 250ms |
+
+**How to add a new Counter**:
+1. Open `src/metrics/registry.ts`.
+2. Add a new `Counter` export:
+   ```typescript
+   export const myNewCounter = new Counter({
+     name: 'agentidp_my_new_counter_total',
+     help: 'Description of what this counts.',
+     labelNames: ['label_one'] as const,
+     registers: [metricsRegistry],
+   });
+   ```
+3. Import and call `myNewCounter.inc({ label_one: value })` in the service or middleware where the event occurs.
+
+**How to add a new Histogram**:
+1. Open `src/metrics/registry.ts`.
+2. Add a new `Histogram` export with appropriate buckets:
+   ```typescript
+   export const myDurationHistogram = new Histogram({
+     name: 'agentidp_my_operation_duration_seconds',
+     help: 'Duration of my operation in seconds.',
+     labelNames: ['operation'] as const,
+     buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1],
+     registers: [metricsRegistry],
+   });
+   ```
+3. Use `const end = myDurationHistogram.startTimer({ operation: 'name' }); ... end();` around the operation being measured.
+
+**Grafana access in local Docker**:
+
+Start the monitoring overlay:
+```bash
+docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up
+```
+- Prometheus: `http://localhost:9090`
+- Grafana: `http://localhost:3001` — default credentials: `admin` / `agentidp`
+
+Grafana is pre-provisioned with a Prometheus data source pointing to `http://prometheus:9090`
+and dashboard JSON files from `monitoring/grafana/dashboards/`. No manual configuration
+is needed after startup.