Complete docs/engineering/ suite — 12 documents covering company overview, system architecture, tech stack ADRs, codebase structure, service deep dives, annotated code walkthroughs, dev setup, engineering workflow, testing strategy, deployment/ops, SDK guide, and README index. All content verified against source files. All 82 tasks in openspec/changes/engineering-docs/tasks.md marked complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
22 KiB
Service Deep Dives
AgentService
Purpose: Manages the full lifecycle of AI agent identities — registration, retrieval, updates, and decommissioning.
Responsibility boundary: AgentService does not handle HTTP, credential secrets,
token issuance, or audit log queries. It delegates all data access to
AgentRepository and CredentialRepository, and all audit logging to AuditService.
It enforces free-tier limits and domain rules before any data is written.
Public interface (key methods):
| Method | Parameters | Returns | Description |
|---|---|---|---|
registerAgent |
data: ICreateAgentRequest, ipAddress: string, userAgent: string |
Promise<IAgent> |
Checks the free-tier 100-agent limit, enforces email uniqueness, creates the agent record, writes an agent.created audit event, increments agentidp_agents_registered_total Prometheus counter |
getAgentById |
agentId: string |
Promise<IAgent> |
Retrieves a single agent by UUID; throws AgentNotFoundError if not found |
listAgents |
filters: IAgentListFilters |
Promise<IPaginatedAgentsResponse> |
Returns a paginated, optionally filtered list; filters include owner, agentType, status, page, limit |
updateAgent |
agentId: string, data: IUpdateAgentRequest, ipAddress: string, userAgent: string |
Promise<IAgent> |
Partially updates agent metadata; rejects updates to decommissioned agents; determines the correct audit action (agent.updated, agent.suspended, agent.reactivated, agent.decommissioned) based on status transition |
decommissionAgent |
agentId: string, ipAddress: string, userAgent: string |
Promise<void> |
Soft-deletes the agent (sets status = 'decommissioned'); revokes all active credentials by calling credentialRepository.revokeAllForAgent(agentId) before decommissioning |
Database / storage schema:
- Table
agents:agent_id(UUID PK),email(UNIQUE),agent_type,version,capabilities(text array),owner,deployment_env,status,created_at,updated_at. - No Redis usage — AgentService is PostgreSQL-only.
Error types:
FreeTierLimitError(403) — 100-agent limit reachedAgentAlreadyExistsError(409) — email already registeredAgentNotFoundError(404) — agent UUID not foundAgentAlreadyDecommissionedError(409) — agent is already decommissioned
Configuration: None — AgentService reads no environment variables. The free-tier limit (FREE_TIER_MAX_AGENTS = 100) is a module-level constant.
OAuth2Service
Purpose: Issues, introspects, and revokes RS256 JWT access tokens via the OAuth 2.0 Client Credentials grant.
Responsibility boundary: OAuth2Service does not know about HTTP or routing. It
receives already-extracted values (clientId, clientSecret, scope) from the
controller, resolves credential verification (Vault or bcrypt), enforces the 10,000
tokens/month free-tier limit, and returns a typed ITokenResponse. All audit writes
on high-throughput paths (issue, introspect, revoke) are fire-and-forget (void) to
keep token endpoint latency low.
Public interface (key methods):
| Method | Parameters | Returns | Description |
|---|---|---|---|
issueToken |
clientId: string, clientSecret: string, scope: string, ipAddress: string, userAgent: string |
Promise<ITokenResponse> |
Verifies credentials (Vault or bcrypt), checks agent status, enforces 10k/month limit, signs RS256 JWT, increments monthly counter and audit event as fire-and-forget |
introspectToken |
token: string, callerPayload: ITokenPayload, ipAddress: string, userAgent: string |
Promise<IIntrospectResponse> |
Verifies JWT signature and checks Redis revocation list; always returns 200 with active: true/false per RFC 7662 |
revokeToken |
token: string, callerPayload: ITokenPayload, ipAddress: string, userAgent: string |
Promise<void> |
Decodes token without verification; enforces that caller can only revoke their own tokens (decoded.sub === callerPayload.sub); adds JTI to Redis revocation list with TTL matching token expiry |
Database / storage schema:
- Redis key
revoked:{jti}— value1, TTL = seconds until token expiry. Written on revocation; read on every authenticated request viaauthMiddleware. - Redis key
monthly:tokens:{agentId}:{yyyy-mm}— integer counter, incremented on every successful token issuance. Read to enforce the 10k/month free-tier limit.
Error types:
AuthenticationError(401) — agent not found, or no active credential matches the provided secretAuthorizationError(403) — agent is suspended or decommissioned; or caller attempts to revoke another agent's tokenFreeTierLimitError(403) — 10,000 tokens/month limit reached
Configuration:
JWT_PRIVATE_KEY— PEM-encoded RSA private key, required, read at app startup insrc/app.tsJWT_PUBLIC_KEY— PEM-encoded RSA public key, required, read at app startup and inauthMiddlewareVAULT_ADDR,VAULT_TOKEN,VAULT_MOUNT— optional; when set, Vault is used for credential verification instead of bcrypt
CredentialService
Purpose: Manages the full lifecycle of agent credentials — generation, listing, rotation, and revocation.
Responsibility boundary: CredentialService does not know about HTTP or token
issuance. It enforces that credentials can only be generated for active agents. It
delegates secret storage to either VaultClient (Phase 2) or bcrypt (Phase 1 fallback).
The plain-text clientSecret is generated here, returned once in the response, and
never stored or logged — only the bcrypt hash or Vault path is persisted.
Public interface (key methods):
| Method | Parameters | Returns | Description |
|---|---|---|---|
generateCredential |
agentId: string, data: IGenerateCredentialRequest, ipAddress: string, userAgent: string |
Promise<ICredentialWithSecret> |
Verifies agent exists and is active; generates a cryptographically random secret via generateClientSecret(); writes to Vault (when configured) or hashes with bcrypt; returns ICredentialWithSecret — the only time the plain-text secret is returned |
listCredentials |
agentId: string, filters: ICredentialListFilters |
Promise<IPaginatedCredentialsResponse> |
Returns paginated credentials for an agent; clientSecret is never included in list responses |
rotateCredential |
agentId: string, credentialId: string, data: IGenerateCredentialRequest, ipAddress: string, userAgent: string |
Promise<ICredentialWithSecret> |
Generates a new secret for the same credentialId; overwrites Vault entry (new KV v2 version) or updates bcrypt hash; old secret is immediately invalidated; returns new ICredentialWithSecret once |
revokeCredential |
agentId: string, credentialId: string, ipAddress: string, userAgent: string |
Promise<void> |
Sets credential status = 'revoked'; permanently deletes the Vault secret via vaultClient.deleteSecret() when Vault is configured; idempotent rejection of already-revoked credentials with CredentialAlreadyRevokedError |
Database / storage schema:
- Table
credentials:credential_id(UUID PK),client_id(=agentId, FK toagents),secret_hash(bcrypt hash; empty string when Vault path is set),vault_path(nullable — KV v2 data path),status,created_at,expires_at(nullable),revoked_at(nullable).
Error types:
AgentNotFoundError(404) — agent UUID not foundCredentialError(400) — agent is not inactivestatus (code:AGENT_NOT_ACTIVE)CredentialNotFoundError(404) — credential not found or belongs to a different agentCredentialAlreadyRevokedError(409) — credential is already revoked
Configuration:
VAULT_ADDR,VAULT_TOKEN,VAULT_MOUNT— optional; when set, new credentials are stored in Vault KV v2 instead of bcrypt. Existing bcrypt-based credentials continue to work unchanged.
AuditService
Purpose: Creates and queries immutable audit events for compliance and observability.
Responsibility boundary: AuditService does not know about HTTP, tokens, or agents.
It receives already-assembled event data from other services and delegates all
persistence to AuditRepository. It enforces the 90-day free-tier retention window
on all query and retrieval operations — events older than 90 days are treated as
non-existent.
Public interface (key methods):
| Method | Parameters | Returns | Description |
|---|---|---|---|
logEvent |
agentId: string, action: AuditAction, outcome: AuditOutcome, ipAddress: string, userAgent: string, metadata: Record<string, unknown> |
Promise<IAuditEvent> |
Writes an immutable audit row to PostgreSQL. For token endpoints, callers use void (fire-and-forget). For CRUD operations, callers await this method. |
queryEvents |
filters: IAuditListFilters |
Promise<IPaginatedAuditEventsResponse> |
Returns paginated, filtered audit events; enforces the 90-day retention window by computing the cutoff date and rejecting queries with fromDate before the cutoff; validates that fromDate <= toDate |
getEventById |
eventId: string |
Promise<IAuditEvent> |
Retrieves a single event by UUID; returns AuditEventNotFoundError for both genuinely missing events and events outside the 90-day retention window (indistinguishable by design) |
Database / storage schema:
- Table
audit_events:event_id(UUID PK),agent_id(text FK to agents),action(text — one of theAuditActionunion type values),outcome(successorfailure),ip_address(text),user_agent(text),metadata(JSONB),timestamp(timestamptz, NOT NULL, indexed). - No Redis usage — AuditService is PostgreSQL-only.
Error types:
AuditEventNotFoundError(404) — event not found or outside retention windowRetentionWindowError(400) — queryfromDateis before the 90-day retention cutoffValidationError(400) —fromDateis aftertoDate
Configuration: None — the retention window (FREE_TIER_RETENTION_DAYS = 90) is a module-level constant.
VaultClient
Purpose: Wraps HashiCorp Vault KV v2 operations for credential secret storage and verification.
Responsibility boundary: VaultClient is a client adapter — it knows only about
Vault API calls. It has no knowledge of business rules, HTTP, or PostgreSQL. It is
injected into CredentialService and OAuth2Service via constructor injection. When
VAULT_ADDR is not set, createVaultClientFromEnv() returns null and the bcrypt
code path is used unchanged.
Public methods:
| Method | Parameters | Returns | Description |
|---|---|---|---|
writeSecret |
agentId: string, credentialId: string, plainSecret: string |
Promise<string> |
Writes the plain-text secret to the KV v2 data path; returns the path; creates a new KV v2 version on subsequent calls (used for rotation) |
readSecret |
agentId: string, credentialId: string |
Promise<string> |
Reads and returns the plain-text secret from Vault; throws CredentialError if the path is not found or the read fails |
verifySecret |
agentId: string, credentialId: string, candidateSecret: string |
Promise<boolean> |
Reads the stored secret via readSecret, then compares using crypto.timingSafeEqual to prevent timing-based side-channel attacks; returns false on any Vault error rather than throwing |
deleteSecret |
agentId: string, credentialId: string |
Promise<void> |
Permanently deletes all versions of a credential secret by calling the KV v2 metadata path (DELETE {mount}/metadata/agentidp/agents/{agentId}/credentials/{credentialId}) |
KV v2 path structure:
- Data path:
{mount}/data/agentidp/agents/{agentId}/credentials/{credentialId} - Metadata path (for permanent deletion):
{mount}/metadata/agentidp/agents/{agentId}/credentials/{credentialId} - Default mount:
secret(overridable viaVAULT_MOUNT)
Opt-in configuration:
VAULT_ADDR— Vault server address (e.g.http://127.0.0.1:8200) — required to enable Vault modeVAULT_TOKEN— Vault authentication token — required to enable Vault modeVAULT_MOUNT— KV v2 mount path — optional, defaults tosecret
Constant-time comparison rationale: The verifySecret method uses Node.js
crypto.timingSafeEqual instead of === to prevent attackers from inferring the
length or content of stored secrets by measuring how long the comparison takes. When
the stored and candidate secrets differ in length, a dummy timingSafeEqual call is
still performed to eliminate the timing signal from the early-exit path.
OPA Policy Engine
Purpose: Enforces scope-based authorisation on every protected HTTP request without requiring a code deployment to change access rules.
Responsibility boundary: The OPA policy engine (src/middleware/opa.ts) is a
middleware layer — it does not know about business rules, credentials, or audit events.
It receives the HTTP method, full request path, and caller scopes from req.user, and
returns allow or deny. All policy logic lives in policies/authz.rego and
policies/data/scopes.json.
Policy file locations:
policies/authz.rego— Rego policy definingnormalise_path,lookup_key, and theallowrule. Evaluated by the Wasm bundle when compiled; replicated in TypeScript for the fallback path.policies/data/scopes.json— JSON map of"METHOD:/path/pattern"→[required_scopes]. Loaded as data into the Wasm policy and used directly by the TypeScript fallback.policies/authz.wasm— compiled Wasm bundle (not committed to source control; built fromauthz.regousing the OPA CLI). When present, the Wasm path is used; when absent, the TypeScript fallback readsscopes.json.
How opaMiddleware evaluates input:
createOpaMiddleware()is called once at app startup insrc/app.ts.- It attempts to load
policies/authz.wasm. If found,loadPolicy(wasmBuffer)is called andscopes.jsondata is injected vialoaded.setData(parsed). - If no Wasm bundle is found,
scopes.jsonis loaded intoscopesMapas the TypeScript fallback. - On every request, the middleware builds an
OpaInputobject:{ method: req.method, path: req.baseUrl + req.path, scopes: req.user.scope.split(' ') }. evaluate(input)checks the Wasm policy (if loaded) or appliesnormalisePath+ scope-intersection logic againstscopesMap. Returnsfalseif neither is loaded (fail-closed).- If
evaluatereturnsfalse, the middleware callsnext(new AuthorizationError()).
How to write a new policy rule:
- Add the new endpoint's scope requirement to
policies/data/scopes.json:"GET:/api/v1/reports": ["reports:read"] - Add
"reports:read"to theOAuthScopeunion type insrc/types/index.ts. - If Wasm mode is in use, recompile
authz.regotoauthz.wasmusing the OPA CLI:opa build policies/authz.rego -d policies/data/ -o policies/authz.wasm. - Send
SIGHUPto the running process to hot-reload:kill -HUP <pid>.
How to test a policy rule:
# Using the OPA CLI directly
opa eval --data policies/data/scopes.json \
--input '{"method":"GET","path":"/api/v1/agents","scopes":["agents:read"]}' \
--bundle policies/ \
'data.authz.allow'
Expected output: true. Replace method/path/scopes to test deny cases.
Hot-reload via SIGHUP: When SIGHUP is received by the Node.js process,
server.ts calls reloadOpaPolicy(). This re-executes the same startup loading logic:
tries to load the Wasm bundle, falls back to scopes.json. The in-memory wasmPolicy
and scopesMap module-level variables are replaced atomically. No requests are dropped.
Web Dashboard
Purpose: Provides a browser-based UI for human operators to manage agents, credentials, and audit logs without writing API calls directly.
Responsibility boundary: The dashboard is a pure client-side React SPA. It has no
server-side logic. It calls the AgentIdP REST API using the @sentryagent/idp-sdk
TokenManager for authentication and a typed ApiClient from dashboard/src/lib/client.ts
for all API calls. It never stores the access_token in localStorage — only
client_id, client_secret, and baseUrl are stored in sessionStorage (cleared
on tab close).
React component structure:
dashboard/src/
├── main.tsx # React root — mounts App into #root, wraps with BrowserRouter
├── App.tsx # Route definitions — AuthProvider, RequireAuth, AppShell
├── lib/
│ ├── auth.tsx # AuthContext, AuthProvider, useAuth hook, sessionStorage helpers
│ └── client.ts # Typed ApiClient class — wraps fetch with TokenManager token injection
├── components/
│ ├── RequireAuth.tsx # Route guard — redirects to /dashboard/login if not authenticated
│ └── layout/AppShell.tsx # Persistent sidebar navigation + Outlet for page content
└── pages/
├── Login.tsx # Login form — calls auth.login(), redirects to /dashboard/agents
├── Agents.tsx # Paginated agents list with status filter and search
├── AgentDetail.tsx # Single agent view — status, metadata, update, decommission actions
├── Credentials.tsx # Credential list for an agent — generate, rotate, revoke actions
├── AuditLog.tsx # Paginated audit log with date range and action filters
└── Health.tsx # /health endpoint response — PostgreSQL and Redis status display
Authentication flow with sessionStorage:
- On
Login.tsxform submit,auth.login(creds)is called. validateCredentials(creds)creates aTokenManagerand callsgetToken()— if this succeeds, the credentials are valid.saveCredentials(creds)stores{ clientId, clientSecret, baseUrl }insessionStorageunder keyagentidp_credentials.- On every subsequent API call,
getClient()inlib/client.tsreads credentials fromsessionStorage, creates aTokenManager, and injects the currentaccess_tokeninto theAuthorization: Bearerheader. TheTokenManagerhandles automatic token refresh when the token is expired. auth.logout()callsclearCredentials()(removes thesessionStoragekey) and navigates to/dashboard/login.
Main views and their API calls:
- Agents —
GET /api/v1/agents?page=N&limit=20— paginated list withstatusfilter - AgentDetail —
GET /api/v1/agents/:id,PATCH /api/v1/agents/:id,DELETE /api/v1/agents/:id - Credentials —
GET /api/v1/agents/:id/credentials,POST /api/v1/agents/:id/credentials,POST /api/v1/agents/:id/credentials/:credId/rotate,DELETE /api/v1/agents/:id/credentials/:credId - AuditLog —
GET /api/v1/audit?page=N&limit=20&fromDate=...&toDate=... - Health —
GET /health
Local development:
cd dashboard
npm install
npm run dev # Vite dev server with HMR — dashboard available at http://localhost:5173/dashboard
The Vite dev server proxies /api/ calls to the Express server at http://localhost:3000.
The Express server must be running separately for API calls to work.
Prometheus/Grafana Monitoring
Purpose: Provides operational visibility into AgentIdP's HTTP traffic, token issuance rates, agent registration rates, database latency, and Redis command latency.
Responsibility boundary: The metrics middleware (src/middleware/metrics.ts) and
the metrics registry (src/metrics/registry.ts) are observability concerns only — they
do not affect business logic. Metrics are exposed at GET /metrics via
createMetricsRouter() using metricsRegistry.metrics() from prom-client. The
/metrics endpoint is unauthenticated, intended for scraping by Prometheus only and
not exposed to the public internet.
Key metrics with labels:
| Metric Name | Type | Labels | Description |
|---|---|---|---|
agentidp_http_requests_total |
Counter | method, route, status_code |
Total HTTP requests received; route is normalised (UUIDs replaced with :id) |
agentidp_http_request_duration_seconds |
Histogram | method, route, status_code |
HTTP request duration; buckets from 5ms to 2.5s |
agentidp_tokens_issued_total |
Counter | scope |
Total OAuth 2.0 access tokens successfully issued |
agentidp_agents_registered_total |
Counter | deployment_env |
Total AI agents successfully registered |
agentidp_db_query_duration_seconds |
Histogram | operation |
PostgreSQL query duration; buckets from 1ms to 1s |
agentidp_redis_command_duration_seconds |
Histogram | command |
Redis command duration; buckets from 0.5ms to 250ms |
How to add a new Counter:
- Open
src/metrics/registry.ts. - Add a new
Counterexport:export const myNewCounter = new Counter({ name: 'agentidp_my_new_counter_total', help: 'Description of what this counts.', labelNames: ['label_one'] as const, registers: [metricsRegistry], }); - Import and call
myNewCounter.inc({ label_one: value })in the service or middleware where the event occurs.
How to add a new Histogram:
- Open
src/metrics/registry.ts. - Add a new
Histogramexport with appropriate buckets:export const myDurationHistogram = new Histogram({ name: 'agentidp_my_operation_duration_seconds', help: 'Duration of my operation in seconds.', labelNames: ['operation'] as const, buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1], registers: [metricsRegistry], }); - Use
const end = myDurationHistogram.startTimer({ operation: 'name' }); ... end();around the operation being measured.
Grafana access in local Docker:
Start the monitoring overlay:
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up
- Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3001— default credentials:admin/agentidp
Grafana is pre-provisioned with a Prometheus data source pointing to http://prometheus:9090
and dashboard JSON files from monitoring/grafana/dashboards/. No manual configuration
is needed after startup.