# Service Deep Dives --- ### AgentService **Purpose**: Manages the full lifecycle of AI agent identities — registration, retrieval, updates, and decommissioning. **Responsibility boundary**: AgentService does not handle HTTP, credential secrets, token issuance, or audit log queries. It delegates all data access to `AgentRepository` and `CredentialRepository`, and all audit logging to `AuditService`. It enforces free-tier limits and domain rules before any data is written. **Public interface** (key methods): | Method | Parameters | Returns | Description | |--------|-----------|---------|-------------| | `registerAgent` | `data: ICreateAgentRequest, ipAddress: string, userAgent: string` | `Promise` | Checks the free-tier 100-agent limit, enforces email uniqueness, creates the agent record, writes an `agent.created` audit event, increments `agentidp_agents_registered_total` Prometheus counter | | `getAgentById` | `agentId: string` | `Promise` | Retrieves a single agent by UUID; throws `AgentNotFoundError` if not found | | `listAgents` | `filters: IAgentListFilters` | `Promise` | Returns a paginated, optionally filtered list; filters include `owner`, `agentType`, `status`, `page`, `limit` | | `updateAgent` | `agentId: string, data: IUpdateAgentRequest, ipAddress: string, userAgent: string` | `Promise` | Partially updates agent metadata; rejects updates to decommissioned agents; determines the correct audit action (`agent.updated`, `agent.suspended`, `agent.reactivated`, `agent.decommissioned`) based on status transition | | `decommissionAgent` | `agentId: string, ipAddress: string, userAgent: string` | `Promise` | Soft-deletes the agent (sets `status = 'decommissioned'`); revokes all active credentials by calling `credentialRepository.revokeAllForAgent(agentId)` before decommissioning | **Database / storage schema**: - Table `agents`: `agent_id` (UUID PK), `email` (UNIQUE), `agent_type`, `version`, `capabilities` (text array), `owner`, `deployment_env`, `status`, `created_at`, `updated_at`. - No Redis usage — AgentService is PostgreSQL-only. **Error types**: - `FreeTierLimitError` (403) — 100-agent limit reached - `AgentAlreadyExistsError` (409) — email already registered - `AgentNotFoundError` (404) — agent UUID not found - `AgentAlreadyDecommissionedError` (409) — agent is already decommissioned **Configuration**: None — AgentService reads no environment variables. The free-tier limit (`FREE_TIER_MAX_AGENTS = 100`) is a module-level constant. --- ### OAuth2Service **Purpose**: Issues, introspects, and revokes RS256 JWT access tokens via the OAuth 2.0 Client Credentials grant. **Responsibility boundary**: OAuth2Service does not know about HTTP or routing. It receives already-extracted values (`clientId`, `clientSecret`, `scope`) from the controller, resolves credential verification (Vault or bcrypt), enforces the 10,000 tokens/month free-tier limit, and returns a typed `ITokenResponse`. All audit writes on high-throughput paths (issue, introspect, revoke) are fire-and-forget (`void`) to keep token endpoint latency low. **Public interface** (key methods): | Method | Parameters | Returns | Description | |--------|-----------|---------|-------------| | `issueToken` | `clientId: string, clientSecret: string, scope: string, ipAddress: string, userAgent: string` | `Promise` | Verifies credentials (Vault or bcrypt), checks agent status, enforces 10k/month limit, signs RS256 JWT, increments monthly counter and audit event as fire-and-forget | | `introspectToken` | `token: string, callerPayload: ITokenPayload, ipAddress: string, userAgent: string` | `Promise` | Verifies JWT signature and checks Redis revocation list; always returns 200 with `active: true/false` per RFC 7662 | | `revokeToken` | `token: string, callerPayload: ITokenPayload, ipAddress: string, userAgent: string` | `Promise` | Decodes token without verification; enforces that caller can only revoke their own tokens (`decoded.sub === callerPayload.sub`); adds JTI to Redis revocation list with TTL matching token expiry | **Database / storage schema**: - Redis key `revoked:{jti}` — value `1`, TTL = seconds until token expiry. Written on revocation; read on every authenticated request via `authMiddleware`. - Redis key `monthly:tokens:{agentId}:{yyyy-mm}` — integer counter, incremented on every successful token issuance. Read to enforce the 10k/month free-tier limit. **Error types**: - `AuthenticationError` (401) — agent not found, or no active credential matches the provided secret - `AuthorizationError` (403) — agent is suspended or decommissioned; or caller attempts to revoke another agent's token - `FreeTierLimitError` (403) — 10,000 tokens/month limit reached **Configuration**: - `JWT_PRIVATE_KEY` — PEM-encoded RSA private key, required, read at app startup in `src/app.ts` - `JWT_PUBLIC_KEY` — PEM-encoded RSA public key, required, read at app startup and in `authMiddleware` - `VAULT_ADDR`, `VAULT_TOKEN`, `VAULT_MOUNT` — optional; when set, Vault is used for credential verification instead of bcrypt --- ### CredentialService **Purpose**: Manages the full lifecycle of agent credentials — generation, listing, rotation, and revocation. **Responsibility boundary**: CredentialService does not know about HTTP or token issuance. It enforces that credentials can only be generated for `active` agents. It delegates secret storage to either `VaultClient` (Phase 2) or bcrypt (Phase 1 fallback). The plain-text `clientSecret` is generated here, returned once in the response, and never stored or logged — only the bcrypt hash or Vault path is persisted. **Public interface** (key methods): | Method | Parameters | Returns | Description | |--------|-----------|---------|-------------| | `generateCredential` | `agentId: string, data: IGenerateCredentialRequest, ipAddress: string, userAgent: string` | `Promise` | Verifies agent exists and is `active`; generates a cryptographically random secret via `generateClientSecret()`; writes to Vault (when configured) or hashes with bcrypt; returns `ICredentialWithSecret` — the only time the plain-text secret is returned | | `listCredentials` | `agentId: string, filters: ICredentialListFilters` | `Promise` | Returns paginated credentials for an agent; `clientSecret` is never included in list responses | | `rotateCredential` | `agentId: string, credentialId: string, data: IGenerateCredentialRequest, ipAddress: string, userAgent: string` | `Promise` | Generates a new secret for the same `credentialId`; overwrites Vault entry (new KV v2 version) or updates bcrypt hash; old secret is immediately invalidated; returns new `ICredentialWithSecret` once | | `revokeCredential` | `agentId: string, credentialId: string, ipAddress: string, userAgent: string` | `Promise` | Sets credential `status = 'revoked'`; permanently deletes the Vault secret via `vaultClient.deleteSecret()` when Vault is configured; idempotent rejection of already-revoked credentials with `CredentialAlreadyRevokedError` | **Database / storage schema**: - Table `credentials`: `credential_id` (UUID PK), `client_id` (= `agentId`, FK to `agents`), `secret_hash` (bcrypt hash; empty string when Vault path is set), `vault_path` (nullable — KV v2 data path), `status`, `created_at`, `expires_at` (nullable), `revoked_at` (nullable). **Error types**: - `AgentNotFoundError` (404) — agent UUID not found - `CredentialError` (400) — agent is not in `active` status (code: `AGENT_NOT_ACTIVE`) - `CredentialNotFoundError` (404) — credential not found or belongs to a different agent - `CredentialAlreadyRevokedError` (409) — credential is already revoked **Configuration**: - `VAULT_ADDR`, `VAULT_TOKEN`, `VAULT_MOUNT` — optional; when set, new credentials are stored in Vault KV v2 instead of bcrypt. Existing bcrypt-based credentials continue to work unchanged. --- ### AuditService **Purpose**: Creates and queries immutable audit events for compliance and observability. **Responsibility boundary**: AuditService does not know about HTTP, tokens, or agents. It receives already-assembled event data from other services and delegates all persistence to `AuditRepository`. It enforces the 90-day free-tier retention window on all query and retrieval operations — events older than 90 days are treated as non-existent. **Public interface** (key methods): | Method | Parameters | Returns | Description | |--------|-----------|---------|-------------| | `logEvent` | `agentId: string, action: AuditAction, outcome: AuditOutcome, ipAddress: string, userAgent: string, metadata: Record` | `Promise` | Writes an immutable audit row to PostgreSQL. For token endpoints, callers use `void` (fire-and-forget). For CRUD operations, callers `await` this method. | | `queryEvents` | `filters: IAuditListFilters` | `Promise` | Returns paginated, filtered audit events; enforces the 90-day retention window by computing the cutoff date and rejecting queries with `fromDate` before the cutoff; validates that `fromDate <= toDate` | | `getEventById` | `eventId: string` | `Promise` | Retrieves a single event by UUID; returns `AuditEventNotFoundError` for both genuinely missing events and events outside the 90-day retention window (indistinguishable by design) | **Database / storage schema**: - Table `audit_events`: `event_id` (UUID PK), `agent_id` (text FK to agents), `action` (text — one of the `AuditAction` union type values), `outcome` (`success` or `failure`), `ip_address` (text), `user_agent` (text), `metadata` (JSONB), `timestamp` (timestamptz, NOT NULL, indexed). - No Redis usage — AuditService is PostgreSQL-only. **Error types**: - `AuditEventNotFoundError` (404) — event not found or outside retention window - `RetentionWindowError` (400) — query `fromDate` is before the 90-day retention cutoff - `ValidationError` (400) — `fromDate` is after `toDate` **Configuration**: None — the retention window (`FREE_TIER_RETENTION_DAYS = 90`) is a module-level constant. --- ### VaultClient **Purpose**: Wraps HashiCorp Vault KV v2 operations for credential secret storage and verification. **Responsibility boundary**: VaultClient is a client adapter — it knows only about Vault API calls. It has no knowledge of business rules, HTTP, or PostgreSQL. It is injected into `CredentialService` and `OAuth2Service` via constructor injection. When `VAULT_ADDR` is not set, `createVaultClientFromEnv()` returns `null` and the bcrypt code path is used unchanged. **Public methods**: | Method | Parameters | Returns | Description | |--------|-----------|---------|-------------| | `writeSecret` | `agentId: string, credentialId: string, plainSecret: string` | `Promise` | Writes the plain-text secret to the KV v2 data path; returns the path; creates a new KV v2 version on subsequent calls (used for rotation) | | `readSecret` | `agentId: string, credentialId: string` | `Promise` | Reads and returns the plain-text secret from Vault; throws `CredentialError` if the path is not found or the read fails | | `verifySecret` | `agentId: string, credentialId: string, candidateSecret: string` | `Promise` | Reads the stored secret via `readSecret`, then compares using `crypto.timingSafeEqual` to prevent timing-based side-channel attacks; returns `false` on any Vault error rather than throwing | | `deleteSecret` | `agentId: string, credentialId: string` | `Promise` | Permanently deletes all versions of a credential secret by calling the KV v2 metadata path (`DELETE {mount}/metadata/agentidp/agents/{agentId}/credentials/{credentialId}`) | **KV v2 path structure**: - Data path: `{mount}/data/agentidp/agents/{agentId}/credentials/{credentialId}` - Metadata path (for permanent deletion): `{mount}/metadata/agentidp/agents/{agentId}/credentials/{credentialId}` - Default mount: `secret` (overridable via `VAULT_MOUNT`) **Opt-in configuration**: - `VAULT_ADDR` — Vault server address (e.g. `http://127.0.0.1:8200`) — required to enable Vault mode - `VAULT_TOKEN` — Vault authentication token — required to enable Vault mode - `VAULT_MOUNT` — KV v2 mount path — optional, defaults to `secret` **Constant-time comparison rationale**: The `verifySecret` method uses Node.js `crypto.timingSafeEqual` instead of `===` to prevent attackers from inferring the length or content of stored secrets by measuring how long the comparison takes. When the stored and candidate secrets differ in length, a dummy `timingSafeEqual` call is still performed to eliminate the timing signal from the early-exit path. --- ### OPA Policy Engine **Purpose**: Enforces scope-based authorisation on every protected HTTP request without requiring a code deployment to change access rules. **Responsibility boundary**: The OPA policy engine (`src/middleware/opa.ts`) is a middleware layer — it does not know about business rules, credentials, or audit events. It receives the HTTP method, full request path, and caller scopes from `req.user`, and returns allow or deny. All policy logic lives in `policies/authz.rego` and `policies/data/scopes.json`. **Policy file locations**: - `policies/authz.rego` — Rego policy defining `normalise_path`, `lookup_key`, and the `allow` rule. Evaluated by the Wasm bundle when compiled; replicated in TypeScript for the fallback path. - `policies/data/scopes.json` — JSON map of `"METHOD:/path/pattern"` → `[required_scopes]`. Loaded as data into the Wasm policy and used directly by the TypeScript fallback. - `policies/authz.wasm` — compiled Wasm bundle (not committed to source control; built from `authz.rego` using the OPA CLI). When present, the Wasm path is used; when absent, the TypeScript fallback reads `scopes.json`. **How `opaMiddleware` evaluates input**: 1. `createOpaMiddleware()` is called once at app startup in `src/app.ts`. 2. It attempts to load `policies/authz.wasm`. If found, `loadPolicy(wasmBuffer)` is called and `scopes.json` data is injected via `loaded.setData(parsed)`. 3. If no Wasm bundle is found, `scopes.json` is loaded into `scopesMap` as the TypeScript fallback. 4. On every request, the middleware builds an `OpaInput` object: `{ method: req.method, path: req.baseUrl + req.path, scopes: req.user.scope.split(' ') }`. 5. `evaluate(input)` checks the Wasm policy (if loaded) or applies `normalisePath` + scope-intersection logic against `scopesMap`. Returns `false` if neither is loaded (fail-closed). 6. If `evaluate` returns `false`, the middleware calls `next(new AuthorizationError())`. **How to write a new policy rule**: 1. Add the new endpoint's scope requirement to `policies/data/scopes.json`: ```json "GET:/api/v1/reports": ["reports:read"] ``` 2. Add `"reports:read"` to the `OAuthScope` union type in `src/types/index.ts`. 3. If Wasm mode is in use, recompile `authz.rego` to `authz.wasm` using the OPA CLI: `opa build policies/authz.rego -d policies/data/ -o policies/authz.wasm`. 4. Send `SIGHUP` to the running process to hot-reload: `kill -HUP `. **How to test a policy rule**: ```bash # Using the OPA CLI directly opa eval --data policies/data/scopes.json \ --input '{"method":"GET","path":"/api/v1/agents","scopes":["agents:read"]}' \ --bundle policies/ \ 'data.authz.allow' ``` Expected output: `true`. Replace method/path/scopes to test deny cases. **Hot-reload via SIGHUP**: When `SIGHUP` is received by the Node.js process, `server.ts` calls `reloadOpaPolicy()`. This re-executes the same startup loading logic: tries to load the Wasm bundle, falls back to `scopes.json`. The in-memory `wasmPolicy` and `scopesMap` module-level variables are replaced atomically. No requests are dropped. --- ### Web Dashboard **Purpose**: Provides a browser-based UI for human operators to manage agents, credentials, and audit logs without writing API calls directly. **Responsibility boundary**: The dashboard is a pure client-side React SPA. It has no server-side logic. It calls the AgentIdP REST API using the `@sentryagent/idp-sdk` `TokenManager` for authentication and a typed `ApiClient` from `dashboard/src/lib/client.ts` for all API calls. It never stores the `access_token` in localStorage — only `client_id`, `client_secret`, and `baseUrl` are stored in `sessionStorage` (cleared on tab close). **React component structure**: ``` dashboard/src/ ├── main.tsx # React root — mounts App into #root, wraps with BrowserRouter ├── App.tsx # Route definitions — AuthProvider, RequireAuth, AppShell ├── lib/ │ ├── auth.tsx # AuthContext, AuthProvider, useAuth hook, sessionStorage helpers │ └── client.ts # Typed ApiClient class — wraps fetch with TokenManager token injection ├── components/ │ ├── RequireAuth.tsx # Route guard — redirects to /dashboard/login if not authenticated │ └── layout/AppShell.tsx # Persistent sidebar navigation + Outlet for page content └── pages/ ├── Login.tsx # Login form — calls auth.login(), redirects to /dashboard/agents ├── Agents.tsx # Paginated agents list with status filter and search ├── AgentDetail.tsx # Single agent view — status, metadata, update, decommission actions ├── Credentials.tsx # Credential list for an agent — generate, rotate, revoke actions ├── AuditLog.tsx # Paginated audit log with date range and action filters └── Health.tsx # /health endpoint response — PostgreSQL and Redis status display ``` **Authentication flow with sessionStorage**: 1. On `Login.tsx` form submit, `auth.login(creds)` is called. 2. `validateCredentials(creds)` creates a `TokenManager` and calls `getToken()` — if this succeeds, the credentials are valid. 3. `saveCredentials(creds)` stores `{ clientId, clientSecret, baseUrl }` in `sessionStorage` under key `agentidp_credentials`. 4. On every subsequent API call, `getClient()` in `lib/client.ts` reads credentials from `sessionStorage`, creates a `TokenManager`, and injects the current `access_token` into the `Authorization: Bearer` header. The `TokenManager` handles automatic token refresh when the token is expired. 5. `auth.logout()` calls `clearCredentials()` (removes the `sessionStorage` key) and navigates to `/dashboard/login`. **Main views and their API calls**: - **Agents** — `GET /api/v1/agents?page=N&limit=20` — paginated list with `status` filter - **AgentDetail** — `GET /api/v1/agents/:id`, `PATCH /api/v1/agents/:id`, `DELETE /api/v1/agents/:id` - **Credentials** — `GET /api/v1/agents/:id/credentials`, `POST /api/v1/agents/:id/credentials`, `POST /api/v1/agents/:id/credentials/:credId/rotate`, `DELETE /api/v1/agents/:id/credentials/:credId` - **AuditLog** — `GET /api/v1/audit?page=N&limit=20&fromDate=...&toDate=...` - **Health** — `GET /health` **Local development**: ```bash cd dashboard npm install npm run dev # Vite dev server with HMR — dashboard available at http://localhost:5173/dashboard ``` The Vite dev server proxies `/api/` calls to the Express server at `http://localhost:3000`. The Express server must be running separately for API calls to work. --- ### Prometheus/Grafana Monitoring **Purpose**: Provides operational visibility into AgentIdP's HTTP traffic, token issuance rates, agent registration rates, database latency, and Redis command latency. **Responsibility boundary**: The metrics middleware (`src/middleware/metrics.ts`) and the metrics registry (`src/metrics/registry.ts`) are observability concerns only — they do not affect business logic. Metrics are exposed at `GET /metrics` via `createMetricsRouter()` using `metricsRegistry.metrics()` from `prom-client`. The `/metrics` endpoint is unauthenticated, intended for scraping by Prometheus only and not exposed to the public internet. **Key metrics with labels**: | Metric Name | Type | Labels | Description | |-------------|------|--------|-------------| | `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | Total HTTP requests received; route is normalised (UUIDs replaced with `:id`) | | `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP request duration; buckets from 5ms to 2.5s | | `agentidp_tokens_issued_total` | Counter | `scope` | Total OAuth 2.0 access tokens successfully issued | | `agentidp_agents_registered_total` | Counter | `deployment_env` | Total AI agents successfully registered | | `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration; buckets from 1ms to 1s | | `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration; buckets from 0.5ms to 250ms | **How to add a new Counter**: 1. Open `src/metrics/registry.ts`. 2. Add a new `Counter` export: ```typescript export const myNewCounter = new Counter({ name: 'agentidp_my_new_counter_total', help: 'Description of what this counts.', labelNames: ['label_one'] as const, registers: [metricsRegistry], }); ``` 3. Import and call `myNewCounter.inc({ label_one: value })` in the service or middleware where the event occurs. **How to add a new Histogram**: 1. Open `src/metrics/registry.ts`. 2. Add a new `Histogram` export with appropriate buckets: ```typescript export const myDurationHistogram = new Histogram({ name: 'agentidp_my_operation_duration_seconds', help: 'Duration of my operation in seconds.', labelNames: ['operation'] as const, buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1], registers: [metricsRegistry], }); ``` 3. Use `const end = myDurationHistogram.startTimer({ operation: 'name' }); ... end();` around the operation being measured. **Grafana access in local Docker**: Start the monitoring overlay: ```bash docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up ``` - Prometheus: `http://localhost:9090` - Grafana: `http://localhost:3001` — default credentials: `admin` / `agentidp` Grafana is pre-provisioned with a Prometheus data source pointing to `http://prometheus:9090` and dashboard JSON files from `monitoring/grafana/dashboards/`. No manual configuration is needed after startup.