sentryagent-idp/docs/engineering/05-services.md

# Service Deep Dives

---

### AgentService

**Purpose**: Manages the full lifecycle of AI agent identities — registration, retrieval, updates, and decommissioning.

**Responsibility boundary**: AgentService does not handle HTTP, credential secrets,
token issuance, or audit log queries. It delegates all data access to
`AgentRepository` and `CredentialRepository`, and all audit logging to `AuditService`.
It enforces free-tier limits and domain rules before any data is written.

**Public interface** (key methods):

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `registerAgent` | `data: ICreateAgentRequest, ipAddress: string, userAgent: string` | `Promise<IAgent>` | Checks the free-tier 100-agent limit, enforces email uniqueness, creates the agent record, writes an `agent.created` audit event, increments `agentidp_agents_registered_total` Prometheus counter |
| `getAgentById` | `agentId: string` | `Promise<IAgent>` | Retrieves a single agent by UUID; throws `AgentNotFoundError` if not found |
| `listAgents` | `filters: IAgentListFilters` | `Promise<IPaginatedAgentsResponse>` | Returns a paginated, optionally filtered list; filters include `owner`, `agentType`, `status`, `page`, `limit` |
| `updateAgent` | `agentId: string, data: IUpdateAgentRequest, ipAddress: string, userAgent: string` | `Promise<IAgent>` | Partially updates agent metadata; rejects updates to decommissioned agents; determines the correct audit action (`agent.updated`, `agent.suspended`, `agent.reactivated`, `agent.decommissioned`) based on status transition |
| `decommissionAgent` | `agentId: string, ipAddress: string, userAgent: string` | `Promise<void>` | Soft-deletes the agent (sets `status = 'decommissioned'`); revokes all active credentials by calling `credentialRepository.revokeAllForAgent(agentId)` before decommissioning |

**Database / storage schema**:
- Table `agents`: `agent_id` (UUID PK), `email` (UNIQUE), `agent_type`, `version`, `capabilities` (text array), `owner`, `deployment_env`, `status`, `created_at`, `updated_at`.
- No Redis usage — AgentService is PostgreSQL-only.

**Error types**:
- `FreeTierLimitError` (403) — 100-agent limit reached
- `AgentAlreadyExistsError` (409) — email already registered
- `AgentNotFoundError` (404) — agent UUID not found
- `AgentAlreadyDecommissionedError` (409) — agent is already decommissioned

**Configuration**: None — AgentService reads no environment variables. The free-tier limit (`FREE_TIER_MAX_AGENTS = 100`) is a module-level constant.

---

### OAuth2Service

**Purpose**: Issues, introspects, and revokes RS256 JWT access tokens via the OAuth 2.0 Client Credentials grant.

**Responsibility boundary**: OAuth2Service does not know about HTTP or routing. It
receives already-extracted values (`clientId`, `clientSecret`, `scope`) from the
controller, resolves credential verification (Vault or bcrypt), enforces the 10,000
tokens/month free-tier limit, and returns a typed `ITokenResponse`. All audit writes
on high-throughput paths (issue, introspect, revoke) are fire-and-forget (`void`) to
keep token endpoint latency low.

**Public interface** (key methods):

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `issueToken` | `clientId: string, clientSecret: string, scope: string, ipAddress: string, userAgent: string` | `Promise<ITokenResponse>` | Verifies credentials (Vault or bcrypt), checks agent status, enforces 10k/month limit, signs RS256 JWT, increments monthly counter and audit event as fire-and-forget |
| `introspectToken` | `token: string, callerPayload: ITokenPayload, ipAddress: string, userAgent: string` | `Promise<IIntrospectResponse>` | Verifies JWT signature and checks Redis revocation list; always returns 200 with `active: true/false` per RFC 7662 |
| `revokeToken` | `token: string, callerPayload: ITokenPayload, ipAddress: string, userAgent: string` | `Promise<void>` | Decodes token without verification; enforces that caller can only revoke their own tokens (`decoded.sub === callerPayload.sub`); adds JTI to Redis revocation list with TTL matching token expiry |

**Database / storage schema**:
- Redis key `revoked:{jti}` — value `1`, TTL = seconds until token expiry. Written on revocation; read on every authenticated request via `authMiddleware`.
- Redis key `monthly:tokens:{agentId}:{yyyy-mm}` — integer counter, incremented on every successful token issuance. Read to enforce the 10k/month free-tier limit.

**Error types**:
- `AuthenticationError` (401) — agent not found, or no active credential matches the provided secret
- `AuthorizationError` (403) — agent is suspended or decommissioned; or caller attempts to revoke another agent's token
- `FreeTierLimitError` (403) — 10,000 tokens/month limit reached

**Configuration**:
- `JWT_PRIVATE_KEY` — PEM-encoded RSA private key, required, read at app startup in `src/app.ts`
- `JWT_PUBLIC_KEY` — PEM-encoded RSA public key, required, read at app startup and in `authMiddleware`
- `VAULT_ADDR`, `VAULT_TOKEN`, `VAULT_MOUNT` — optional; when set, Vault is used for credential verification instead of bcrypt

---

### CredentialService

**Purpose**: Manages the full lifecycle of agent credentials — generation, listing, rotation, and revocation.

**Responsibility boundary**: CredentialService does not know about HTTP or token
issuance. It enforces that credentials can only be generated for `active` agents. It
delegates secret storage to either `VaultClient` (Phase 2) or bcrypt (Phase 1 fallback).
The plain-text `clientSecret` is generated here, returned once in the response, and
never stored or logged — only the bcrypt hash or Vault path is persisted.

**Public interface** (key methods):

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `generateCredential` | `agentId: string, data: IGenerateCredentialRequest, ipAddress: string, userAgent: string` | `Promise<ICredentialWithSecret>` | Verifies agent exists and is `active`; generates a cryptographically random secret via `generateClientSecret()`; writes to Vault (when configured) or hashes with bcrypt; returns `ICredentialWithSecret` — the only time the plain-text secret is returned |
| `listCredentials` | `agentId: string, filters: ICredentialListFilters` | `Promise<IPaginatedCredentialsResponse>` | Returns paginated credentials for an agent; `clientSecret` is never included in list responses |
| `rotateCredential` | `agentId: string, credentialId: string, data: IGenerateCredentialRequest, ipAddress: string, userAgent: string` | `Promise<ICredentialWithSecret>` | Generates a new secret for the same `credentialId`; overwrites Vault entry (new KV v2 version) or updates bcrypt hash; old secret is immediately invalidated; returns new `ICredentialWithSecret` once |
| `revokeCredential` | `agentId: string, credentialId: string, ipAddress: string, userAgent: string` | `Promise<void>` | Sets credential `status = 'revoked'`; permanently deletes the Vault secret via `vaultClient.deleteSecret()` when Vault is configured; idempotent rejection of already-revoked credentials with `CredentialAlreadyRevokedError` |

**Database / storage schema**:
- Table `credentials`: `credential_id` (UUID PK), `client_id` (= `agentId`, FK to `agents`), `secret_hash` (bcrypt hash; empty string when Vault path is set), `vault_path` (nullable — KV v2 data path), `status`, `created_at`, `expires_at` (nullable), `revoked_at` (nullable).

**Error types**:
- `AgentNotFoundError` (404) — agent UUID not found
- `CredentialError` (400) — agent is not in `active` status (code: `AGENT_NOT_ACTIVE`)
- `CredentialNotFoundError` (404) — credential not found or belongs to a different agent
- `CredentialAlreadyRevokedError` (409) — credential is already revoked

**Configuration**:
- `VAULT_ADDR`, `VAULT_TOKEN`, `VAULT_MOUNT` — optional; when set, new credentials are stored in Vault KV v2 instead of bcrypt. Existing bcrypt-based credentials continue to work unchanged.

---

### AuditService

**Purpose**: Creates and queries immutable audit events for compliance and observability.

**Responsibility boundary**: AuditService does not know about HTTP, tokens, or agents.
It receives already-assembled event data from other services and delegates all
persistence to `AuditRepository`. It enforces the 90-day free-tier retention window
on all query and retrieval operations — events older than 90 days are treated as
non-existent.

**Public interface** (key methods):

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `logEvent` | `agentId: string, action: AuditAction, outcome: AuditOutcome, ipAddress: string, userAgent: string, metadata: Record<string, unknown>` | `Promise<IAuditEvent>` | Writes an immutable audit row to PostgreSQL. For token endpoints, callers use `void` (fire-and-forget). For CRUD operations, callers `await` this method. |
| `queryEvents` | `filters: IAuditListFilters` | `Promise<IPaginatedAuditEventsResponse>` | Returns paginated, filtered audit events; enforces the 90-day retention window by computing the cutoff date and rejecting queries with `fromDate` before the cutoff; validates that `fromDate <= toDate` |
| `getEventById` | `eventId: string` | `Promise<IAuditEvent>` | Retrieves a single event by UUID; returns `AuditEventNotFoundError` for both genuinely missing events and events outside the 90-day retention window (indistinguishable by design) |

**Database / storage schema**:
- Table `audit_events`: `event_id` (UUID PK), `agent_id` (text FK to agents), `action` (text — one of the `AuditAction` union type values), `outcome` (`success` or `failure`), `ip_address` (text), `user_agent` (text), `metadata` (JSONB), `timestamp` (timestamptz, NOT NULL, indexed).
- No Redis usage — AuditService is PostgreSQL-only.

**Error types**:
- `AuditEventNotFoundError` (404) — event not found or outside retention window
- `RetentionWindowError` (400) — query `fromDate` is before the 90-day retention cutoff
- `ValidationError` (400) — `fromDate` is after `toDate`

**Configuration**: None — the retention window (`FREE_TIER_RETENTION_DAYS = 90`) is a module-level constant.

---

### VaultClient

**Purpose**: Wraps HashiCorp Vault KV v2 operations for credential secret storage and verification.

**Responsibility boundary**: VaultClient is a client adapter — it knows only about
Vault API calls. It has no knowledge of business rules, HTTP, or PostgreSQL. It is
injected into `CredentialService` and `OAuth2Service` via constructor injection. When
`VAULT_ADDR` is not set, `createVaultClientFromEnv()` returns `null` and the bcrypt
code path is used unchanged.

**Public methods**:

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `writeSecret` | `agentId: string, credentialId: string, plainSecret: string` | `Promise<string>` | Writes the plain-text secret to the KV v2 data path; returns the path; creates a new KV v2 version on subsequent calls (used for rotation) |
| `readSecret` | `agentId: string, credentialId: string` | `Promise<string>` | Reads and returns the plain-text secret from Vault; throws `CredentialError` if the path is not found or the read fails |
| `verifySecret` | `agentId: string, credentialId: string, candidateSecret: string` | `Promise<boolean>` | Reads the stored secret via `readSecret`, then compares using `crypto.timingSafeEqual` to prevent timing-based side-channel attacks; returns `false` on any Vault error rather than throwing |
| `deleteSecret` | `agentId: string, credentialId: string` | `Promise<void>` | Permanently deletes all versions of a credential secret by calling the KV v2 metadata path (`DELETE {mount}/metadata/agentidp/agents/{agentId}/credentials/{credentialId}`) |

**KV v2 path structure**:
- Data path: `{mount}/data/agentidp/agents/{agentId}/credentials/{credentialId}`
- Metadata path (for permanent deletion): `{mount}/metadata/agentidp/agents/{agentId}/credentials/{credentialId}`
- Default mount: `secret` (overridable via `VAULT_MOUNT`)

**Opt-in configuration**:
- `VAULT_ADDR` — Vault server address (e.g. `http://127.0.0.1:8200`) — required to enable Vault mode
- `VAULT_TOKEN` — Vault authentication token — required to enable Vault mode
- `VAULT_MOUNT` — KV v2 mount path — optional, defaults to `secret`

**Constant-time comparison rationale**: The `verifySecret` method uses Node.js
`crypto.timingSafeEqual` instead of `===` to prevent attackers from inferring the
length or content of stored secrets by measuring how long the comparison takes. When
the stored and candidate secrets differ in length, a dummy `timingSafeEqual` call is
still performed to eliminate the timing signal from the early-exit path.

---

### OPA Policy Engine

**Purpose**: Enforces scope-based authorisation on every protected HTTP request without requiring a code deployment to change access rules.

**Responsibility boundary**: The OPA policy engine (`src/middleware/opa.ts`) is a
middleware layer — it does not know about business rules, credentials, or audit events.
It receives the HTTP method, full request path, and caller scopes from `req.user`, and
returns allow or deny. All policy logic lives in `policies/authz.rego` and
`policies/data/scopes.json`.

**Policy file locations**:
- `policies/authz.rego` — Rego policy defining `normalise_path`, `lookup_key`, and the `allow` rule. Evaluated by the Wasm bundle when compiled; replicated in TypeScript for the fallback path.
- `policies/data/scopes.json` — JSON map of `"METHOD:/path/pattern"` → `[required_scopes]`. Loaded as data into the Wasm policy and used directly by the TypeScript fallback.
- `policies/authz.wasm` — compiled Wasm bundle (not committed to source control; built from `authz.rego` using the OPA CLI). When present, the Wasm path is used; when absent, the TypeScript fallback reads `scopes.json`.

**How `opaMiddleware` evaluates input**:

1. `createOpaMiddleware()` is called once at app startup in `src/app.ts`.
2. It attempts to load `policies/authz.wasm`. If found, `loadPolicy(wasmBuffer)` is called and `scopes.json` data is injected via `loaded.setData(parsed)`.
3. If no Wasm bundle is found, `scopes.json` is loaded into `scopesMap` as the TypeScript fallback.
4. On every request, the middleware builds an `OpaInput` object: `{ method: req.method, path: req.baseUrl + req.path, scopes: req.user.scope.split(' ') }`.
5. `evaluate(input)` checks the Wasm policy (if loaded) or applies `normalisePath` + scope-intersection logic against `scopesMap`. Returns `false` if neither is loaded (fail-closed).
6. If `evaluate` returns `false`, the middleware calls `next(new AuthorizationError())`.

**How to write a new policy rule**:

1. Add the new endpoint's scope requirement to `policies/data/scopes.json`:
   ```json
   "GET:/api/v1/reports": ["reports:read"]
   ```
2. Add `"reports:read"` to the `OAuthScope` union type in `src/types/index.ts`.
3. If Wasm mode is in use, recompile `authz.rego` to `authz.wasm` using the OPA CLI: `opa build policies/authz.rego -d policies/data/ -o policies/authz.wasm`.
4. Send `SIGHUP` to the running process to hot-reload: `kill -HUP <pid>`.

**How to test a policy rule**:
```bash
# Using the OPA CLI directly
opa eval --data policies/data/scopes.json \
         --input '{"method":"GET","path":"/api/v1/agents","scopes":["agents:read"]}' \
         --bundle policies/ \
         'data.authz.allow'
```
Expected output: `true`. Replace method/path/scopes to test deny cases.

**Hot-reload via SIGHUP**: When `SIGHUP` is received by the Node.js process,
`server.ts` calls `reloadOpaPolicy()`. This re-executes the same startup loading logic:
tries to load the Wasm bundle, falls back to `scopes.json`. The in-memory `wasmPolicy`
and `scopesMap` module-level variables are replaced atomically. No requests are dropped.

---

### Web Dashboard

**Purpose**: Provides a browser-based UI for human operators to manage agents, credentials, and audit logs without writing API calls directly.

**Responsibility boundary**: The dashboard is a pure client-side React SPA. It has no
server-side logic. It calls the AgentIdP REST API using the `@sentryagent/idp-sdk`
`TokenManager` for authentication and a typed `ApiClient` from `dashboard/src/lib/client.ts`
for all API calls. It never stores the `access_token` in localStorage — only
`client_id`, `client_secret`, and `baseUrl` are stored in `sessionStorage` (cleared
on tab close).

**React component structure**:

```
dashboard/src/
├── main.tsx               # React root — mounts App into #root, wraps with BrowserRouter
├── App.tsx                # Route definitions — AuthProvider, RequireAuth, AppShell
├── lib/
│   ├── auth.tsx           # AuthContext, AuthProvider, useAuth hook, sessionStorage helpers
│   └── client.ts          # Typed ApiClient class — wraps fetch with TokenManager token injection
├── components/
│   ├── RequireAuth.tsx    # Route guard — redirects to /dashboard/login if not authenticated
│   └── layout/AppShell.tsx # Persistent sidebar navigation + Outlet for page content
└── pages/
    ├── Login.tsx          # Login form — calls auth.login(), redirects to /dashboard/agents
    ├── Agents.tsx         # Paginated agents list with status filter and search
    ├── AgentDetail.tsx    # Single agent view — status, metadata, update, decommission actions
    ├── Credentials.tsx    # Credential list for an agent — generate, rotate, revoke actions
    ├── AuditLog.tsx       # Paginated audit log with date range and action filters
    └── Health.tsx         # /health endpoint response — PostgreSQL and Redis status display
```

**Authentication flow with sessionStorage**:
1. On `Login.tsx` form submit, `auth.login(creds)` is called.
2. `validateCredentials(creds)` creates a `TokenManager` and calls `getToken()` — if this succeeds, the credentials are valid.
3. `saveCredentials(creds)` stores `{ clientId, clientSecret, baseUrl }` in `sessionStorage` under key `agentidp_credentials`.
4. On every subsequent API call, `getClient()` in `lib/client.ts` reads credentials from `sessionStorage`, creates a `TokenManager`, and injects the current `access_token` into the `Authorization: Bearer` header. The `TokenManager` handles automatic token refresh when the token is expired.
5. `auth.logout()` calls `clearCredentials()` (removes the `sessionStorage` key) and navigates to `/dashboard/login`.

**Main views and their API calls**:
- **Agents** — `GET /api/v1/agents?page=N&limit=20` — paginated list with `status` filter
- **AgentDetail** — `GET /api/v1/agents/:id`, `PATCH /api/v1/agents/:id`, `DELETE /api/v1/agents/:id`
- **Credentials** — `GET /api/v1/agents/:id/credentials`, `POST /api/v1/agents/:id/credentials`, `POST /api/v1/agents/:id/credentials/:credId/rotate`, `DELETE /api/v1/agents/:id/credentials/:credId`
- **AuditLog** — `GET /api/v1/audit?page=N&limit=20&fromDate=...&toDate=...`
- **Health** — `GET /health`

**Local development**:
```bash
cd dashboard
npm install
npm run dev    # Vite dev server with HMR — dashboard available at http://localhost:5173/dashboard
```
The Vite dev server proxies `/api/` calls to the Express server at `http://localhost:3000`.
The Express server must be running separately for API calls to work.

---

### Prometheus/Grafana Monitoring

**Purpose**: Provides operational visibility into AgentIdP's HTTP traffic, token issuance rates, agent registration rates, database latency, and Redis command latency.

**Responsibility boundary**: The metrics middleware (`src/middleware/metrics.ts`) and
the metrics registry (`src/metrics/registry.ts`) are observability concerns only — they
do not affect business logic. Metrics are exposed at `GET /metrics` via
`createMetricsRouter()` using `metricsRegistry.metrics()` from `prom-client`. The
`/metrics` endpoint is unauthenticated, intended for scraping by Prometheus only and
not exposed to the public internet.

**Key metrics with labels**:

| Metric Name | Type | Labels | Description |
|-------------|------|--------|-------------|
| `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | Total HTTP requests received; route is normalised (UUIDs replaced with `:id`) |
| `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP request duration; buckets from 5ms to 2.5s |
| `agentidp_tokens_issued_total` | Counter | `scope` | Total OAuth 2.0 access tokens successfully issued |
| `agentidp_agents_registered_total` | Counter | `deployment_env` | Total AI agents successfully registered |
| `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration; buckets from 1ms to 1s |
| `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration; buckets from 0.5ms to 250ms |

**How to add a new Counter**:
1. Open `src/metrics/registry.ts`.
2. Add a new `Counter` export:
   ```typescript
   export const myNewCounter = new Counter({
     name: 'agentidp_my_new_counter_total',
     help: 'Description of what this counts.',
     labelNames: ['label_one'] as const,
     registers: [metricsRegistry],
   });
   ```
3. Import and call `myNewCounter.inc({ label_one: value })` in the service or middleware where the event occurs.

**How to add a new Histogram**:
1. Open `src/metrics/registry.ts`.
2. Add a new `Histogram` export with appropriate buckets:
   ```typescript
   export const myDurationHistogram = new Histogram({
     name: 'agentidp_my_operation_duration_seconds',
     help: 'Duration of my operation in seconds.',
     labelNames: ['operation'] as const,
     buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1],
     registers: [metricsRegistry],
   });
   ```
3. Use `const end = myDurationHistogram.startTimer({ operation: 'name' }); ... end();` around the operation being measured.

**Grafana access in local Docker**:

Start the monitoring overlay:
```bash
docker compose -f compose.yaml -f compose.monitoring.yaml up
```
- Prometheus: `http://localhost:9090`
- Grafana: `http://localhost:3001` — credentials: `admin` / `<GF_ADMIN_PASSWORD from .env>`

Grafana is pre-provisioned with a Prometheus data source pointing to `http://prometheus:9090`
and dashboard JSON files from `monitoring/grafana/dashboards/`. No manual configuration
is needed after startup.

---

### AnalyticsService

**Purpose**: Records daily aggregated analytics events (token issuances, agent activity) and exposes query methods for token trends, agent activity heatmaps, and per-agent usage summaries. All query methods scope results strictly to the supplied `tenantId`. The `recordEvent` method is fire-and-forget — it catches all errors internally and never propagates them to the caller, so analytics writes never block primary request paths.

**Public methods**:

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `recordEvent` | `tenantId: string, metricType: string` | `Promise<void>` | Upserts a daily counter row in `analytics_events` via `INSERT ... ON CONFLICT DO UPDATE SET count = count + 1`. Catches and swallows all errors; safe to call with `void` on hot paths. |
| `getTokenTrend` | `tenantId: string, days: number` | `Promise<ITokenTrendEntry[]>` | Returns daily token issuance counts for the last N days (clamped to 90). Uses `generate_series` + `LEFT JOIN` so that days with no events appear as `count: 0`. Results sorted ascending by date. |
| `getAgentActivity` | `tenantId: string` | `Promise<IAgentActivityEntry[]>` | Returns agent activity bucketed by day-of-week (0=Sun…6=Sat) and hour-of-day for the last 30 days. Reads only rows whose `metric_type` matches the pattern `agent:<agentId>:<metricType>`. |
| `getAgentUsageSummary` | `tenantId: string` | `Promise<IAgentUsageSummaryEntry[]>` | Returns per-agent token issuance totals for the current calendar month, joined with the agent name (`owner` field). Sorted descending by `token_count`. Excludes decommissioned agents. |

**Dependencies**: PostgreSQL connection pool (`Pool` from `pg`). No Redis usage.

**Configuration**: None. `MAX_TREND_DAYS = 90` is a module-level constant.

**DB tables**:
- `analytics_events`: `organization_id` (UUID FK to `organizations`), `date` (DATE), `metric_type` (text — e.g. `'token_issued'`, `'agent:<agentId>:token_issued'`), `count` (integer). Unique constraint on `(organization_id, date, metric_type)`.
- `agents`: read in `getAgentUsageSummary` to join `owner` and filter by `organization_id`.

---

### TierService

**Purpose**: Single authority for all subscription tier business logic — fetches current tier and live usage, initiates Stripe Checkout sessions for upgrades, applies confirmed upgrades to the `organizations` table, and enforces per-tier agent count limits. Controllers and middleware delegate all tier decisions to this service; no tier logic lives elsewhere.

**Public methods**:

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `getStatus` | `orgId: string` | `Promise<ITierStatus>` | Returns current `tier`, per-tier `limits` (from `TIER_CONFIG`), live `usage` (Redis counters + DB agent count), and `resetAt` (ISO 8601 next UTC midnight). Falls back to `0` for Redis counters when Redis is unavailable. |
| `initiateUpgrade` | `orgId: string, targetTier: TierName` | `Promise<IUpgradeInitiation>` | Validates that `targetTier` is strictly higher rank than current tier. Creates a Stripe Checkout Session with `mode: 'subscription'`, `metadata: { orgId, targetTier }`, and the price ID from `STRIPE_PRICE_ID_<TIER>` env var. Returns `{ checkoutUrl }`. |
| `applyUpgrade` | `orgId: string, tier: TierName` | `Promise<void>` | Sets `organizations.tier` and `organizations.tier_updated_at = NOW()`. Called by the Stripe webhook handler after `checkout.session.completed`. |
| `fetchTier` | `orgId: string` | `Promise<TierName>` | Queries `organizations.tier` for the given org. Returns `'free'` as a safe default when no row is found or the stored value is not a valid `TierName`. |
| `enforceAgentLimit` | `orgId: string, tier: TierName` | `Promise<void>` | Counts non-decommissioned agents for the org and throws `TierLimitError` when count is at or over `TIER_CONFIG[tier].maxAgents`. No-op for Enterprise (infinite limit). Called by `AgentService` before creating a new agent. |

**Dependencies**: PostgreSQL (`Pool`), Redis (`RedisClientType`), Stripe client (`Stripe`). Imports `TIER_CONFIG` and `TIER_RANK` from `src/config/tiers.ts`.

**Configuration**:
- `STRIPE_PRICE_ID_PRO` — Stripe price ID for the Pro tier
- `STRIPE_PRICE_ID_ENTERPRISE` — Stripe price ID for the Enterprise tier
- `STRIPE_PRICE_ID` — Fallback Stripe price ID when tier-specific vars are not set
- `STRIPE_SUCCESS_URL` — Redirect URL on successful checkout (default: `APP_BASE_URL/dashboard?billing=success`)
- `STRIPE_CANCEL_URL` — Redirect URL when checkout is cancelled (default: `APP_BASE_URL/dashboard?billing=cancel`)
- `APP_BASE_URL` — Base URL for redirect URL construction (default: `http://localhost:3000`)

**Redis keys**:
- `rate:tier:calls:<orgId>` — integer, daily API call counter; TTL set at next UTC midnight. Read in `getStatus`.
- `rate:tier:tokens:<orgId>` — integer, daily token issuance counter; same TTL. Read in `getStatus`.

**DB tables**:
- `organizations`: `organization_id` (UUID PK), `tier` (text — `'free'|'pro'|'enterprise'`), `tier_updated_at` (timestamptz). Read in `fetchTier`; written in `applyUpgrade`.
- `agents`: read in `enforceAgentLimit` and `getStatus` to count non-decommissioned agents per org.

**Error types**:
- `ValidationError` (400) — target tier is not higher than current tier
- `TierLimitError` (429) — agent count limit reached for the current tier

---

### ComplianceService

**Purpose**: Generates AGNTCY-standard compliance reports and exports agent cards for a tenant. Reports cover two sections: `agent-identity` (DID presence and credential expiry checks) and `audit-trail` (cryptographic hash chain verification). Reports are cached in Redis for 5 minutes to avoid repeated expensive DB queries. Agent card export returns all active agents in AGNTCY-standard JSON format.

**Public methods**:

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `generateReport` | `tenantId: string` | `Promise<IComplianceReport>` | Attempts to read `compliance:report:<tenantId>` from Redis; if found, returns it with `from_cache: true`. Otherwise builds the report by running `buildAgentIdentitySection` and `buildAuditTrailSection` in parallel, rolls up the overall status (fail > warn > pass), caches the result for 300 seconds, and returns it. |
| `exportAgentCards` | `tenantId: string` | `Promise<IAgentCard[]>` | Queries all non-decommissioned agents for the tenant and maps each to an AGNTCY agent card with `id` (DID or agent UUID), `name`, `capabilities`, `endpoint`, `created_at`, and `agntcy_schema_version: '1.0'`. |

**Dependencies**: PostgreSQL (`Pool`), Redis (`RedisClientType`). Internally instantiates `AuditVerificationService` for hash chain verification.

**Configuration**: None. `CACHE_TTL_SECONDS = 300` and `AGNTCY_SCHEMA_VERSION = '1.0'` are module-level constants.

**Redis keys**:
- `compliance:report:<tenantId>` — JSON-serialised `IComplianceReport`, TTL 300 seconds. Written by `generateReport`; read on every call within the cache window.

**DB tables**:
- `agents`: queried in both `buildAgentIdentitySection` (checks DID presence) and `exportAgentCards`.
- `credentials`: queried in `buildAgentIdentitySection` to check active credential expiry per agent.
- `audit_events`: read via `AuditVerificationService` in `buildAuditTrailSection` to verify hash chain integrity.

**Error types**: None thrown directly. Internal errors in section builders produce `status: 'fail'` sections rather than exceptions.

**Report structure**:
- `agent-identity` section: `fail` when any active agent is missing a DID or has expired credentials; `warn` when any credential expires within 7 days; `pass` otherwise.
- `audit-trail` section: `fail` when `AuditVerificationService.verifyChain()` returns `verified: false`; `pass` otherwise.

---

### FederationService

**Purpose**: Manages trusted federation partners and cross-IdP JWT token verification. At partner registration, the partner's JWKS endpoint is validated and the keys are cached in Redis. At token verification, the service fetches (or reuses cached) partner JWKS, verifies the JWT signature and standard claims, enforces the partner's `allowed_organizations` filter, and rejects tokens from suspended or expired partners.

**Public methods**:

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `registerPartner` | `req: ICreatePartnerRequest` | `Promise<IFederationPartner>` | Validates the `jwks_uri` is reachable (5-second timeout) and returns valid JWKS. Inserts the partner row into `federation_partners`. Caches the JWKS in Redis under `federation:jwks:<issuer>`. |
| `listPartners` | _(none)_ | `Promise<IFederationPartner[]>` | Updates any partners past `expires_at` to `status = 'expired'` before returning all rows ordered by `created_at DESC`. |
| `getPartner` | `id: string` | `Promise<IFederationPartner>` | Applies the same expiry update, then returns the partner row. Throws `FederationPartnerNotFoundError` (404) when not found. |
| `updatePartner` | `id: string, req: IUpdatePartnerRequest` | `Promise<IFederationPartner>` | Applies a partial update. When `jwks_uri` changes, invalidates the old issuer's JWKS cache entry (`DEL federation:jwks:<oldIssuer>`). |
| `deletePartner` | `id: string` | `Promise<void>` | Deletes the partner row and invalidates the JWKS cache. |
| `verifyFederatedToken` | `req: IFederationVerifyRequest` | `Promise<IFederationVerifyResult>` | Decodes token header/payload without verification, rejects `alg:none`, looks up partner by `iss`, checks partner status and expiry, fetches JWKS (cache-first), finds matching key by `kid`, converts JWK to PEM, verifies signature via `jsonwebtoken.verify` (RS256 or ES256), enforces `allowed_organizations` filter. Returns `{ valid, issuer, subject, organization_id, claims }`. |

**Dependencies**: PostgreSQL (`Pool`), Redis (`RedisClientType`). Uses `jsonwebtoken` for JWT decoding/verification and Node.js `crypto.createPublicKey` for JWK-to-PEM conversion.

**Configuration**:
- `FEDERATION_JWKS_CACHE_TTL_SECONDS` — TTL for cached partner JWKS in Redis (default: `3600`)

**Redis keys**:
- `federation:jwks:<issuer>` — JSON-serialised `IJWKSKey[]`, TTL from `FEDERATION_JWKS_CACHE_TTL_SECONDS`. Written on partner registration and on cache miss during token verification; deleted when a partner is updated (JWKS URI changed) or deleted.

**DB tables**:
- `federation_partners`: `id` (UUID PK), `name` (text), `issuer` (text — the IdP's issuer URL), `jwks_uri` (text), `allowed_organizations` (text[] — empty means all orgs allowed), `status` (`active|suspended|expired`), `created_at`, `updated_at`, `expires_at` (nullable timestamptz).

**Error types**:
- `FederationPartnerError` (400) — JWKS endpoint unreachable or returns invalid JWKS
- `FederationPartnerNotFoundError` (404) — partner UUID not found
- `FederationVerificationError` (401) — token malformed, alg:none, unknown issuer, partner suspended/expired, signature invalid, org not in allow list

---

### DIDService

**Purpose**: Manages W3C DID Core 1.0 document generation, EC P-256 key pair creation, and AGNTCY agent card export. Generates per-agent `did:web` identifiers, stores private keys in HashiCorp Vault (or records a `dev:no-vault` marker in dev mode), and caches DID documents in Redis. Builds both an instance-level DID document (for AgentIdP itself) and per-agent DID documents with AGNTCY extension properties.

**Public methods**:

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `generateDIDForAgent` | `agentId: string, organizationId: string` | `Promise<{ did: string; publicKeyJwk: IPublicKeyJwk }>` | Generates an EC P-256 key pair. Stores the private key PEM in Vault KV v2 at `{mount}/data/agentidp/agents/{agentId}/did-key`. Encrypts the vault path via `EncryptionService` (when configured). Inserts a row into `agent_did_keys`. Updates `agents.did` and `agents.did_created_at`. Returns the `did:web` identifier and public key JWK. |
| `buildInstanceDIDDocument` | _(none)_ | `Promise<IDIDDocument>` | Builds the root instance DID document for AgentIdP (format: `did:web:{DID_WEB_DOMAIN}`). Cached in Redis under `did:doc:instance`. |
| `buildAgentDIDDocument` | `agentId: string` | `Promise<IAgentDIDDocumentResult>` | Builds a per-agent DID document (format: `did:web:{DID_WEB_DOMAIN}:agents:{agentId}`). Decommissioned agents get a deactivated document with an `AgentStatus: decommissioned` service entry. Cached in Redis under `did:doc:{agentId}` for active agents only. Throws `AgentNotFoundError` if the agent does not exist. |
| `buildResolutionResult` | `agentId: string` | `Promise<IDIDResolutionResult>` | Wraps `buildAgentDIDDocument` with W3C DID Resolution metadata (`didDocumentMetadata`, `didResolutionMetadata`). |
| `buildAgentCard` | `agentId: string` | `Promise<IAgentCard>` | Returns an AGNTCY-format agent card with `did`, `name` (agent email), `agentType`, `capabilities`, `owner`, `version`, `deploymentEnv`, `identityProvider`, and `issuedAt`. |

**Dependencies**: PostgreSQL (`Pool`), Redis (`RedisClientType`), optional `VaultClient`, optional `EncryptionService`. Uses `node-vault` directly for DID private key storage.

**Configuration**:
- `DID_WEB_DOMAIN` — required; the domain for `did:web` DID construction (e.g. `idp.sentryagent.ai`)
- `DID_DOCUMENT_CACHE_TTL_SECONDS` — Redis cache TTL for DID documents (default: `300`)
- `VAULT_ADDR`, `VAULT_TOKEN`, `VAULT_MOUNT` — when set, private keys are stored in Vault; otherwise `dev:no-vault` marker is used

**Redis keys**:
- `did:doc:instance` — JSON-serialised instance `IDIDDocument`, TTL from `DID_DOCUMENT_CACHE_TTL_SECONDS`
- `did:doc:<agentId>` — JSON-serialised per-agent `IDIDDocument`, same TTL. Not cached for decommissioned agents.

**DB tables**:
- `agents`: `did` (text — `did:web:...`), `did_created_at` (timestamptz). Written by `generateDIDForAgent`; read in all document-building methods.
- `agent_did_keys`: `key_id` (UUID PK), `agent_id` (UUID FK), `organization_id` (UUID FK), `public_key_jwk` (JSONB), `vault_key_path` (text — Vault KV v2 path or `dev:no-vault`), `key_type` (`'EC'`), `curve` (`'P-256'`), `created_at`. Written by `generateDIDForAgent`.

**Error types**:
- `AgentNotFoundError` (404) — agent UUID not found in `buildAgentDIDDocument`, `buildResolutionResult`, `buildAgentCard`

---

### WebhookService

**Purpose**: Manages webhook subscriptions and their delivery history for a tenant organisation. HMAC signing secrets are stored in HashiCorp Vault KV v2 (when configured) or bcrypt-hashed in PostgreSQL in local mode. The raw secret is only returned once at subscription creation time. `vault_secret_path` is encrypted at rest via `EncryptionService` (AES-256-CBC) before being written to PostgreSQL (SOC 2 CC6.1 compliance).

**Public methods**:

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `createSubscription` | `orgId: string, req: ICreateWebhookRequest` | `Promise<IWebhookSubscription & { secret: string }>` | Generates a 32-byte random hex HMAC secret. Stores in Vault at `secret/data/agentidp/webhooks/{orgId}/{id}/secret` (Vault mode) or bcrypt-hashes and stores in `secret_hash` (local mode). Encrypts `vault_secret_path` via `EncryptionService`. Returns the subscription including the one-time `secret`. Validates URL must use `https://` and events array must be non-empty. |
| `listSubscriptions` | `orgId: string` | `Promise<IWebhookSubscription[]>` | Returns all subscriptions for the org, ordered by `created_at DESC`. No secret fields are included. |
| `getSubscription` | `id: string, orgId: string` | `Promise<IWebhookSubscription>` | Returns a single subscription. Verifies org ownership. |
| `updateSubscription` | `id: string, orgId: string, req: IUpdateWebhookRequest` | `Promise<IWebhookSubscription>` | Partially updates `name`, `url`, `events`, or `active` fields. Validates `https://` if URL is changing. |
| `deleteSubscription` | `id: string, orgId: string` | `Promise<void>` | Permanently deletes the subscription and all deliveries (via PostgreSQL CASCADE). |
| `getSubscriptionSecret` | `subscriptionId: string, orgId: string` | `Promise<string>` | Retrieves the raw HMAC secret from Vault (Vault mode only). Throws `WebhookValidationError` in local mode since the secret cannot be recovered after creation. |
| `listDeliveries` | `subscriptionId: string, orgId: string, limit: number, offset: number` | `Promise<IPaginatedDeliveriesResponse>` | Returns paginated delivery records for a subscription. Verifies org ownership before querying. |

**Dependencies**: PostgreSQL (`Pool`), optional `VaultClient`, Redis (`RedisClientType` — reserved for future caching), optional `EncryptionService`.

**Configuration**: Inherits Vault configuration from `VaultClient` (`VAULT_ADDR`, `VAULT_TOKEN`, `VAULT_MOUNT`). `EncryptionService` requires `ENCRYPTION_KEY` env var (see `EncryptionService` docs).

**DB tables**:
- `webhook_subscriptions`: `id` (UUID PK), `organization_id` (UUID FK), `name` (text), `url` (text — https only), `events` (JSONB — `WebhookEventType[]`), `secret_hash` (text — bcrypt hash in local mode, `'vault'` in Vault mode), `vault_secret_path` (text — encrypted Vault path or `'local'`), `active` (boolean), `failure_count` (integer), `created_at`, `updated_at`.
- `webhook_deliveries`: `id` (UUID PK), `subscription_id` (UUID FK), `event_type` (text), `payload` (JSONB), `status` (`pending|delivered|failed|dead_letter`), `http_status_code` (integer nullable), `attempt_count` (integer), `next_retry_at` (timestamptz nullable), `delivered_at` (timestamptz nullable), `created_at`, `updated_at`. Cascades on subscription delete.

**Error types**:
- `WebhookNotFoundError` (404) — subscription not found or belongs to another org
- `WebhookValidationError` (400) — invalid URL scheme, empty events array, or secret not recoverable in local mode

---

### BillingService

**Purpose**: Manages Stripe billing integration — creates Checkout Sessions for tenant subscriptions, processes incoming Stripe webhook events (subscription lifecycle and checkout completion), and retrieves current subscription status. When a `checkout.session.completed` event carries `{ orgId, targetTier }` in its metadata, delegates to `TierService.applyUpgrade` to update the organisation's tier.

**Public methods**:

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `createCheckoutSession` | `tenantId: string, successUrl: string, cancelUrl: string` | `Promise<string>` | Creates a Stripe Checkout Session with `mode: 'subscription'`, `client_reference_id: tenantId`, and the price from `STRIPE_PRICE_ID`. Returns the checkout URL. Throws if Stripe does not return a URL. |
| `handleWebhookEvent` | `rawBody: Buffer, sig: string, webhookSecret: string` | `Promise<void>` | Verifies the Stripe webhook signature via `stripe.webhooks.constructEvent`. Handles `customer.subscription.created/updated/deleted` (upserts `tenant_subscriptions`) and `checkout.session.completed` (applies tier upgrade via `TierService` when metadata contains `orgId` and `targetTier`). |
| `getSubscriptionStatus` | `tenantId: string` | `Promise<ISubscriptionStatus>` | Queries `tenant_subscriptions` for the given tenant. Returns `{ tenantId, status: 'free', currentPeriodEnd: null, stripeSubscriptionId: null }` when no row exists. |

**Dependencies**: PostgreSQL (`Pool`), Stripe client (`Stripe`), optional `TierService`.

**Configuration**:
- `STRIPE_PRICE_ID` — Stripe price ID for subscription checkout sessions
- `STRIPE_WEBHOOK_SECRET` — Stripe webhook endpoint secret (`whsec_...`); passed by the webhook controller, not read directly by the service

**DB tables**:
- `tenant_subscriptions`: `tenant_id` (UUID PK or unique), `status` (text — `'free'|'active'|'past_due'|'canceled'`), `stripe_customer_id` (text), `stripe_subscription_id` (text), `current_period_end` (timestamptz nullable), `updated_at`. Upserted on subscription lifecycle events.

**Error types**: None defined in the service. Stripe signature failures raise `Error` from `stripe.webhooks.constructEvent`; these propagate to the error handler as 400 responses.

---

### OIDCService (A2A / OIDC Provider)

**Note**: `src/services/OIDCService.ts` does not exist as a standalone file — OIDC provider functionality is handled by the `oidc-provider` npm package, configured in `src/app.ts` and related route files. The service boundary for OIDC-related business logic is the `DelegationService`. Document the OIDC integration as follows.

**Purpose**: The OIDC/A2A subsystem provides agent-to-agent (A2A) delegation using the `oidc-provider` library (v9.7.x). The provider is mounted as a sub-application at `/oidc` and issues short-lived delegation tokens scoped to a specific `delegatee_id`. The `DelegationService` (`src/services/DelegationService.ts`) manages the `delegation_chains` table for auditing.

**Key endpoints exposed by the OIDC provider**:
- `POST /oidc/token` — issues delegation tokens via `client_credentials` or custom grant
- `GET /oidc/.well-known/openid-configuration` — OIDC discovery document
- `GET /oidc/jwks` — public JWK Set for verifying delegation tokens

**DelegationService public methods** (from `src/services/DelegationService.ts`):

| Method | Parameters | Returns | Description |
|--------|-----------|---------|-------------|
| `createDelegation` | `delegatorId: string, delegateeId: string, scope: string, expiresAt?: Date` | `Promise<IDelegationChain>` | Inserts a delegation chain record into `delegation_chains`. Validates both agents exist and are active. |
| `verifyDelegation` | `token: string, delegateeId: string` | `Promise<IDelegationVerifyResult>` | Verifies the delegation token signature and checks the chain record is active and not expired. |
| `revokeDelegation` | `chainId: string, delegatorId: string` | `Promise<void>` | Sets `delegation_chains.status = 'revoked'` and `revoked_at = NOW()`. Validates the delegator owns the chain. |

**DB tables**:
- `delegation_chains`: `chain_id` (UUID PK), `delegator_id` (UUID), `delegatee_id` (UUID), `scope` (text), `status` (`active|revoked|expired`), `created_at`, `expires_at` (nullable), `revoked_at` (nullable), `token` (text — the delegation JWT).

**Configuration**:
- `A2A_ENABLED` — when set to `'false'`, A2A/delegation endpoints return 404
- `OIDC_ISSUER` — issuer URL for the OIDC provider