Files
sentryagent-idp/docs/engineering/02-architecture.md
SentryAgent.ai Developer 8cabc0191c docs: commit all Phase 6 documentation updates and OpenSpec archives
- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook)
- developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added
- engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview)
- OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update
- VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added
- .gitignore: exclude session artifacts, build artifacts, and agent workspaces

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 02:24:24 +00:00

236 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# System Architecture
---
## 1. Component Diagram
```mermaid
graph TD
Client["Client (AI Agent / Browser / CI)"]
Client -->|HTTPS| ExpressApp["Express App (AgentIdP)"]
subgraph ExpressApp["Express App — src/app.ts"]
Router["Router (src/routes/)"]
AuthMW["authMiddleware (src/middleware/auth.ts)"]
TierMW["tierMiddleware (src/middleware/tier.ts)"]
OpaMW["opaMiddleware (src/middleware/opa.ts)"]
Controller["Controller (src/controllers/)"]
Service["Service (src/services/)"]
Repository["Repository (src/repositories/)"]
Router --> AuthMW --> TierMW --> OpaMW --> Controller --> Service --> Repository
end
Repository -->|parameterized SQL| PG["PostgreSQL 14\n(agents, credentials, audit_events,\nanalytics_events, organizations,\nfederation_partners, webhook_subscriptions,\nagent_did_keys, delegation_chains)"]
Service -->|Redis commands| Redis["Redis 7\n(token revocation list, daily tier counters,\nJWKS cache, compliance report cache,\nDID document cache)"]
Service -->|KV v2 read/write| Vault["HashiCorp Vault\n(opt-in — credentials, DID private keys,\nwebhook secrets — when VAULT_ADDR is set)"]
ExpressApp -->|evaluate input| OPA["OPA Policy Engine\n(policies/authz.rego + data/scopes.json)"]
ExpressApp -->|expose| Metrics["/metrics (prom-client)"]
ExpressApp -->|checkout session / webhooks| Stripe["Stripe\n(billing — when STRIPE_SECRET_KEY is set)"]
Dashboard["Dashboard SPA (React 18 + Vite 5)\ndashboard/dist/ served from /dashboard"]
Portal["Developer Portal (Next.js 14)\nportal/ — served separately on port 3002"]
Client -->|browser| Dashboard
Client -->|browser| Portal
Dashboard -->|REST API calls| ExpressApp
Portal -->|REST API calls| ExpressApp
Grafana["Grafana (port 3001)"] -->|scrapes| Metrics
OIDCProvider["OIDC Provider (oidc-provider v9)\nmounted at /oidc — A2A delegation tokens"]
ExpressApp --- OIDCProvider
```
---
## 2. HTTP Request Lifecycle
Every authenticated API request travels through the following sequence. Understanding
this sequence end-to-end is essential for debugging and for writing new endpoints
correctly.
1. HTTP request arrives at the Node.js HTTP listener — configured in `src/server.ts`, which calls `app.listen(PORT)` after `createApp()` resolves.
2. App-level middleware runs in registration order: `helmet()` sets security headers, `cors()` applies CORS policy from `CORS_ORIGIN`, `morgan('combined')` logs the request line (skipped in `NODE_ENV=test`), `express.json()` and `express.urlencoded()` parse the body, `metricsMiddleware` (`src/middleware/metrics.ts`) starts the request timer and records `agentidp_http_requests_total` and `agentidp_http_request_duration_seconds` on response finish.
3. The Express router matches the path to a route definition in `src/routes/*.ts` and hands off to the appropriate middleware chain.
4. `authMiddleware` (`src/middleware/auth.ts`) validates the Bearer JWT: extracts the token from the `Authorization` header, calls `verifyToken()` for RS256 signature and expiry, then calls `redis.get('revoked:{jti}')` to check the revocation list. On success, attaches the decoded `ITokenPayload` to `req.user`.
5. `tierMiddleware` (`src/middleware/tier.ts`) enforces per-tier daily API call limits. It reads the organisation's current tier from `TierService.fetchTier(orgId)`, checks the daily call counter from Redis key `rate:tier:calls:<orgId>` against `TIER_CONFIG[tier].maxCallsPerDay`, increments the counter on each passing request (fire-and-forget `INCR` with TTL set to next UTC midnight), and throws `TierLimitError` (429) when the limit is reached. This middleware is applied only to API routes, not to `/health`, `/metrics`, or `/dashboard`.
6. `opaMiddleware` (`src/middleware/opa.ts`) evaluates the OPA policy: builds an `OpaInput` object from `req.method`, `req.baseUrl + req.path`, and `req.user.scope.split(' ')`, then calls `evaluate(input)`. Uses the Wasm bundle (`policies/authz.wasm`) when present, or the TypeScript fallback reading `policies/data/scopes.json`. Calls `next(new AuthorizationError())` if the policy denies.
7. The controller (`src/controllers/*.ts`) receives the validated request, extracts and validates path params and body using Joi schemas, then delegates to the service layer.
8. The service (`src/services/*.ts`) executes all business logic — enforces tier limits, resolves domain rules, and calls repositories. Phase 36 introduces specialised services: `AnalyticsService` (fire-and-forget event recording), `TierService` (enforces per-tier agent and call limits), `ComplianceService` (AGNTCY compliance reports, cached 5 min in Redis), `FederationService` (cross-IdP JWT verification with cached JWKS), `DIDService` (W3C DID document generation and caching), `WebhookService` (subscription management with Vault-backed HMAC secrets), and `BillingService` (Stripe Checkout and webhook processing). The service has no knowledge of HTTP.
9. The repository (`src/repositories/*.ts`) executes parameterized SQL against PostgreSQL via `node-postgres`, or issues Redis commands via the `redis` client. No business logic lives here. Phase 36 added the following tables: `analytics_events` (daily metric counters), `organizations` (org tier and billing), `federation_partners` (cross-IdP trust registry), `webhook_subscriptions` and `webhook_deliveries` (outbound event delivery), `agent_did_keys` (public EC keys for DID documents), `delegation_chains` (A2A delegation records), `tenant_subscriptions` (Stripe subscription status).
10. The controller serialises the service result and calls `res.status(xxx).json(payload)`.
11. `AuditService.logEvent()` is called — for high-throughput paths (token issuance, introspection, revocation) this is fire-and-forget (`void` — not awaited); for CRUD operations it is awaited. The audit event is written as an immutable row to the `audit_events` table in PostgreSQL.
---
## 3. OAuth 2.0 Client Credentials Flow
```mermaid
sequenceDiagram
actor Agent
participant AgentIdP
participant PostgreSQL
participant Redis
participant Vault as Vault (optional)
Agent->>AgentIdP: POST /api/v1/token<br/>grant_type=client_credentials<br/>client_id=&lt;agentId&gt;<br/>client_secret=sk_live_...&<br/>scope=agents:read agents:write
AgentIdP->>PostgreSQL: SELECT * FROM agents WHERE agent_id = $1
PostgreSQL-->>AgentIdP: agent row (status, etc.)
AgentIdP->>PostgreSQL: SELECT * FROM credentials WHERE agent_id = $1 AND status = 'active'
PostgreSQL-->>AgentIdP: active credential rows
alt Vault path (vaultPath IS NOT NULL and VAULT_ADDR is set)
AgentIdP->>Vault: readSecret(agentId, credentialId)
Vault-->>AgentIdP: plain-text secret
AgentIdP->>AgentIdP: crypto.timingSafeEqual(stored, candidate)
else bcrypt path (fallback)
AgentIdP->>AgentIdP: bcrypt.compare(clientSecret, secretHash)
end
AgentIdP->>Redis: GET monthly:tokens:{agentId}:{yyyy-mm}
Redis-->>AgentIdP: current monthly count
AgentIdP->>AgentIdP: signToken(payload, privateKey) — RS256 JWT
AgentIdP->>Redis: INCR monthly:tokens:{agentId}:{yyyy-mm} (fire-and-forget)
AgentIdP-->>Agent: 200 OK<br/>{ access_token, token_type: "Bearer", expires_in: 3600, scope }
Note over Agent,AgentIdP: Subsequent protected API call
Agent->>AgentIdP: GET /api/v1/agents<br/>Authorization: Bearer &lt;access_token&gt;
AgentIdP->>AgentIdP: verifyToken(token, publicKey) — RS256 verify + expiry
AgentIdP->>Redis: GET revoked:{jti}
Redis-->>AgentIdP: null (not revoked)
AgentIdP->>AgentIdP: OPA evaluate({method, path, scopes})
AgentIdP-->>Agent: 200 OK — agents list
```
---
## 3b. Analytics Event Capture Flow
Every successful token issuance writes a fire-and-forget analytics event:
```mermaid
sequenceDiagram
participant Controller as TokenController
participant OAuth2Svc as OAuth2Service
participant AnalyticsSvc as AnalyticsService
participant PG as PostgreSQL
Controller->>OAuth2Svc: issueToken(clientId, clientSecret, scope, ...)
OAuth2Svc->>OAuth2Svc: signToken() — RS256 JWT
OAuth2Svc-->>Controller: ITokenResponse
Note over OAuth2Svc,AnalyticsSvc: fire-and-forget (void)
OAuth2Svc-)AnalyticsSvc: recordEvent(tenantId, 'token_issued')
AnalyticsSvc-)PG: INSERT INTO analytics_events ... ON CONFLICT DO UPDATE count + 1
```
`recordEvent` uses PostgreSQL `UPSERT` — one row per `(organization_id, date, metric_type)`. If the INSERT conflicts (same date, same org, same metric), the `count` column is incremented atomically. This keeps the table compact (one row per day per metric type per org) and fast to query.
---
## 3c. Tier Enforcement Middleware Chain
```mermaid
sequenceDiagram
actor Agent
participant TierMW as tierMiddleware
participant TierSvc as TierService
participant Redis
participant PG as PostgreSQL
Agent->>TierMW: API request (with valid Bearer token)
TierMW->>TierSvc: fetchTier(orgId)
TierSvc->>PG: SELECT tier FROM organizations WHERE organization_id = $1
PG-->>TierSvc: 'pro'
TierSvc-->>TierMW: 'pro'
TierMW->>Redis: GET rate:tier:calls:<orgId>
Redis-->>TierMW: "4999" (current daily count)
Note over TierMW: TIER_CONFIG['pro'].maxCallsPerDay = 50000 — limit not reached
TierMW-)Redis: INCR rate:tier:calls:<orgId> (fire-and-forget, TTL = next UTC midnight)
TierMW->>Agent: next() — request proceeds to opaMiddleware
```
When the counter equals or exceeds the tier limit, `tierMiddleware` throws `TierLimitError` (429) before `opaMiddleware` runs. The daily counter resets at UTC midnight via Redis TTL.
---
## 3d. A2A Delegation End-to-End Flow
```mermaid
sequenceDiagram
actor Delegator as Delegator Agent
actor Delegatee as Delegatee Agent
participant AgentIdP
participant DelegationSvc as DelegationService
participant OIDCProvider as OIDC Provider
participant PG as PostgreSQL
Delegator->>AgentIdP: POST /api/v1/oauth2/token/delegate<br/>{ delegatee_id, scope }
AgentIdP->>DelegationSvc: createDelegation(delegatorId, delegateeId, scope)
DelegationSvc->>PG: INSERT INTO delegation_chains ...
PG-->>DelegationSvc: chain_id
DelegationSvc->>OIDCProvider: issue delegation JWT (delegator claims + delegatee sub)
OIDCProvider-->>DelegationSvc: signed delegation token
DelegationSvc-->>AgentIdP: IDelegationChain (with token)
AgentIdP-->>Delegator: 201 { token, chain_id }
Note over Delegatee,AgentIdP: Delegatee uses the delegation token
Delegatee->>AgentIdP: POST /api/v1/oauth2/token/verify-delegation<br/>{ token }
AgentIdP->>DelegationSvc: verifyDelegation(token, delegateeId)
DelegationSvc->>PG: SELECT * FROM delegation_chains WHERE chain_id = $1 AND status = 'active'
PG-->>DelegationSvc: chain row (not expired, not revoked)
DelegationSvc->>OIDCProvider: verify token signature
OIDCProvider-->>DelegationSvc: verified claims
DelegationSvc-->>AgentIdP: IDelegationVerifyResult { valid: true, ... }
AgentIdP-->>Delegatee: 200 { valid: true, delegatorId, scope }
```
---
## 4. Multi-Region Deployment Topology
```mermaid
graph LR
TFRoot["Terraform Root Module\nterraform/"]
TFRoot --> AWSMod["AWS Module\nterraform/environments/aws/"]
TFRoot --> GCPMod["GCP Module\nterraform/environments/gcp/"]
subgraph AWS["AWS (us-east-1 default)"]
AWSVPC["VPC"] --> ECSCluster["ECS Cluster (Fargate)"]
ECSCluster --> ECSTask["ECS Task — AgentIdP container"]
ECSTask --> RDS["RDS PostgreSQL 14 (Multi-AZ)"]
ECSTask --> Elasticache["ElastiCache Redis 7"]
ALB["Application Load Balancer"] --> ECSCluster
end
subgraph GCP["GCP (us-central1 default)"]
GCPVPC["VPC"] --> CloudRun["Cloud Run service — AgentIdP"]
CloudRun --> CloudSQL["Cloud SQL PostgreSQL 14"]
CloudRun --> Memorystore["Memorystore Redis 7"]
GCPLB["Cloud Load Balancer"] --> CloudRun
end
AWSMod --> AWS
GCPMod --> GCP
ECR["ECR / Artifact Registry\n(container image)"] --> ECSTask
ECR --> CloudRun
```
Each region is an independent deployment with its own PostgreSQL and Redis instances.
The Terraform root module sets `aws_region` (default `us-east-1`) and `gcp_region`
(default `us-central1`) as input variables. Infrastructure modules live under
`terraform/modules/` (agentidp, lb, rds, redis) with environment-specific configuration
under `terraform/environments/aws/` and `terraform/environments/gcp/`. Cross-region
data replication and federation are Phase 3 goals.