Files
sentryagent-idp/docs/engineering/02-architecture.md
SentryAgent.ai Developer 8cabc0191c docs: commit all Phase 6 documentation updates and OpenSpec archives
- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook)
- developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added
- engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview)
- OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update
- VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added
- .gitignore: exclude session artifacts, build artifacts, and agent workspaces

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 02:24:24 +00:00

12 KiB
Raw Blame History

System Architecture


1. Component Diagram

graph TD
    Client["Client (AI Agent / Browser / CI)"]

    Client -->|HTTPS| ExpressApp["Express App (AgentIdP)"]

    subgraph ExpressApp["Express App — src/app.ts"]
        Router["Router (src/routes/)"]
        AuthMW["authMiddleware (src/middleware/auth.ts)"]
        TierMW["tierMiddleware (src/middleware/tier.ts)"]
        OpaMW["opaMiddleware (src/middleware/opa.ts)"]
        Controller["Controller (src/controllers/)"]
        Service["Service (src/services/)"]
        Repository["Repository (src/repositories/)"]
        Router --> AuthMW --> TierMW --> OpaMW --> Controller --> Service --> Repository
    end

    Repository -->|parameterized SQL| PG["PostgreSQL 14\n(agents, credentials, audit_events,\nanalytics_events, organizations,\nfederation_partners, webhook_subscriptions,\nagent_did_keys, delegation_chains)"]
    Service -->|Redis commands| Redis["Redis 7\n(token revocation list, daily tier counters,\nJWKS cache, compliance report cache,\nDID document cache)"]
    Service -->|KV v2 read/write| Vault["HashiCorp Vault\n(opt-in — credentials, DID private keys,\nwebhook secrets — when VAULT_ADDR is set)"]

    ExpressApp -->|evaluate input| OPA["OPA Policy Engine\n(policies/authz.rego + data/scopes.json)"]
    ExpressApp -->|expose| Metrics["/metrics (prom-client)"]
    ExpressApp -->|checkout session / webhooks| Stripe["Stripe\n(billing — when STRIPE_SECRET_KEY is set)"]

    Dashboard["Dashboard SPA (React 18 + Vite 5)\ndashboard/dist/ served from /dashboard"]
    Portal["Developer Portal (Next.js 14)\nportal/ — served separately on port 3002"]
    Client -->|browser| Dashboard
    Client -->|browser| Portal
    Dashboard -->|REST API calls| ExpressApp
    Portal -->|REST API calls| ExpressApp

    Grafana["Grafana (port 3001)"] -->|scrapes| Metrics

    OIDCProvider["OIDC Provider (oidc-provider v9)\nmounted at /oidc — A2A delegation tokens"]
    ExpressApp --- OIDCProvider

2. HTTP Request Lifecycle

Every authenticated API request travels through the following sequence. Understanding this sequence end-to-end is essential for debugging and for writing new endpoints correctly.

  1. HTTP request arrives at the Node.js HTTP listener — configured in src/server.ts, which calls app.listen(PORT) after createApp() resolves.
  2. App-level middleware runs in registration order: helmet() sets security headers, cors() applies CORS policy from CORS_ORIGIN, morgan('combined') logs the request line (skipped in NODE_ENV=test), express.json() and express.urlencoded() parse the body, metricsMiddleware (src/middleware/metrics.ts) starts the request timer and records agentidp_http_requests_total and agentidp_http_request_duration_seconds on response finish.
  3. The Express router matches the path to a route definition in src/routes/*.ts and hands off to the appropriate middleware chain.
  4. authMiddleware (src/middleware/auth.ts) validates the Bearer JWT: extracts the token from the Authorization header, calls verifyToken() for RS256 signature and expiry, then calls redis.get('revoked:{jti}') to check the revocation list. On success, attaches the decoded ITokenPayload to req.user.
  5. tierMiddleware (src/middleware/tier.ts) enforces per-tier daily API call limits. It reads the organisation's current tier from TierService.fetchTier(orgId), checks the daily call counter from Redis key rate:tier:calls:<orgId> against TIER_CONFIG[tier].maxCallsPerDay, increments the counter on each passing request (fire-and-forget INCR with TTL set to next UTC midnight), and throws TierLimitError (429) when the limit is reached. This middleware is applied only to API routes, not to /health, /metrics, or /dashboard.
  6. opaMiddleware (src/middleware/opa.ts) evaluates the OPA policy: builds an OpaInput object from req.method, req.baseUrl + req.path, and req.user.scope.split(' '), then calls evaluate(input). Uses the Wasm bundle (policies/authz.wasm) when present, or the TypeScript fallback reading policies/data/scopes.json. Calls next(new AuthorizationError()) if the policy denies.
  7. The controller (src/controllers/*.ts) receives the validated request, extracts and validates path params and body using Joi schemas, then delegates to the service layer.
  8. The service (src/services/*.ts) executes all business logic — enforces tier limits, resolves domain rules, and calls repositories. Phase 36 introduces specialised services: AnalyticsService (fire-and-forget event recording), TierService (enforces per-tier agent and call limits), ComplianceService (AGNTCY compliance reports, cached 5 min in Redis), FederationService (cross-IdP JWT verification with cached JWKS), DIDService (W3C DID document generation and caching), WebhookService (subscription management with Vault-backed HMAC secrets), and BillingService (Stripe Checkout and webhook processing). The service has no knowledge of HTTP.
  9. The repository (src/repositories/*.ts) executes parameterized SQL against PostgreSQL via node-postgres, or issues Redis commands via the redis client. No business logic lives here. Phase 36 added the following tables: analytics_events (daily metric counters), organizations (org tier and billing), federation_partners (cross-IdP trust registry), webhook_subscriptions and webhook_deliveries (outbound event delivery), agent_did_keys (public EC keys for DID documents), delegation_chains (A2A delegation records), tenant_subscriptions (Stripe subscription status).
  10. The controller serialises the service result and calls res.status(xxx).json(payload).
  11. AuditService.logEvent() is called — for high-throughput paths (token issuance, introspection, revocation) this is fire-and-forget (void — not awaited); for CRUD operations it is awaited. The audit event is written as an immutable row to the audit_events table in PostgreSQL.

3. OAuth 2.0 Client Credentials Flow

sequenceDiagram
    actor Agent
    participant AgentIdP
    participant PostgreSQL
    participant Redis
    participant Vault as Vault (optional)

    Agent->>AgentIdP: POST /api/v1/token<br/>grant_type=client_credentials<br/>client_id=&lt;agentId&gt;<br/>client_secret=sk_live_...&<br/>scope=agents:read agents:write

    AgentIdP->>PostgreSQL: SELECT * FROM agents WHERE agent_id = $1
    PostgreSQL-->>AgentIdP: agent row (status, etc.)

    AgentIdP->>PostgreSQL: SELECT * FROM credentials WHERE agent_id = $1 AND status = 'active'
    PostgreSQL-->>AgentIdP: active credential rows

    alt Vault path (vaultPath IS NOT NULL and VAULT_ADDR is set)
        AgentIdP->>Vault: readSecret(agentId, credentialId)
        Vault-->>AgentIdP: plain-text secret
        AgentIdP->>AgentIdP: crypto.timingSafeEqual(stored, candidate)
    else bcrypt path (fallback)
        AgentIdP->>AgentIdP: bcrypt.compare(clientSecret, secretHash)
    end

    AgentIdP->>Redis: GET monthly:tokens:{agentId}:{yyyy-mm}
    Redis-->>AgentIdP: current monthly count

    AgentIdP->>AgentIdP: signToken(payload, privateKey) — RS256 JWT

    AgentIdP->>Redis: INCR monthly:tokens:{agentId}:{yyyy-mm} (fire-and-forget)

    AgentIdP-->>Agent: 200 OK<br/>{ access_token, token_type: "Bearer", expires_in: 3600, scope }

    Note over Agent,AgentIdP: Subsequent protected API call

    Agent->>AgentIdP: GET /api/v1/agents<br/>Authorization: Bearer &lt;access_token&gt;
    AgentIdP->>AgentIdP: verifyToken(token, publicKey) — RS256 verify + expiry
    AgentIdP->>Redis: GET revoked:{jti}
    Redis-->>AgentIdP: null (not revoked)
    AgentIdP->>AgentIdP: OPA evaluate({method, path, scopes})
    AgentIdP-->>Agent: 200 OK — agents list

3b. Analytics Event Capture Flow

Every successful token issuance writes a fire-and-forget analytics event:

sequenceDiagram
    participant Controller as TokenController
    participant OAuth2Svc as OAuth2Service
    participant AnalyticsSvc as AnalyticsService
    participant PG as PostgreSQL

    Controller->>OAuth2Svc: issueToken(clientId, clientSecret, scope, ...)
    OAuth2Svc->>OAuth2Svc: signToken() — RS256 JWT
    OAuth2Svc-->>Controller: ITokenResponse

    Note over OAuth2Svc,AnalyticsSvc: fire-and-forget (void)
    OAuth2Svc-)AnalyticsSvc: recordEvent(tenantId, 'token_issued')
    AnalyticsSvc-)PG: INSERT INTO analytics_events ... ON CONFLICT DO UPDATE count + 1

recordEvent uses PostgreSQL UPSERT — one row per (organization_id, date, metric_type). If the INSERT conflicts (same date, same org, same metric), the count column is incremented atomically. This keeps the table compact (one row per day per metric type per org) and fast to query.


3c. Tier Enforcement Middleware Chain

sequenceDiagram
    actor Agent
    participant TierMW as tierMiddleware
    participant TierSvc as TierService
    participant Redis
    participant PG as PostgreSQL

    Agent->>TierMW: API request (with valid Bearer token)
    TierMW->>TierSvc: fetchTier(orgId)
    TierSvc->>PG: SELECT tier FROM organizations WHERE organization_id = $1
    PG-->>TierSvc: 'pro'
    TierSvc-->>TierMW: 'pro'

    TierMW->>Redis: GET rate:tier:calls:<orgId>
    Redis-->>TierMW: "4999" (current daily count)

    Note over TierMW: TIER_CONFIG['pro'].maxCallsPerDay = 50000 — limit not reached

    TierMW-)Redis: INCR rate:tier:calls:<orgId> (fire-and-forget, TTL = next UTC midnight)
    TierMW->>Agent: next() — request proceeds to opaMiddleware

When the counter equals or exceeds the tier limit, tierMiddleware throws TierLimitError (429) before opaMiddleware runs. The daily counter resets at UTC midnight via Redis TTL.


3d. A2A Delegation End-to-End Flow

sequenceDiagram
    actor Delegator as Delegator Agent
    actor Delegatee as Delegatee Agent
    participant AgentIdP
    participant DelegationSvc as DelegationService
    participant OIDCProvider as OIDC Provider
    participant PG as PostgreSQL

    Delegator->>AgentIdP: POST /api/v1/oauth2/token/delegate<br/>{ delegatee_id, scope }
    AgentIdP->>DelegationSvc: createDelegation(delegatorId, delegateeId, scope)
    DelegationSvc->>PG: INSERT INTO delegation_chains ...
    PG-->>DelegationSvc: chain_id
    DelegationSvc->>OIDCProvider: issue delegation JWT (delegator claims + delegatee sub)
    OIDCProvider-->>DelegationSvc: signed delegation token
    DelegationSvc-->>AgentIdP: IDelegationChain (with token)
    AgentIdP-->>Delegator: 201 { token, chain_id }

    Note over Delegatee,AgentIdP: Delegatee uses the delegation token
    Delegatee->>AgentIdP: POST /api/v1/oauth2/token/verify-delegation<br/>{ token }
    AgentIdP->>DelegationSvc: verifyDelegation(token, delegateeId)
    DelegationSvc->>PG: SELECT * FROM delegation_chains WHERE chain_id = $1 AND status = 'active'
    PG-->>DelegationSvc: chain row (not expired, not revoked)
    DelegationSvc->>OIDCProvider: verify token signature
    OIDCProvider-->>DelegationSvc: verified claims
    DelegationSvc-->>AgentIdP: IDelegationVerifyResult { valid: true, ... }
    AgentIdP-->>Delegatee: 200 { valid: true, delegatorId, scope }

4. Multi-Region Deployment Topology

graph LR
    TFRoot["Terraform Root Module\nterraform/"]
    TFRoot --> AWSMod["AWS Module\nterraform/environments/aws/"]
    TFRoot --> GCPMod["GCP Module\nterraform/environments/gcp/"]

    subgraph AWS["AWS (us-east-1 default)"]
        AWSVPC["VPC"] --> ECSCluster["ECS Cluster (Fargate)"]
        ECSCluster --> ECSTask["ECS Task — AgentIdP container"]
        ECSTask --> RDS["RDS PostgreSQL 14 (Multi-AZ)"]
        ECSTask --> Elasticache["ElastiCache Redis 7"]
        ALB["Application Load Balancer"] --> ECSCluster
    end

    subgraph GCP["GCP (us-central1 default)"]
        GCPVPC["VPC"] --> CloudRun["Cloud Run service — AgentIdP"]
        CloudRun --> CloudSQL["Cloud SQL PostgreSQL 14"]
        CloudRun --> Memorystore["Memorystore Redis 7"]
        GCPLB["Cloud Load Balancer"] --> CloudRun
    end

    AWSMod --> AWS
    GCPMod --> GCP

    ECR["ECR / Artifact Registry\n(container image)"] --> ECSTask
    ECR --> CloudRun

Each region is an independent deployment with its own PostgreSQL and Redis instances. The Terraform root module sets aws_region (default us-east-1) and gcp_region (default us-central1) as input variables. Infrastructure modules live under terraform/modules/ (agentidp, lb, rds, redis) with environment-specific configuration under terraform/environments/aws/ and terraform/environments/gcp/. Cross-region data replication and federation are Phase 3 goals.