docs: commit all Phase 6 documentation updates and OpenSpec archives
- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook) - developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added - engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview) - OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update - VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added - .gitignore: exclude session artifacts, build artifacts, and agent workspaces Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,609 @@
|
||||
# WS5 — Remaining Documentation Updates
|
||||
|
||||
**Targets:** 5 separate files with surgical edits.
|
||||
|
||||
---
|
||||
|
||||
## File 1: `docs/engineering/01-overview.md`
|
||||
|
||||
**Operation:** Replace the Phase Roadmap table (Section 4) to reflect Phase 3–6 completion status and add Phase 6 capabilities to the Product Features table.
|
||||
|
||||
---
|
||||
|
||||
### Change 1a — Update Phase Roadmap Table
|
||||
|
||||
**Find (Section 4, the Phase 3 row):**
|
||||
```
|
||||
| Phase 3 — Enterprise | PLANNED | AGNTCY federation (cross-IdP agent identity), W3C Decentralised Identifiers (DIDs), agent marketplace, advanced compliance reporting, SOC 2 Type II certification, enterprise tier (custom retention, SLAs, advanced RBAC) |
|
||||
```
|
||||
|
||||
**Replace with (3 rows — Phase 3 was completed and Phases 4–6 have been added):**
|
||||
```
|
||||
| Phase 3 — Enterprise | COMPLETE | AGNTCY federation (cross-IdP agent identity), W3C Decentralised Identifiers (DIDs), agent marketplace, OIDC provider (A2A delegation), Rust SDK, developer portal (Next.js 14) |
|
||||
| Phase 4 — Compliance & Security | COMPLETE | AGNTCY compliance reports (agent-identity + audit-trail sections), audit hash chain verification, SOC 2 CC6.1 AES-256-CBC column encryption (`EncryptionService`), DID document caching, federation partner JWKS caching |
|
||||
| Phase 5 — Scale & Ecosystem | COMPLETE | Multi-tier subscription model (free/pro/enterprise), Stripe billing integration (`BillingService`, `TierService`), tier enforcement middleware (daily call and token limits), webhook subscriptions + delivery history (`WebhookService`), analytics service (daily event aggregation + trend queries) |
|
||||
| Phase 6 — Market Expansion | COMPLETE | AGNTCY conformance test suite (4 conformance scenarios), API tiers enforced end-to-end, analytics dashboard in developer portal, full Phase 6 engineering documentation update |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 1b — Add Phase 3–6 Capabilities to Product Features Table
|
||||
|
||||
**Find (Section 3, the last row of the features table):**
|
||||
```
|
||||
| Health Check | `GET /health` | Checks PostgreSQL and Redis connectivity; unauthenticated; used by load balancers |
|
||||
```
|
||||
|
||||
**Insert the following rows after that line (before the closing of the table):**
|
||||
```
|
||||
| W3C Decentralised Identifiers | `GET /api/v1/agents/:id/did`, `GET /api/v1/.well-known/did.json` | DID Core 1.0 documents; `did:web` method; EC P-256 keys; AGNTCY extension fields |
|
||||
| AGNTCY Agent Cards | `GET /api/v1/agents/:id/card` | Machine-readable agent identity summary; AGNTCY schema v1.0 |
|
||||
| AGNTCY Compliance Reports | `GET /api/v1/compliance/report`, `GET /api/v1/compliance/agent-cards` | Compliance sections: agent-identity + audit-trail; cached 5 min; AGNTCY schema v1.0 |
|
||||
| Federation (Cross-IdP) | `POST /api/v1/federation/partners`, `GET /api/v1/federation/partners`, `POST /api/v1/federation/verify` | Register partner IdPs; verify cross-IdP JWTs using cached partner JWKS |
|
||||
| A2A Delegation | `POST /api/v1/oauth2/token/delegate`, `POST /api/v1/oauth2/token/verify-delegation` | Agent-to-agent delegation tokens; OIDC provider (oidc-provider v9) mounted at `/oidc` |
|
||||
| Webhook Subscriptions | `POST /api/v1/webhooks`, `GET /api/v1/webhooks`, `GET /api/v1/webhooks/:id/deliveries` | Outbound event delivery with HMAC signing; Vault-backed secrets; delivery history |
|
||||
| Tier Management | `GET /api/v1/tiers/status`, `POST /api/v1/tiers/upgrade` | Free / Pro / Enterprise tiers; daily call and token limits; Stripe Checkout upgrade flow |
|
||||
| Billing | `POST /api/v1/billing/checkout`, `POST /api/v1/billing/webhook`, `GET /api/v1/billing/status` | Stripe subscription management; webhook event processing |
|
||||
| Analytics | Internal (via `AnalyticsService`) | Daily aggregated event counts per org; token trend queries (up to 90 days); agent activity heatmap; usage summary |
|
||||
| Developer Portal | `/portal` (Next.js 14, separate process) | Get-started wizard, SDK explorer, API reference, analytics dashboard, pricing page |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 1c — Update Free Tier Limits Table
|
||||
|
||||
**Find (Section 6, entire table):**
|
||||
```
|
||||
| Limit | Value |
|
||||
|-------|-------|
|
||||
| Max agents | 100 |
|
||||
| Max credentials per agent | No hard cap enforced in code (5 is the documented recommendation) |
|
||||
| Max tokens in flight | 10,000 per agent per calendar month |
|
||||
| Token TTL | 3,600 seconds (1 hour) |
|
||||
| Audit log retention | 90 days |
|
||||
| API rate limit | 100 requests per minute per IP address |
|
||||
```
|
||||
|
||||
**Replace with:**
|
||||
```
|
||||
| Limit | Free Tier | Pro Tier | Enterprise Tier |
|
||||
|-------|-----------|----------|-----------------|
|
||||
| Max agents | 100 | 1,000 | Unlimited |
|
||||
| Max API calls per day | Configured in `TIER_CONFIG` | Configured in `TIER_CONFIG` | Unlimited |
|
||||
| Max tokens per day | Configured in `TIER_CONFIG` | Configured in `TIER_CONFIG` | Unlimited |
|
||||
| Token TTL | 3,600 seconds (1 hour) | 3,600 seconds (1 hour) | 3,600 seconds (1 hour) |
|
||||
| Audit log retention | 90 days | 1 year | Custom |
|
||||
| API rate limit (per IP) | 100 req/min | 100 req/min | 100 req/min |
|
||||
| Webhook subscriptions | 0 | 10 | Unlimited |
|
||||
| Analytics retention | 90 days | 1 year | Custom |
|
||||
|
||||
Tier limits are configured in `src/config/tiers.ts` (`TIER_CONFIG`). Enforcement is handled by `TierService.enforceAgentLimit()` (agent cap) and `src/middleware/tier.ts` (daily call/token caps). Tier upgrades are initiated via `POST /api/v1/tiers/upgrade` and confirmed via the Stripe webhook.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## File 2: `docs/engineering/03-tech-stack.md`
|
||||
|
||||
**Operation:** Append new ADR entries after the existing `### ADR-10: Terraform` section.
|
||||
|
||||
**Find (last line of the file):**
|
||||
```
|
||||
**Consequences**: All infrastructure changes must go through Terraform. No manual edits
|
||||
via the AWS console or GCP console are permitted — they will be overwritten on the next
|
||||
`terraform apply`. Terraform state is stored in a remote backend and must not be edited
|
||||
manually.
|
||||
```
|
||||
|
||||
**Append the following after that line:**
|
||||
|
||||
```markdown
|
||||
---
|
||||
|
||||
### ADR-11: Stripe
|
||||
|
||||
**Status**: Adopted
|
||||
**Component**: Billing — subscription management and payment processing
|
||||
|
||||
**Decision**: Use Stripe as the payment processing and subscription management platform. The `stripe` npm package (v21+) handles Checkout Session creation, webhook event verification, and subscription lifecycle events.
|
||||
|
||||
**Rationale**: Stripe's hosted Checkout flow eliminates the need to handle PCI-DSS scope for card data. The `stripe.webhooks.constructEvent()` method uses HMAC-SHA256 to verify incoming webhook payloads, preventing replay attacks. The `checkout.session.completed` event carries `metadata: { orgId, targetTier }`, allowing `BillingService` to delegate tier upgrades to `TierService.applyUpgrade()` without coupling billing logic to tier logic.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Paddle — rejected because its global merchant-of-record model introduced complexities with the open-source free tier.
|
||||
- Braintree — rejected because Stripe's webhook reliability and developer experience are superior.
|
||||
|
||||
**Consequences**: Stripe requires `STRIPE_SECRET_KEY` (for API calls) and `STRIPE_WEBHOOK_SECRET` (`whsec_...`, for webhook verification). Per-tier Stripe price IDs are configured via `STRIPE_PRICE_ID_PRO` and `STRIPE_PRICE_ID_ENTERPRISE`. All billing webhook handlers must pass the raw `Buffer` body (not parsed JSON) to `stripe.webhooks.constructEvent()` — use `express.raw()` middleware on the webhook route.
|
||||
|
||||
---
|
||||
|
||||
### ADR-12: oidc-provider (A2A Delegation)
|
||||
|
||||
**Status**: Adopted
|
||||
**Component**: A2A delegation — OIDC provider for agent-to-agent trust tokens
|
||||
|
||||
**Decision**: Use the `oidc-provider` npm package (v9.7.x) as the OIDC provider for issuing A2A delegation tokens. The provider is mounted as a sub-application at `/oidc` within the Express app.
|
||||
|
||||
**Rationale**: `oidc-provider` is a certified OpenID Connect implementation that handles the full OIDC protocol, including JWKS serving, token endpoint, and discovery document. Rather than implementing a custom delegation token format, using a standards-compliant OIDC provider means delegation tokens can be verified by any OIDC-aware party using the published JWKS at `/oidc/jwks`.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Custom JWT signing — rejected because hand-rolled token formats cannot benefit from OIDC tooling and interoperability.
|
||||
|
||||
**Consequences**: `A2A_ENABLED` env var gates the OIDC provider — when set to `'false'`, delegation endpoints return 404. The `OIDC_ISSUER` env var must be set to the full base URL of the OIDC provider (e.g. `https://api.sentryagent.ai`).
|
||||
|
||||
---
|
||||
|
||||
### ADR-13: Next.js 14 (Developer Portal)
|
||||
|
||||
**Status**: Adopted
|
||||
**Component**: Developer Portal (`portal/`) — public-facing documentation and onboarding
|
||||
|
||||
**Decision**: Use Next.js 14 (App Router) with Tailwind CSS for the developer portal. The portal is a separate process served on its own port (independent of the Express API server).
|
||||
|
||||
**Rationale**: The developer portal has different performance and SEO requirements than the internal operator dashboard (`dashboard/`). Next.js 14's App Router supports React Server Components, which allows the marketing and documentation pages to be statically generated while the analytics dashboard and API Explorer are client-rendered. Tailwind CSS enables rapid UI development consistent with the design system.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Extending the Vite dashboard — rejected because the developer portal requires server-side rendering for SEO on marketing pages, which Vite does not provide.
|
||||
- Docusaurus — rejected because the portal includes interactive components (Swagger Explorer, analytics charts) that are not well-suited to a documentation-only tool.
|
||||
|
||||
**Consequences**: The portal (`portal/`) has its own `package.json`, `tsconfig.json`, `tailwind.config.ts`, and `next.config.js`. It is built and run independently: `cd portal && npm install && npm run dev`. The portal calls the AgentIdP REST API using the same `@sentryagent/idp-sdk` as the dashboard.
|
||||
|
||||
---
|
||||
|
||||
### ADR-14: bull (Job Queue) + kafkajs (Event Streaming)
|
||||
|
||||
**Status**: Adopted (opt-in)
|
||||
**Component**: Async job processing and event streaming
|
||||
|
||||
**Decision**: Use `bull` (Redis-backed job queue) for async webhook delivery retries and `kafkajs` for event streaming to external consumers. Both are opt-in — the system operates correctly without Kafka configured.
|
||||
|
||||
**Rationale**: Webhook delivery requires retry logic with exponential backoff and dead-letter handling. `bull` provides this out of the box using the existing Redis dependency. `kafkajs` enables high-throughput event streaming for analytics and audit events to external data pipelines without blocking the primary request path.
|
||||
|
||||
**Alternatives considered**:
|
||||
- BullMQ — considered as a more modern alternative to `bull` but rejected to avoid adding a new package family during Phase 6. Migration is a future backlog item.
|
||||
|
||||
**Consequences**: Kafka is entirely optional. When `KAFKA_BROKERS` is not set, `kafkajs` is not initialised and no events are published. The `bull` queue for webhook delivery requires only the existing Redis instance.
|
||||
|
||||
---
|
||||
|
||||
### ADR-15: did-resolver + web-did-resolver (W3C DIDs)
|
||||
|
||||
**Status**: Adopted
|
||||
**Component**: W3C DID Core 1.0 document resolution
|
||||
|
||||
**Decision**: Use `did-resolver` (v4.1.x) as the DID resolution framework and `web-did-resolver` (v2.0.x) for the `did:web` method implementation.
|
||||
|
||||
**Rationale**: `did-resolver` provides a pluggable resolver interface used by both the server (for internal resolution) and by third parties who want to verify AgentIdP-issued DIDs. The `did:web` method maps DID identifiers to HTTPS URLs hosting the DID document JSON, requiring no blockchain. `DIDService` generates documents that conform to the W3C DID Core 1.0 specification and include AGNTCY-specific extension fields.
|
||||
|
||||
**Consequences**: `DID_WEB_DOMAIN` env var is required for DID generation. DID documents are cached in Redis (`did:doc:<agentId>`, TTL from `DID_DOCUMENT_CACHE_TTL_SECONDS`, default 300s). Private keys are stored in HashiCorp Vault KV v2 when Vault is configured; in dev mode, a `dev:no-vault` marker is stored and keys are ephemeral.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## File 3: `docs/engineering/04-codebase-structure.md`
|
||||
|
||||
**Operation:** Two surgical edits — update the directory tree and update the `src/` subdirectory table.
|
||||
|
||||
---
|
||||
|
||||
### Change 3a — Update the Annotated Directory Tree
|
||||
|
||||
**Find (inside the code block in Section 1, after the `sdk-java/` line):**
|
||||
```
|
||||
├── policies/ # OPA policy files
|
||||
```
|
||||
|
||||
**Replace the entire block from `├── policies/` down through `└── jest.config.ts # Jest configuration — ts-jest, test timeouts, coverage thresholds` with the following updated version:**
|
||||
|
||||
```
|
||||
├── sdk-rust/ # Rust SDK (sentryagent-idp crate) — async, tokio, reqwest, typed errors
|
||||
├── policies/ # OPA policy files
|
||||
│ ├── authz.rego # Rego policy — normalise_path + scope-intersection allow rule
|
||||
│ └── data/scopes.json # Endpoint permission map — used by Rego and TypeScript fallback
|
||||
├── portal/ # Developer Portal — Next.js 14 App Router, Tailwind CSS
|
||||
│ ├── app/ # Next.js App Router pages (get-started, pricing, sdks, analytics, settings, login)
|
||||
│ ├── components/ # Shared UI components (Nav.tsx, SwaggerExplorer.tsx, GetStartedWizard.tsx)
|
||||
│ ├── hooks/ # React hooks (useAuth.ts)
|
||||
│ └── types/ # TypeScript type definitions for portal-only types
|
||||
├── terraform/ # Terraform infrastructure as code
|
||||
│ ├── modules/ # Reusable modules: agentidp, lb, rds, redis
|
||||
│ └── environments/ # Environment configs: aws/ (ECS+RDS+ElastiCache), gcp/ (Cloud Run+SQL+Memorystore)
|
||||
├── monitoring/ # Prometheus and Grafana configuration
|
||||
│ ├── prometheus/ # prometheus.yml scrape configuration
|
||||
│ └── grafana/ # Grafana provisioning YAML and dashboard JSON files
|
||||
├── docs/ # All project documentation
|
||||
│ ├── engineering/ # Internal engineering knowledge base (this directory)
|
||||
│ ├── developers/ # End-user API reference and developer guides
|
||||
│ ├── devops/ # Operator runbooks and environment variable reference
|
||||
│ ├── agntcy/ # AGNTCY alignment documentation
|
||||
│ └── openapi/ # OpenAPI 3.0 specification files
|
||||
├── openspec/ # OpenSpec change management — proposals, designs, specs, tasks, archives
|
||||
├── tests/ # Jest test suite — mirrors src/ structure
|
||||
│ ├── unit/ # Unit tests (mocked dependencies) — mirrors src/
|
||||
│ ├── integration/ # Integration tests (real DB + Redis)
|
||||
│ ├── agntcy-conformance/ # AGNTCY conformance test suite (separate Jest config)
|
||||
│ └── load/ # k6 load test scripts
|
||||
├── Dockerfile # Multi-stage production build (build + runtime stages)
|
||||
├── docker-compose.yml # Local development: PostgreSQL 14 (port 5432) + Redis 7 (port 6379)
|
||||
├── docker-compose.monitoring.yml # Monitoring overlay: Prometheus (port 9090) + Grafana (port 3001)
|
||||
├── package.json # Node.js dependencies and npm scripts
|
||||
├── tsconfig.json # TypeScript strict configuration — compiled to dist/
|
||||
└── jest.config.ts # Jest configuration — ts-jest, test timeouts, coverage thresholds
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 3b — Add New src/ Subdirectories to Section 2
|
||||
|
||||
**Find (Section 2 table, the last row):**
|
||||
```
|
||||
| `src/cache/` | Redis client factory — creates and caches a single `redis` client instance | Client is a singleton created once in `src/app.ts` and passed to repositories |
|
||||
```
|
||||
|
||||
**Insert these rows after that line:**
|
||||
```
|
||||
| `src/config/` | Configuration constants — `tiers.ts` exports `TIER_CONFIG`, `TIER_RANK`, `TierName`, and `isTierName()` type guard | Imported by `TierService` and `tierMiddleware`; never imports from services |
|
||||
| `src/middleware/tier.ts` | Tier enforcement middleware — reads org tier from `TierService`, checks daily call counter in Redis, throws `TierLimitError` (429) when limit is exceeded, increments counter on pass | Applied only to API routes; skips `/health`, `/metrics`, and static file routes |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 3c — Add New Entries to Section 3 (Where to Add New Code)
|
||||
|
||||
**Find (Section 3 table, after the `A new Prometheus metric` row):**
|
||||
```
|
||||
| A new TypeScript type used in 2+ files | `src/types/index.ts` | A new `AgentGroupMembership` interface |
|
||||
```
|
||||
|
||||
**Insert these rows after that line:**
|
||||
```
|
||||
| A new tier-gated feature | `src/config/tiers.ts` (add limit field) + `src/middleware/tier.ts` (add check) + service (enforce) | Adding a `maxWebhooksPerOrg` tier limit |
|
||||
| A webhook event handler | `src/services/WebhookService.ts` (add event type to `WebhookEventType`) + the producer that calls `void webhookService.dispatch(orgId, eventType, payload)` | Emitting `agent.decommissioned` events to subscriber URLs |
|
||||
| A new analytics metric type | `src/services/AnalyticsService.ts` (call `recordEvent(tenantId, 'new_metric')` in the relevant service using `void`) | Recording `credential_rotated` events for analytics |
|
||||
| A new DID endpoint | `src/controllers/DIDController.ts` + `src/routes/did.ts` + `src/services/DIDService.ts` (if new method needed) + `policies/data/scopes.json` | Adding `GET /api/v1/agents/:id/did/rotate-key` |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## File 4: `docs/engineering/README.md`
|
||||
|
||||
**Operation:** Replace the reading order table and quick reference table to reflect all Phase 6 additions.
|
||||
|
||||
---
|
||||
|
||||
### Change 4a — Update Reading Order Table
|
||||
|
||||
**Find (Section "Reading Order (New Engineers Start Here)", the last row):**
|
||||
```
|
||||
| 11 | [SDK Integration Guide](11-sdk-guide.md) | All 4 SDKs — installation, examples, contribution guide | 20 min |
|
||||
```
|
||||
|
||||
**Replace with (adds the Rust SDK to the description and updates the estimated time):**
|
||||
```
|
||||
| 11 | [SDK Integration Guide](11-sdk-guide.md) | All 5 SDKs (Node.js, Python, Go, Java, Rust) — installation, examples, contribution guide | 25 min |
|
||||
```
|
||||
|
||||
**Find (the line after the table):**
|
||||
```
|
||||
**Total estimated reading time for new engineers: ~3.5 hours**
|
||||
```
|
||||
|
||||
**Replace with:**
|
||||
```
|
||||
**Total estimated reading time for new engineers: ~4 hours**
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 4b — Update "Service Deep Dives" Entry
|
||||
|
||||
**Find:**
|
||||
```
|
||||
| 5 | [Service Deep Dives](05-services.md) | All 8 services/components — purpose, interface, schema, error types | 30 min |
|
||||
```
|
||||
|
||||
**Replace with:**
|
||||
```
|
||||
| 5 | [Service Deep Dives](05-services.md) | All 17 services/components (incl. Phase 3–6: AnalyticsService, TierService, ComplianceService, FederationService, DIDService, WebhookService, BillingService, DelegationService, OIDCService) — purpose, interface, schema, error types | 45 min |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 4c — Update Quick Reference Table
|
||||
|
||||
**Find (in the Quick Reference section):**
|
||||
```
|
||||
| Integrate with the SDK | [11-sdk-guide.md](11-sdk-guide.md) |
|
||||
```
|
||||
|
||||
**Replace with:**
|
||||
```
|
||||
| Integrate with the SDK (Node.js, Python, Go, Java, Rust) | [11-sdk-guide.md](11-sdk-guide.md) |
|
||||
```
|
||||
|
||||
**Find (after the "Integrate with the SDK" row):**
|
||||
```
|
||||
| Understand why a technology was chosen | [03-tech-stack.md](03-tech-stack.md) |
|
||||
```
|
||||
|
||||
**Insert after that row:**
|
||||
```
|
||||
| Understand tier limits and billing | [01-overview.md](01-overview.md) (Section 6) + [03-tech-stack.md](03-tech-stack.md) (ADR-11) |
|
||||
| Understand AGNTCY compliance reports | [05-services.md](05-services.md) (ComplianceService) |
|
||||
| Understand the A2A delegation flow | [06-walkthroughs.md](06-walkthroughs.md) (Walkthrough 4) |
|
||||
| Run the AGNTCY conformance suite | [09-testing.md](09-testing.md) (Section 10.8) |
|
||||
| Add a new Rust SDK endpoint | [11-sdk-guide.md](11-sdk-guide.md) (Section 6 contribution guide) |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## File 5: `docs/engineering/06-walkthroughs.md`
|
||||
|
||||
**Operation:** Append three new walkthrough sections at the end of the file.
|
||||
|
||||
**Find (the last line of the file):**
|
||||
```
|
||||
Returns `ICredentialWithSecret` — the updated credential including the new
|
||||
`clientSecret`. This is the only time the new secret is ever returned. The caller
|
||||
must store it securely.
|
||||
```
|
||||
|
||||
**Append the following after that final JSON block:**
|
||||
|
||||
```markdown
|
||||
---
|
||||
|
||||
## Walkthrough 4 — A2A Delegation End-to-End
|
||||
|
||||
**Request:** `POST /api/v1/oauth2/token/delegate` — one AI agent delegating a scoped capability to another
|
||||
|
||||
This walkthrough traces how agent A (an orchestrator) issues a delegation token that grants agent B (a sub-agent) the right to act on its behalf with a restricted scope.
|
||||
|
||||
---
|
||||
|
||||
### Step 1 — Route dispatch
|
||||
|
||||
**File:** `src/routes/delegation.ts`
|
||||
|
||||
```typescript
|
||||
router.post(
|
||||
'/token/delegate',
|
||||
asyncHandler(authMiddleware),
|
||||
opaMiddleware,
|
||||
asyncHandler(delegationController.createDelegation.bind(delegationController))
|
||||
);
|
||||
```
|
||||
|
||||
Both `authMiddleware` and `opaMiddleware` run. The OPA policy requires scope `agents:write` for delegation creation.
|
||||
|
||||
---
|
||||
|
||||
### Step 2 — Controller: extract delegator and validate
|
||||
|
||||
**File:** `src/controllers/DelegationController.ts`
|
||||
|
||||
```typescript
|
||||
const delegatorId = req.user.sub; // From the Bearer token's sub claim
|
||||
const { delegatee_id, scope, expires_at } = req.body;
|
||||
```
|
||||
|
||||
The controller validates that `delegatee_id` is a non-empty UUID, `scope` is a non-empty string, and `expires_at` (if provided) is a valid ISO 8601 datetime in the future. It passes these to `DelegationService.createDelegation()`.
|
||||
|
||||
---
|
||||
|
||||
### Step 3 — Service: verify both agents exist
|
||||
|
||||
**File:** `src/services/DelegationService.ts`
|
||||
|
||||
```typescript
|
||||
const delegator = await this.agentRepository.findById(delegatorId);
|
||||
if (!delegator || delegator.status !== 'active') { throw new AgentNotFoundError(delegatorId) }
|
||||
|
||||
const delegatee = await this.agentRepository.findById(delegateeId);
|
||||
if (!delegatee || delegatee.status !== 'active') { throw new AgentNotFoundError(delegateeId) }
|
||||
```
|
||||
|
||||
Both agents must exist and be in `active` status. A suspended or decommissioned agent cannot participate in delegation.
|
||||
|
||||
---
|
||||
|
||||
### Step 4 — Service: insert delegation chain record
|
||||
|
||||
**File:** `src/services/DelegationService.ts`
|
||||
|
||||
```typescript
|
||||
await this.pool.query(
|
||||
`INSERT INTO delegation_chains (chain_id, delegator_id, delegatee_id, scope, status, expires_at)
|
||||
VALUES ($1, $2, $3, $4, 'active', $5)`,
|
||||
[chainId, delegatorId, delegateeId, scope, expiresAt]
|
||||
);
|
||||
```
|
||||
|
||||
The `chain_id` is a UUID generated by the service. The `delegation_chains` table provides the authoritative source of truth for which delegations are active, independent of any token.
|
||||
|
||||
---
|
||||
|
||||
### Step 5 — Response
|
||||
|
||||
```json
|
||||
{
|
||||
"chain_id": "f1e2d3c4-...",
|
||||
"token": "eyJhbGciOiJSUzI1NiJ9...",
|
||||
"delegator_id": "a1b2c3d4-...",
|
||||
"delegatee_id": "b2c3d4e5-...",
|
||||
"scope": "agents:read",
|
||||
"status": "active",
|
||||
"expires_at": "2026-04-05T00:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
The `token` field is the signed delegation JWT. The delegatee presents this token to `POST /api/v1/oauth2/token/verify-delegation` to prove it has authority to act on the delegator's behalf.
|
||||
|
||||
**Why store both the DB record and the JWT?** The DB record allows revocation — when the delegator calls `DELETE /api/v1/delegation-chains/:chainId`, the record is soft-deleted and all subsequent `verify-delegation` calls will fail even if the JWT itself has not yet expired.
|
||||
|
||||
---
|
||||
|
||||
## Walkthrough 5 — Tier Enforcement Request Lifecycle
|
||||
|
||||
**Request:** Any authenticated API request when the organisation's daily call limit is reached
|
||||
|
||||
This walkthrough traces how `tierMiddleware` intercepts a request before it reaches the OPA middleware, preventing quota-exceeded traffic from consuming service resources.
|
||||
|
||||
---
|
||||
|
||||
### Step 1 — Auth middleware passes
|
||||
|
||||
Same as Walkthrough 2, Step 3. The Bearer JWT is verified and `req.user` is populated with `sub` (agentId) and `organization_id`.
|
||||
|
||||
---
|
||||
|
||||
### Step 2 — Tier middleware: fetch org tier
|
||||
|
||||
**File:** `src/middleware/tier.ts`
|
||||
|
||||
```typescript
|
||||
const orgId = req.user.organization_id;
|
||||
const tier = await tierService.fetchTier(orgId);
|
||||
const config = TIER_CONFIG[tier];
|
||||
```
|
||||
|
||||
`fetchTier()` issues `SELECT tier FROM organizations WHERE organization_id = $1`. Returns `'free'` if no row is found (safe default).
|
||||
|
||||
---
|
||||
|
||||
### Step 3 — Tier middleware: read daily counter
|
||||
|
||||
**File:** `src/middleware/tier.ts`
|
||||
|
||||
```typescript
|
||||
const callsKey = `rate:tier:calls:${orgId}`;
|
||||
const callsToday = await redis.get(callsKey);
|
||||
const count = callsToday !== null ? parseInt(callsToday, 10) : 0;
|
||||
|
||||
if (count >= config.maxCallsPerDay) {
|
||||
throw new TierLimitError('calls', config.maxCallsPerDay, { orgId, tier, current: count });
|
||||
}
|
||||
```
|
||||
|
||||
The Redis key `rate:tier:calls:<orgId>` is read. If null (first call of the day), count is 0. When count equals or exceeds the tier limit, `TierLimitError` (HTTP 429) is thrown immediately — no further middleware runs.
|
||||
|
||||
---
|
||||
|
||||
### Step 4 — Tier middleware: increment counter (fire-and-forget)
|
||||
|
||||
**File:** `src/middleware/tier.ts`
|
||||
|
||||
```typescript
|
||||
// Set TTL to next UTC midnight if key is new
|
||||
void redis.multi()
|
||||
.incr(callsKey)
|
||||
.expireAt(callsKey, nextUtcMidnightUnix())
|
||||
.exec();
|
||||
next();
|
||||
```
|
||||
|
||||
The counter is incremented atomically using a Redis MULTI block. The `EXPIREAT` command sets the key to auto-delete at the next UTC midnight, resetting the daily counter without any scheduled job. The increment is fire-and-forget — the request proceeds immediately to `opaMiddleware`.
|
||||
|
||||
**Why expire at UTC midnight rather than a rolling 24-hour window?** Tier limits are documented as "per day", which users interpret as resetting at midnight. A rolling window would allow a user to consume their full daily quota twice within a 48-hour period straddling midnight, which is counterintuitive. UTC midnight is predictable and easy to reason about.
|
||||
|
||||
---
|
||||
|
||||
### Step 5 — Error handler serialises TierLimitError
|
||||
|
||||
**File:** `src/middleware/errorHandler.ts`
|
||||
|
||||
```json
|
||||
HTTP 429
|
||||
{
|
||||
"code": "TIER_LIMIT_EXCEEDED",
|
||||
"message": "Daily API call limit reached for your tier.",
|
||||
"details": {
|
||||
"tier": "free",
|
||||
"limit": 1000,
|
||||
"current": 1000
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `Retry-After` header is set to the number of seconds until next UTC midnight so clients can implement automatic backoff.
|
||||
|
||||
---
|
||||
|
||||
## Walkthrough 6 — Analytics Event Capture Flow
|
||||
|
||||
**Trigger:** Any successful token issuance (`POST /api/v1/token`)
|
||||
|
||||
This walkthrough traces how an analytics event is captured without affecting the latency of the primary token issuance response.
|
||||
|
||||
---
|
||||
|
||||
### Step 1 — Token issuance completes
|
||||
|
||||
**File:** `src/services/OAuth2Service.ts`
|
||||
|
||||
```typescript
|
||||
const accessToken = signToken(payload, this.privateKey);
|
||||
// Primary response is ready — analytics is now fire-and-forget
|
||||
void this.analyticsService.recordEvent(tenantId, 'token_issued');
|
||||
tokensIssuedTotal.inc({ scope });
|
||||
```
|
||||
|
||||
The `signToken()` call completes synchronously (RSA signing is CPU-bound, not I/O). The controller can now send the response. `analyticsService.recordEvent()` is called with `void` — the `await` is deliberately omitted.
|
||||
|
||||
**Why `void` instead of `await`?** Token issuance latency must remain below 100ms (per the QA performance gate). A PostgreSQL write adds 5–15ms. Since analytics data is aggregated (not transactional), losing an occasional event due to an error is acceptable. The response is never delayed for analytics.
|
||||
|
||||
---
|
||||
|
||||
### Step 2 — AnalyticsService: UPSERT daily counter
|
||||
|
||||
**File:** `src/services/AnalyticsService.ts`
|
||||
|
||||
```typescript
|
||||
async recordEvent(tenantId: string, metricType: string): Promise<void> {
|
||||
try {
|
||||
await this.pool.query(
|
||||
`INSERT INTO analytics_events (organization_id, date, metric_type, count)
|
||||
VALUES ($1, CURRENT_DATE, $2, 1)
|
||||
ON CONFLICT (organization_id, date, metric_type)
|
||||
DO UPDATE SET count = analytics_events.count + 1`,
|
||||
[tenantId, metricType],
|
||||
);
|
||||
} catch (err) {
|
||||
console.error('[AnalyticsService] recordEvent failed — primary path unaffected', err);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `ON CONFLICT DO UPDATE` upsert is atomic. Whether this is the first or the ten-thousandth `token_issued` event for this tenant today, the row is updated correctly. All errors are caught and swallowed — the token has already been returned to the caller.
|
||||
|
||||
**Why one row per day per metric, not one row per event?** Storing a row per event would create millions of rows. The daily aggregate model keeps the table compact while still providing daily trend data (the granularity that analytics dashboards need). Sub-day granularity is available from the Prometheus `agentidp_tokens_issued_total` counter if needed.
|
||||
|
||||
---
|
||||
|
||||
### Step 3 — Dashboard query (deferred)
|
||||
|
||||
When a developer visits the analytics page in the developer portal, the portal calls:
|
||||
|
||||
```
|
||||
GET /api/v1/analytics/token-trend?days=30
|
||||
```
|
||||
|
||||
**File:** `src/services/AnalyticsService.ts` — `getTokenTrend(tenantId, 30)`
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
gs.date::DATE::TEXT AS date,
|
||||
COALESCE(ae.count, 0)::INTEGER AS count
|
||||
FROM generate_series(
|
||||
CURRENT_DATE - 29 * INTERVAL '1 day',
|
||||
CURRENT_DATE,
|
||||
INTERVAL '1 day'
|
||||
) AS gs(date)
|
||||
LEFT JOIN analytics_events ae
|
||||
ON ae.date = gs.date::DATE
|
||||
AND ae.organization_id = $2
|
||||
AND ae.metric_type = 'token_issued'
|
||||
ORDER BY gs.date ASC
|
||||
```
|
||||
|
||||
The `generate_series` + `LEFT JOIN` pattern ensures all 30 days appear in the result, with `count: 0` for days with no events. This avoids the need for the client to fill in gaps.
|
||||
```
|
||||
Reference in New Issue
Block a user