docs: commit all Phase 6 documentation updates and OpenSpec archives

- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook) - developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added - engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview) - OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update - VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added - .gitignore: exclude session artifacts, build artifacts, and agent workspaces Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 02:24:24 +00:00
parent 0fb00256b4
commit 8cabc0191c
56 changed files with 12780 additions and 446 deletions
--- a/docs/engineering/06-walkthroughs.md
+++ b/docs/engineering/06-walkthroughs.md
@@ -715,3 +715,260 @@ must store it securely.
  "revokedAt": null
 }
 ```
+
+---
+
+## Walkthrough 4 — A2A Delegation End-to-End
+
+**Request:** `POST /api/v1/oauth2/token/delegate` — one AI agent delegating a scoped capability to another
+
+This walkthrough traces how agent A (an orchestrator) issues a delegation token that grants agent B (a sub-agent) the right to act on its behalf with a restricted scope.
+
+---
+
+### Step 1 — Route dispatch
+
+**File:** `src/routes/delegation.ts`
+
+```typescript
+router.post(
+  '/token/delegate',
+  asyncHandler(authMiddleware),
+  opaMiddleware,
+  asyncHandler(delegationController.createDelegation.bind(delegationController))
+);
+```
+
+Both `authMiddleware` and `opaMiddleware` run. The OPA policy requires scope `agents:write` for delegation creation.
+
+---
+
+### Step 2 — Controller: extract delegator and validate
+
+**File:** `src/controllers/DelegationController.ts`
+
+```typescript
+const delegatorId = req.user.sub;               // From the Bearer token's sub claim
+const { delegatee_id, scope, expires_at } = req.body;
+```
+
+The controller validates that `delegatee_id` is a non-empty UUID, `scope` is a non-empty string, and `expires_at` (if provided) is a valid ISO 8601 datetime in the future. It passes these to `DelegationService.createDelegation()`.
+
+---
+
+### Step 3 — Service: verify both agents exist
+
+**File:** `src/services/DelegationService.ts`
+
+```typescript
+const delegator = await this.agentRepository.findById(delegatorId);
+if (!delegator || delegator.status !== 'active') { throw new AgentNotFoundError(delegatorId) }
+
+const delegatee = await this.agentRepository.findById(delegateeId);
+if (!delegatee || delegatee.status !== 'active') { throw new AgentNotFoundError(delegateeId) }
+```
+
+Both agents must exist and be in `active` status. A suspended or decommissioned agent cannot participate in delegation.
+
+---
+
+### Step 4 — Service: insert delegation chain record
+
+**File:** `src/services/DelegationService.ts`
+
+```typescript
+await this.pool.query(
+  `INSERT INTO delegation_chains (chain_id, delegator_id, delegatee_id, scope, status, expires_at)
+   VALUES ($1, $2, $3, $4, 'active', $5)`,
+  [chainId, delegatorId, delegateeId, scope, expiresAt]
+);
+```
+
+The `chain_id` is a UUID generated by the service. The `delegation_chains` table provides the authoritative source of truth for which delegations are active, independent of any token.
+
+---
+
+### Step 5 — Response
+
+```json
+{
+  "chain_id": "f1e2d3c4-...",
+  "token": "eyJhbGciOiJSUzI1NiJ9...",
+  "delegator_id": "a1b2c3d4-...",
+  "delegatee_id": "b2c3d4e5-...",
+  "scope": "agents:read",
+  "status": "active",
+  "expires_at": "2026-04-05T00:00:00Z"
+}
+```
+
+The `token` field is the signed delegation JWT. The delegatee presents this token to `POST /api/v1/oauth2/token/verify-delegation` to prove it has authority to act on the delegator's behalf.
+
+**Why store both the DB record and the JWT?** The DB record allows revocation — when the delegator calls `DELETE /api/v1/delegation-chains/:chainId`, the record is soft-deleted and all subsequent `verify-delegation` calls will fail even if the JWT itself has not yet expired.
+
+---
+
+## Walkthrough 5 — Tier Enforcement Request Lifecycle
+
+**Request:** Any authenticated API request when the organisation's daily call limit is reached
+
+This walkthrough traces how `tierMiddleware` intercepts a request before it reaches the OPA middleware, preventing quota-exceeded traffic from consuming service resources.
+
+---
+
+### Step 1 — Auth middleware passes
+
+Same as Walkthrough 2, Step 3. The Bearer JWT is verified and `req.user` is populated with `sub` (agentId) and `organization_id`.
+
+---
+
+### Step 2 — Tier middleware: fetch org tier
+
+**File:** `src/middleware/tier.ts`
+
+```typescript
+const orgId = req.user.organization_id;
+const tier = await tierService.fetchTier(orgId);
+const config = TIER_CONFIG[tier];
+```
+
+`fetchTier()` issues `SELECT tier FROM organizations WHERE organization_id = $1`. Returns `'free'` if no row is found (safe default).
+
+---
+
+### Step 3 — Tier middleware: read daily counter
+
+**File:** `src/middleware/tier.ts`
+
+```typescript
+const callsKey = `rate:tier:calls:${orgId}`;
+const callsToday = await redis.get(callsKey);
+const count = callsToday !== null ? parseInt(callsToday, 10) : 0;
+
+if (count >= config.maxCallsPerDay) {
+  throw new TierLimitError('calls', config.maxCallsPerDay, { orgId, tier, current: count });
+}
+```
+
+The Redis key `rate:tier:calls:<orgId>` is read. If null (first call of the day), count is 0. When count equals or exceeds the tier limit, `TierLimitError` (HTTP 429) is thrown immediately — no further middleware runs.
+
+---
+
+### Step 4 — Tier middleware: increment counter (fire-and-forget)
+
+**File:** `src/middleware/tier.ts`
+
+```typescript
+// Set TTL to next UTC midnight if key is new
+void redis.multi()
+  .incr(callsKey)
+  .expireAt(callsKey, nextUtcMidnightUnix())
+  .exec();
+next();
+```
+
+The counter is incremented atomically using a Redis MULTI block. The `EXPIREAT` command sets the key to auto-delete at the next UTC midnight, resetting the daily counter without any scheduled job. The increment is fire-and-forget — the request proceeds immediately to `opaMiddleware`.
+
+**Why expire at UTC midnight rather than a rolling 24-hour window?** Tier limits are documented as "per day", which users interpret as resetting at midnight. A rolling window would allow a user to consume their full daily quota twice within a 48-hour period straddling midnight, which is counterintuitive. UTC midnight is predictable and easy to reason about.
+
+---
+
+### Step 5 — Error handler serialises TierLimitError
+
+**File:** `src/middleware/errorHandler.ts`
+
+```json
+HTTP 429
+{
+  "code": "TIER_LIMIT_EXCEEDED",
+  "message": "Daily API call limit reached for your tier.",
+  "details": {
+    "tier": "free",
+    "limit": 1000,
+    "current": 1000
+  }
+}
+```
+
+The `Retry-After` header is set to the number of seconds until next UTC midnight so clients can implement automatic backoff.
+
+---
+
+## Walkthrough 6 — Analytics Event Capture Flow
+
+**Trigger:** Any successful token issuance (`POST /api/v1/token`)
+
+This walkthrough traces how an analytics event is captured without affecting the latency of the primary token issuance response.
+
+---
+
+### Step 1 — Token issuance completes
+
+**File:** `src/services/OAuth2Service.ts`
+
+```typescript
+const accessToken = signToken(payload, this.privateKey);
+// Primary response is ready — analytics is now fire-and-forget
+void this.analyticsService.recordEvent(tenantId, 'token_issued');
+tokensIssuedTotal.inc({ scope });
+```
+
+The `signToken()` call completes synchronously (RSA signing is CPU-bound, not I/O). The controller can now send the response. `analyticsService.recordEvent()` is called with `void` — the `await` is deliberately omitted.
+
+**Why `void` instead of `await`?** Token issuance latency must remain below 100ms (per the QA performance gate). A PostgreSQL write adds 5–15ms. Since analytics data is aggregated (not transactional), losing an occasional event due to an error is acceptable. The response is never delayed for analytics.
+
+---
+
+### Step 2 — AnalyticsService: UPSERT daily counter
+
+**File:** `src/services/AnalyticsService.ts`
+
+```typescript
+async recordEvent(tenantId: string, metricType: string): Promise<void> {
+  try {
+    await this.pool.query(
+      `INSERT INTO analytics_events (organization_id, date, metric_type, count)
+       VALUES ($1, CURRENT_DATE, $2, 1)
+       ON CONFLICT (organization_id, date, metric_type)
+       DO UPDATE SET count = analytics_events.count + 1`,
+      [tenantId, metricType],
+    );
+  } catch (err) {
+    console.error('[AnalyticsService] recordEvent failed — primary path unaffected', err);
+  }
+}
+```
+
+The `ON CONFLICT DO UPDATE` upsert is atomic. Whether this is the first or the ten-thousandth `token_issued` event for this tenant today, the row is updated correctly. All errors are caught and swallowed — the token has already been returned to the caller.
+
+**Why one row per day per metric, not one row per event?** Storing a row per event would create millions of rows. The daily aggregate model keeps the table compact while still providing daily trend data (the granularity that analytics dashboards need). Sub-day granularity is available from the Prometheus `agentidp_tokens_issued_total` counter if needed.
+
+---
+
+### Step 3 — Dashboard query (deferred)
+
+When a developer visits the analytics page in the developer portal, the portal calls:
+
+```
+GET /api/v1/analytics/token-trend?days=30
+```
+
+**File:** `src/services/AnalyticsService.ts` — `getTokenTrend(tenantId, 30)`
+
+```sql
+SELECT
+  gs.date::DATE::TEXT AS date,
+  COALESCE(ae.count, 0)::INTEGER AS count
+FROM generate_series(
+  CURRENT_DATE - 29 * INTERVAL '1 day',
+  CURRENT_DATE,
+  INTERVAL '1 day'
+) AS gs(date)
+LEFT JOIN analytics_events ae
+  ON ae.date = gs.date::DATE
+  AND ae.organization_id = $2
+  AND ae.metric_type = 'token_issued'
+ORDER BY gs.date ASC
+```
+
+The `generate_series` + `LEFT JOIN` pattern ensures all 30 days appear in the result, with `count: 0` for days with no events. This avoids the need for the client to fill in gaps.