docs: commit all Phase 6 documentation updates and OpenSpec archives

- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook)
- developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added
- engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview)
- OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update
- VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added
- .gitignore: exclude session artifacts, build artifacts, and agent workspaces

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
SentryAgent.ai Developer
2026-04-07 02:24:24 +00:00
parent 0fb00256b4
commit 8cabc0191c
56 changed files with 12780 additions and 446 deletions

View File

@@ -715,3 +715,260 @@ must store it securely.
"revokedAt": null
}
```
---
## Walkthrough 4 — A2A Delegation End-to-End
**Request:** `POST /api/v1/oauth2/token/delegate` — one AI agent delegating a scoped capability to another
This walkthrough traces how agent A (an orchestrator) issues a delegation token that grants agent B (a sub-agent) the right to act on its behalf with a restricted scope.
---
### Step 1 — Route dispatch
**File:** `src/routes/delegation.ts`
```typescript
router.post(
'/token/delegate',
asyncHandler(authMiddleware),
opaMiddleware,
asyncHandler(delegationController.createDelegation.bind(delegationController))
);
```
Both `authMiddleware` and `opaMiddleware` run. The OPA policy requires scope `agents:write` for delegation creation.
---
### Step 2 — Controller: extract delegator and validate
**File:** `src/controllers/DelegationController.ts`
```typescript
const delegatorId = req.user.sub; // From the Bearer token's sub claim
const { delegatee_id, scope, expires_at } = req.body;
```
The controller validates that `delegatee_id` is a non-empty UUID, `scope` is a non-empty string, and `expires_at` (if provided) is a valid ISO 8601 datetime in the future. It passes these to `DelegationService.createDelegation()`.
---
### Step 3 — Service: verify both agents exist
**File:** `src/services/DelegationService.ts`
```typescript
const delegator = await this.agentRepository.findById(delegatorId);
if (!delegator || delegator.status !== 'active') { throw new AgentNotFoundError(delegatorId) }
const delegatee = await this.agentRepository.findById(delegateeId);
if (!delegatee || delegatee.status !== 'active') { throw new AgentNotFoundError(delegateeId) }
```
Both agents must exist and be in `active` status. A suspended or decommissioned agent cannot participate in delegation.
---
### Step 4 — Service: insert delegation chain record
**File:** `src/services/DelegationService.ts`
```typescript
await this.pool.query(
`INSERT INTO delegation_chains (chain_id, delegator_id, delegatee_id, scope, status, expires_at)
VALUES ($1, $2, $3, $4, 'active', $5)`,
[chainId, delegatorId, delegateeId, scope, expiresAt]
);
```
The `chain_id` is a UUID generated by the service. The `delegation_chains` table provides the authoritative source of truth for which delegations are active, independent of any token.
---
### Step 5 — Response
```json
{
"chain_id": "f1e2d3c4-...",
"token": "eyJhbGciOiJSUzI1NiJ9...",
"delegator_id": "a1b2c3d4-...",
"delegatee_id": "b2c3d4e5-...",
"scope": "agents:read",
"status": "active",
"expires_at": "2026-04-05T00:00:00Z"
}
```
The `token` field is the signed delegation JWT. The delegatee presents this token to `POST /api/v1/oauth2/token/verify-delegation` to prove it has authority to act on the delegator's behalf.
**Why store both the DB record and the JWT?** The DB record allows revocation — when the delegator calls `DELETE /api/v1/delegation-chains/:chainId`, the record is soft-deleted and all subsequent `verify-delegation` calls will fail even if the JWT itself has not yet expired.
---
## Walkthrough 5 — Tier Enforcement Request Lifecycle
**Request:** Any authenticated API request when the organisation's daily call limit is reached
This walkthrough traces how `tierMiddleware` intercepts a request before it reaches the OPA middleware, preventing quota-exceeded traffic from consuming service resources.
---
### Step 1 — Auth middleware passes
Same as Walkthrough 2, Step 3. The Bearer JWT is verified and `req.user` is populated with `sub` (agentId) and `organization_id`.
---
### Step 2 — Tier middleware: fetch org tier
**File:** `src/middleware/tier.ts`
```typescript
const orgId = req.user.organization_id;
const tier = await tierService.fetchTier(orgId);
const config = TIER_CONFIG[tier];
```
`fetchTier()` issues `SELECT tier FROM organizations WHERE organization_id = $1`. Returns `'free'` if no row is found (safe default).
---
### Step 3 — Tier middleware: read daily counter
**File:** `src/middleware/tier.ts`
```typescript
const callsKey = `rate:tier:calls:${orgId}`;
const callsToday = await redis.get(callsKey);
const count = callsToday !== null ? parseInt(callsToday, 10) : 0;
if (count >= config.maxCallsPerDay) {
throw new TierLimitError('calls', config.maxCallsPerDay, { orgId, tier, current: count });
}
```
The Redis key `rate:tier:calls:<orgId>` is read. If null (first call of the day), count is 0. When count equals or exceeds the tier limit, `TierLimitError` (HTTP 429) is thrown immediately — no further middleware runs.
---
### Step 4 — Tier middleware: increment counter (fire-and-forget)
**File:** `src/middleware/tier.ts`
```typescript
// Set TTL to next UTC midnight if key is new
void redis.multi()
.incr(callsKey)
.expireAt(callsKey, nextUtcMidnightUnix())
.exec();
next();
```
The counter is incremented atomically using a Redis MULTI block. The `EXPIREAT` command sets the key to auto-delete at the next UTC midnight, resetting the daily counter without any scheduled job. The increment is fire-and-forget — the request proceeds immediately to `opaMiddleware`.
**Why expire at UTC midnight rather than a rolling 24-hour window?** Tier limits are documented as "per day", which users interpret as resetting at midnight. A rolling window would allow a user to consume their full daily quota twice within a 48-hour period straddling midnight, which is counterintuitive. UTC midnight is predictable and easy to reason about.
---
### Step 5 — Error handler serialises TierLimitError
**File:** `src/middleware/errorHandler.ts`
```json
HTTP 429
{
"code": "TIER_LIMIT_EXCEEDED",
"message": "Daily API call limit reached for your tier.",
"details": {
"tier": "free",
"limit": 1000,
"current": 1000
}
}
```
The `Retry-After` header is set to the number of seconds until next UTC midnight so clients can implement automatic backoff.
---
## Walkthrough 6 — Analytics Event Capture Flow
**Trigger:** Any successful token issuance (`POST /api/v1/token`)
This walkthrough traces how an analytics event is captured without affecting the latency of the primary token issuance response.
---
### Step 1 — Token issuance completes
**File:** `src/services/OAuth2Service.ts`
```typescript
const accessToken = signToken(payload, this.privateKey);
// Primary response is ready — analytics is now fire-and-forget
void this.analyticsService.recordEvent(tenantId, 'token_issued');
tokensIssuedTotal.inc({ scope });
```
The `signToken()` call completes synchronously (RSA signing is CPU-bound, not I/O). The controller can now send the response. `analyticsService.recordEvent()` is called with `void` — the `await` is deliberately omitted.
**Why `void` instead of `await`?** Token issuance latency must remain below 100ms (per the QA performance gate). A PostgreSQL write adds 515ms. Since analytics data is aggregated (not transactional), losing an occasional event due to an error is acceptable. The response is never delayed for analytics.
---
### Step 2 — AnalyticsService: UPSERT daily counter
**File:** `src/services/AnalyticsService.ts`
```typescript
async recordEvent(tenantId: string, metricType: string): Promise<void> {
try {
await this.pool.query(
`INSERT INTO analytics_events (organization_id, date, metric_type, count)
VALUES ($1, CURRENT_DATE, $2, 1)
ON CONFLICT (organization_id, date, metric_type)
DO UPDATE SET count = analytics_events.count + 1`,
[tenantId, metricType],
);
} catch (err) {
console.error('[AnalyticsService] recordEvent failed — primary path unaffected', err);
}
}
```
The `ON CONFLICT DO UPDATE` upsert is atomic. Whether this is the first or the ten-thousandth `token_issued` event for this tenant today, the row is updated correctly. All errors are caught and swallowed — the token has already been returned to the caller.
**Why one row per day per metric, not one row per event?** Storing a row per event would create millions of rows. The daily aggregate model keeps the table compact while still providing daily trend data (the granularity that analytics dashboards need). Sub-day granularity is available from the Prometheus `agentidp_tokens_issued_total` counter if needed.
---
### Step 3 — Dashboard query (deferred)
When a developer visits the analytics page in the developer portal, the portal calls:
```
GET /api/v1/analytics/token-trend?days=30
```
**File:** `src/services/AnalyticsService.ts``getTokenTrend(tenantId, 30)`
```sql
SELECT
gs.date::DATE::TEXT AS date,
COALESCE(ae.count, 0)::INTEGER AS count
FROM generate_series(
CURRENT_DATE - 29 * INTERVAL '1 day',
CURRENT_DATE,
INTERVAL '1 day'
) AS gs(date)
LEFT JOIN analytics_events ae
ON ae.date = gs.date::DATE
AND ae.organization_id = $2
AND ae.metric_type = 'token_issued'
ORDER BY gs.date ASC
```
The `generate_series` + `LEFT JOIN` pattern ensures all 30 days appear in the result, with `count: 0` for days with no events. This avoids the need for the client to fill in gaps.