- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook) - developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added - engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview) - OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update - VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added - .gitignore: exclude session artifacts, build artifacts, and agent workspaces Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
29 KiB
WS5 — Remaining Documentation Updates
Targets: 5 separate files with surgical edits.
File 1: docs/engineering/01-overview.md
Operation: Replace the Phase Roadmap table (Section 4) to reflect Phase 3–6 completion status and add Phase 6 capabilities to the Product Features table.
Change 1a — Update Phase Roadmap Table
Find (Section 4, the Phase 3 row):
| Phase 3 — Enterprise | PLANNED | AGNTCY federation (cross-IdP agent identity), W3C Decentralised Identifiers (DIDs), agent marketplace, advanced compliance reporting, SOC 2 Type II certification, enterprise tier (custom retention, SLAs, advanced RBAC) |
Replace with (3 rows — Phase 3 was completed and Phases 4–6 have been added):
| Phase 3 — Enterprise | COMPLETE | AGNTCY federation (cross-IdP agent identity), W3C Decentralised Identifiers (DIDs), agent marketplace, OIDC provider (A2A delegation), Rust SDK, developer portal (Next.js 14) |
| Phase 4 — Compliance & Security | COMPLETE | AGNTCY compliance reports (agent-identity + audit-trail sections), audit hash chain verification, SOC 2 CC6.1 AES-256-CBC column encryption (`EncryptionService`), DID document caching, federation partner JWKS caching |
| Phase 5 — Scale & Ecosystem | COMPLETE | Multi-tier subscription model (free/pro/enterprise), Stripe billing integration (`BillingService`, `TierService`), tier enforcement middleware (daily call and token limits), webhook subscriptions + delivery history (`WebhookService`), analytics service (daily event aggregation + trend queries) |
| Phase 6 — Market Expansion | COMPLETE | AGNTCY conformance test suite (4 conformance scenarios), API tiers enforced end-to-end, analytics dashboard in developer portal, full Phase 6 engineering documentation update |
Change 1b — Add Phase 3–6 Capabilities to Product Features Table
Find (Section 3, the last row of the features table):
| Health Check | `GET /health` | Checks PostgreSQL and Redis connectivity; unauthenticated; used by load balancers |
Insert the following rows after that line (before the closing of the table):
| W3C Decentralised Identifiers | `GET /api/v1/agents/:id/did`, `GET /api/v1/.well-known/did.json` | DID Core 1.0 documents; `did:web` method; EC P-256 keys; AGNTCY extension fields |
| AGNTCY Agent Cards | `GET /api/v1/agents/:id/card` | Machine-readable agent identity summary; AGNTCY schema v1.0 |
| AGNTCY Compliance Reports | `GET /api/v1/compliance/report`, `GET /api/v1/compliance/agent-cards` | Compliance sections: agent-identity + audit-trail; cached 5 min; AGNTCY schema v1.0 |
| Federation (Cross-IdP) | `POST /api/v1/federation/partners`, `GET /api/v1/federation/partners`, `POST /api/v1/federation/verify` | Register partner IdPs; verify cross-IdP JWTs using cached partner JWKS |
| A2A Delegation | `POST /api/v1/oauth2/token/delegate`, `POST /api/v1/oauth2/token/verify-delegation` | Agent-to-agent delegation tokens; OIDC provider (oidc-provider v9) mounted at `/oidc` |
| Webhook Subscriptions | `POST /api/v1/webhooks`, `GET /api/v1/webhooks`, `GET /api/v1/webhooks/:id/deliveries` | Outbound event delivery with HMAC signing; Vault-backed secrets; delivery history |
| Tier Management | `GET /api/v1/tiers/status`, `POST /api/v1/tiers/upgrade` | Free / Pro / Enterprise tiers; daily call and token limits; Stripe Checkout upgrade flow |
| Billing | `POST /api/v1/billing/checkout`, `POST /api/v1/billing/webhook`, `GET /api/v1/billing/status` | Stripe subscription management; webhook event processing |
| Analytics | Internal (via `AnalyticsService`) | Daily aggregated event counts per org; token trend queries (up to 90 days); agent activity heatmap; usage summary |
| Developer Portal | `/portal` (Next.js 14, separate process) | Get-started wizard, SDK explorer, API reference, analytics dashboard, pricing page |
Change 1c — Update Free Tier Limits Table
Find (Section 6, entire table):
| Limit | Value |
|-------|-------|
| Max agents | 100 |
| Max credentials per agent | No hard cap enforced in code (5 is the documented recommendation) |
| Max tokens in flight | 10,000 per agent per calendar month |
| Token TTL | 3,600 seconds (1 hour) |
| Audit log retention | 90 days |
| API rate limit | 100 requests per minute per IP address |
Replace with:
| Limit | Free Tier | Pro Tier | Enterprise Tier |
|-------|-----------|----------|-----------------|
| Max agents | 100 | 1,000 | Unlimited |
| Max API calls per day | Configured in `TIER_CONFIG` | Configured in `TIER_CONFIG` | Unlimited |
| Max tokens per day | Configured in `TIER_CONFIG` | Configured in `TIER_CONFIG` | Unlimited |
| Token TTL | 3,600 seconds (1 hour) | 3,600 seconds (1 hour) | 3,600 seconds (1 hour) |
| Audit log retention | 90 days | 1 year | Custom |
| API rate limit (per IP) | 100 req/min | 100 req/min | 100 req/min |
| Webhook subscriptions | 0 | 10 | Unlimited |
| Analytics retention | 90 days | 1 year | Custom |
Tier limits are configured in `src/config/tiers.ts` (`TIER_CONFIG`). Enforcement is handled by `TierService.enforceAgentLimit()` (agent cap) and `src/middleware/tier.ts` (daily call/token caps). Tier upgrades are initiated via `POST /api/v1/tiers/upgrade` and confirmed via the Stripe webhook.
File 2: docs/engineering/03-tech-stack.md
Operation: Append new ADR entries after the existing ### ADR-10: Terraform section.
Find (last line of the file):
**Consequences**: All infrastructure changes must go through Terraform. No manual edits
via the AWS console or GCP console are permitted — they will be overwritten on the next
`terraform apply`. Terraform state is stored in a remote backend and must not be edited
manually.
Append the following after that line:
---
### ADR-11: Stripe
**Status**: Adopted
**Component**: Billing — subscription management and payment processing
**Decision**: Use Stripe as the payment processing and subscription management platform. The `stripe` npm package (v21+) handles Checkout Session creation, webhook event verification, and subscription lifecycle events.
**Rationale**: Stripe's hosted Checkout flow eliminates the need to handle PCI-DSS scope for card data. The `stripe.webhooks.constructEvent()` method uses HMAC-SHA256 to verify incoming webhook payloads, preventing replay attacks. The `checkout.session.completed` event carries `metadata: { orgId, targetTier }`, allowing `BillingService` to delegate tier upgrades to `TierService.applyUpgrade()` without coupling billing logic to tier logic.
**Alternatives considered**:
- Paddle — rejected because its global merchant-of-record model introduced complexities with the open-source free tier.
- Braintree — rejected because Stripe's webhook reliability and developer experience are superior.
**Consequences**: Stripe requires `STRIPE_SECRET_KEY` (for API calls) and `STRIPE_WEBHOOK_SECRET` (`whsec_...`, for webhook verification). Per-tier Stripe price IDs are configured via `STRIPE_PRICE_ID_PRO` and `STRIPE_PRICE_ID_ENTERPRISE`. All billing webhook handlers must pass the raw `Buffer` body (not parsed JSON) to `stripe.webhooks.constructEvent()` — use `express.raw()` middleware on the webhook route.
---
### ADR-12: oidc-provider (A2A Delegation)
**Status**: Adopted
**Component**: A2A delegation — OIDC provider for agent-to-agent trust tokens
**Decision**: Use the `oidc-provider` npm package (v9.7.x) as the OIDC provider for issuing A2A delegation tokens. The provider is mounted as a sub-application at `/oidc` within the Express app.
**Rationale**: `oidc-provider` is a certified OpenID Connect implementation that handles the full OIDC protocol, including JWKS serving, token endpoint, and discovery document. Rather than implementing a custom delegation token format, using a standards-compliant OIDC provider means delegation tokens can be verified by any OIDC-aware party using the published JWKS at `/oidc/jwks`.
**Alternatives considered**:
- Custom JWT signing — rejected because hand-rolled token formats cannot benefit from OIDC tooling and interoperability.
**Consequences**: `A2A_ENABLED` env var gates the OIDC provider — when set to `'false'`, delegation endpoints return 404. The `OIDC_ISSUER` env var must be set to the full base URL of the OIDC provider (e.g. `https://api.sentryagent.ai`).
---
### ADR-13: Next.js 14 (Developer Portal)
**Status**: Adopted
**Component**: Developer Portal (`portal/`) — public-facing documentation and onboarding
**Decision**: Use Next.js 14 (App Router) with Tailwind CSS for the developer portal. The portal is a separate process served on its own port (independent of the Express API server).
**Rationale**: The developer portal has different performance and SEO requirements than the internal operator dashboard (`dashboard/`). Next.js 14's App Router supports React Server Components, which allows the marketing and documentation pages to be statically generated while the analytics dashboard and API Explorer are client-rendered. Tailwind CSS enables rapid UI development consistent with the design system.
**Alternatives considered**:
- Extending the Vite dashboard — rejected because the developer portal requires server-side rendering for SEO on marketing pages, which Vite does not provide.
- Docusaurus — rejected because the portal includes interactive components (Swagger Explorer, analytics charts) that are not well-suited to a documentation-only tool.
**Consequences**: The portal (`portal/`) has its own `package.json`, `tsconfig.json`, `tailwind.config.ts`, and `next.config.js`. It is built and run independently: `cd portal && npm install && npm run dev`. The portal calls the AgentIdP REST API using the same `@sentryagent/idp-sdk` as the dashboard.
---
### ADR-14: bull (Job Queue) + kafkajs (Event Streaming)
**Status**: Adopted (opt-in)
**Component**: Async job processing and event streaming
**Decision**: Use `bull` (Redis-backed job queue) for async webhook delivery retries and `kafkajs` for event streaming to external consumers. Both are opt-in — the system operates correctly without Kafka configured.
**Rationale**: Webhook delivery requires retry logic with exponential backoff and dead-letter handling. `bull` provides this out of the box using the existing Redis dependency. `kafkajs` enables high-throughput event streaming for analytics and audit events to external data pipelines without blocking the primary request path.
**Alternatives considered**:
- BullMQ — considered as a more modern alternative to `bull` but rejected to avoid adding a new package family during Phase 6. Migration is a future backlog item.
**Consequences**: Kafka is entirely optional. When `KAFKA_BROKERS` is not set, `kafkajs` is not initialised and no events are published. The `bull` queue for webhook delivery requires only the existing Redis instance.
---
### ADR-15: did-resolver + web-did-resolver (W3C DIDs)
**Status**: Adopted
**Component**: W3C DID Core 1.0 document resolution
**Decision**: Use `did-resolver` (v4.1.x) as the DID resolution framework and `web-did-resolver` (v2.0.x) for the `did:web` method implementation.
**Rationale**: `did-resolver` provides a pluggable resolver interface used by both the server (for internal resolution) and by third parties who want to verify AgentIdP-issued DIDs. The `did:web` method maps DID identifiers to HTTPS URLs hosting the DID document JSON, requiring no blockchain. `DIDService` generates documents that conform to the W3C DID Core 1.0 specification and include AGNTCY-specific extension fields.
**Consequences**: `DID_WEB_DOMAIN` env var is required for DID generation. DID documents are cached in Redis (`did:doc:<agentId>`, TTL from `DID_DOCUMENT_CACHE_TTL_SECONDS`, default 300s). Private keys are stored in HashiCorp Vault KV v2 when Vault is configured; in dev mode, a `dev:no-vault` marker is stored and keys are ephemeral.
File 3: docs/engineering/04-codebase-structure.md
Operation: Two surgical edits — update the directory tree and update the src/ subdirectory table.
Change 3a — Update the Annotated Directory Tree
Find (inside the code block in Section 1, after the sdk-java/ line):
├── policies/ # OPA policy files
Replace the entire block from ├── policies/ down through └── jest.config.ts # Jest configuration — ts-jest, test timeouts, coverage thresholds with the following updated version:
├── sdk-rust/ # Rust SDK (sentryagent-idp crate) — async, tokio, reqwest, typed errors
├── policies/ # OPA policy files
│ ├── authz.rego # Rego policy — normalise_path + scope-intersection allow rule
│ └── data/scopes.json # Endpoint permission map — used by Rego and TypeScript fallback
├── portal/ # Developer Portal — Next.js 14 App Router, Tailwind CSS
│ ├── app/ # Next.js App Router pages (get-started, pricing, sdks, analytics, settings, login)
│ ├── components/ # Shared UI components (Nav.tsx, SwaggerExplorer.tsx, GetStartedWizard.tsx)
│ ├── hooks/ # React hooks (useAuth.ts)
│ └── types/ # TypeScript type definitions for portal-only types
├── terraform/ # Terraform infrastructure as code
│ ├── modules/ # Reusable modules: agentidp, lb, rds, redis
│ └── environments/ # Environment configs: aws/ (ECS+RDS+ElastiCache), gcp/ (Cloud Run+SQL+Memorystore)
├── monitoring/ # Prometheus and Grafana configuration
│ ├── prometheus/ # prometheus.yml scrape configuration
│ └── grafana/ # Grafana provisioning YAML and dashboard JSON files
├── docs/ # All project documentation
│ ├── engineering/ # Internal engineering knowledge base (this directory)
│ ├── developers/ # End-user API reference and developer guides
│ ├── devops/ # Operator runbooks and environment variable reference
│ ├── agntcy/ # AGNTCY alignment documentation
│ └── openapi/ # OpenAPI 3.0 specification files
├── openspec/ # OpenSpec change management — proposals, designs, specs, tasks, archives
├── tests/ # Jest test suite — mirrors src/ structure
│ ├── unit/ # Unit tests (mocked dependencies) — mirrors src/
│ ├── integration/ # Integration tests (real DB + Redis)
│ ├── agntcy-conformance/ # AGNTCY conformance test suite (separate Jest config)
│ └── load/ # k6 load test scripts
├── Dockerfile # Multi-stage production build (build + runtime stages)
├── docker-compose.yml # Local development: PostgreSQL 14 (port 5432) + Redis 7 (port 6379)
├── docker-compose.monitoring.yml # Monitoring overlay: Prometheus (port 9090) + Grafana (port 3001)
├── package.json # Node.js dependencies and npm scripts
├── tsconfig.json # TypeScript strict configuration — compiled to dist/
└── jest.config.ts # Jest configuration — ts-jest, test timeouts, coverage thresholds
Change 3b — Add New src/ Subdirectories to Section 2
Find (Section 2 table, the last row):
| `src/cache/` | Redis client factory — creates and caches a single `redis` client instance | Client is a singleton created once in `src/app.ts` and passed to repositories |
Insert these rows after that line:
| `src/config/` | Configuration constants — `tiers.ts` exports `TIER_CONFIG`, `TIER_RANK`, `TierName`, and `isTierName()` type guard | Imported by `TierService` and `tierMiddleware`; never imports from services |
| `src/middleware/tier.ts` | Tier enforcement middleware — reads org tier from `TierService`, checks daily call counter in Redis, throws `TierLimitError` (429) when limit is exceeded, increments counter on pass | Applied only to API routes; skips `/health`, `/metrics`, and static file routes |
Change 3c — Add New Entries to Section 3 (Where to Add New Code)
Find (Section 3 table, after the A new Prometheus metric row):
| A new TypeScript type used in 2+ files | `src/types/index.ts` | A new `AgentGroupMembership` interface |
Insert these rows after that line:
| A new tier-gated feature | `src/config/tiers.ts` (add limit field) + `src/middleware/tier.ts` (add check) + service (enforce) | Adding a `maxWebhooksPerOrg` tier limit |
| A webhook event handler | `src/services/WebhookService.ts` (add event type to `WebhookEventType`) + the producer that calls `void webhookService.dispatch(orgId, eventType, payload)` | Emitting `agent.decommissioned` events to subscriber URLs |
| A new analytics metric type | `src/services/AnalyticsService.ts` (call `recordEvent(tenantId, 'new_metric')` in the relevant service using `void`) | Recording `credential_rotated` events for analytics |
| A new DID endpoint | `src/controllers/DIDController.ts` + `src/routes/did.ts` + `src/services/DIDService.ts` (if new method needed) + `policies/data/scopes.json` | Adding `GET /api/v1/agents/:id/did/rotate-key` |
File 4: docs/engineering/README.md
Operation: Replace the reading order table and quick reference table to reflect all Phase 6 additions.
Change 4a — Update Reading Order Table
Find (Section "Reading Order (New Engineers Start Here)", the last row):
| 11 | [SDK Integration Guide](11-sdk-guide.md) | All 4 SDKs — installation, examples, contribution guide | 20 min |
Replace with (adds the Rust SDK to the description and updates the estimated time):
| 11 | [SDK Integration Guide](11-sdk-guide.md) | All 5 SDKs (Node.js, Python, Go, Java, Rust) — installation, examples, contribution guide | 25 min |
Find (the line after the table):
**Total estimated reading time for new engineers: ~3.5 hours**
Replace with:
**Total estimated reading time for new engineers: ~4 hours**
Change 4b — Update "Service Deep Dives" Entry
Find:
| 5 | [Service Deep Dives](05-services.md) | All 8 services/components — purpose, interface, schema, error types | 30 min |
Replace with:
| 5 | [Service Deep Dives](05-services.md) | All 17 services/components (incl. Phase 3–6: AnalyticsService, TierService, ComplianceService, FederationService, DIDService, WebhookService, BillingService, DelegationService, OIDCService) — purpose, interface, schema, error types | 45 min |
Change 4c — Update Quick Reference Table
Find (in the Quick Reference section):
| Integrate with the SDK | [11-sdk-guide.md](11-sdk-guide.md) |
Replace with:
| Integrate with the SDK (Node.js, Python, Go, Java, Rust) | [11-sdk-guide.md](11-sdk-guide.md) |
Find (after the "Integrate with the SDK" row):
| Understand why a technology was chosen | [03-tech-stack.md](03-tech-stack.md) |
Insert after that row:
| Understand tier limits and billing | [01-overview.md](01-overview.md) (Section 6) + [03-tech-stack.md](03-tech-stack.md) (ADR-11) |
| Understand AGNTCY compliance reports | [05-services.md](05-services.md) (ComplianceService) |
| Understand the A2A delegation flow | [06-walkthroughs.md](06-walkthroughs.md) (Walkthrough 4) |
| Run the AGNTCY conformance suite | [09-testing.md](09-testing.md) (Section 10.8) |
| Add a new Rust SDK endpoint | [11-sdk-guide.md](11-sdk-guide.md) (Section 6 contribution guide) |
File 5: docs/engineering/06-walkthroughs.md
Operation: Append three new walkthrough sections at the end of the file.
Find (the last line of the file):
Returns `ICredentialWithSecret` — the updated credential including the new
`clientSecret`. This is the only time the new secret is ever returned. The caller
must store it securely.
Append the following after that final JSON block:
---
## Walkthrough 4 — A2A Delegation End-to-End
**Request:** `POST /api/v1/oauth2/token/delegate` — one AI agent delegating a scoped capability to another
This walkthrough traces how agent A (an orchestrator) issues a delegation token that grants agent B (a sub-agent) the right to act on its behalf with a restricted scope.
---
### Step 1 — Route dispatch
**File:** `src/routes/delegation.ts`
```typescript
router.post(
'/token/delegate',
asyncHandler(authMiddleware),
opaMiddleware,
asyncHandler(delegationController.createDelegation.bind(delegationController))
);
Both authMiddleware and opaMiddleware run. The OPA policy requires scope agents:write for delegation creation.
Step 2 — Controller: extract delegator and validate
File: src/controllers/DelegationController.ts
const delegatorId = req.user.sub; // From the Bearer token's sub claim
const { delegatee_id, scope, expires_at } = req.body;
The controller validates that delegatee_id is a non-empty UUID, scope is a non-empty string, and expires_at (if provided) is a valid ISO 8601 datetime in the future. It passes these to DelegationService.createDelegation().
Step 3 — Service: verify both agents exist
File: src/services/DelegationService.ts
const delegator = await this.agentRepository.findById(delegatorId);
if (!delegator || delegator.status !== 'active') { throw new AgentNotFoundError(delegatorId) }
const delegatee = await this.agentRepository.findById(delegateeId);
if (!delegatee || delegatee.status !== 'active') { throw new AgentNotFoundError(delegateeId) }
Both agents must exist and be in active status. A suspended or decommissioned agent cannot participate in delegation.
Step 4 — Service: insert delegation chain record
File: src/services/DelegationService.ts
await this.pool.query(
`INSERT INTO delegation_chains (chain_id, delegator_id, delegatee_id, scope, status, expires_at)
VALUES ($1, $2, $3, $4, 'active', $5)`,
[chainId, delegatorId, delegateeId, scope, expiresAt]
);
The chain_id is a UUID generated by the service. The delegation_chains table provides the authoritative source of truth for which delegations are active, independent of any token.
Step 5 — Response
{
"chain_id": "f1e2d3c4-...",
"token": "eyJhbGciOiJSUzI1NiJ9...",
"delegator_id": "a1b2c3d4-...",
"delegatee_id": "b2c3d4e5-...",
"scope": "agents:read",
"status": "active",
"expires_at": "2026-04-05T00:00:00Z"
}
The token field is the signed delegation JWT. The delegatee presents this token to POST /api/v1/oauth2/token/verify-delegation to prove it has authority to act on the delegator's behalf.
Why store both the DB record and the JWT? The DB record allows revocation — when the delegator calls DELETE /api/v1/delegation-chains/:chainId, the record is soft-deleted and all subsequent verify-delegation calls will fail even if the JWT itself has not yet expired.
Walkthrough 5 — Tier Enforcement Request Lifecycle
Request: Any authenticated API request when the organisation's daily call limit is reached
This walkthrough traces how tierMiddleware intercepts a request before it reaches the OPA middleware, preventing quota-exceeded traffic from consuming service resources.
Step 1 — Auth middleware passes
Same as Walkthrough 2, Step 3. The Bearer JWT is verified and req.user is populated with sub (agentId) and organization_id.
Step 2 — Tier middleware: fetch org tier
File: src/middleware/tier.ts
const orgId = req.user.organization_id;
const tier = await tierService.fetchTier(orgId);
const config = TIER_CONFIG[tier];
fetchTier() issues SELECT tier FROM organizations WHERE organization_id = $1. Returns 'free' if no row is found (safe default).
Step 3 — Tier middleware: read daily counter
File: src/middleware/tier.ts
const callsKey = `rate:tier:calls:${orgId}`;
const callsToday = await redis.get(callsKey);
const count = callsToday !== null ? parseInt(callsToday, 10) : 0;
if (count >= config.maxCallsPerDay) {
throw new TierLimitError('calls', config.maxCallsPerDay, { orgId, tier, current: count });
}
The Redis key rate:tier:calls:<orgId> is read. If null (first call of the day), count is 0. When count equals or exceeds the tier limit, TierLimitError (HTTP 429) is thrown immediately — no further middleware runs.
Step 4 — Tier middleware: increment counter (fire-and-forget)
File: src/middleware/tier.ts
// Set TTL to next UTC midnight if key is new
void redis.multi()
.incr(callsKey)
.expireAt(callsKey, nextUtcMidnightUnix())
.exec();
next();
The counter is incremented atomically using a Redis MULTI block. The EXPIREAT command sets the key to auto-delete at the next UTC midnight, resetting the daily counter without any scheduled job. The increment is fire-and-forget — the request proceeds immediately to opaMiddleware.
Why expire at UTC midnight rather than a rolling 24-hour window? Tier limits are documented as "per day", which users interpret as resetting at midnight. A rolling window would allow a user to consume their full daily quota twice within a 48-hour period straddling midnight, which is counterintuitive. UTC midnight is predictable and easy to reason about.
Step 5 — Error handler serialises TierLimitError
File: src/middleware/errorHandler.ts
HTTP 429
{
"code": "TIER_LIMIT_EXCEEDED",
"message": "Daily API call limit reached for your tier.",
"details": {
"tier": "free",
"limit": 1000,
"current": 1000
}
}
The Retry-After header is set to the number of seconds until next UTC midnight so clients can implement automatic backoff.
Walkthrough 6 — Analytics Event Capture Flow
Trigger: Any successful token issuance (POST /api/v1/token)
This walkthrough traces how an analytics event is captured without affecting the latency of the primary token issuance response.
Step 1 — Token issuance completes
File: src/services/OAuth2Service.ts
const accessToken = signToken(payload, this.privateKey);
// Primary response is ready — analytics is now fire-and-forget
void this.analyticsService.recordEvent(tenantId, 'token_issued');
tokensIssuedTotal.inc({ scope });
The signToken() call completes synchronously (RSA signing is CPU-bound, not I/O). The controller can now send the response. analyticsService.recordEvent() is called with void — the await is deliberately omitted.
Why void instead of await? Token issuance latency must remain below 100ms (per the QA performance gate). A PostgreSQL write adds 5–15ms. Since analytics data is aggregated (not transactional), losing an occasional event due to an error is acceptable. The response is never delayed for analytics.
Step 2 — AnalyticsService: UPSERT daily counter
File: src/services/AnalyticsService.ts
async recordEvent(tenantId: string, metricType: string): Promise<void> {
try {
await this.pool.query(
`INSERT INTO analytics_events (organization_id, date, metric_type, count)
VALUES ($1, CURRENT_DATE, $2, 1)
ON CONFLICT (organization_id, date, metric_type)
DO UPDATE SET count = analytics_events.count + 1`,
[tenantId, metricType],
);
} catch (err) {
console.error('[AnalyticsService] recordEvent failed — primary path unaffected', err);
}
}
The ON CONFLICT DO UPDATE upsert is atomic. Whether this is the first or the ten-thousandth token_issued event for this tenant today, the row is updated correctly. All errors are caught and swallowed — the token has already been returned to the caller.
Why one row per day per metric, not one row per event? Storing a row per event would create millions of rows. The daily aggregate model keeps the table compact while still providing daily trend data (the granularity that analytics dashboards need). Sub-day granularity is available from the Prometheus agentidp_tokens_issued_total counter if needed.
Step 3 — Dashboard query (deferred)
When a developer visits the analytics page in the developer portal, the portal calls:
GET /api/v1/analytics/token-trend?days=30
File: src/services/AnalyticsService.ts — getTokenTrend(tenantId, 30)
SELECT
gs.date::DATE::TEXT AS date,
COALESCE(ae.count, 0)::INTEGER AS count
FROM generate_series(
CURRENT_DATE - 29 * INTERVAL '1 day',
CURRENT_DATE,
INTERVAL '1 day'
) AS gs(date)
LEFT JOIN analytics_events ae
ON ae.date = gs.date::DATE
AND ae.organization_id = $2
AND ae.metric_type = 'token_issued'
ORDER BY gs.date ASC
The generate_series + LEFT JOIN pattern ensures all 30 days appear in the result, with count: 0 for days with no events. This avoids the need for the client to fill in gaps.