Files
sentryagent-idp/openspec/changes/archive/engineering-docs-phase6-update/specs/ws5-remaining/spec.md
SentryAgent.ai Developer 8cabc0191c docs: commit all Phase 6 documentation updates and OpenSpec archives
- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook)
- developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added
- engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview)
- OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update
- VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added
- .gitignore: exclude session artifacts, build artifacts, and agent workspaces

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 02:24:24 +00:00

29 KiB
Raw Blame History

WS5 — Remaining Documentation Updates

Targets: 5 separate files with surgical edits.


File 1: docs/engineering/01-overview.md

Operation: Replace the Phase Roadmap table (Section 4) to reflect Phase 36 completion status and add Phase 6 capabilities to the Product Features table.


Change 1a — Update Phase Roadmap Table

Find (Section 4, the Phase 3 row):

| Phase 3 — Enterprise | PLANNED | AGNTCY federation (cross-IdP agent identity), W3C Decentralised Identifiers (DIDs), agent marketplace, advanced compliance reporting, SOC 2 Type II certification, enterprise tier (custom retention, SLAs, advanced RBAC) |

Replace with (3 rows — Phase 3 was completed and Phases 46 have been added):

| Phase 3 — Enterprise | COMPLETE | AGNTCY federation (cross-IdP agent identity), W3C Decentralised Identifiers (DIDs), agent marketplace, OIDC provider (A2A delegation), Rust SDK, developer portal (Next.js 14) |
| Phase 4 — Compliance & Security | COMPLETE | AGNTCY compliance reports (agent-identity + audit-trail sections), audit hash chain verification, SOC 2 CC6.1 AES-256-CBC column encryption (`EncryptionService`), DID document caching, federation partner JWKS caching |
| Phase 5 — Scale & Ecosystem | COMPLETE | Multi-tier subscription model (free/pro/enterprise), Stripe billing integration (`BillingService`, `TierService`), tier enforcement middleware (daily call and token limits), webhook subscriptions + delivery history (`WebhookService`), analytics service (daily event aggregation + trend queries) |
| Phase 6 — Market Expansion | COMPLETE | AGNTCY conformance test suite (4 conformance scenarios), API tiers enforced end-to-end, analytics dashboard in developer portal, full Phase 6 engineering documentation update |

Change 1b — Add Phase 36 Capabilities to Product Features Table

Find (Section 3, the last row of the features table):

| Health Check | `GET /health` | Checks PostgreSQL and Redis connectivity; unauthenticated; used by load balancers |

Insert the following rows after that line (before the closing of the table):

| W3C Decentralised Identifiers | `GET /api/v1/agents/:id/did`, `GET /api/v1/.well-known/did.json` | DID Core 1.0 documents; `did:web` method; EC P-256 keys; AGNTCY extension fields |
| AGNTCY Agent Cards | `GET /api/v1/agents/:id/card` | Machine-readable agent identity summary; AGNTCY schema v1.0 |
| AGNTCY Compliance Reports | `GET /api/v1/compliance/report`, `GET /api/v1/compliance/agent-cards` | Compliance sections: agent-identity + audit-trail; cached 5 min; AGNTCY schema v1.0 |
| Federation (Cross-IdP) | `POST /api/v1/federation/partners`, `GET /api/v1/federation/partners`, `POST /api/v1/federation/verify` | Register partner IdPs; verify cross-IdP JWTs using cached partner JWKS |
| A2A Delegation | `POST /api/v1/oauth2/token/delegate`, `POST /api/v1/oauth2/token/verify-delegation` | Agent-to-agent delegation tokens; OIDC provider (oidc-provider v9) mounted at `/oidc` |
| Webhook Subscriptions | `POST /api/v1/webhooks`, `GET /api/v1/webhooks`, `GET /api/v1/webhooks/:id/deliveries` | Outbound event delivery with HMAC signing; Vault-backed secrets; delivery history |
| Tier Management | `GET /api/v1/tiers/status`, `POST /api/v1/tiers/upgrade` | Free / Pro / Enterprise tiers; daily call and token limits; Stripe Checkout upgrade flow |
| Billing | `POST /api/v1/billing/checkout`, `POST /api/v1/billing/webhook`, `GET /api/v1/billing/status` | Stripe subscription management; webhook event processing |
| Analytics | Internal (via `AnalyticsService`) | Daily aggregated event counts per org; token trend queries (up to 90 days); agent activity heatmap; usage summary |
| Developer Portal | `/portal` (Next.js 14, separate process) | Get-started wizard, SDK explorer, API reference, analytics dashboard, pricing page |

Change 1c — Update Free Tier Limits Table

Find (Section 6, entire table):

| Limit | Value |
|-------|-------|
| Max agents | 100 |
| Max credentials per agent | No hard cap enforced in code (5 is the documented recommendation) |
| Max tokens in flight | 10,000 per agent per calendar month |
| Token TTL | 3,600 seconds (1 hour) |
| Audit log retention | 90 days |
| API rate limit | 100 requests per minute per IP address |

Replace with:

| Limit | Free Tier | Pro Tier | Enterprise Tier |
|-------|-----------|----------|-----------------|
| Max agents | 100 | 1,000 | Unlimited |
| Max API calls per day | Configured in `TIER_CONFIG` | Configured in `TIER_CONFIG` | Unlimited |
| Max tokens per day | Configured in `TIER_CONFIG` | Configured in `TIER_CONFIG` | Unlimited |
| Token TTL | 3,600 seconds (1 hour) | 3,600 seconds (1 hour) | 3,600 seconds (1 hour) |
| Audit log retention | 90 days | 1 year | Custom |
| API rate limit (per IP) | 100 req/min | 100 req/min | 100 req/min |
| Webhook subscriptions | 0 | 10 | Unlimited |
| Analytics retention | 90 days | 1 year | Custom |

Tier limits are configured in `src/config/tiers.ts` (`TIER_CONFIG`). Enforcement is handled by `TierService.enforceAgentLimit()` (agent cap) and `src/middleware/tier.ts` (daily call/token caps). Tier upgrades are initiated via `POST /api/v1/tiers/upgrade` and confirmed via the Stripe webhook.

File 2: docs/engineering/03-tech-stack.md

Operation: Append new ADR entries after the existing ### ADR-10: Terraform section.

Find (last line of the file):

**Consequences**: All infrastructure changes must go through Terraform. No manual edits
via the AWS console or GCP console are permitted — they will be overwritten on the next
`terraform apply`. Terraform state is stored in a remote backend and must not be edited
manually.

Append the following after that line:

---

### ADR-11: Stripe

**Status**: Adopted
**Component**: Billing — subscription management and payment processing

**Decision**: Use Stripe as the payment processing and subscription management platform. The `stripe` npm package (v21+) handles Checkout Session creation, webhook event verification, and subscription lifecycle events.

**Rationale**: Stripe's hosted Checkout flow eliminates the need to handle PCI-DSS scope for card data. The `stripe.webhooks.constructEvent()` method uses HMAC-SHA256 to verify incoming webhook payloads, preventing replay attacks. The `checkout.session.completed` event carries `metadata: { orgId, targetTier }`, allowing `BillingService` to delegate tier upgrades to `TierService.applyUpgrade()` without coupling billing logic to tier logic.

**Alternatives considered**:
- Paddle — rejected because its global merchant-of-record model introduced complexities with the open-source free tier.
- Braintree — rejected because Stripe's webhook reliability and developer experience are superior.

**Consequences**: Stripe requires `STRIPE_SECRET_KEY` (for API calls) and `STRIPE_WEBHOOK_SECRET` (`whsec_...`, for webhook verification). Per-tier Stripe price IDs are configured via `STRIPE_PRICE_ID_PRO` and `STRIPE_PRICE_ID_ENTERPRISE`. All billing webhook handlers must pass the raw `Buffer` body (not parsed JSON) to `stripe.webhooks.constructEvent()` — use `express.raw()` middleware on the webhook route.

---

### ADR-12: oidc-provider (A2A Delegation)

**Status**: Adopted
**Component**: A2A delegation — OIDC provider for agent-to-agent trust tokens

**Decision**: Use the `oidc-provider` npm package (v9.7.x) as the OIDC provider for issuing A2A delegation tokens. The provider is mounted as a sub-application at `/oidc` within the Express app.

**Rationale**: `oidc-provider` is a certified OpenID Connect implementation that handles the full OIDC protocol, including JWKS serving, token endpoint, and discovery document. Rather than implementing a custom delegation token format, using a standards-compliant OIDC provider means delegation tokens can be verified by any OIDC-aware party using the published JWKS at `/oidc/jwks`.

**Alternatives considered**:
- Custom JWT signing — rejected because hand-rolled token formats cannot benefit from OIDC tooling and interoperability.

**Consequences**: `A2A_ENABLED` env var gates the OIDC provider — when set to `'false'`, delegation endpoints return 404. The `OIDC_ISSUER` env var must be set to the full base URL of the OIDC provider (e.g. `https://api.sentryagent.ai`).

---

### ADR-13: Next.js 14 (Developer Portal)

**Status**: Adopted
**Component**: Developer Portal (`portal/`) — public-facing documentation and onboarding

**Decision**: Use Next.js 14 (App Router) with Tailwind CSS for the developer portal. The portal is a separate process served on its own port (independent of the Express API server).

**Rationale**: The developer portal has different performance and SEO requirements than the internal operator dashboard (`dashboard/`). Next.js 14's App Router supports React Server Components, which allows the marketing and documentation pages to be statically generated while the analytics dashboard and API Explorer are client-rendered. Tailwind CSS enables rapid UI development consistent with the design system.

**Alternatives considered**:
- Extending the Vite dashboard — rejected because the developer portal requires server-side rendering for SEO on marketing pages, which Vite does not provide.
- Docusaurus — rejected because the portal includes interactive components (Swagger Explorer, analytics charts) that are not well-suited to a documentation-only tool.

**Consequences**: The portal (`portal/`) has its own `package.json`, `tsconfig.json`, `tailwind.config.ts`, and `next.config.js`. It is built and run independently: `cd portal && npm install && npm run dev`. The portal calls the AgentIdP REST API using the same `@sentryagent/idp-sdk` as the dashboard.

---

### ADR-14: bull (Job Queue) + kafkajs (Event Streaming)

**Status**: Adopted (opt-in)
**Component**: Async job processing and event streaming

**Decision**: Use `bull` (Redis-backed job queue) for async webhook delivery retries and `kafkajs` for event streaming to external consumers. Both are opt-in — the system operates correctly without Kafka configured.

**Rationale**: Webhook delivery requires retry logic with exponential backoff and dead-letter handling. `bull` provides this out of the box using the existing Redis dependency. `kafkajs` enables high-throughput event streaming for analytics and audit events to external data pipelines without blocking the primary request path.

**Alternatives considered**:
- BullMQ — considered as a more modern alternative to `bull` but rejected to avoid adding a new package family during Phase 6. Migration is a future backlog item.

**Consequences**: Kafka is entirely optional. When `KAFKA_BROKERS` is not set, `kafkajs` is not initialised and no events are published. The `bull` queue for webhook delivery requires only the existing Redis instance.

---

### ADR-15: did-resolver + web-did-resolver (W3C DIDs)

**Status**: Adopted
**Component**: W3C DID Core 1.0 document resolution

**Decision**: Use `did-resolver` (v4.1.x) as the DID resolution framework and `web-did-resolver` (v2.0.x) for the `did:web` method implementation.

**Rationale**: `did-resolver` provides a pluggable resolver interface used by both the server (for internal resolution) and by third parties who want to verify AgentIdP-issued DIDs. The `did:web` method maps DID identifiers to HTTPS URLs hosting the DID document JSON, requiring no blockchain. `DIDService` generates documents that conform to the W3C DID Core 1.0 specification and include AGNTCY-specific extension fields.

**Consequences**: `DID_WEB_DOMAIN` env var is required for DID generation. DID documents are cached in Redis (`did:doc:<agentId>`, TTL from `DID_DOCUMENT_CACHE_TTL_SECONDS`, default 300s). Private keys are stored in HashiCorp Vault KV v2 when Vault is configured; in dev mode, a `dev:no-vault` marker is stored and keys are ephemeral.

File 3: docs/engineering/04-codebase-structure.md

Operation: Two surgical edits — update the directory tree and update the src/ subdirectory table.


Change 3a — Update the Annotated Directory Tree

Find (inside the code block in Section 1, after the sdk-java/ line):

├── policies/                  # OPA policy files

Replace the entire block from ├── policies/ down through └── jest.config.ts # Jest configuration — ts-jest, test timeouts, coverage thresholds with the following updated version:

├── sdk-rust/                  # Rust SDK (sentryagent-idp crate) — async, tokio, reqwest, typed errors
├── policies/                  # OPA policy files
│   ├── authz.rego             # Rego policy — normalise_path + scope-intersection allow rule
│   └── data/scopes.json       # Endpoint permission map — used by Rego and TypeScript fallback
├── portal/                    # Developer Portal — Next.js 14 App Router, Tailwind CSS
│   ├── app/                   # Next.js App Router pages (get-started, pricing, sdks, analytics, settings, login)
│   ├── components/            # Shared UI components (Nav.tsx, SwaggerExplorer.tsx, GetStartedWizard.tsx)
│   ├── hooks/                 # React hooks (useAuth.ts)
│   └── types/                 # TypeScript type definitions for portal-only types
├── terraform/                 # Terraform infrastructure as code
│   ├── modules/               # Reusable modules: agentidp, lb, rds, redis
│   └── environments/          # Environment configs: aws/ (ECS+RDS+ElastiCache), gcp/ (Cloud Run+SQL+Memorystore)
├── monitoring/                # Prometheus and Grafana configuration
│   ├── prometheus/            # prometheus.yml scrape configuration
│   └── grafana/               # Grafana provisioning YAML and dashboard JSON files
├── docs/                      # All project documentation
│   ├── engineering/           # Internal engineering knowledge base (this directory)
│   ├── developers/            # End-user API reference and developer guides
│   ├── devops/                # Operator runbooks and environment variable reference
│   ├── agntcy/                # AGNTCY alignment documentation
│   └── openapi/               # OpenAPI 3.0 specification files
├── openspec/                  # OpenSpec change management — proposals, designs, specs, tasks, archives
├── tests/                     # Jest test suite — mirrors src/ structure
│   ├── unit/                  # Unit tests (mocked dependencies) — mirrors src/
│   ├── integration/           # Integration tests (real DB + Redis)
│   ├── agntcy-conformance/    # AGNTCY conformance test suite (separate Jest config)
│   └── load/                  # k6 load test scripts
├── Dockerfile                 # Multi-stage production build (build + runtime stages)
├── docker-compose.yml         # Local development: PostgreSQL 14 (port 5432) + Redis 7 (port 6379)
├── docker-compose.monitoring.yml  # Monitoring overlay: Prometheus (port 9090) + Grafana (port 3001)
├── package.json               # Node.js dependencies and npm scripts
├── tsconfig.json              # TypeScript strict configuration — compiled to dist/
└── jest.config.ts             # Jest configuration — ts-jest, test timeouts, coverage thresholds

Change 3b — Add New src/ Subdirectories to Section 2

Find (Section 2 table, the last row):

| `src/cache/` | Redis client factory — creates and caches a single `redis` client instance | Client is a singleton created once in `src/app.ts` and passed to repositories |

Insert these rows after that line:

| `src/config/` | Configuration constants — `tiers.ts` exports `TIER_CONFIG`, `TIER_RANK`, `TierName`, and `isTierName()` type guard | Imported by `TierService` and `tierMiddleware`; never imports from services |
| `src/middleware/tier.ts` | Tier enforcement middleware — reads org tier from `TierService`, checks daily call counter in Redis, throws `TierLimitError` (429) when limit is exceeded, increments counter on pass | Applied only to API routes; skips `/health`, `/metrics`, and static file routes |

Change 3c — Add New Entries to Section 3 (Where to Add New Code)

Find (Section 3 table, after the A new Prometheus metric row):

| A new TypeScript type used in 2+ files | `src/types/index.ts` | A new `AgentGroupMembership` interface |

Insert these rows after that line:

| A new tier-gated feature | `src/config/tiers.ts` (add limit field) + `src/middleware/tier.ts` (add check) + service (enforce) | Adding a `maxWebhooksPerOrg` tier limit |
| A webhook event handler | `src/services/WebhookService.ts` (add event type to `WebhookEventType`) + the producer that calls `void webhookService.dispatch(orgId, eventType, payload)` | Emitting `agent.decommissioned` events to subscriber URLs |
| A new analytics metric type | `src/services/AnalyticsService.ts` (call `recordEvent(tenantId, 'new_metric')` in the relevant service using `void`) | Recording `credential_rotated` events for analytics |
| A new DID endpoint | `src/controllers/DIDController.ts` + `src/routes/did.ts` + `src/services/DIDService.ts` (if new method needed) + `policies/data/scopes.json` | Adding `GET /api/v1/agents/:id/did/rotate-key` |

File 4: docs/engineering/README.md

Operation: Replace the reading order table and quick reference table to reflect all Phase 6 additions.


Change 4a — Update Reading Order Table

Find (Section "Reading Order (New Engineers Start Here)", the last row):

| 11 | [SDK Integration Guide](11-sdk-guide.md) | All 4 SDKs — installation, examples, contribution guide | 20 min |

Replace with (adds the Rust SDK to the description and updates the estimated time):

| 11 | [SDK Integration Guide](11-sdk-guide.md) | All 5 SDKs (Node.js, Python, Go, Java, Rust) — installation, examples, contribution guide | 25 min |

Find (the line after the table):

**Total estimated reading time for new engineers: ~3.5 hours**

Replace with:

**Total estimated reading time for new engineers: ~4 hours**

Change 4b — Update "Service Deep Dives" Entry

Find:

| 5 | [Service Deep Dives](05-services.md) | All 8 services/components — purpose, interface, schema, error types | 30 min |

Replace with:

| 5 | [Service Deep Dives](05-services.md) | All 17 services/components (incl. Phase 36: AnalyticsService, TierService, ComplianceService, FederationService, DIDService, WebhookService, BillingService, DelegationService, OIDCService) — purpose, interface, schema, error types | 45 min |

Change 4c — Update Quick Reference Table

Find (in the Quick Reference section):

| Integrate with the SDK | [11-sdk-guide.md](11-sdk-guide.md) |

Replace with:

| Integrate with the SDK (Node.js, Python, Go, Java, Rust) | [11-sdk-guide.md](11-sdk-guide.md) |

Find (after the "Integrate with the SDK" row):

| Understand why a technology was chosen | [03-tech-stack.md](03-tech-stack.md) |

Insert after that row:

| Understand tier limits and billing | [01-overview.md](01-overview.md) (Section 6) + [03-tech-stack.md](03-tech-stack.md) (ADR-11) |
| Understand AGNTCY compliance reports | [05-services.md](05-services.md) (ComplianceService) |
| Understand the A2A delegation flow | [06-walkthroughs.md](06-walkthroughs.md) (Walkthrough 4) |
| Run the AGNTCY conformance suite | [09-testing.md](09-testing.md) (Section 10.8) |
| Add a new Rust SDK endpoint | [11-sdk-guide.md](11-sdk-guide.md) (Section 6 contribution guide) |

File 5: docs/engineering/06-walkthroughs.md

Operation: Append three new walkthrough sections at the end of the file.

Find (the last line of the file):

Returns `ICredentialWithSecret` — the updated credential including the new
`clientSecret`. This is the only time the new secret is ever returned. The caller
must store it securely.

Append the following after that final JSON block:

---

## Walkthrough 4 — A2A Delegation End-to-End

**Request:** `POST /api/v1/oauth2/token/delegate` — one AI agent delegating a scoped capability to another

This walkthrough traces how agent A (an orchestrator) issues a delegation token that grants agent B (a sub-agent) the right to act on its behalf with a restricted scope.

---

### Step 1 — Route dispatch

**File:** `src/routes/delegation.ts`

```typescript
router.post(
  '/token/delegate',
  asyncHandler(authMiddleware),
  opaMiddleware,
  asyncHandler(delegationController.createDelegation.bind(delegationController))
);

Both authMiddleware and opaMiddleware run. The OPA policy requires scope agents:write for delegation creation.


Step 2 — Controller: extract delegator and validate

File: src/controllers/DelegationController.ts

const delegatorId = req.user.sub;               // From the Bearer token's sub claim
const { delegatee_id, scope, expires_at } = req.body;

The controller validates that delegatee_id is a non-empty UUID, scope is a non-empty string, and expires_at (if provided) is a valid ISO 8601 datetime in the future. It passes these to DelegationService.createDelegation().


Step 3 — Service: verify both agents exist

File: src/services/DelegationService.ts

const delegator = await this.agentRepository.findById(delegatorId);
if (!delegator || delegator.status !== 'active') { throw new AgentNotFoundError(delegatorId) }

const delegatee = await this.agentRepository.findById(delegateeId);
if (!delegatee || delegatee.status !== 'active') { throw new AgentNotFoundError(delegateeId) }

Both agents must exist and be in active status. A suspended or decommissioned agent cannot participate in delegation.


Step 4 — Service: insert delegation chain record

File: src/services/DelegationService.ts

await this.pool.query(
  `INSERT INTO delegation_chains (chain_id, delegator_id, delegatee_id, scope, status, expires_at)
   VALUES ($1, $2, $3, $4, 'active', $5)`,
  [chainId, delegatorId, delegateeId, scope, expiresAt]
);

The chain_id is a UUID generated by the service. The delegation_chains table provides the authoritative source of truth for which delegations are active, independent of any token.


Step 5 — Response

{
  "chain_id": "f1e2d3c4-...",
  "token": "eyJhbGciOiJSUzI1NiJ9...",
  "delegator_id": "a1b2c3d4-...",
  "delegatee_id": "b2c3d4e5-...",
  "scope": "agents:read",
  "status": "active",
  "expires_at": "2026-04-05T00:00:00Z"
}

The token field is the signed delegation JWT. The delegatee presents this token to POST /api/v1/oauth2/token/verify-delegation to prove it has authority to act on the delegator's behalf.

Why store both the DB record and the JWT? The DB record allows revocation — when the delegator calls DELETE /api/v1/delegation-chains/:chainId, the record is soft-deleted and all subsequent verify-delegation calls will fail even if the JWT itself has not yet expired.


Walkthrough 5 — Tier Enforcement Request Lifecycle

Request: Any authenticated API request when the organisation's daily call limit is reached

This walkthrough traces how tierMiddleware intercepts a request before it reaches the OPA middleware, preventing quota-exceeded traffic from consuming service resources.


Step 1 — Auth middleware passes

Same as Walkthrough 2, Step 3. The Bearer JWT is verified and req.user is populated with sub (agentId) and organization_id.


Step 2 — Tier middleware: fetch org tier

File: src/middleware/tier.ts

const orgId = req.user.organization_id;
const tier = await tierService.fetchTier(orgId);
const config = TIER_CONFIG[tier];

fetchTier() issues SELECT tier FROM organizations WHERE organization_id = $1. Returns 'free' if no row is found (safe default).


Step 3 — Tier middleware: read daily counter

File: src/middleware/tier.ts

const callsKey = `rate:tier:calls:${orgId}`;
const callsToday = await redis.get(callsKey);
const count = callsToday !== null ? parseInt(callsToday, 10) : 0;

if (count >= config.maxCallsPerDay) {
  throw new TierLimitError('calls', config.maxCallsPerDay, { orgId, tier, current: count });
}

The Redis key rate:tier:calls:<orgId> is read. If null (first call of the day), count is 0. When count equals or exceeds the tier limit, TierLimitError (HTTP 429) is thrown immediately — no further middleware runs.


Step 4 — Tier middleware: increment counter (fire-and-forget)

File: src/middleware/tier.ts

// Set TTL to next UTC midnight if key is new
void redis.multi()
  .incr(callsKey)
  .expireAt(callsKey, nextUtcMidnightUnix())
  .exec();
next();

The counter is incremented atomically using a Redis MULTI block. The EXPIREAT command sets the key to auto-delete at the next UTC midnight, resetting the daily counter without any scheduled job. The increment is fire-and-forget — the request proceeds immediately to opaMiddleware.

Why expire at UTC midnight rather than a rolling 24-hour window? Tier limits are documented as "per day", which users interpret as resetting at midnight. A rolling window would allow a user to consume their full daily quota twice within a 48-hour period straddling midnight, which is counterintuitive. UTC midnight is predictable and easy to reason about.


Step 5 — Error handler serialises TierLimitError

File: src/middleware/errorHandler.ts

HTTP 429
{
  "code": "TIER_LIMIT_EXCEEDED",
  "message": "Daily API call limit reached for your tier.",
  "details": {
    "tier": "free",
    "limit": 1000,
    "current": 1000
  }
}

The Retry-After header is set to the number of seconds until next UTC midnight so clients can implement automatic backoff.


Walkthrough 6 — Analytics Event Capture Flow

Trigger: Any successful token issuance (POST /api/v1/token)

This walkthrough traces how an analytics event is captured without affecting the latency of the primary token issuance response.


Step 1 — Token issuance completes

File: src/services/OAuth2Service.ts

const accessToken = signToken(payload, this.privateKey);
// Primary response is ready — analytics is now fire-and-forget
void this.analyticsService.recordEvent(tenantId, 'token_issued');
tokensIssuedTotal.inc({ scope });

The signToken() call completes synchronously (RSA signing is CPU-bound, not I/O). The controller can now send the response. analyticsService.recordEvent() is called with void — the await is deliberately omitted.

Why void instead of await? Token issuance latency must remain below 100ms (per the QA performance gate). A PostgreSQL write adds 515ms. Since analytics data is aggregated (not transactional), losing an occasional event due to an error is acceptable. The response is never delayed for analytics.


Step 2 — AnalyticsService: UPSERT daily counter

File: src/services/AnalyticsService.ts

async recordEvent(tenantId: string, metricType: string): Promise<void> {
  try {
    await this.pool.query(
      `INSERT INTO analytics_events (organization_id, date, metric_type, count)
       VALUES ($1, CURRENT_DATE, $2, 1)
       ON CONFLICT (organization_id, date, metric_type)
       DO UPDATE SET count = analytics_events.count + 1`,
      [tenantId, metricType],
    );
  } catch (err) {
    console.error('[AnalyticsService] recordEvent failed — primary path unaffected', err);
  }
}

The ON CONFLICT DO UPDATE upsert is atomic. Whether this is the first or the ten-thousandth token_issued event for this tenant today, the row is updated correctly. All errors are caught and swallowed — the token has already been returned to the caller.

Why one row per day per metric, not one row per event? Storing a row per event would create millions of rows. The daily aggregate model keeps the table compact while still providing daily trend data (the granularity that analytics dashboards need). Sub-day granularity is available from the Prometheus agentidp_tokens_issued_total counter if needed.


Step 3 — Dashboard query (deferred)

When a developer visits the analytics page in the developer portal, the portal calls:

GET /api/v1/analytics/token-trend?days=30

File: src/services/AnalyticsService.tsgetTokenTrend(tenantId, 30)

SELECT
  gs.date::DATE::TEXT AS date,
  COALESCE(ae.count, 0)::INTEGER AS count
FROM generate_series(
  CURRENT_DATE - 29 * INTERVAL '1 day',
  CURRENT_DATE,
  INTERVAL '1 day'
) AS gs(date)
LEFT JOIN analytics_events ae
  ON ae.date = gs.date::DATE
  AND ae.organization_id = $2
  AND ae.metric_type = 'token_issued'
ORDER BY gs.date ASC

The generate_series + LEFT JOIN pattern ensures all 30 days appear in the result, with count: 0 for days with no events. This avoids the need for the client to fill in gaps.