chore(openspec): archive phase-5-scale-ecosystem — 68/68 tasks complete

WS1 (Rust SDK), WS2 (A2A Authorization), WS5 (Developer Experience)
all delivered, QA gates passed, committed to main.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
SentryAgent.ai Developer
2026-04-03 02:54:45 +00:00
parent eaabaebf52
commit 8fd6823581
7 changed files with 0 additions and 58 deletions

View File

@@ -0,0 +1,279 @@
## Context
SentryAgent.ai has completed four phases: Phase 1 (MVP — core agent registry, OAuth 2.0, audit log), Phase 2 (Production-Ready — Vault, 4 language SDKs, OPA, React dashboard, Prometheus, Terraform multi-region), Phase 3 (Enterprise — multi-tenancy, W3C DIDs, OIDC, AGNTCY federation, webhooks, SOC 2 controls), and Phase 4 (Developer Growth — production hardening, developer portal, CLI, agent marketplace, GitHub Actions, Stripe billing). The product is technically complete, commercially launched, and has an active developer community.
Phase 5 operates on a stable, proven foundation. Every new workstream is additive — no existing service is refactored, only extended. The architecture constraint that governs Phase 5 is: **any new service MUST follow the existing DI pattern (constructor injection of typed interfaces), MUST emit Prometheus metrics, and MUST be covered by the existing QA gate (>80% coverage, OpenAPI spec-first).**
## Goals / Non-Goals
**Goals:**
- Complete language SDK parity — Rust is the only major language missing a first-party SDK
- Introduce A2A delegation as a first-class authorization primitive aligned with AGNTCY multi-agent workflows
- Give paying tenants visibility into their own usage patterns through analytics
- Expose multi-tier rate limits as a self-service commercial lever
- Eliminate DX friction for new developers — scaffold generation reduces time-to-first-request to under 5 minutes
- Certify AGNTCY compliance formally — this is a competitive moat
**Non-Goals:**
- Real-time WebSocket-based analytics streaming (batch/polling is acceptable for MVP analytics)
- Full marketplace monetization (agent listings with pricing — discovery only, no transactions, out of scope)
- Native mobile SDK (iOS/Android)
- GraphQL API surface
- Webhook delivery for analytics events (Phase 6 if needed)
## Decisions
### ADR-1: Rust SDK is a standalone Cargo crate in `sdk-rust/` with no code generation
**Decision:** The Rust SDK is a hand-authored Cargo crate at `sdk-rust/`, not generated from the OpenAPI spec using `openapi-generator`.
**Rationale:** Code generation produces idiomatically poor Rust — `openapi-generator`'s Rust output does not use `async/await` idiomatically, does not produce proper `thiserror`-based error types, and generates `unwrap()` calls in critical paths. Hand-authored code ensures idiomatic Rust: `async/await` throughout, `Arc<Mutex<TokenCache>>` for thread-safe token caching, `Result<T, AgentIdPError>` for every fallible operation, and zero `unwrap()` in library code. The SDK API surface mirrors the Go SDK pattern (the most recently authored, cleanest SDK) to minimize cognitive load for polyglot teams.
**Alternatives considered:** `openapi-generator --generator rust` — produces non-idiomatic output, requires post-processing, hard to maintain. `progenitor` (Oxide) — excellent output but requires forking Oxide's toolchain and adds a complex build dependency.
---
### ADR-2: A2A delegation chains are stored in PostgreSQL, verified cryptographically at request time
**Decision:** Delegation chains are stored as rows in a `delegation_chains` table. Each row captures: delegator agent ID, delegatee agent ID, granted scopes, expiry, and a cryptographic signature over the delegation payload using the delegator's credential secret. Verification at `POST /oauth2/token/verify-delegation` reconstructs and verifies the chain signature.
**Rationale:** Storing the full delegation chain in the database enables: (1) audit log entries with full chain context, (2) revocation of any link in a chain (invalidating all downstream delegations), and (3) analytics over delegation depth and patterns. Cryptographic signing at issuance means the database is the source of truth but is not trusted blindly — the chain is independently verifiable.
**Alternatives considered:** JWT-encoded delegation claims only (no DB storage) — enables verification without a DB hit but prevents revocation and audit. Blockchain-anchored delegation — extreme overkill for MVP scale, operational complexity exceeds benefit.
---
### ADR-3: Analytics are computed from `usage_events` table using pre-aggregated daily summaries
**Decision:** The analytics endpoints (`GET /analytics/usage-summary`, `GET /analytics/agent-activity`, `GET /analytics/token-trends`) query a new `analytics_daily_aggregates` table that is populated by a nightly aggregation job (pg_cron or a Node.js cron via `node-cron`). Raw `usage_events` rows are not queried at API request time.
**Rationale:** The `usage_events` table is append-only and grows without bound. Scanning it for date-range analytics would produce full-table scans at production scale. Pre-aggregated daily summaries (`tenant_id`, `agent_id`, `date`, `metric_type`, `count`) enable O(days) queries regardless of event volume. The aggregation job runs at 00:05 UTC daily to aggregate the previous day's events.
**Alternatives considered:** Real-time aggregation using PostgreSQL window functions — acceptable at small scale, degrades catastrophically at 10M+ events. TimescaleDB hypertables — excellent solution but adds an infrastructure dependency (separate DB engine) disproportionate to Phase 5 scope.
---
### ADR-4: Multi-tier rate limits are enforced in a new `TierRateLimiter` middleware that reads tier from `tenant_subscriptions`
**Decision:** A new `TierRateLimiter` middleware replaces the flat rate limiter for authenticated routes. It reads the tenant's current tier (`free` | `pro` | `enterprise`) from a Redis-cached lookup of `tenant_subscriptions` and applies the tier-appropriate rate limit from a static tier definition map. The tier definition map is the single source of truth — also returned verbatim by `GET /tiers`.
**Rationale:** The existing `RateLimiterRedis` middleware applies a single flat limit across all tenants. Multi-tier enforcement requires per-tenant limit keys (already supported by `rate-limiter-flexible` via the `keyPrefix` option) and per-tier limit configurations. Centralizing tier definitions in a static config (not a database table) avoids the complexity of dynamic tier management and keeps tier changes as code changes (reviewed, versioned, deployed).
**Alternatives considered:** API gateway (Kong, AWS API Gateway) for rate limiting — correct long-term architecture but adds operational complexity and cost beyond Phase 5 scope. Per-tenant custom limits stored in DB — too flexible, hard to reason about, no self-service model.
---
### ADR-5: Scaffold generator produces a ZIP archive served from `GET /sdk/scaffold/:agentId`
**Decision:** `ScaffoldService` generates an in-memory ZIP archive (using `archiver`) containing language-specific starter files pre-populated with the agent's `clientId` and the API URL. The endpoint streams the ZIP directly from memory — no disk I/O, no S3.
**Rationale:** Scaffold generation is a low-frequency, low-latency-sensitive operation (developers use it once per new project). In-memory generation avoids disk I/O, eliminates cleanup complexity, and produces no persistent artifacts on the server. The `archiver` library supports in-memory streaming to an HTTP response via Node.js streams. Each scaffold is generated on demand and is not cached — the agent's credentials could rotate between requests.
**Alternatives considered:** Pre-built scaffold templates on S3 with client ID injected at runtime — adds AWS dependency, complicates credential injection. GitHub template repositories — developer must authenticate with GitHub, adds friction. Static downloadable templates — not pre-wired with agent credentials, defeats the purpose.
---
### ADR-6: AGNTCY compliance report is generated on demand from live system state, not cached
**Decision:** `GET /agntcy/compliance-report` queries live system state — registered agents, DID documents, OIDC configuration, federation policies, audit log retention settings — and generates a structured compliance report in real time. No pre-computed report cache.
**Rationale:** Compliance reports must reflect current system state. A cached report could misrepresent configuration that has changed since the last cache population. The compliance report endpoint is not on the critical path (it is used by compliance officers, not application code) — latency of 5002000ms is acceptable. The report format is machine-readable JSON (with an optional PDF export hint for human-readable presentation).
**Alternatives considered:** Pre-generated nightly compliance reports stored in S3 — stale by definition, adds S3 dependency. Compliance report built into the monitoring stack (Grafana) — mixing compliance and observability concerns violates single responsibility.
## Component Architecture — How Phase 5 Extends Phase 4
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ SentryAgent.ai Platform — Phase 5 │
│ │
│ ┌──────────────────────────────┐ ┌──────────────────────────────────────┐ │
│ │ Developer Portal (Next.js) │ │ Web Dashboard (React 18) │ │
│ │ ┌────────────────────────┐ │ │ ┌──────────────────────────────────┐│ │
│ │ │ API Explorer │ │ │ │ Analytics Tab (NEW - WS3) ││ │
│ │ │ (Elements v5 — WS5) │ │ │ │ - Agent Activity Heatmap ││ │
│ │ ├────────────────────────┤ │ │ │ - Token Issuance Trends ││ │
│ │ │ Scaffold Download (WS5)│ │ │ │ - Rotation Frequency ││ │
│ │ └────────────────────────┘ │ │ └──────────────────────────────────┘│ │
│ └──────────────────────────────┘ └──────────────────────────────────────┘ │
│ │ │ │
│ └──────────────┬─────────────────────────┘ │
│ │ HTTPS │
│ ┌─────────────────────────────▼────────────────────────────────────────────┐ │
│ │ Express API (Node.js / TypeScript) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │ │
│ │ │ Delegation │ │ Analytics │ │ Tiers & │ │ Scaffold │ │ │
│ │ │ Router (WS2) │ │ Router (WS3)│ │ Upgrade(WS4)│ │ Router (WS5) │ │ │
│ │ └──────┬───────┘ └──────┬──────┘ └──────┬──────┘ └──────┬────────┘ │ │
│ │ │ │ │ │ │ │
│ │ ┌──────▼───────┐ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼────────┐ │ │
│ │ │Delegation │ │Analytics │ │BillingService│ │Scaffold │ │ │
│ │ │Service (WS2) │ │Service (WS3)│ │(extended WS4)│ │Service (WS5) │ │ │
│ │ └──────┬───────┘ └──────┬──────┘ └─────────────┘ └──────┬────────┘ │ │
│ │ │ │ │ │ │
│ │ ┌──────▼───────────────────▼───────────────────────────────────▼────────┐ │ │
│ │ │ TierRateLimiter Middleware (WS4) │ │ │
│ │ │ Reads tenant tier from Redis → applies tier-specific limits │ │ │
│ │ └───────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌────────────────────────────────────────────────┐ │ │
│ │ │ AGNTCY Routes │ │ Existing Phase 14 Routes (unchanged) │ │ │
│ │ │ (WS6) │ │ /agents, /oauth2, /credentials, /audit, │ │ │
│ │ │ /agntcy/ │ │ /marketplace, /billing, /health, /oidc, etc. │ │ │
│ │ │ compliance-report│ └────────────────────────────────────────────────┘ │ │
│ │ │ /agents/:id/ │ │ │
│ │ │ agent-card │ │ │
│ │ └──────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┬┴──────────────────────┐ │
│ │ │ │ │
│ ┌────────────▼────────┐ ┌───────────▼──────────┐ ┌────────▼──────────────┐ │
│ │ PostgreSQL 14+ │ │ Redis 7+ │ │ External Services │ │
│ │ │ │ │ │ │ │
│ │ delegation_chains │ │ tier_cache:{tenantId} │ │ Stripe (billing) │ │
│ │ (WS2 - new) │ │ delegation_cache:{id} │ │ HashiCorp Vault │ │
│ │ analytics_daily_ │ │ analytics_cache:{k} │ │ OPA Policy Engine │ │
│ │ aggregates (WS3) │ │ │ │ │ │
│ │ tenant_subscriptions │ │ │ │ │ │
│ │ usage_events │ │ │ │ │ │
│ │ (Phase 4 — existing) │ │ │ │ │ │
│ └─────────────────────┘ └────────────────────────┘ └───────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ External SDKs & Tooling │ │
│ │ sdk-rust/ (WS1 — new) │ cli/ (extended WS5) │ AGNTCY Test Suite │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
```
## System-Level Data Flows
### WS2: A2A Delegation Flow
```
Agent A (Delegator) SentryAgent.ai API Agent B (Delegatee)
│ │ │
│ POST /oauth2/token/delegate │ │
│ { agentId: A, delegateeId: B, │ │
│ scopes: [...], ttl: 3600 } │ │
│──────────────────────────────────>│ │
│ │ 1. Authenticate Agent A │
│ │ 2. Validate B exists │
│ │ 3. Verify scopes ⊆ A's scopes │
│ │ 4. Sign delegation payload │
│ │ with A's credential │
│ │ 5. INSERT delegation_chains │
│ │ 6. Return delegation token │
│ { delegationToken, chainId } │ │
│<──────────────────────────────────│ │
│ │ │
│ (out of band: share delegationToken with Agent B) │
│────────────────────────────────────────────────────────────────> │
│ │ │
│ │ POST /oauth2/token/verify-delegation
│ │ { delegationToken } │
│ │<─────────────────────────────│
│ │ 1. Decode token │
│ │ 2. Fetch chain from DB │
│ │ 3. Verify signature │
│ │ 4. Check expiry & revocation │
│ │ 5. Return chain + scopes │
│ │ { valid, scopes, chainId } │
│ │─────────────────────────────>│
```
### WS3: Analytics Aggregation Flow
```
API Request (any route) Middleware PostgreSQL
│ │ │
│ (every authenticated req) │ │
│──────────────────────────── >│ │
│ │ increment in-memory counter │
│ │ {tenantId, agentId, metric} │
│ │ │
│ │ (every 60s flush — Phase 4) │
│ │──────────────────────────────>│
│ │ INSERT usage_events │
│ │ │
│ │
(00:05 UTC daily) │ │
node-cron job │ │
──────────────────────>│ │
│ aggregate usage_events │
│ for previous day │
│──────────────────────────────>│
│ INSERT analytics_daily_ │
│ aggregates (upsert) │
GET /analytics/agent-activity AnalyticsController AnalyticsService
│ │ │
│──────────────────────────────>│ │
│ │ checkCache(Redis) │
│ │ (miss) → queryAggregates() │
│ │─────────────────────────────>│
│ │ SELECT from │
│ │ analytics_daily_aggregates │
│ │<─────────────────────────────│
│ │ writeCache(Redis, 5min TTL) │
│ { agents: [...heatmap] } │ │
│<──────────────────────────────│ │
```
### WS5: Scaffold Generation Flow
```
Developer (CLI) SentryAgent.ai API ScaffoldService
│ │ │
│ sentryagent scaffold │ │
│ --agent-id abc123 │ │
│ --language typescript │ │
│ │ │
│ GET /sdk/scaffold/abc123 │ │
│ ?language=typescript │ │
│─────────────────────────────────>│ │
│ │ authenticate request │
│ │ fetch agent credentials │
│ │──────────────────────────>│
│ │ generateScaffold( │
│ │ agentId, clientId, │
│ │ language, apiUrl) │
│ │ build ZIP in-memory: │
│ │ - package.json │
│ │ - index.ts │
│ │ - .env.example │
│ │ - README.md │
│ │<──────────────────────────│
│ (ZIP stream, Content- │ │
│ Disposition: attachment) │ │
│<─────────────────────────────────│ │
│ │ │
│ unzip → ready-to-run project │ │
```
## Risks / Trade-offs
- **[Risk] Rust SDK compile times in CI** — Mitigation: Use `sccache` in CI to cache compiled Rust dependencies. The SDK has minimal dependencies — compile time is bounded.
- **[Risk] A2A delegation scope creep** — Mitigation: Delegated scopes are strictly a subset of the delegator's own scopes (enforced at issuance, not just verification). A delegatee cannot escalate privileges beyond what the delegator holds.
- **[Risk] Analytics aggregation job failure leaves stale data** — Mitigation: Aggregation job is idempotent (upsert on `(tenant_id, agent_id, date, metric_type)`). A failed job can be re-run for any date without producing duplicate data.
- **[Risk] Scaffold ZIP includes clientId but not clientSecret** — Mitigation: The scaffold `.env.example` includes `AGENT_CLIENT_ID=<your-client-id>` with a placeholder for `AGENT_CLIENT_SECRET=<your-client-secret>`. The secret is never returned by the scaffold endpoint — developers copy it from the credentials page once.
- **[Risk] Elements (Swagger UI v5) breaking change in portal** — Mitigation: Elements is a drop-in React component. The existing `swagger-ui-react` dependency is replaced, not wrapped. The `/api-explorer` page is isolated — no other portal pages are affected.
- **[Risk] AGNTCY compliance report reflects live state but AGNTCY spec may update** — Mitigation: The report includes the AGNTCY spec version it was evaluated against (`agntcy_spec_version` field). Report consumers can detect when the evaluation is stale relative to a newer AGNTCY spec.
## Migration Plan
1. **WS1 first** (independent, no API changes): Build and publish the Rust SDK. No server-side migrations required.
2. **WS2 second** (requires migration `008_add_delegation_chains.sql`): Apply migration first, then deploy delegation endpoints. No breaking changes to existing endpoints.
3. **WS3 + WS4 in parallel** (WS3 requires migration `009_add_analytics_aggregates.sql`; WS4 requires no migration): Apply WS3 migration, deploy analytics endpoints, schedule nightly aggregation job. WS4 tier rate limiter deploys behind `TIER_RATE_LIMITING_ENABLED` feature flag.
4. **WS5** (extends portal and CLI — independent deployments): Deploy portal with Elements upgrade. Publish updated CLI to npm with `scaffold` command.
5. **WS6 last** (reads live system state — no migrations): Deploy AGNTCY compliance endpoints. Run interoperability test suite in CI on every commit going forward.
**Rollback strategy per workstream:**
- WS1 (Rust SDK): Publish to crates.io is permanent — yanked if critical bug found. No server-side rollback needed.
- WS2 (A2A): Disable delegation routes via `A2A_ENABLED=false` feature flag. `delegation_chains` table is additive — leaving it in place causes no harm.
- WS3 (Analytics): Disable analytics routes via `ANALYTICS_ENABLED=false`. Aggregation job is a cron — disable in deployment config.
- WS4 (Tiers): Revert `TierRateLimiter` middleware to flat `RateLimiterRedis` middleware via `TIER_RATE_LIMITING_ENABLED=false`.
- WS5 (DX): Revert portal deploy to previous version. Publish CLI patch release removing scaffold command.
- WS6 (AGNTCY): Disable AGNTCY routes via `AGNTCY_ENABLED=false` feature flag. No state changes — read-only endpoints.