sentryagent-idp/openspec/changes/phase-4-developer-growth/design.md at 26a56f84e187dff16f9d2f2aa3d10bea83a59b3d

Files

SentryAgent.ai Developer b0f70b7ac4 feat(openspec): Phase 4 Developer Growth & Go-to-Market Readiness

OpenSpec change: phase-4-developer-growth (spec-driven, 4/4 artifacts)

6 workstreams, 90 implementation tasks, delivery sequence:
WS1 → WS2 + WS3 (parallel) → WS4 → WS5 → WS6

Workstreams:
1. Production Hardening — ioredis rate limiting, DB pool tuning, /health/detailed, k6 load tests
2. Developer Portal — Next.js 14, Swagger UI explorer, onboarding wizard, pricing/SDK pages
3. CLI Tool — sentryagent npm CLI, 5 commands, shell completion
4. Agent Marketplace — public searchable registry powered by existing agent/DID infrastructure
5. GitHub Actions — register-agent + issue-token Actions via OIDC (no stored secrets)
6. Billing & Usage Metering — Stripe Checkout, webhook-driven state, free tier enforcement

New capabilities (8 specs): production-hardening, developer-portal, cli-tool,
agent-marketplace, github-actions, billing-metering (+delta: web-dashboard, monitoring)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-02 04:00:34 +00:00

10 KiB

Raw Blame History

Context

SentryAgent.ai has completed three phases of development: Phase 1 (MVP — core agent registry, OAuth 2.0, audit log), Phase 2 (Production-Ready — Vault, 4 SDKs, OPA, React dashboard, Prometheus, Terraform), and Phase 3 (Enterprise — multi-tenancy, W3C DIDs, OIDC, AGNTCY federation, webhooks, SOC 2). The product is technically complete and enterprise-grade.

Phase 4's constraint is that the codebase is a single Express + TypeScript monorepo (src/) with a co-located React dashboard (dashboard/). The new developer portal and CLI are independent packages that must not couple into the existing API codebase beyond HTTP calls to the public API.

Known technical debt to resolve before launch: the GET /audit/verify rate limiter is process-local (express-rate-limit in-memory store), which breaks under horizontal scaling. This must be fixed before public launch.

Goals / Non-Goals

Goals:

Eliminate the in-memory rate limiter gap — all rate limiting is Redis-backed and horizontally safe
Give developers a public portal to discover, learn, and onboard onto SentryAgent.ai
Ship a CLI that lets developers manage agents from their terminal without writing code
Create a public agent marketplace powered by existing agent registry + DID infrastructure
Enable CI/CD-native agent identity via GitHub Actions OIDC federation
Lay the monetization foundation — usage metering, Stripe billing, free/paid tier enforcement

Non-Goals:

Multi-cloud or self-hosted billing (Stripe only)
Full SaaS admin panel (beyond existing React dashboard additions)
Mobile apps
WebSocket-based real-time CLI tail (polling is acceptable for MVP)
Marketplace payments or agent listings with pricing (discovery only, no transactions)

Decisions

ADR-1: ioredis replaces express-rate-limit in-memory store

Decision: Switch from express-rate-limit (default memory store) to a Redis-backed sliding window using ioredis + rate-limiter-flexible. Rationale: The in-memory store is process-local — horizontal scaling (multiple Express instances behind a load balancer) produces independent rate limit windows per process, making limits meaningless. ioredis is already the preferred Redis client (faster, promises-native, cluster-aware). rate-limiter-flexible is battle-tested and supports sliding window, fixed window, and token bucket algorithms in Redis. Alternatives considered: redis (official client) — less ergonomic, no cluster support out of box. express-rate-limit with rate-limit-redis store — additional dependency on top of ioredis, less control.

ADR-2: Developer portal is a separate Next.js 14 app in `portal/`

Decision: The developer portal lives at portal/ — a standalone Next.js 14 application — not inside the existing dashboard/ React app. Rationale: The portal is a public-facing marketing/onboarding site (unauthenticated), not an internal management dashboard (authenticated). Mixing public and authenticated surfaces in one bundle increases attack surface and deployment complexity. portal/ can be deployed independently (Vercel, Cloudflare Pages) while the dashboard remains behind the API. Alternatives considered: Single React app with public/private routing — increases bundle size and complicates auth guards. Embedding portal in existing Express static serving — prevents CDN-edge deployment.

ADR-3: CLI is a standalone npm package in `cli/`

Decision: The sentryagent CLI lives at cli/ with its own package.json and is published separately to npm as sentryagent. Rationale: CLI users install globally (npm i -g sentryagent). Bundling into the API monorepo would force users to install all API dependencies. Separate package = minimal install surface + independent versioning + dedicated README on npm. Alternatives considered: Monorepo workspace — possible but adds tooling complexity for a single-package CLI.

ADR-4: Agent Marketplace is implemented as new routes in the existing Express API

Decision: Marketplace endpoints (GET /marketplace/agents, GET /marketplace/agents/:id) are added to the existing Express API, not a separate service. Rationale: Marketplace data is derived from the existing agents table + DID infrastructure — it is a read-only projection of existing data with public access controls. No new persistence layer needed. Adding routes to Express is the simplest, lowest-risk approach. Alternatives considered: Separate microservice — unnecessary complexity for read-only projections of existing data.

ADR-5: GitHub Actions use OIDC token exchange (not stored secrets)

Decision: sentryagent/register-agent and sentryagent/issue-token Actions use GitHub's OIDC provider to exchange a GitHub-issued JWT for a SentryAgent.ai agent token — no API keys stored in GitHub Secrets. Rationale: Storing long-lived API keys in GitHub Secrets creates a credential leak risk (secrets can be logged, forked into other repos, etc.). OIDC token exchange is keyless — credentials are ephemeral and scoped to the workflow run. The existing OIDC Provider (Phase 3 WS3) already supports external OIDC federation. Alternatives considered: API key in GitHub Secrets — simpler but credential leak risk. GitHub App installation tokens — more complex, not needed when OIDC already exists.

ADR-6: Billing uses Stripe with webhook-driven state synchronization

Decision: Stripe Checkout + Stripe Webhooks drive subscription state. SentryAgent.ai does not poll Stripe — it receives webhook events (customer.subscription.created, invoice.payment_succeeded, customer.subscription.deleted) to update a tenant_subscriptions table. Rationale: Polling Stripe for subscription status introduces latency and API rate limit risk. Webhook-driven state is the Stripe-recommended pattern. Tenant subscription state is stored locally to avoid Stripe API calls on every request. Alternatives considered: Paddle — less developer familiarity, smaller ecosystem. Lemon Squeezy — less mature. Manual invoicing — not scalable.

ADR-7: Usage metering uses in-request counters flushed to PostgreSQL

Decision: Per-request middleware increments in-memory counters per tenant per metric type (api_calls, token_issuances). A 60-second flush interval writes aggregated counts to a usage_events table in PostgreSQL. Free tier limits are checked at request time against a cached summary. Rationale: Synchronous database writes on every API request would add latency and DB load. Async aggregation + periodic flush gives near-real-time metering with minimal overhead. Redis could buffer these, but PostgreSQL is sufficient for MVP flush intervals. Alternatives considered: Stripe Metered Billing API (report per-unit usage to Stripe) — locked to Stripe, adds latency on usage reporting, complex to roll back. ClickHouse/TimescaleDB — overkill for MVP scale.

Risks / Trade-offs

[Risk] Portal deployment is separate from API → Mitigation: Document CORS configuration clearly. Portal calls the public API via NEXT_PUBLIC_API_URL env var. Deployments are independent.
[Risk] CLI polling for audit tail adds API load → Mitigation: Polling interval defaults to 5s with exponential backoff. Document this limitation. Real-time tail via WebSockets is a Phase 5 enhancement.
[Risk] Stripe webhook signature verification must be enforced → Mitigation: All webhook handlers verify stripe-signature header using stripe.webhooks.constructEvent() before processing. Reject without verification.
[Risk] GitHub Actions OIDC requires trust policy configuration per repo → Mitigation: Document trust policy setup clearly in Action README. Provide a quickstart template for /.github/workflows/sentryagent-setup.yml.
[Risk] Free tier limit checks add latency on every request → Mitigation: Limit summaries are cached in Redis with a 60s TTL. Stale cache means brief over-limit grace window — acceptable for MVP.
[Risk] ioredis migration may break existing Redis usage → Mitigation: Existing Redis usage (Bull queue, session) already uses ioredis under the hood (Bull requires it). Migration is additive — replace rate-limiter middleware only, no existing code removed.

Migration Plan

WS1 first (before any public traffic): deploy ioredis rate limiter, connection pool tuning, and detailed health endpoint. Run k6 load tests. Only proceed to WS2+ after load tests pass.
WS2 + WS3 in parallel: portal and CLI are independent. Portal deployed to CDN/Vercel. CLI published to npm.
WS4: Marketplace routes added to Express API behind feature flag (MARKETPLACE_ENABLED=true). Enable after WS1 hardening is confirmed stable.
WS5: GitHub Actions published to GitHub Actions Marketplace after OIDC trust policy documentation is complete.
WS6 last: Billing affects all tenants. Stripe webhooks registered in Stripe dashboard. tenant_subscriptions table migration applied. Free tier limits initially set generously; tightened after monitoring confirms limit logic is correct.

Rollback strategy per workstream:

WS1: Rate limiter is middleware — revert to in-memory store by toggling env var (REDIS_RATE_LIMIT_ENABLED=false)
WS2: Portal is separate deployment — roll back independently
WS3: npm package — unpublish or yank specific version
WS4: Feature flag MARKETPLACE_ENABLED=false
WS5: GitHub Actions are versioned — pin to prior release tag
WS6: Feature flag BILLING_ENABLED=false — disables enforcement, metering continues

Open Questions

Portal domain: Will portal/ be served from sentryagent.ai (marketing site) or app.sentryagent.ai (portal subdomain)? Affects CORS and Next.js basePath config. Recommend: sentryagent.ai for portal, app.sentryagent.ai for dashboard.
Free tier limits: Are 10 agents and 1,000 API calls/day the final limits, or placeholders? If placeholder, billing enforcement should be gated behind BILLING_ENABLED flag until limits are confirmed.
Marketplace moderation: Will agent marketplace listings be auto-published on registration, or require manual approval? Recommend: auto-publish for MVP, flag-based moderation later.

10 KiB Raw Blame History