OpenSpec change: phase-4-developer-growth (spec-driven, 4/4 artifacts) 6 workstreams, 90 implementation tasks, delivery sequence: WS1 → WS2 + WS3 (parallel) → WS4 → WS5 → WS6 Workstreams: 1. Production Hardening — ioredis rate limiting, DB pool tuning, /health/detailed, k6 load tests 2. Developer Portal — Next.js 14, Swagger UI explorer, onboarding wizard, pricing/SDK pages 3. CLI Tool — sentryagent npm CLI, 5 commands, shell completion 4. Agent Marketplace — public searchable registry powered by existing agent/DID infrastructure 5. GitHub Actions — register-agent + issue-token Actions via OIDC (no stored secrets) 6. Billing & Usage Metering — Stripe Checkout, webhook-driven state, free tier enforcement New capabilities (8 specs): production-hardening, developer-portal, cli-tool, agent-marketplace, github-actions, billing-metering (+delta: web-dashboard, monitoring) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
93 lines
10 KiB
Markdown
93 lines
10 KiB
Markdown
## Context
|
|
|
|
SentryAgent.ai has completed three phases of development: Phase 1 (MVP — core agent registry, OAuth 2.0, audit log), Phase 2 (Production-Ready — Vault, 4 SDKs, OPA, React dashboard, Prometheus, Terraform), and Phase 3 (Enterprise — multi-tenancy, W3C DIDs, OIDC, AGNTCY federation, webhooks, SOC 2). The product is technically complete and enterprise-grade.
|
|
|
|
Phase 4's constraint is that the codebase is a single Express + TypeScript monorepo (`src/`) with a co-located React dashboard (`dashboard/`). The new developer portal and CLI are independent packages that must not couple into the existing API codebase beyond HTTP calls to the public API.
|
|
|
|
Known technical debt to resolve before launch: the `GET /audit/verify` rate limiter is process-local (`express-rate-limit` in-memory store), which breaks under horizontal scaling. This must be fixed before public launch.
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Eliminate the in-memory rate limiter gap — all rate limiting is Redis-backed and horizontally safe
|
|
- Give developers a public portal to discover, learn, and onboard onto SentryAgent.ai
|
|
- Ship a CLI that lets developers manage agents from their terminal without writing code
|
|
- Create a public agent marketplace powered by existing agent registry + DID infrastructure
|
|
- Enable CI/CD-native agent identity via GitHub Actions OIDC federation
|
|
- Lay the monetization foundation — usage metering, Stripe billing, free/paid tier enforcement
|
|
|
|
**Non-Goals:**
|
|
- Multi-cloud or self-hosted billing (Stripe only)
|
|
- Full SaaS admin panel (beyond existing React dashboard additions)
|
|
- Mobile apps
|
|
- WebSocket-based real-time CLI tail (polling is acceptable for MVP)
|
|
- Marketplace payments or agent listings with pricing (discovery only, no transactions)
|
|
|
|
## Decisions
|
|
|
|
### ADR-1: ioredis replaces express-rate-limit in-memory store
|
|
**Decision:** Switch from `express-rate-limit` (default memory store) to a Redis-backed sliding window using `ioredis` + `rate-limiter-flexible`.
|
|
**Rationale:** The in-memory store is process-local — horizontal scaling (multiple Express instances behind a load balancer) produces independent rate limit windows per process, making limits meaningless. `ioredis` is already the preferred Redis client (faster, promises-native, cluster-aware). `rate-limiter-flexible` is battle-tested and supports sliding window, fixed window, and token bucket algorithms in Redis.
|
|
**Alternatives considered:** `redis` (official client) — less ergonomic, no cluster support out of box. `express-rate-limit` with `rate-limit-redis` store — additional dependency on top of ioredis, less control.
|
|
|
|
### ADR-2: Developer portal is a separate Next.js 14 app in `portal/`
|
|
**Decision:** The developer portal lives at `portal/` — a standalone Next.js 14 application — not inside the existing `dashboard/` React app.
|
|
**Rationale:** The portal is a public-facing marketing/onboarding site (unauthenticated), not an internal management dashboard (authenticated). Mixing public and authenticated surfaces in one bundle increases attack surface and deployment complexity. `portal/` can be deployed independently (Vercel, Cloudflare Pages) while the dashboard remains behind the API.
|
|
**Alternatives considered:** Single React app with public/private routing — increases bundle size and complicates auth guards. Embedding portal in existing Express static serving — prevents CDN-edge deployment.
|
|
|
|
### ADR-3: CLI is a standalone npm package in `cli/`
|
|
**Decision:** The `sentryagent` CLI lives at `cli/` with its own `package.json` and is published separately to npm as `sentryagent`.
|
|
**Rationale:** CLI users install globally (`npm i -g sentryagent`). Bundling into the API monorepo would force users to install all API dependencies. Separate package = minimal install surface + independent versioning + dedicated README on npm.
|
|
**Alternatives considered:** Monorepo workspace — possible but adds tooling complexity for a single-package CLI.
|
|
|
|
### ADR-4: Agent Marketplace is implemented as new routes in the existing Express API
|
|
**Decision:** Marketplace endpoints (`GET /marketplace/agents`, `GET /marketplace/agents/:id`) are added to the existing Express API, not a separate service.
|
|
**Rationale:** Marketplace data is derived from the existing `agents` table + DID infrastructure — it is a read-only projection of existing data with public access controls. No new persistence layer needed. Adding routes to Express is the simplest, lowest-risk approach.
|
|
**Alternatives considered:** Separate microservice — unnecessary complexity for read-only projections of existing data.
|
|
|
|
### ADR-5: GitHub Actions use OIDC token exchange (not stored secrets)
|
|
**Decision:** `sentryagent/register-agent` and `sentryagent/issue-token` Actions use GitHub's OIDC provider to exchange a GitHub-issued JWT for a SentryAgent.ai agent token — no API keys stored in GitHub Secrets.
|
|
**Rationale:** Storing long-lived API keys in GitHub Secrets creates a credential leak risk (secrets can be logged, forked into other repos, etc.). OIDC token exchange is keyless — credentials are ephemeral and scoped to the workflow run. The existing OIDC Provider (Phase 3 WS3) already supports external OIDC federation.
|
|
**Alternatives considered:** API key in GitHub Secrets — simpler but credential leak risk. GitHub App installation tokens — more complex, not needed when OIDC already exists.
|
|
|
|
### ADR-6: Billing uses Stripe with webhook-driven state synchronization
|
|
**Decision:** Stripe Checkout + Stripe Webhooks drive subscription state. SentryAgent.ai does not poll Stripe — it receives webhook events (`customer.subscription.created`, `invoice.payment_succeeded`, `customer.subscription.deleted`) to update a `tenant_subscriptions` table.
|
|
**Rationale:** Polling Stripe for subscription status introduces latency and API rate limit risk. Webhook-driven state is the Stripe-recommended pattern. Tenant subscription state is stored locally to avoid Stripe API calls on every request.
|
|
**Alternatives considered:** Paddle — less developer familiarity, smaller ecosystem. Lemon Squeezy — less mature. Manual invoicing — not scalable.
|
|
|
|
### ADR-7: Usage metering uses in-request counters flushed to PostgreSQL
|
|
**Decision:** Per-request middleware increments in-memory counters per tenant per metric type (api_calls, token_issuances). A 60-second flush interval writes aggregated counts to a `usage_events` table in PostgreSQL. Free tier limits are checked at request time against a cached summary.
|
|
**Rationale:** Synchronous database writes on every API request would add latency and DB load. Async aggregation + periodic flush gives near-real-time metering with minimal overhead. Redis could buffer these, but PostgreSQL is sufficient for MVP flush intervals.
|
|
**Alternatives considered:** Stripe Metered Billing API (report per-unit usage to Stripe) — locked to Stripe, adds latency on usage reporting, complex to roll back. ClickHouse/TimescaleDB — overkill for MVP scale.
|
|
|
|
## Risks / Trade-offs
|
|
|
|
- **[Risk] Portal deployment is separate from API** → Mitigation: Document CORS configuration clearly. Portal calls the public API via `NEXT_PUBLIC_API_URL` env var. Deployments are independent.
|
|
- **[Risk] CLI polling for audit tail adds API load** → Mitigation: Polling interval defaults to 5s with exponential backoff. Document this limitation. Real-time tail via WebSockets is a Phase 5 enhancement.
|
|
- **[Risk] Stripe webhook signature verification must be enforced** → Mitigation: All webhook handlers verify `stripe-signature` header using `stripe.webhooks.constructEvent()` before processing. Reject without verification.
|
|
- **[Risk] GitHub Actions OIDC requires trust policy configuration per repo** → Mitigation: Document trust policy setup clearly in Action README. Provide a quickstart template for `/.github/workflows/sentryagent-setup.yml`.
|
|
- **[Risk] Free tier limit checks add latency on every request** → Mitigation: Limit summaries are cached in Redis with a 60s TTL. Stale cache means brief over-limit grace window — acceptable for MVP.
|
|
- **[Risk] ioredis migration may break existing Redis usage** → Mitigation: Existing Redis usage (Bull queue, session) already uses `ioredis` under the hood (Bull requires it). Migration is additive — replace rate-limiter middleware only, no existing code removed.
|
|
|
|
## Migration Plan
|
|
|
|
1. **WS1 first** (before any public traffic): deploy ioredis rate limiter, connection pool tuning, and detailed health endpoint. Run k6 load tests. Only proceed to WS2+ after load tests pass.
|
|
2. **WS2 + WS3 in parallel**: portal and CLI are independent. Portal deployed to CDN/Vercel. CLI published to npm.
|
|
3. **WS4**: Marketplace routes added to Express API behind feature flag (`MARKETPLACE_ENABLED=true`). Enable after WS1 hardening is confirmed stable.
|
|
4. **WS5**: GitHub Actions published to GitHub Actions Marketplace after OIDC trust policy documentation is complete.
|
|
5. **WS6 last**: Billing affects all tenants. Stripe webhooks registered in Stripe dashboard. `tenant_subscriptions` table migration applied. Free tier limits initially set generously; tightened after monitoring confirms limit logic is correct.
|
|
|
|
**Rollback strategy per workstream:**
|
|
- WS1: Rate limiter is middleware — revert to in-memory store by toggling env var (`REDIS_RATE_LIMIT_ENABLED=false`)
|
|
- WS2: Portal is separate deployment — roll back independently
|
|
- WS3: npm package — unpublish or yank specific version
|
|
- WS4: Feature flag `MARKETPLACE_ENABLED=false`
|
|
- WS5: GitHub Actions are versioned — pin to prior release tag
|
|
- WS6: Feature flag `BILLING_ENABLED=false` — disables enforcement, metering continues
|
|
|
|
## Open Questions
|
|
|
|
- **Portal domain**: Will `portal/` be served from `sentryagent.ai` (marketing site) or `app.sentryagent.ai` (portal subdomain)? Affects CORS and Next.js `basePath` config. Recommend: `sentryagent.ai` for portal, `app.sentryagent.ai` for dashboard.
|
|
- **Free tier limits**: Are 10 agents and 1,000 API calls/day the final limits, or placeholders? If placeholder, billing enforcement should be gated behind `BILLING_ENABLED` flag until limits are confirmed.
|
|
- **Marketplace moderation**: Will agent marketplace listings be auto-published on registration, or require manual approval? Recommend: auto-publish for MVP, flag-based moderation later.
|