feat(openspec): Phase 4 Developer Growth & Go-to-Market Readiness

OpenSpec change: phase-4-developer-growth (spec-driven, 4/4 artifacts)

6 workstreams, 90 implementation tasks, delivery sequence:
WS1 → WS2 + WS3 (parallel) → WS4 → WS5 → WS6

Workstreams:
1. Production Hardening — ioredis rate limiting, DB pool tuning, /health/detailed, k6 load tests
2. Developer Portal — Next.js 14, Swagger UI explorer, onboarding wizard, pricing/SDK pages
3. CLI Tool — sentryagent npm CLI, 5 commands, shell completion
4. Agent Marketplace — public searchable registry powered by existing agent/DID infrastructure
5. GitHub Actions — register-agent + issue-token Actions via OIDC (no stored secrets)
6. Billing & Usage Metering — Stripe Checkout, webhook-driven state, free tier enforcement

New capabilities (8 specs): production-hardening, developer-portal, cli-tool,
agent-marketplace, github-actions, billing-metering (+delta: web-dashboard, monitoring)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
SentryAgent.ai Developer
2026-04-02 04:00:34 +00:00
parent f1fbe0e29a
commit b0f70b7ac4
12 changed files with 630 additions and 0 deletions

View File

@@ -0,0 +1,53 @@
## ADDED Requirements
### Requirement: Redis-backed distributed rate limiting replaces in-memory limiter
The system SHALL use `ioredis` + `rate-limiter-flexible` to enforce rate limits across all Express instances using a Redis sliding window algorithm. The in-memory `express-rate-limit` store SHALL be removed. Rate limit configuration SHALL be injectable via environment variables (`RATE_LIMIT_WINDOW_MS`, `RATE_LIMIT_MAX_REQUESTS`). When `REDIS_RATE_LIMIT_ENABLED=false`, the system SHALL fall back to an in-memory limiter for local development.
#### Scenario: Rate limit enforced across multiple instances
- **WHEN** two Express instances are running behind a load balancer and a client sends requests alternating between instances
- **THEN** the rate limit counter is shared across both instances via Redis and the client is rejected after the combined limit is reached
#### Scenario: Redis unavailable — graceful fallback
- **WHEN** Redis is unreachable and `REDIS_RATE_LIMIT_ENABLED=true`
- **THEN** the system SHALL log a warning and fall back to in-memory limiting rather than rejecting all requests
#### Scenario: Rate limit exceeded
- **WHEN** a client exceeds the configured request limit within the window
- **THEN** the system SHALL respond with HTTP 429 and a `Retry-After` header indicating when the window resets
### Requirement: Database connection pool is explicitly configured
The system SHALL configure `pg` connection pool with explicit `max`, `min`, `idleTimeoutMillis`, and `connectionTimeoutMillis` parameters via environment variables (`DB_POOL_MAX`, `DB_POOL_MIN`, `DB_POOL_IDLE_TIMEOUT_MS`, `DB_POOL_CONNECTION_TIMEOUT_MS`). Defaults SHALL be: max=20, min=2, idleTimeout=30000ms, connectionTimeout=5000ms.
#### Scenario: Pool exhaustion under load
- **WHEN** all pool connections are in use and a new query is requested
- **THEN** the system SHALL queue the request and resolve it within `DB_POOL_CONNECTION_TIMEOUT_MS`, or reject with a 503 if timeout is exceeded
#### Scenario: Idle connections are reaped
- **WHEN** a connection has been idle for longer than `DB_POOL_IDLE_TIMEOUT_MS`
- **THEN** the pool SHALL close the connection and reduce active pool size toward `min`
### Requirement: Detailed health endpoint reports per-service status
The system SHALL expose `GET /health/detailed` returning a JSON object with individual status for each dependency: `database`, `redis`, `vault` (if configured), `opa` (if configured). Each service SHALL report `status` (`healthy` | `degraded` | `unreachable`), `latencyMs`, and an optional `message`. The overall response status SHALL be HTTP 200 if all services are healthy, HTTP 207 if any are degraded, and HTTP 503 if any are unreachable.
#### Scenario: All services healthy
- **WHEN** all dependencies respond within acceptable latency
- **THEN** `GET /health/detailed` returns HTTP 200 with all services reporting `status: "healthy"`
#### Scenario: Redis unreachable
- **WHEN** Redis does not respond within 2000ms
- **THEN** `GET /health/detailed` returns HTTP 503 with `redis.status: "unreachable"` and overall `status: "unhealthy"`
#### Scenario: Vault degraded
- **WHEN** Vault responds but with latency exceeding 1000ms
- **THEN** `GET /health/detailed` returns HTTP 207 with `vault.status: "degraded"` and a latency measurement
### Requirement: k6 load test suite validates production readiness
The system SHALL include a k6 load test suite at `tests/load/` covering: agent registration under load (100 virtual users, 60s), token issuance under load (1000 virtual users, 60s), and credential rotation under load (50 virtual users, 60s). Each scenario SHALL define pass/fail thresholds: p95 response time < 500ms, error rate < 1%.
#### Scenario: Token issuance load test passes thresholds
- **WHEN** the k6 load test `token-issuance.js` runs with 1000 virtual users for 60 seconds
- **THEN** p95 response time SHALL be below 500ms and error rate SHALL be below 1%
#### Scenario: Load test threshold failure surfaces clearly
- **WHEN** a k6 threshold is breached during the load test run
- **THEN** the k6 process SHALL exit with a non-zero exit code, making CI failure explicit