OpenSpec change: phase-4-developer-growth (spec-driven, 4/4 artifacts) 6 workstreams, 90 implementation tasks, delivery sequence: WS1 → WS2 + WS3 (parallel) → WS4 → WS5 → WS6 Workstreams: 1. Production Hardening — ioredis rate limiting, DB pool tuning, /health/detailed, k6 load tests 2. Developer Portal — Next.js 14, Swagger UI explorer, onboarding wizard, pricing/SDK pages 3. CLI Tool — sentryagent npm CLI, 5 commands, shell completion 4. Agent Marketplace — public searchable registry powered by existing agent/DID infrastructure 5. GitHub Actions — register-agent + issue-token Actions via OIDC (no stored secrets) 6. Billing & Usage Metering — Stripe Checkout, webhook-driven state, free tier enforcement New capabilities (8 specs): production-hardening, developer-portal, cli-tool, agent-marketplace, github-actions, billing-metering (+delta: web-dashboard, monitoring) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.1 KiB
ADDED Requirements
Requirement: Redis-backed distributed rate limiting replaces in-memory limiter
The system SHALL use ioredis + rate-limiter-flexible to enforce rate limits across all Express instances using a Redis sliding window algorithm. The in-memory express-rate-limit store SHALL be removed. Rate limit configuration SHALL be injectable via environment variables (RATE_LIMIT_WINDOW_MS, RATE_LIMIT_MAX_REQUESTS). When REDIS_RATE_LIMIT_ENABLED=false, the system SHALL fall back to an in-memory limiter for local development.
Scenario: Rate limit enforced across multiple instances
- WHEN two Express instances are running behind a load balancer and a client sends requests alternating between instances
- THEN the rate limit counter is shared across both instances via Redis and the client is rejected after the combined limit is reached
Scenario: Redis unavailable — graceful fallback
- WHEN Redis is unreachable and
REDIS_RATE_LIMIT_ENABLED=true - THEN the system SHALL log a warning and fall back to in-memory limiting rather than rejecting all requests
Scenario: Rate limit exceeded
- WHEN a client exceeds the configured request limit within the window
- THEN the system SHALL respond with HTTP 429 and a
Retry-Afterheader indicating when the window resets
Requirement: Database connection pool is explicitly configured
The system SHALL configure pg connection pool with explicit max, min, idleTimeoutMillis, and connectionTimeoutMillis parameters via environment variables (DB_POOL_MAX, DB_POOL_MIN, DB_POOL_IDLE_TIMEOUT_MS, DB_POOL_CONNECTION_TIMEOUT_MS). Defaults SHALL be: max=20, min=2, idleTimeout=30000ms, connectionTimeout=5000ms.
Scenario: Pool exhaustion under load
- WHEN all pool connections are in use and a new query is requested
- THEN the system SHALL queue the request and resolve it within
DB_POOL_CONNECTION_TIMEOUT_MS, or reject with a 503 if timeout is exceeded
Scenario: Idle connections are reaped
- WHEN a connection has been idle for longer than
DB_POOL_IDLE_TIMEOUT_MS - THEN the pool SHALL close the connection and reduce active pool size toward
min
Requirement: Detailed health endpoint reports per-service status
The system SHALL expose GET /health/detailed returning a JSON object with individual status for each dependency: database, redis, vault (if configured), opa (if configured). Each service SHALL report status (healthy | degraded | unreachable), latencyMs, and an optional message. The overall response status SHALL be HTTP 200 if all services are healthy, HTTP 207 if any are degraded, and HTTP 503 if any are unreachable.
Scenario: All services healthy
- WHEN all dependencies respond within acceptable latency
- THEN
GET /health/detailedreturns HTTP 200 with all services reportingstatus: "healthy"
Scenario: Redis unreachable
- WHEN Redis does not respond within 2000ms
- THEN
GET /health/detailedreturns HTTP 503 withredis.status: "unreachable"and overallstatus: "unhealthy"
Scenario: Vault degraded
- WHEN Vault responds but with latency exceeding 1000ms
- THEN
GET /health/detailedreturns HTTP 207 withvault.status: "degraded"and a latency measurement
Requirement: k6 load test suite validates production readiness
The system SHALL include a k6 load test suite at tests/load/ covering: agent registration under load (100 virtual users, 60s), token issuance under load (1000 virtual users, 60s), and credential rotation under load (50 virtual users, 60s). Each scenario SHALL define pass/fail thresholds: p95 response time < 500ms, error rate < 1%.
Scenario: Token issuance load test passes thresholds
- WHEN the k6 load test
token-issuance.jsruns with 1000 virtual users for 60 seconds - THEN p95 response time SHALL be below 500ms and error rate SHALL be below 1%
Scenario: Load test threshold failure surfaces clearly
- WHEN a k6 threshold is breached during the load test run
- THEN the k6 process SHALL exit with a non-zero exit code, making CI failure explicit