feat(openspec): Phase 4 Developer Growth & Go-to-Market Readiness

OpenSpec change: phase-4-developer-growth (spec-driven, 4/4 artifacts) 6 workstreams, 90 implementation tasks, delivery sequence: WS1 → WS2 + WS3 (parallel) → WS4 → WS5 → WS6 Workstreams: 1. Production Hardening — ioredis rate limiting, DB pool tuning, /health/detailed, k6 load tests 2. Developer Portal — Next.js 14, Swagger UI explorer, onboarding wizard, pricing/SDK pages 3. CLI Tool — sentryagent npm CLI, 5 commands, shell completion 4. Agent Marketplace — public searchable registry powered by existing agent/DID infrastructure 5. GitHub Actions — register-agent + issue-token Actions via OIDC (no stored secrets) 6. Billing & Usage Metering — Stripe Checkout, webhook-driven state, free tier enforcement New capabilities (8 specs): production-hardening, developer-portal, cli-tool, agent-marketplace, github-actions, billing-metering (+delta: web-dashboard, monitoring) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 04:00:34 +00:00
parent f1fbe0e29a
commit b0f70b7ac4
12 changed files with 630 additions and 0 deletions
--- a/openspec/changes/phase-4-developer-growth/specs/production-hardening/spec.md
+++ b/openspec/changes/phase-4-developer-growth/specs/production-hardening/spec.md
@@ -0,0 +1,53 @@
+## ADDED Requirements
+
+### Requirement: Redis-backed distributed rate limiting replaces in-memory limiter
+The system SHALL use `ioredis` + `rate-limiter-flexible` to enforce rate limits across all Express instances using a Redis sliding window algorithm. The in-memory `express-rate-limit` store SHALL be removed. Rate limit configuration SHALL be injectable via environment variables (`RATE_LIMIT_WINDOW_MS`, `RATE_LIMIT_MAX_REQUESTS`). When `REDIS_RATE_LIMIT_ENABLED=false`, the system SHALL fall back to an in-memory limiter for local development.
+
+#### Scenario: Rate limit enforced across multiple instances
+- **WHEN** two Express instances are running behind a load balancer and a client sends requests alternating between instances
+- **THEN** the rate limit counter is shared across both instances via Redis and the client is rejected after the combined limit is reached
+
+#### Scenario: Redis unavailable — graceful fallback
+- **WHEN** Redis is unreachable and `REDIS_RATE_LIMIT_ENABLED=true`
+- **THEN** the system SHALL log a warning and fall back to in-memory limiting rather than rejecting all requests
+
+#### Scenario: Rate limit exceeded
+- **WHEN** a client exceeds the configured request limit within the window
+- **THEN** the system SHALL respond with HTTP 429 and a `Retry-After` header indicating when the window resets
+
+### Requirement: Database connection pool is explicitly configured
+The system SHALL configure `pg` connection pool with explicit `max`, `min`, `idleTimeoutMillis`, and `connectionTimeoutMillis` parameters via environment variables (`DB_POOL_MAX`, `DB_POOL_MIN`, `DB_POOL_IDLE_TIMEOUT_MS`, `DB_POOL_CONNECTION_TIMEOUT_MS`). Defaults SHALL be: max=20, min=2, idleTimeout=30000ms, connectionTimeout=5000ms.
+
+#### Scenario: Pool exhaustion under load
+- **WHEN** all pool connections are in use and a new query is requested
+- **THEN** the system SHALL queue the request and resolve it within `DB_POOL_CONNECTION_TIMEOUT_MS`, or reject with a 503 if timeout is exceeded
+
+#### Scenario: Idle connections are reaped
+- **WHEN** a connection has been idle for longer than `DB_POOL_IDLE_TIMEOUT_MS`
+- **THEN** the pool SHALL close the connection and reduce active pool size toward `min`
+
+### Requirement: Detailed health endpoint reports per-service status
+The system SHALL expose `GET /health/detailed` returning a JSON object with individual status for each dependency: `database`, `redis`, `vault` (if configured), `opa` (if configured). Each service SHALL report `status` (`healthy` | `degraded` | `unreachable`), `latencyMs`, and an optional `message`. The overall response status SHALL be HTTP 200 if all services are healthy, HTTP 207 if any are degraded, and HTTP 503 if any are unreachable.
+
+#### Scenario: All services healthy
+- **WHEN** all dependencies respond within acceptable latency
+- **THEN** `GET /health/detailed` returns HTTP 200 with all services reporting `status: "healthy"`
+
+#### Scenario: Redis unreachable
+- **WHEN** Redis does not respond within 2000ms
+- **THEN** `GET /health/detailed` returns HTTP 503 with `redis.status: "unreachable"` and overall `status: "unhealthy"`
+
+#### Scenario: Vault degraded
+- **WHEN** Vault responds but with latency exceeding 1000ms
+- **THEN** `GET /health/detailed` returns HTTP 207 with `vault.status: "degraded"` and a latency measurement
+
+### Requirement: k6 load test suite validates production readiness
+The system SHALL include a k6 load test suite at `tests/load/` covering: agent registration under load (100 virtual users, 60s), token issuance under load (1000 virtual users, 60s), and credential rotation under load (50 virtual users, 60s). Each scenario SHALL define pass/fail thresholds: p95 response time < 500ms, error rate < 1%.
+
+#### Scenario: Token issuance load test passes thresholds
+- **WHEN** the k6 load test `token-issuance.js` runs with 1000 virtual users for 60 seconds
+- **THEN** p95 response time SHALL be below 500ms and error rate SHALL be below 1%
+
+#### Scenario: Load test threshold failure surfaces clearly
+- **WHEN** a k6 threshold is breached during the load test run
+- **THEN** the k6 process SHALL exit with a non-zero exit code, making CI failure explicit