feat(openspec): Phase 4 Developer Growth & Go-to-Market Readiness
OpenSpec change: phase-4-developer-growth (spec-driven, 4/4 artifacts) 6 workstreams, 90 implementation tasks, delivery sequence: WS1 → WS2 + WS3 (parallel) → WS4 → WS5 → WS6 Workstreams: 1. Production Hardening — ioredis rate limiting, DB pool tuning, /health/detailed, k6 load tests 2. Developer Portal — Next.js 14, Swagger UI explorer, onboarding wizard, pricing/SDK pages 3. CLI Tool — sentryagent npm CLI, 5 commands, shell completion 4. Agent Marketplace — public searchable registry powered by existing agent/DID infrastructure 5. GitHub Actions — register-agent + issue-token Actions via OIDC (no stored secrets) 6. Billing & Usage Metering — Stripe Checkout, webhook-driven state, free tier enforcement New capabilities (8 specs): production-hardening, developer-portal, cli-tool, agent-marketplace, github-actions, billing-metering (+delta: web-dashboard, monitoring) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,53 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Redis-backed distributed rate limiting replaces in-memory limiter
|
||||
The system SHALL use `ioredis` + `rate-limiter-flexible` to enforce rate limits across all Express instances using a Redis sliding window algorithm. The in-memory `express-rate-limit` store SHALL be removed. Rate limit configuration SHALL be injectable via environment variables (`RATE_LIMIT_WINDOW_MS`, `RATE_LIMIT_MAX_REQUESTS`). When `REDIS_RATE_LIMIT_ENABLED=false`, the system SHALL fall back to an in-memory limiter for local development.
|
||||
|
||||
#### Scenario: Rate limit enforced across multiple instances
|
||||
- **WHEN** two Express instances are running behind a load balancer and a client sends requests alternating between instances
|
||||
- **THEN** the rate limit counter is shared across both instances via Redis and the client is rejected after the combined limit is reached
|
||||
|
||||
#### Scenario: Redis unavailable — graceful fallback
|
||||
- **WHEN** Redis is unreachable and `REDIS_RATE_LIMIT_ENABLED=true`
|
||||
- **THEN** the system SHALL log a warning and fall back to in-memory limiting rather than rejecting all requests
|
||||
|
||||
#### Scenario: Rate limit exceeded
|
||||
- **WHEN** a client exceeds the configured request limit within the window
|
||||
- **THEN** the system SHALL respond with HTTP 429 and a `Retry-After` header indicating when the window resets
|
||||
|
||||
### Requirement: Database connection pool is explicitly configured
|
||||
The system SHALL configure `pg` connection pool with explicit `max`, `min`, `idleTimeoutMillis`, and `connectionTimeoutMillis` parameters via environment variables (`DB_POOL_MAX`, `DB_POOL_MIN`, `DB_POOL_IDLE_TIMEOUT_MS`, `DB_POOL_CONNECTION_TIMEOUT_MS`). Defaults SHALL be: max=20, min=2, idleTimeout=30000ms, connectionTimeout=5000ms.
|
||||
|
||||
#### Scenario: Pool exhaustion under load
|
||||
- **WHEN** all pool connections are in use and a new query is requested
|
||||
- **THEN** the system SHALL queue the request and resolve it within `DB_POOL_CONNECTION_TIMEOUT_MS`, or reject with a 503 if timeout is exceeded
|
||||
|
||||
#### Scenario: Idle connections are reaped
|
||||
- **WHEN** a connection has been idle for longer than `DB_POOL_IDLE_TIMEOUT_MS`
|
||||
- **THEN** the pool SHALL close the connection and reduce active pool size toward `min`
|
||||
|
||||
### Requirement: Detailed health endpoint reports per-service status
|
||||
The system SHALL expose `GET /health/detailed` returning a JSON object with individual status for each dependency: `database`, `redis`, `vault` (if configured), `opa` (if configured). Each service SHALL report `status` (`healthy` | `degraded` | `unreachable`), `latencyMs`, and an optional `message`. The overall response status SHALL be HTTP 200 if all services are healthy, HTTP 207 if any are degraded, and HTTP 503 if any are unreachable.
|
||||
|
||||
#### Scenario: All services healthy
|
||||
- **WHEN** all dependencies respond within acceptable latency
|
||||
- **THEN** `GET /health/detailed` returns HTTP 200 with all services reporting `status: "healthy"`
|
||||
|
||||
#### Scenario: Redis unreachable
|
||||
- **WHEN** Redis does not respond within 2000ms
|
||||
- **THEN** `GET /health/detailed` returns HTTP 503 with `redis.status: "unreachable"` and overall `status: "unhealthy"`
|
||||
|
||||
#### Scenario: Vault degraded
|
||||
- **WHEN** Vault responds but with latency exceeding 1000ms
|
||||
- **THEN** `GET /health/detailed` returns HTTP 207 with `vault.status: "degraded"` and a latency measurement
|
||||
|
||||
### Requirement: k6 load test suite validates production readiness
|
||||
The system SHALL include a k6 load test suite at `tests/load/` covering: agent registration under load (100 virtual users, 60s), token issuance under load (1000 virtual users, 60s), and credential rotation under load (50 virtual users, 60s). Each scenario SHALL define pass/fail thresholds: p95 response time < 500ms, error rate < 1%.
|
||||
|
||||
#### Scenario: Token issuance load test passes thresholds
|
||||
- **WHEN** the k6 load test `token-issuance.js` runs with 1000 virtual users for 60 seconds
|
||||
- **THEN** p95 response time SHALL be below 500ms and error rate SHALL be below 1%
|
||||
|
||||
#### Scenario: Load test threshold failure surfaces clearly
|
||||
- **WHEN** a k6 threshold is breached during the load test run
|
||||
- **THEN** the k6 process SHALL exit with a non-zero exit code, making CI failure explicit
|
||||
Reference in New Issue
Block a user