sentryagent-idp/openspec/changes/phase-4-developer-growth/specs/production-hardening/spec.md at fefbf1e3ea4d532102ab5e466436b2a215de96ab

Files

SentryAgent.ai Developer b0f70b7ac4 feat(openspec): Phase 4 Developer Growth & Go-to-Market Readiness

OpenSpec change: phase-4-developer-growth (spec-driven, 4/4 artifacts)

6 workstreams, 90 implementation tasks, delivery sequence:
WS1 → WS2 + WS3 (parallel) → WS4 → WS5 → WS6

Workstreams:
1. Production Hardening — ioredis rate limiting, DB pool tuning, /health/detailed, k6 load tests
2. Developer Portal — Next.js 14, Swagger UI explorer, onboarding wizard, pricing/SDK pages
3. CLI Tool — sentryagent npm CLI, 5 commands, shell completion
4. Agent Marketplace — public searchable registry powered by existing agent/DID infrastructure
5. GitHub Actions — register-agent + issue-token Actions via OIDC (no stored secrets)
6. Billing & Usage Metering — Stripe Checkout, webhook-driven state, free tier enforcement

New capabilities (8 specs): production-hardening, developer-portal, cli-tool,
agent-marketplace, github-actions, billing-metering (+delta: web-dashboard, monitoring)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-02 04:00:34 +00:00

4.1 KiB

Raw Blame History

ADDED Requirements

Requirement: Redis-backed distributed rate limiting replaces in-memory limiter

The system SHALL use ioredis + rate-limiter-flexible to enforce rate limits across all Express instances using a Redis sliding window algorithm. The in-memory express-rate-limit store SHALL be removed. Rate limit configuration SHALL be injectable via environment variables (RATE_LIMIT_WINDOW_MS, RATE_LIMIT_MAX_REQUESTS). When REDIS_RATE_LIMIT_ENABLED=false, the system SHALL fall back to an in-memory limiter for local development.

Scenario: Rate limit enforced across multiple instances

WHEN two Express instances are running behind a load balancer and a client sends requests alternating between instances
THEN the rate limit counter is shared across both instances via Redis and the client is rejected after the combined limit is reached

Scenario: Redis unavailable — graceful fallback

WHEN Redis is unreachable and REDIS_RATE_LIMIT_ENABLED=true
THEN the system SHALL log a warning and fall back to in-memory limiting rather than rejecting all requests

Scenario: Rate limit exceeded

WHEN a client exceeds the configured request limit within the window
THEN the system SHALL respond with HTTP 429 and a Retry-After header indicating when the window resets

Requirement: Database connection pool is explicitly configured

The system SHALL configure pg connection pool with explicit max, min, idleTimeoutMillis, and connectionTimeoutMillis parameters via environment variables (DB_POOL_MAX, DB_POOL_MIN, DB_POOL_IDLE_TIMEOUT_MS, DB_POOL_CONNECTION_TIMEOUT_MS). Defaults SHALL be: max=20, min=2, idleTimeout=30000ms, connectionTimeout=5000ms.

Scenario: Pool exhaustion under load

WHEN all pool connections are in use and a new query is requested
THEN the system SHALL queue the request and resolve it within DB_POOL_CONNECTION_TIMEOUT_MS, or reject with a 503 if timeout is exceeded

Scenario: Idle connections are reaped

WHEN a connection has been idle for longer than DB_POOL_IDLE_TIMEOUT_MS
THEN the pool SHALL close the connection and reduce active pool size toward min

Requirement: Detailed health endpoint reports per-service status

The system SHALL expose GET /health/detailed returning a JSON object with individual status for each dependency: database, redis, vault (if configured), opa (if configured). Each service SHALL report status (healthy | degraded | unreachable), latencyMs, and an optional message. The overall response status SHALL be HTTP 200 if all services are healthy, HTTP 207 if any are degraded, and HTTP 503 if any are unreachable.

Scenario: All services healthy

WHEN all dependencies respond within acceptable latency
THEN GET /health/detailed returns HTTP 200 with all services reporting status: "healthy"

Scenario: Redis unreachable

WHEN Redis does not respond within 2000ms
THEN GET /health/detailed returns HTTP 503 with redis.status: "unreachable" and overall status: "unhealthy"

Scenario: Vault degraded

WHEN Vault responds but with latency exceeding 1000ms
THEN GET /health/detailed returns HTTP 207 with vault.status: "degraded" and a latency measurement

Requirement: k6 load test suite validates production readiness

The system SHALL include a k6 load test suite at tests/load/ covering: agent registration under load (100 virtual users, 60s), token issuance under load (1000 virtual users, 60s), and credential rotation under load (50 virtual users, 60s). Each scenario SHALL define pass/fail thresholds: p95 response time < 500ms, error rate < 1%.

Scenario: Token issuance load test passes thresholds

WHEN the k6 load test token-issuance.js runs with 1000 virtual users for 60 seconds
THEN p95 response time SHALL be below 500ms and error rate SHALL be below 1%

Scenario: Load test threshold failure surfaces clearly

WHEN a k6 threshold is breached during the load test run
THEN the k6 process SHALL exit with a non-zero exit code, making CI failure explicit

4.1 KiB Raw Blame History