## ADDED Requirements ### Requirement: Redis-backed distributed rate limiting replaces in-memory limiter The system SHALL use `ioredis` + `rate-limiter-flexible` to enforce rate limits across all Express instances using a Redis sliding window algorithm. The in-memory `express-rate-limit` store SHALL be removed. Rate limit configuration SHALL be injectable via environment variables (`RATE_LIMIT_WINDOW_MS`, `RATE_LIMIT_MAX_REQUESTS`). When `REDIS_RATE_LIMIT_ENABLED=false`, the system SHALL fall back to an in-memory limiter for local development. #### Scenario: Rate limit enforced across multiple instances - **WHEN** two Express instances are running behind a load balancer and a client sends requests alternating between instances - **THEN** the rate limit counter is shared across both instances via Redis and the client is rejected after the combined limit is reached #### Scenario: Redis unavailable — graceful fallback - **WHEN** Redis is unreachable and `REDIS_RATE_LIMIT_ENABLED=true` - **THEN** the system SHALL log a warning and fall back to in-memory limiting rather than rejecting all requests #### Scenario: Rate limit exceeded - **WHEN** a client exceeds the configured request limit within the window - **THEN** the system SHALL respond with HTTP 429 and a `Retry-After` header indicating when the window resets ### Requirement: Database connection pool is explicitly configured The system SHALL configure `pg` connection pool with explicit `max`, `min`, `idleTimeoutMillis`, and `connectionTimeoutMillis` parameters via environment variables (`DB_POOL_MAX`, `DB_POOL_MIN`, `DB_POOL_IDLE_TIMEOUT_MS`, `DB_POOL_CONNECTION_TIMEOUT_MS`). Defaults SHALL be: max=20, min=2, idleTimeout=30000ms, connectionTimeout=5000ms. #### Scenario: Pool exhaustion under load - **WHEN** all pool connections are in use and a new query is requested - **THEN** the system SHALL queue the request and resolve it within `DB_POOL_CONNECTION_TIMEOUT_MS`, or reject with a 503 if timeout is exceeded #### Scenario: Idle connections are reaped - **WHEN** a connection has been idle for longer than `DB_POOL_IDLE_TIMEOUT_MS` - **THEN** the pool SHALL close the connection and reduce active pool size toward `min` ### Requirement: Detailed health endpoint reports per-service status The system SHALL expose `GET /health/detailed` returning a JSON object with individual status for each dependency: `database`, `redis`, `vault` (if configured), `opa` (if configured). Each service SHALL report `status` (`healthy` | `degraded` | `unreachable`), `latencyMs`, and an optional `message`. The overall response status SHALL be HTTP 200 if all services are healthy, HTTP 207 if any are degraded, and HTTP 503 if any are unreachable. #### Scenario: All services healthy - **WHEN** all dependencies respond within acceptable latency - **THEN** `GET /health/detailed` returns HTTP 200 with all services reporting `status: "healthy"` #### Scenario: Redis unreachable - **WHEN** Redis does not respond within 2000ms - **THEN** `GET /health/detailed` returns HTTP 503 with `redis.status: "unreachable"` and overall `status: "unhealthy"` #### Scenario: Vault degraded - **WHEN** Vault responds but with latency exceeding 1000ms - **THEN** `GET /health/detailed` returns HTTP 207 with `vault.status: "degraded"` and a latency measurement ### Requirement: k6 load test suite validates production readiness The system SHALL include a k6 load test suite at `tests/load/` covering: agent registration under load (100 virtual users, 60s), token issuance under load (1000 virtual users, 60s), and credential rotation under load (50 virtual users, 60s). Each scenario SHALL define pass/fail thresholds: p95 response time < 500ms, error rate < 1%. #### Scenario: Token issuance load test passes thresholds - **WHEN** the k6 load test `token-issuance.js` runs with 1000 virtual users for 60 seconds - **THEN** p95 response time SHALL be below 500ms and error rate SHALL be below 1% #### Scenario: Load test threshold failure surfaces clearly - **WHEN** a k6 threshold is breached during the load test run - **THEN** the k6 process SHALL exit with a non-zero exit code, making CI failure explicit