feat(phase-4): WS1 — Production Hardening (Redis rate limiting, DB pool, health endpoint, k6)

Rate limiting: - Replace in-memory express-rate-limit with ioredis + rate-limiter-flexible (sliding window) - Graceful fallback to RateLimiterMemory when Redis unreachable - RATE_LIMIT_WINDOW_MS / RATE_LIMIT_MAX_REQUESTS env var config - Retry-After header on 429 responses - agentidp_rate_limit_hits_total Prometheus counter Database pool: - Explicit pg.Pool config via DB_POOL_MAX/MIN/IDLE_TIMEOUT_MS/CONNECTION_TIMEOUT_MS - Defaults: max=20, min=2, idle=30s, conn timeout=5s - agentidp_db_pool_active_connections + agentidp_db_pool_waiting_requests gauges Health endpoint: - GET /health/detailed — per-service status (database, Redis, Vault, OPA) - healthy / degraded (>1000ms) / unreachable classification - HTTP 200 (all healthy) / 207 (any degraded) / 503 (any unreachable) Load tests: - tests/load/ with k6 scenarios for agent registration (100 VUs), token issuance (1000 VUs), credential rotation (50 VUs) - npm run load-test script Tests: 586 passing, zero TypeScript errors Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 04:20:37 +00:00
parent b0f70b7ac4
commit 1b682c22b2
16 changed files with 1467 additions and 113 deletions
--- a/openspec/changes/phase-4-developer-growth/tasks.md
+++ b/openspec/changes/phase-4-developer-growth/tasks.md
@@ -1,30 +1,30 @@
 ## 1. WS1: Production Hardening — Redis Rate Limiting

- [ ] 1.1 Install `ioredis` and `rate-limiter-flexible` — add to package.json dependencies
- [ ] 1.2 Create `src/infrastructure/redisClient.ts` — singleton ioredis client with connection error handling and `REDIS_RATE_LIMIT_ENABLED` env var guard
- [ ] 1.3 Replace in-memory `express-rate-limit` with `RateLimiterRedis` from `rate-limiter-flexible` — sliding window, configurable via `RATE_LIMIT_WINDOW_MS` and `RATE_LIMIT_MAX_REQUESTS`
- [ ] 1.4 Implement graceful fallback to `RateLimiterMemory` when Redis is unreachable
- [ ] 1.5 Add `agentidp_rate_limit_hits_total` Prometheus counter (labels: `endpoint`) — increment on HTTP 429
- [ ] 1.6 Update rate limiter middleware to set `Retry-After` header on rejection
- [ ] 1.7 Write unit tests for rate limiter middleware — Redis path, fallback path, 429 response shape
+- [x] 1.1 Install `ioredis` and `rate-limiter-flexible` — add to package.json dependencies
+- [x] 1.2 Create `src/infrastructure/redisClient.ts` — singleton ioredis client with connection error handling and `REDIS_RATE_LIMIT_ENABLED` env var guard
+- [x] 1.3 Replace in-memory `express-rate-limit` with `RateLimiterRedis` from `rate-limiter-flexible` — sliding window, configurable via `RATE_LIMIT_WINDOW_MS` and `RATE_LIMIT_MAX_REQUESTS`
+- [x] 1.4 Implement graceful fallback to `RateLimiterMemory` when Redis is unreachable
+- [x] 1.5 Add `agentidp_rate_limit_hits_total` Prometheus counter (labels: `endpoint`) — increment on HTTP 429
+- [x] 1.6 Update rate limiter middleware to set `Retry-After` header on rejection
+- [x] 1.7 Write unit tests for rate limiter middleware — Redis path, fallback path, 429 response shape

 ## 2. WS1: Production Hardening — Database Pool & Health

- [ ] 2.1 Add `DB_POOL_MAX`, `DB_POOL_MIN`, `DB_POOL_IDLE_TIMEOUT_MS`, `DB_POOL_CONNECTION_TIMEOUT_MS` env vars to `.env.example` and database config
- [ ] 2.2 Configure `pg.Pool` with explicit pool parameters; defaults: max=20, min=2, idle=30000ms, conn timeout=5000ms
- [ ] 2.3 Expose `agentidp_db_pool_active_connections` gauge and `agentidp_db_pool_waiting_requests` gauge — update on pool events
- [ ] 2.4 Create `GET /health/detailed` route and controller — check database, Redis, Vault (if configured), OPA (if configured)
- [ ] 2.5 Implement per-service health checks with latency measurement — `healthy` / `degraded` (>1000ms) / `unreachable` (timeout/error)
- [ ] 2.6 Return HTTP 200 (all healthy), HTTP 207 (any degraded), HTTP 503 (any unreachable)
- [ ] 2.7 Write unit tests for health controller — all healthy, degraded, unreachable scenarios
+- [x] 2.1 Add `DB_POOL_MAX`, `DB_POOL_MIN`, `DB_POOL_IDLE_TIMEOUT_MS`, `DB_POOL_CONNECTION_TIMEOUT_MS` env vars to `.env.example` and database config
+- [x] 2.2 Configure `pg.Pool` with explicit pool parameters; defaults: max=20, min=2, idle=30000ms, conn timeout=5000ms
+- [x] 2.3 Expose `agentidp_db_pool_active_connections` gauge and `agentidp_db_pool_waiting_requests` gauge — update on pool events
+- [x] 2.4 Create `GET /health/detailed` route and controller — check database, Redis, Vault (if configured), OPA (if configured)
+- [x] 2.5 Implement per-service health checks with latency measurement — `healthy` / `degraded` (>1000ms) / `unreachable` (timeout/error)
+- [x] 2.6 Return HTTP 200 (all healthy), HTTP 207 (any degraded), HTTP 503 (any unreachable)
+- [x] 2.7 Write unit tests for health controller — all healthy, degraded, unreachable scenarios

 ## 3. WS1: Production Hardening — Load Tests

- [ ] 3.1 Install k6 and create `tests/load/` directory with `README.md` explaining how to run tests
- [ ] 3.2 Write `tests/load/agent-registration.js` — 100 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
- [ ] 3.3 Write `tests/load/token-issuance.js` — 1000 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
- [ ] 3.4 Write `tests/load/credential-rotation.js` — 50 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
- [ ] 3.5 Add `npm run load-test` script to package.json running all three k6 scenarios sequentially
+- [x] 3.1 Install k6 and create `tests/load/` directory with `README.md` explaining how to run tests
+- [x] 3.2 Write `tests/load/agent-registration.js` — 100 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
+- [x] 3.3 Write `tests/load/token-issuance.js` — 1000 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
+- [x] 3.4 Write `tests/load/credential-rotation.js` — 50 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
+- [x] 3.5 Add `npm run load-test` script to package.json running all three k6 scenarios sequentially

 ## 4. WS2: Developer Portal — Setup & Core Pages