feat(phase-4): WS1 — Production Hardening (Redis rate limiting, DB pool, health endpoint, k6)

Rate limiting:
- Replace in-memory express-rate-limit with ioredis + rate-limiter-flexible (sliding window)
- Graceful fallback to RateLimiterMemory when Redis unreachable
- RATE_LIMIT_WINDOW_MS / RATE_LIMIT_MAX_REQUESTS env var config
- Retry-After header on 429 responses
- agentidp_rate_limit_hits_total Prometheus counter

Database pool:
- Explicit pg.Pool config via DB_POOL_MAX/MIN/IDLE_TIMEOUT_MS/CONNECTION_TIMEOUT_MS
- Defaults: max=20, min=2, idle=30s, conn timeout=5s
- agentidp_db_pool_active_connections + agentidp_db_pool_waiting_requests gauges

Health endpoint:
- GET /health/detailed — per-service status (database, Redis, Vault, OPA)
- healthy / degraded (>1000ms) / unreachable classification
- HTTP 200 (all healthy) / 207 (any degraded) / 503 (any unreachable)

Load tests:
- tests/load/ with k6 scenarios for agent registration (100 VUs), token issuance (1000 VUs), credential rotation (50 VUs)
- npm run load-test script

Tests: 586 passing, zero TypeScript errors

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
SentryAgent.ai Developer
2026-04-02 04:20:37 +00:00
parent b0f70b7ac4
commit 1b682c22b2
16 changed files with 1467 additions and 113 deletions

View File

@@ -1,30 +1,30 @@
## 1. WS1: Production Hardening — Redis Rate Limiting
- [ ] 1.1 Install `ioredis` and `rate-limiter-flexible` — add to package.json dependencies
- [ ] 1.2 Create `src/infrastructure/redisClient.ts` — singleton ioredis client with connection error handling and `REDIS_RATE_LIMIT_ENABLED` env var guard
- [ ] 1.3 Replace in-memory `express-rate-limit` with `RateLimiterRedis` from `rate-limiter-flexible` — sliding window, configurable via `RATE_LIMIT_WINDOW_MS` and `RATE_LIMIT_MAX_REQUESTS`
- [ ] 1.4 Implement graceful fallback to `RateLimiterMemory` when Redis is unreachable
- [ ] 1.5 Add `agentidp_rate_limit_hits_total` Prometheus counter (labels: `endpoint`) — increment on HTTP 429
- [ ] 1.6 Update rate limiter middleware to set `Retry-After` header on rejection
- [ ] 1.7 Write unit tests for rate limiter middleware — Redis path, fallback path, 429 response shape
- [x] 1.1 Install `ioredis` and `rate-limiter-flexible` — add to package.json dependencies
- [x] 1.2 Create `src/infrastructure/redisClient.ts` — singleton ioredis client with connection error handling and `REDIS_RATE_LIMIT_ENABLED` env var guard
- [x] 1.3 Replace in-memory `express-rate-limit` with `RateLimiterRedis` from `rate-limiter-flexible` — sliding window, configurable via `RATE_LIMIT_WINDOW_MS` and `RATE_LIMIT_MAX_REQUESTS`
- [x] 1.4 Implement graceful fallback to `RateLimiterMemory` when Redis is unreachable
- [x] 1.5 Add `agentidp_rate_limit_hits_total` Prometheus counter (labels: `endpoint`) — increment on HTTP 429
- [x] 1.6 Update rate limiter middleware to set `Retry-After` header on rejection
- [x] 1.7 Write unit tests for rate limiter middleware — Redis path, fallback path, 429 response shape
## 2. WS1: Production Hardening — Database Pool & Health
- [ ] 2.1 Add `DB_POOL_MAX`, `DB_POOL_MIN`, `DB_POOL_IDLE_TIMEOUT_MS`, `DB_POOL_CONNECTION_TIMEOUT_MS` env vars to `.env.example` and database config
- [ ] 2.2 Configure `pg.Pool` with explicit pool parameters; defaults: max=20, min=2, idle=30000ms, conn timeout=5000ms
- [ ] 2.3 Expose `agentidp_db_pool_active_connections` gauge and `agentidp_db_pool_waiting_requests` gauge — update on pool events
- [ ] 2.4 Create `GET /health/detailed` route and controller — check database, Redis, Vault (if configured), OPA (if configured)
- [ ] 2.5 Implement per-service health checks with latency measurement — `healthy` / `degraded` (>1000ms) / `unreachable` (timeout/error)
- [ ] 2.6 Return HTTP 200 (all healthy), HTTP 207 (any degraded), HTTP 503 (any unreachable)
- [ ] 2.7 Write unit tests for health controller — all healthy, degraded, unreachable scenarios
- [x] 2.1 Add `DB_POOL_MAX`, `DB_POOL_MIN`, `DB_POOL_IDLE_TIMEOUT_MS`, `DB_POOL_CONNECTION_TIMEOUT_MS` env vars to `.env.example` and database config
- [x] 2.2 Configure `pg.Pool` with explicit pool parameters; defaults: max=20, min=2, idle=30000ms, conn timeout=5000ms
- [x] 2.3 Expose `agentidp_db_pool_active_connections` gauge and `agentidp_db_pool_waiting_requests` gauge — update on pool events
- [x] 2.4 Create `GET /health/detailed` route and controller — check database, Redis, Vault (if configured), OPA (if configured)
- [x] 2.5 Implement per-service health checks with latency measurement — `healthy` / `degraded` (>1000ms) / `unreachable` (timeout/error)
- [x] 2.6 Return HTTP 200 (all healthy), HTTP 207 (any degraded), HTTP 503 (any unreachable)
- [x] 2.7 Write unit tests for health controller — all healthy, degraded, unreachable scenarios
## 3. WS1: Production Hardening — Load Tests
- [ ] 3.1 Install k6 and create `tests/load/` directory with `README.md` explaining how to run tests
- [ ] 3.2 Write `tests/load/agent-registration.js` — 100 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
- [ ] 3.3 Write `tests/load/token-issuance.js` — 1000 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
- [ ] 3.4 Write `tests/load/credential-rotation.js` — 50 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
- [ ] 3.5 Add `npm run load-test` script to package.json running all three k6 scenarios sequentially
- [x] 3.1 Install k6 and create `tests/load/` directory with `README.md` explaining how to run tests
- [x] 3.2 Write `tests/load/agent-registration.js` — 100 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
- [x] 3.3 Write `tests/load/token-issuance.js` — 1000 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
- [x] 3.4 Write `tests/load/credential-rotation.js` — 50 VUs, 60s, threshold: p95 < 500ms, error rate < 1%
- [x] 3.5 Add `npm run load-test` script to package.json running all three k6 scenarios sequentially
## 4. WS2: Developer Portal — Setup & Core Pages