Files
sentryagent-idp/docs/devops/operations.md
SentryAgent.ai Developer a504964e5f feat(phase-2): workstream 7 — Prometheus + Grafana Monitoring
- Add prom-client 15; shared registry in src/metrics/registry.ts (7 metrics)
- HTTP request counter + duration histogram via metricsMiddleware
- DB query duration histogram wrapping pg Pool.query
- Redis command duration histogram via typed instrumentRedisMethod wrapper
- agentidp_tokens_issued_total in OAuth2Service
- agentidp_agents_registered_total in AgentService
- GET /metrics unauthenticated endpoint (Prometheus text format)
- docker-compose.monitoring.yml overlay (Prometheus + Grafana)
- Grafana auto-provisioned datasource + pre-built AgentIdP dashboard
- docs/devops/operations.md monitoring section added
- 36/36 unit tests passing, 100% coverage on new metrics code
- Fix pre-existing unused import in tests/integration/agents.test.ts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 06:13:41 +00:00

285 lines
8.2 KiB
Markdown

# Operations
Startup, shutdown, log interpretation, and troubleshooting for AgentIdP.
---
## Startup Order
Always start services in this order. Starting the application before PostgreSQL or Redis is ready will cause connection errors on first request.
```
1. PostgreSQL (must be healthy)
2. Redis (must be healthy)
3. Migrations (must complete successfully)
4. Application (start last)
```
### Startup checklist
```bash
# 1. Start PostgreSQL and Redis
docker-compose up -d postgres redis
# 2. Wait for healthy status
docker-compose ps
# Both postgres and redis must show "healthy" before proceeding
# 3. Run migrations
npm run db:migrate
# Must complete with 0 errors before starting the app
# 4. Start the application
npm run dev # development
# or
npm start # production (requires prior npm run build)
```
---
## Graceful Shutdown
The application handles `SIGTERM` and `SIGINT` gracefully:
1. Stops accepting new connections
2. Waits for in-flight requests to complete
3. Exits with code `0`
### Sending SIGTERM
```bash
# Find the PID
ps aux | grep "node.*server"
# Send SIGTERM
kill -SIGTERM <pid>
```
Expected log output:
```
Shutting down gracefully...
```
The process exits cleanly. No requests are dropped if they were already in-flight.
### Docker stop
`docker stop` sends `SIGTERM` by default with a 10-second timeout before `SIGKILL`. This is sufficient for graceful shutdown.
```bash
docker stop sentryagent-idp-app-1
```
---
## Log Reference
AgentIdP logs to stdout. In development (`NODE_ENV=development`), Morgan HTTP request logs are included. In test (`NODE_ENV=test`), Morgan is suppressed.
### Startup logs
| Log line | Meaning |
|----------|---------|
| `SentryAgent.ai AgentIdP listening on port 3000` | Server bound successfully — ready to accept requests |
| `Shutting down gracefully...` | SIGTERM/SIGINT received — draining connections |
### Error logs
| Log line | Meaning |
|----------|---------|
| `Failed to start server: Error: DATABASE_URL environment variable is required` | `DATABASE_URL` is not set in the environment |
| `Failed to start server: Error: REDIS_URL environment variable is required` | `REDIS_URL` is not set |
| `Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required` | One or both JWT keys are missing |
| `Unexpected pg pool error <err>` | PostgreSQL connection dropped after startup — check DB availability |
| `Redis client error <err>` | Redis connection error after startup — check Redis availability |
### Morgan HTTP request format (development)
```
::1 - - [28/Mar/2026:09:01:00 +0000] "POST /api/v1/token HTTP/1.1" 200 312 "-" "curl/7.88.1"
```
Format: `<ip> - - [<timestamp>] "<method> <path> <protocol>" <status> <bytes> "<referrer>" "<user-agent>"`
---
## Redis Key Patterns
Three key patterns are used in Redis. Useful for debugging and manual inspection.
```bash
# Connect to Redis CLI
docker-compose exec redis redis-cli
```
| Key pattern | Example | Purpose | TTL |
|------------|---------|---------|-----|
| `revoked:<jti>` | `revoked:f1e2d3c4-b5a6-...` | Revoked token JTI | Remaining token lifetime |
| `rate:<client_id>:<window>` | `rate:a1b2c3...:29086156` | Request count per minute window | 60 seconds |
| `monthly:<client_id>:<year>:<month>` | `monthly:a1b2c3...:2026:3` | Token issuance count for free tier | End of month |
Inspect keys:
```bash
# List all revoked tokens
redis-cli KEYS "revoked:*"
# Check rate limit counter for a specific client
redis-cli GET "rate:<client_id>:<window_key>"
# Check monthly token count for a specific client
redis-cli GET "monthly:<client_id>:2026:3"
```
Where `<window_key>` is `floor(unix_ms / 60000)`. For the current window:
```bash
node -e "console.log(Math.floor(Date.now() / 60000))"
```
---
## Troubleshooting
### Application fails to start — missing environment variable
**Symptom:**
```
Failed to start server: Error: DATABASE_URL environment variable is required
```
**Fix:** Ensure your `.env` file exists in the project root and contains all required variables. Verify:
```bash
grep -E "^(DATABASE_URL|REDIS_URL|JWT_PRIVATE_KEY|JWT_PUBLIC_KEY)=" .env
```
---
### Application fails to start — JWT key error
**Symptom:**
```
Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required
```
**Fix:** Generate RSA keys and add them to `.env`. See [security.md](security.md).
---
### PostgreSQL connection refused on first request
**Symptom:**
```
Error: connect ECONNREFUSED 127.0.0.1:5432
```
**Causes and fixes:**
| Cause | Fix |
|-------|-----|
| PostgreSQL container not started | Run `docker-compose up -d postgres` |
| PostgreSQL container not yet healthy | Wait and run `docker-compose ps` — wait for `healthy` |
| Wrong `DATABASE_URL` host/port | Check `DATABASE_URL` matches the PostgreSQL port (5432) |
| PostgreSQL container exited | Run `docker-compose logs postgres` to see why it exited |
---
### Redis connection error on first request
**Symptom:**
```
Redis client error Error: connect ECONNREFUSED 127.0.0.1:6379
```
**Causes and fixes:**
| Cause | Fix |
|-------|-----|
| Redis container not started | Run `docker-compose up -d redis` |
| Redis container not yet healthy | Run `docker-compose ps` — wait for `healthy` |
| Wrong `REDIS_URL` | Check `REDIS_URL` matches the Redis port (6379) |
---
### Migration fails
**Symptom:**
```
Migration failed: Error: connect ECONNREFUSED 127.0.0.1:5432
```
**Fix:** PostgreSQL is not running or not reachable. Start it and verify health before running migrations.
**Symptom:**
```
Migration failed: Error: relation "agents" already exists
```
**Fix:** The migration has already been applied partially. Check `schema_migrations`:
```bash
psql "$DATABASE_URL" -c "SELECT name FROM schema_migrations ORDER BY name;"
```
If a migration is listed there but the table is inconsistent, manually inspect and repair the database state before re-running.
---
### All requests return 401 after key rotation
**Symptom:** Every API call returns `401 UNAUTHORIZED` with `Token signature is invalid.`
**Cause:** JWT keys were rotated. All previously issued tokens were signed with the old private key and are now invalid.
**Fix:** Clients must re-authenticate using `POST /token` with their `client_id` and `client_secret` to obtain a new token signed with the new key. This is expected behaviour after key rotation.
---
### Rate limit hit unexpectedly — 429 responses
**Symptom:** API returns `429 RATE_LIMIT_EXCEEDED` with `X-RateLimit-Reset` header.
**Check current rate limit state:**
```bash
# Find the current window key
WINDOW=$(node -e "console.log(Math.floor(Date.now() / 60000))")
# Check count for a specific client
docker-compose exec redis redis-cli GET "rate:<client_id>:$WINDOW"
```
**Fix:** Wait until `X-RateLimit-Reset` (Unix timestamp in the response header) before retrying. The window resets every 60 seconds.
---
## Monitoring
AgentIdP exposes a Prometheus metrics endpoint at `GET /metrics` (unauthenticated, plain text).
### Metrics Exposed
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `agentidp_tokens_issued_total` | Counter | `scope` | OAuth 2.0 tokens issued successfully |
| `agentidp_agents_registered_total` | Counter | `deployment_env` | Agents registered successfully |
| `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | HTTP requests received |
| `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP request duration |
| `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration |
| `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration |
### Starting the Monitoring Stack
```bash
# Start the full stack with monitoring
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin / agentidp)
```
The Grafana dashboard auto-provisions on first start. Navigate to **Dashboards → AgentIdP → SentryAgent.ai — AgentIdP**.
### Security Note
`GET /metrics` is unauthenticated. In production, ensure this endpoint is:
- Only accessible from your internal network (firewall rule or reverse proxy restriction)
- Not exposed on a public-facing port