Files
sentryagent-idp/docs/devops/operations.md
SentryAgent.ai Developer f9a6a8aafb docs(devops): update all documentation for DockerSpec compliance
- Replace all docker-compose.yml/docker-compose.monitoring.yml references with
  compose.yaml/compose.monitoring.yaml (modern Compose Spec naming)
- Replace all `docker-compose` CLI commands with `docker compose` (plugin syntax)
- Update Dockerfile stage descriptions: node:18-alpine → node:20.11-bookworm-slim,
  built-in node user → explicit nodeapp:1001 non-root user
- Update image version references: postgres:14-alpine → postgres:14.12-alpine3.19,
  redis:7-alpine → redis:7.2-alpine3.19
- Externalize postgres credentials: hardcoded values → POSTGRES_USER/PASSWORD/DB env vars
- Externalize Grafana admin password: hardcoded 'agentidp' → GF_ADMIN_PASSWORD env var
- Add Docker Compose Variables section to environment-variables.md (POSTGRES_*, GF_ADMIN_PASSWORD)
- Update local-development.md Step 3: cp .env.example .env, document POSTGRES_* purpose
- Update quick-start.md: cp .env.example .env, use awk/sed for JWT key injection
- Update 07-dev-setup.md: remove 'no .env.example' claim, reference cp .env.example
- Update docker-compose.yml key file description in 04-codebase-structure.md
- Update monitoring overlay launch commands across all docs (compose.yaml + compose.monitoring.yaml)
- Update volume names to kebab-case: postgres_data → postgres-data, redis_data → redis-data
- Fix compliance encryption-runbook: docker-compose restart agentidp → docker compose restart app

All docs now consistent with compose.yaml in repo root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 08:27:37 +00:00

359 lines
11 KiB
Markdown

# Operations
Startup, shutdown, log interpretation, and troubleshooting for AgentIdP.
---
## Startup Order
Always start services in this order. Starting the application before PostgreSQL or Redis is ready will cause connection errors on first request.
```
1. PostgreSQL (must be healthy)
2. Redis (must be healthy)
3. Migrations (must complete successfully)
4. Application (start last)
```
### Startup checklist
```bash
# 1. Start the full stack
docker compose up --build -d
# 2. Verify all three services are healthy
docker compose ps
# app, postgres, and redis must all show "healthy"
# 3. Run migrations
docker compose exec app npm run db:migrate
# 4. Verify application health
curl http://localhost:3000/health
# Expected: {"status":"ok"}
# 5. (Optional) Start the portal for local dev
cd portal && npm run dev
```
---
## Graceful Shutdown
The application handles `SIGTERM` and `SIGINT` gracefully:
1. Stops accepting new connections
2. Waits for in-flight requests to complete
3. Exits with code `0`
### Sending SIGTERM
```bash
# Find the PID
ps aux | grep "node.*server"
# Send SIGTERM
kill -SIGTERM <pid>
```
Expected log output:
```
Shutting down gracefully...
```
The process exits cleanly. No requests are dropped if they were already in-flight.
### Docker stop
`docker stop` sends `SIGTERM` by default with a 10-second timeout before `SIGKILL`. This is sufficient for graceful shutdown.
```bash
docker stop sentryagent-idp-app-1
```
---
## Log Reference
AgentIdP logs to stdout. In development (`NODE_ENV=development`), Morgan HTTP request logs are included. In test (`NODE_ENV=test`), Morgan is suppressed.
### Startup logs
| Log line | Meaning |
|----------|---------|
| `SentryAgent.ai AgentIdP listening on port 3000` | Server bound successfully — ready to accept requests |
| `Shutting down gracefully...` | SIGTERM/SIGINT received — draining connections |
### Error logs
| Log line | Meaning |
|----------|---------|
| `Failed to start server: Error: DATABASE_URL environment variable is required` | `DATABASE_URL` is not set in the environment |
| `Failed to start server: Error: REDIS_URL environment variable is required` | `REDIS_URL` is not set |
| `Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required` | One or both JWT keys are missing |
| `Unexpected pg pool error <err>` | PostgreSQL connection dropped after startup — check DB availability |
| `Redis client error <err>` | Redis connection error after startup — check Redis availability |
### Morgan HTTP request format (development)
```
::1 - - [28/Mar/2026:09:01:00 +0000] "POST /api/v1/token HTTP/1.1" 200 312 "-" "curl/7.88.1"
```
Format: `<ip> - - [<timestamp>] "<method> <path> <protocol>" <status> <bytes> "<referrer>" "<user-agent>"`
---
## Redis Key Patterns
Three key patterns are used in Redis. Useful for debugging and manual inspection.
```bash
# Connect to Redis CLI
docker compose exec redis redis-cli
```
| Key pattern | Example | Purpose | TTL |
|------------|---------|---------|-----|
| `revoked:<jti>` | `revoked:f1e2d3c4-...` | Revoked token JTI | Remaining token lifetime |
| `rate:<client_id>:<window>` | `rate:a1b2c3...:29086156` | Request count per window | `RATE_LIMIT_WINDOW_MS` |
| `monthly:<client_id>:<year>:<month>` | `monthly:a1b2c3...:2026:3` | Monthly token issuance count | End of month |
| `rate:tier:calls:<tenantId>` | `rate:tier:calls:org-uuid` | Daily API call counter for tier enforcement | Until midnight UTC |
| `rate:tier:tokens:<tenantId>` | `rate:tier:tokens:org-uuid` | Daily token issuance counter for tier enforcement | Until midnight UTC |
| `compliance:report:<tenantId>` | `compliance:report:org-uuid` | Cached compliance report JSON | 5 minutes |
Inspect keys:
```bash
# List all revoked tokens
redis-cli KEYS "revoked:*"
# Check rate limit counter for a specific client
redis-cli GET "rate:<client_id>:<window_key>"
# Check monthly token count for a specific client
redis-cli GET "monthly:<client_id>:2026:3"
# Check tier API call counter for a tenant
redis-cli GET "rate:tier:calls:<org_id>"
# Check tier token counter for a tenant
redis-cli GET "rate:tier:tokens:<org_id>"
# Check cached compliance report for a tenant
redis-cli GET "compliance:report:<org_id>"
redis-cli TTL "compliance:report:<org_id>"
```
Where `<window_key>` is `floor(unix_ms / 60000)`. For the current window:
```bash
node -e "console.log(Math.floor(Date.now() / 60000))"
```
---
## Troubleshooting
### Application fails to start — missing environment variable
**Symptom:**
```
Failed to start server: Error: DATABASE_URL environment variable is required
```
**Fix:** Ensure your `.env` file exists in the project root and contains all required variables. Verify:
```bash
grep -E "^(DATABASE_URL|REDIS_URL|JWT_PRIVATE_KEY|JWT_PUBLIC_KEY)=" .env
```
---
### Application fails to start — JWT key error
**Symptom:**
```
Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required
```
**Fix:** Generate RSA keys and add them to `.env`. See [security.md](security.md).
---
### PostgreSQL connection refused on first request
**Symptom:**
```
Error: connect ECONNREFUSED 127.0.0.1:5432
```
**Causes and fixes:**
| Cause | Fix |
|-------|-----|
| PostgreSQL container not started | Run `docker compose up -d postgres` |
| PostgreSQL container not yet healthy | Wait and run `docker compose ps` — wait for `healthy` |
| Wrong `DATABASE_URL` host/port | Check `DATABASE_URL` matches the PostgreSQL port (5432) |
| PostgreSQL container exited | Run `docker compose logs postgres` to see why it exited |
---
### Redis connection error on first request
**Symptom:**
```
Redis client error Error: connect ECONNREFUSED 127.0.0.1:6379
```
**Causes and fixes:**
| Cause | Fix |
|-------|-----|
| Redis container not started | Run `docker compose up -d redis` |
| Redis container not yet healthy | Run `docker compose ps` — wait for `healthy` |
| Wrong `REDIS_URL` | Check `REDIS_URL` matches the Redis port (6379) |
---
### Migration fails
**Symptom:**
```
Migration failed: Error: connect ECONNREFUSED 127.0.0.1:5432
```
**Fix:** PostgreSQL is not running or not reachable. Start it and verify health before running migrations.
**Symptom:**
```
Migration failed: Error: relation "agents" already exists
```
**Fix:** The migration has already been applied partially. Check `schema_migrations`:
```bash
psql "$DATABASE_URL" -c "SELECT name FROM schema_migrations ORDER BY name;"
```
If a migration is listed there but the table is inconsistent, manually inspect and repair the database state before re-running.
---
### All requests return 401 after key rotation
**Symptom:** Every API call returns `401 UNAUTHORIZED` with `Token signature is invalid.`
**Cause:** JWT keys were rotated. All previously issued tokens were signed with the old private key and are now invalid.
**Fix:** Clients must re-authenticate using `POST /token` with their `client_id` and `client_secret` to obtain a new token signed with the new key. This is expected behaviour after key rotation.
---
### Rate limit hit unexpectedly — 429 responses
**Symptom:** API returns `429 RATE_LIMIT_EXCEEDED` with `X-RateLimit-Reset` header.
**Check current rate limit state:**
```bash
# Find the current window key
WINDOW=$(node -e "console.log(Math.floor(Date.now() / 60000))")
# Check count for a specific client
docker compose exec redis redis-cli GET "rate:<client_id>:$WINDOW"
```
**Fix:** Wait until `X-RateLimit-Reset` (Unix timestamp in the response header) before retrying. The window resets every 60 seconds.
---
## Monitoring
AgentIdP exposes a Prometheus metrics endpoint at `GET /metrics` (unauthenticated, plain text).
### Metrics Exposed
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `agentidp_tokens_issued_total` | Counter | `scope` | OAuth 2.0 tokens issued |
| `agentidp_agents_registered_total` | Counter | `deployment_env` | Agents registered |
| `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | HTTP requests |
| `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP latency |
| `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration |
| `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration |
| `agentidp_webhook_dead_letters_total` | Counter | `event_type` | Webhook deliveries moved to dead-letter queue |
| `agentidp_credentials_expiring_soon_total` | Gauge | — | Credentials expiring within 7 days |
| `agentidp_audit_chain_integrity` | Gauge | — | `1` if audit chain is intact, `0` if broken |
| `agentidp_rate_limit_hits_total` | Counter | `client_id` | Rate limit rejections |
| `agentidp_db_pool_active_connections` | Gauge | — | Active PostgreSQL connections |
| `agentidp_db_pool_waiting_requests` | Gauge | — | Requests waiting for a pool connection |
| `agentidp_tenant_api_calls_total` | Counter | `org_id`, `tier` | API calls per tenant per tier |
| `agentidp_billing_limit_rejections_total` | Counter | `org_id`, `limit_type` | Tier limit enforcement rejections |
| `agentidp_did_documents_generated_total` | Counter | — | DID documents generated |
| `agentidp_oidc_tokens_issued_total` | Counter | — | OIDC ID tokens issued |
| `agentidp_federation_events_total` | Counter | `event_type` | Federation partner events |
| `agentidp_delegation_chains_created_total` | Counter | — | A2A delegation chains created |
| `agentidp_compliance_reports_generated_total` | Counter | — | Compliance reports generated |
### Starting the Monitoring Stack
```bash
# Start the full stack with monitoring
docker compose -f compose.yaml -f compose.monitoring.yaml up -d
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin / <GF_ADMIN_PASSWORD from .env>)
```
The Grafana dashboard auto-provisions on first start. Navigate to **Dashboards → AgentIdP → SentryAgent.ai — AgentIdP**.
### Security Note
`GET /metrics` is unauthenticated. In production, ensure this endpoint is:
- Only accessible from your internal network (firewall rule or reverse proxy restriction)
- Not exposed on a public-facing port
---
### Tier limit rejected — 429 with `tier_limit_exceeded` code
Symptom: `429 TOO_MANY_REQUESTS` with body `{"code":"tier_limit_exceeded","message":"..."}`
Check the tenant's current tier counter:
```bash
# Check API call counter
docker compose exec redis redis-cli GET "rate:tier:calls:<org_id>"
# Check the tenant's tier
psql "$DATABASE_URL" -c "SELECT org_id, tier FROM tenant_tiers WHERE org_id = '<org_id>';"
```
If the org is on the `free` tier and has hit 1,000 calls/day, upgrade the tier or wait until
midnight UTC for the counter to reset.
---
### Analytics endpoints return 404
Cause: `ANALYTICS_ENABLED` is set to `false` in `.env`.
Fix: Set `ANALYTICS_ENABLED=true` and restart the application.
---
### Compliance report returns 404
Cause: `COMPLIANCE_ENABLED` is set to `false` in `.env`.
Fix: Set `COMPLIANCE_ENABLED=true` and restart the application.
---
### Portal CORS error
Symptom: Browser console shows `Access-Control-Allow-Origin` error on requests to
`http://localhost:3000`.
Fix: Ensure `CORS_ORIGIN` in `.env` includes `http://localhost:3001`:
```
CORS_ORIGIN=http://localhost:3001
```
Restart the application after changing this variable.