- Replace all docker-compose.yml/docker-compose.monitoring.yml references with compose.yaml/compose.monitoring.yaml (modern Compose Spec naming) - Replace all `docker-compose` CLI commands with `docker compose` (plugin syntax) - Update Dockerfile stage descriptions: node:18-alpine → node:20.11-bookworm-slim, built-in node user → explicit nodeapp:1001 non-root user - Update image version references: postgres:14-alpine → postgres:14.12-alpine3.19, redis:7-alpine → redis:7.2-alpine3.19 - Externalize postgres credentials: hardcoded values → POSTGRES_USER/PASSWORD/DB env vars - Externalize Grafana admin password: hardcoded 'agentidp' → GF_ADMIN_PASSWORD env var - Add Docker Compose Variables section to environment-variables.md (POSTGRES_*, GF_ADMIN_PASSWORD) - Update local-development.md Step 3: cp .env.example .env, document POSTGRES_* purpose - Update quick-start.md: cp .env.example .env, use awk/sed for JWT key injection - Update 07-dev-setup.md: remove 'no .env.example' claim, reference cp .env.example - Update docker-compose.yml key file description in 04-codebase-structure.md - Update monitoring overlay launch commands across all docs (compose.yaml + compose.monitoring.yaml) - Update volume names to kebab-case: postgres_data → postgres-data, redis_data → redis-data - Fix compliance encryption-runbook: docker-compose restart agentidp → docker compose restart app All docs now consistent with compose.yaml in repo root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
359 lines
11 KiB
Markdown
359 lines
11 KiB
Markdown
# Operations
|
|
|
|
Startup, shutdown, log interpretation, and troubleshooting for AgentIdP.
|
|
|
|
---
|
|
|
|
## Startup Order
|
|
|
|
Always start services in this order. Starting the application before PostgreSQL or Redis is ready will cause connection errors on first request.
|
|
|
|
```
|
|
1. PostgreSQL (must be healthy)
|
|
2. Redis (must be healthy)
|
|
3. Migrations (must complete successfully)
|
|
4. Application (start last)
|
|
```
|
|
|
|
### Startup checklist
|
|
|
|
```bash
|
|
# 1. Start the full stack
|
|
docker compose up --build -d
|
|
|
|
# 2. Verify all three services are healthy
|
|
docker compose ps
|
|
# app, postgres, and redis must all show "healthy"
|
|
|
|
# 3. Run migrations
|
|
docker compose exec app npm run db:migrate
|
|
|
|
# 4. Verify application health
|
|
curl http://localhost:3000/health
|
|
# Expected: {"status":"ok"}
|
|
|
|
# 5. (Optional) Start the portal for local dev
|
|
cd portal && npm run dev
|
|
```
|
|
|
|
---
|
|
|
|
## Graceful Shutdown
|
|
|
|
The application handles `SIGTERM` and `SIGINT` gracefully:
|
|
|
|
1. Stops accepting new connections
|
|
2. Waits for in-flight requests to complete
|
|
3. Exits with code `0`
|
|
|
|
### Sending SIGTERM
|
|
|
|
```bash
|
|
# Find the PID
|
|
ps aux | grep "node.*server"
|
|
|
|
# Send SIGTERM
|
|
kill -SIGTERM <pid>
|
|
```
|
|
|
|
Expected log output:
|
|
|
|
```
|
|
Shutting down gracefully...
|
|
```
|
|
|
|
The process exits cleanly. No requests are dropped if they were already in-flight.
|
|
|
|
### Docker stop
|
|
|
|
`docker stop` sends `SIGTERM` by default with a 10-second timeout before `SIGKILL`. This is sufficient for graceful shutdown.
|
|
|
|
```bash
|
|
docker stop sentryagent-idp-app-1
|
|
```
|
|
|
|
---
|
|
|
|
## Log Reference
|
|
|
|
AgentIdP logs to stdout. In development (`NODE_ENV=development`), Morgan HTTP request logs are included. In test (`NODE_ENV=test`), Morgan is suppressed.
|
|
|
|
### Startup logs
|
|
|
|
| Log line | Meaning |
|
|
|----------|---------|
|
|
| `SentryAgent.ai AgentIdP listening on port 3000` | Server bound successfully — ready to accept requests |
|
|
| `Shutting down gracefully...` | SIGTERM/SIGINT received — draining connections |
|
|
|
|
### Error logs
|
|
|
|
| Log line | Meaning |
|
|
|----------|---------|
|
|
| `Failed to start server: Error: DATABASE_URL environment variable is required` | `DATABASE_URL` is not set in the environment |
|
|
| `Failed to start server: Error: REDIS_URL environment variable is required` | `REDIS_URL` is not set |
|
|
| `Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required` | One or both JWT keys are missing |
|
|
| `Unexpected pg pool error <err>` | PostgreSQL connection dropped after startup — check DB availability |
|
|
| `Redis client error <err>` | Redis connection error after startup — check Redis availability |
|
|
|
|
### Morgan HTTP request format (development)
|
|
|
|
```
|
|
::1 - - [28/Mar/2026:09:01:00 +0000] "POST /api/v1/token HTTP/1.1" 200 312 "-" "curl/7.88.1"
|
|
```
|
|
|
|
Format: `<ip> - - [<timestamp>] "<method> <path> <protocol>" <status> <bytes> "<referrer>" "<user-agent>"`
|
|
|
|
---
|
|
|
|
## Redis Key Patterns
|
|
|
|
Three key patterns are used in Redis. Useful for debugging and manual inspection.
|
|
|
|
```bash
|
|
# Connect to Redis CLI
|
|
docker compose exec redis redis-cli
|
|
```
|
|
|
|
| Key pattern | Example | Purpose | TTL |
|
|
|------------|---------|---------|-----|
|
|
| `revoked:<jti>` | `revoked:f1e2d3c4-...` | Revoked token JTI | Remaining token lifetime |
|
|
| `rate:<client_id>:<window>` | `rate:a1b2c3...:29086156` | Request count per window | `RATE_LIMIT_WINDOW_MS` |
|
|
| `monthly:<client_id>:<year>:<month>` | `monthly:a1b2c3...:2026:3` | Monthly token issuance count | End of month |
|
|
| `rate:tier:calls:<tenantId>` | `rate:tier:calls:org-uuid` | Daily API call counter for tier enforcement | Until midnight UTC |
|
|
| `rate:tier:tokens:<tenantId>` | `rate:tier:tokens:org-uuid` | Daily token issuance counter for tier enforcement | Until midnight UTC |
|
|
| `compliance:report:<tenantId>` | `compliance:report:org-uuid` | Cached compliance report JSON | 5 minutes |
|
|
|
|
Inspect keys:
|
|
|
|
```bash
|
|
# List all revoked tokens
|
|
redis-cli KEYS "revoked:*"
|
|
|
|
# Check rate limit counter for a specific client
|
|
redis-cli GET "rate:<client_id>:<window_key>"
|
|
|
|
# Check monthly token count for a specific client
|
|
redis-cli GET "monthly:<client_id>:2026:3"
|
|
|
|
# Check tier API call counter for a tenant
|
|
redis-cli GET "rate:tier:calls:<org_id>"
|
|
|
|
# Check tier token counter for a tenant
|
|
redis-cli GET "rate:tier:tokens:<org_id>"
|
|
|
|
# Check cached compliance report for a tenant
|
|
redis-cli GET "compliance:report:<org_id>"
|
|
redis-cli TTL "compliance:report:<org_id>"
|
|
```
|
|
|
|
Where `<window_key>` is `floor(unix_ms / 60000)`. For the current window:
|
|
|
|
```bash
|
|
node -e "console.log(Math.floor(Date.now() / 60000))"
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Application fails to start — missing environment variable
|
|
|
|
**Symptom:**
|
|
```
|
|
Failed to start server: Error: DATABASE_URL environment variable is required
|
|
```
|
|
|
|
**Fix:** Ensure your `.env` file exists in the project root and contains all required variables. Verify:
|
|
```bash
|
|
grep -E "^(DATABASE_URL|REDIS_URL|JWT_PRIVATE_KEY|JWT_PUBLIC_KEY)=" .env
|
|
```
|
|
|
|
---
|
|
|
|
### Application fails to start — JWT key error
|
|
|
|
**Symptom:**
|
|
```
|
|
Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required
|
|
```
|
|
|
|
**Fix:** Generate RSA keys and add them to `.env`. See [security.md](security.md).
|
|
|
|
---
|
|
|
|
### PostgreSQL connection refused on first request
|
|
|
|
**Symptom:**
|
|
```
|
|
Error: connect ECONNREFUSED 127.0.0.1:5432
|
|
```
|
|
|
|
**Causes and fixes:**
|
|
|
|
| Cause | Fix |
|
|
|-------|-----|
|
|
| PostgreSQL container not started | Run `docker compose up -d postgres` |
|
|
| PostgreSQL container not yet healthy | Wait and run `docker compose ps` — wait for `healthy` |
|
|
| Wrong `DATABASE_URL` host/port | Check `DATABASE_URL` matches the PostgreSQL port (5432) |
|
|
| PostgreSQL container exited | Run `docker compose logs postgres` to see why it exited |
|
|
|
|
---
|
|
|
|
### Redis connection error on first request
|
|
|
|
**Symptom:**
|
|
```
|
|
Redis client error Error: connect ECONNREFUSED 127.0.0.1:6379
|
|
```
|
|
|
|
**Causes and fixes:**
|
|
|
|
| Cause | Fix |
|
|
|-------|-----|
|
|
| Redis container not started | Run `docker compose up -d redis` |
|
|
| Redis container not yet healthy | Run `docker compose ps` — wait for `healthy` |
|
|
| Wrong `REDIS_URL` | Check `REDIS_URL` matches the Redis port (6379) |
|
|
|
|
---
|
|
|
|
### Migration fails
|
|
|
|
**Symptom:**
|
|
```
|
|
Migration failed: Error: connect ECONNREFUSED 127.0.0.1:5432
|
|
```
|
|
|
|
**Fix:** PostgreSQL is not running or not reachable. Start it and verify health before running migrations.
|
|
|
|
**Symptom:**
|
|
```
|
|
Migration failed: Error: relation "agents" already exists
|
|
```
|
|
|
|
**Fix:** The migration has already been applied partially. Check `schema_migrations`:
|
|
```bash
|
|
psql "$DATABASE_URL" -c "SELECT name FROM schema_migrations ORDER BY name;"
|
|
```
|
|
If a migration is listed there but the table is inconsistent, manually inspect and repair the database state before re-running.
|
|
|
|
---
|
|
|
|
### All requests return 401 after key rotation
|
|
|
|
**Symptom:** Every API call returns `401 UNAUTHORIZED` with `Token signature is invalid.`
|
|
|
|
**Cause:** JWT keys were rotated. All previously issued tokens were signed with the old private key and are now invalid.
|
|
|
|
**Fix:** Clients must re-authenticate using `POST /token` with their `client_id` and `client_secret` to obtain a new token signed with the new key. This is expected behaviour after key rotation.
|
|
|
|
---
|
|
|
|
### Rate limit hit unexpectedly — 429 responses
|
|
|
|
**Symptom:** API returns `429 RATE_LIMIT_EXCEEDED` with `X-RateLimit-Reset` header.
|
|
|
|
**Check current rate limit state:**
|
|
```bash
|
|
# Find the current window key
|
|
WINDOW=$(node -e "console.log(Math.floor(Date.now() / 60000))")
|
|
# Check count for a specific client
|
|
docker compose exec redis redis-cli GET "rate:<client_id>:$WINDOW"
|
|
```
|
|
|
|
**Fix:** Wait until `X-RateLimit-Reset` (Unix timestamp in the response header) before retrying. The window resets every 60 seconds.
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
AgentIdP exposes a Prometheus metrics endpoint at `GET /metrics` (unauthenticated, plain text).
|
|
|
|
### Metrics Exposed
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|--------|------|--------|-------------|
|
|
| `agentidp_tokens_issued_total` | Counter | `scope` | OAuth 2.0 tokens issued |
|
|
| `agentidp_agents_registered_total` | Counter | `deployment_env` | Agents registered |
|
|
| `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | HTTP requests |
|
|
| `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP latency |
|
|
| `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration |
|
|
| `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration |
|
|
| `agentidp_webhook_dead_letters_total` | Counter | `event_type` | Webhook deliveries moved to dead-letter queue |
|
|
| `agentidp_credentials_expiring_soon_total` | Gauge | — | Credentials expiring within 7 days |
|
|
| `agentidp_audit_chain_integrity` | Gauge | — | `1` if audit chain is intact, `0` if broken |
|
|
| `agentidp_rate_limit_hits_total` | Counter | `client_id` | Rate limit rejections |
|
|
| `agentidp_db_pool_active_connections` | Gauge | — | Active PostgreSQL connections |
|
|
| `agentidp_db_pool_waiting_requests` | Gauge | — | Requests waiting for a pool connection |
|
|
| `agentidp_tenant_api_calls_total` | Counter | `org_id`, `tier` | API calls per tenant per tier |
|
|
| `agentidp_billing_limit_rejections_total` | Counter | `org_id`, `limit_type` | Tier limit enforcement rejections |
|
|
| `agentidp_did_documents_generated_total` | Counter | — | DID documents generated |
|
|
| `agentidp_oidc_tokens_issued_total` | Counter | — | OIDC ID tokens issued |
|
|
| `agentidp_federation_events_total` | Counter | `event_type` | Federation partner events |
|
|
| `agentidp_delegation_chains_created_total` | Counter | — | A2A delegation chains created |
|
|
| `agentidp_compliance_reports_generated_total` | Counter | — | Compliance reports generated |
|
|
|
|
### Starting the Monitoring Stack
|
|
|
|
```bash
|
|
# Start the full stack with monitoring
|
|
docker compose -f compose.yaml -f compose.monitoring.yaml up -d
|
|
|
|
# Prometheus: http://localhost:9090
|
|
# Grafana: http://localhost:3001 (admin / <GF_ADMIN_PASSWORD from .env>)
|
|
```
|
|
|
|
The Grafana dashboard auto-provisions on first start. Navigate to **Dashboards → AgentIdP → SentryAgent.ai — AgentIdP**.
|
|
|
|
### Security Note
|
|
|
|
`GET /metrics` is unauthenticated. In production, ensure this endpoint is:
|
|
- Only accessible from your internal network (firewall rule or reverse proxy restriction)
|
|
- Not exposed on a public-facing port
|
|
|
|
---
|
|
|
|
### Tier limit rejected — 429 with `tier_limit_exceeded` code
|
|
|
|
Symptom: `429 TOO_MANY_REQUESTS` with body `{"code":"tier_limit_exceeded","message":"..."}`
|
|
|
|
Check the tenant's current tier counter:
|
|
```bash
|
|
# Check API call counter
|
|
docker compose exec redis redis-cli GET "rate:tier:calls:<org_id>"
|
|
|
|
# Check the tenant's tier
|
|
psql "$DATABASE_URL" -c "SELECT org_id, tier FROM tenant_tiers WHERE org_id = '<org_id>';"
|
|
```
|
|
|
|
If the org is on the `free` tier and has hit 1,000 calls/day, upgrade the tier or wait until
|
|
midnight UTC for the counter to reset.
|
|
|
|
---
|
|
|
|
### Analytics endpoints return 404
|
|
|
|
Cause: `ANALYTICS_ENABLED` is set to `false` in `.env`.
|
|
|
|
Fix: Set `ANALYTICS_ENABLED=true` and restart the application.
|
|
|
|
---
|
|
|
|
### Compliance report returns 404
|
|
|
|
Cause: `COMPLIANCE_ENABLED` is set to `false` in `.env`.
|
|
|
|
Fix: Set `COMPLIANCE_ENABLED=true` and restart the application.
|
|
|
|
---
|
|
|
|
### Portal CORS error
|
|
|
|
Symptom: Browser console shows `Access-Control-Allow-Origin` error on requests to
|
|
`http://localhost:3000`.
|
|
|
|
Fix: Ensure `CORS_ORIGIN` in `.env` includes `http://localhost:3001`:
|
|
```
|
|
CORS_ORIGIN=http://localhost:3001
|
|
```
|
|
Restart the application after changing this variable.
|