- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook) - developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added - engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview) - OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update - VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added - .gitignore: exclude session artifacts, build artifacts, and agent workspaces Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 KiB
Operations
Startup, shutdown, log interpretation, and troubleshooting for AgentIdP.
Startup Order
Always start services in this order. Starting the application before PostgreSQL or Redis is ready will cause connection errors on first request.
1. PostgreSQL (must be healthy)
2. Redis (must be healthy)
3. Migrations (must complete successfully)
4. Application (start last)
Startup checklist
# 1. Start the full stack
docker compose up --build -d
# 2. Verify all three services are healthy
docker compose ps
# app, postgres, and redis must all show "healthy"
# 3. Run migrations
docker compose exec app npm run db:migrate
# 4. Verify application health
curl http://localhost:3000/health
# Expected: {"status":"ok"}
# 5. (Optional) Start the portal for local dev
cd portal && npm run dev
Graceful Shutdown
The application handles SIGTERM and SIGINT gracefully:
- Stops accepting new connections
- Waits for in-flight requests to complete
- Exits with code
0
Sending SIGTERM
# Find the PID
ps aux | grep "node.*server"
# Send SIGTERM
kill -SIGTERM <pid>
Expected log output:
Shutting down gracefully...
The process exits cleanly. No requests are dropped if they were already in-flight.
Docker stop
docker stop sends SIGTERM by default with a 10-second timeout before SIGKILL. This is sufficient for graceful shutdown.
docker stop sentryagent-idp-app-1
Log Reference
AgentIdP logs to stdout. In development (NODE_ENV=development), Morgan HTTP request logs are included. In test (NODE_ENV=test), Morgan is suppressed.
Startup logs
| Log line | Meaning |
|---|---|
SentryAgent.ai AgentIdP listening on port 3000 |
Server bound successfully — ready to accept requests |
Shutting down gracefully... |
SIGTERM/SIGINT received — draining connections |
Error logs
| Log line | Meaning |
|---|---|
Failed to start server: Error: DATABASE_URL environment variable is required |
DATABASE_URL is not set in the environment |
Failed to start server: Error: REDIS_URL environment variable is required |
REDIS_URL is not set |
Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required |
One or both JWT keys are missing |
Unexpected pg pool error <err> |
PostgreSQL connection dropped after startup — check DB availability |
Redis client error <err> |
Redis connection error after startup — check Redis availability |
Morgan HTTP request format (development)
::1 - - [28/Mar/2026:09:01:00 +0000] "POST /api/v1/token HTTP/1.1" 200 312 "-" "curl/7.88.1"
Format: <ip> - - [<timestamp>] "<method> <path> <protocol>" <status> <bytes> "<referrer>" "<user-agent>"
Redis Key Patterns
Three key patterns are used in Redis. Useful for debugging and manual inspection.
# Connect to Redis CLI
docker-compose exec redis redis-cli
| Key pattern | Example | Purpose | TTL |
|---|---|---|---|
revoked:<jti> |
revoked:f1e2d3c4-... |
Revoked token JTI | Remaining token lifetime |
rate:<client_id>:<window> |
rate:a1b2c3...:29086156 |
Request count per window | RATE_LIMIT_WINDOW_MS |
monthly:<client_id>:<year>:<month> |
monthly:a1b2c3...:2026:3 |
Monthly token issuance count | End of month |
rate:tier:calls:<tenantId> |
rate:tier:calls:org-uuid |
Daily API call counter for tier enforcement | Until midnight UTC |
rate:tier:tokens:<tenantId> |
rate:tier:tokens:org-uuid |
Daily token issuance counter for tier enforcement | Until midnight UTC |
compliance:report:<tenantId> |
compliance:report:org-uuid |
Cached compliance report JSON | 5 minutes |
Inspect keys:
# List all revoked tokens
redis-cli KEYS "revoked:*"
# Check rate limit counter for a specific client
redis-cli GET "rate:<client_id>:<window_key>"
# Check monthly token count for a specific client
redis-cli GET "monthly:<client_id>:2026:3"
# Check tier API call counter for a tenant
redis-cli GET "rate:tier:calls:<org_id>"
# Check tier token counter for a tenant
redis-cli GET "rate:tier:tokens:<org_id>"
# Check cached compliance report for a tenant
redis-cli GET "compliance:report:<org_id>"
redis-cli TTL "compliance:report:<org_id>"
Where <window_key> is floor(unix_ms / 60000). For the current window:
node -e "console.log(Math.floor(Date.now() / 60000))"
Troubleshooting
Application fails to start — missing environment variable
Symptom:
Failed to start server: Error: DATABASE_URL environment variable is required
Fix: Ensure your .env file exists in the project root and contains all required variables. Verify:
grep -E "^(DATABASE_URL|REDIS_URL|JWT_PRIVATE_KEY|JWT_PUBLIC_KEY)=" .env
Application fails to start — JWT key error
Symptom:
Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required
Fix: Generate RSA keys and add them to .env. See security.md.
PostgreSQL connection refused on first request
Symptom:
Error: connect ECONNREFUSED 127.0.0.1:5432
Causes and fixes:
| Cause | Fix |
|---|---|
| PostgreSQL container not started | Run docker-compose up -d postgres |
| PostgreSQL container not yet healthy | Wait and run docker-compose ps — wait for healthy |
Wrong DATABASE_URL host/port |
Check DATABASE_URL matches the PostgreSQL port (5432) |
| PostgreSQL container exited | Run docker-compose logs postgres to see why it exited |
Redis connection error on first request
Symptom:
Redis client error Error: connect ECONNREFUSED 127.0.0.1:6379
Causes and fixes:
| Cause | Fix |
|---|---|
| Redis container not started | Run docker-compose up -d redis |
| Redis container not yet healthy | Run docker-compose ps — wait for healthy |
Wrong REDIS_URL |
Check REDIS_URL matches the Redis port (6379) |
Migration fails
Symptom:
Migration failed: Error: connect ECONNREFUSED 127.0.0.1:5432
Fix: PostgreSQL is not running or not reachable. Start it and verify health before running migrations.
Symptom:
Migration failed: Error: relation "agents" already exists
Fix: The migration has already been applied partially. Check schema_migrations:
psql "$DATABASE_URL" -c "SELECT name FROM schema_migrations ORDER BY name;"
If a migration is listed there but the table is inconsistent, manually inspect and repair the database state before re-running.
All requests return 401 after key rotation
Symptom: Every API call returns 401 UNAUTHORIZED with Token signature is invalid.
Cause: JWT keys were rotated. All previously issued tokens were signed with the old private key and are now invalid.
Fix: Clients must re-authenticate using POST /token with their client_id and client_secret to obtain a new token signed with the new key. This is expected behaviour after key rotation.
Rate limit hit unexpectedly — 429 responses
Symptom: API returns 429 RATE_LIMIT_EXCEEDED with X-RateLimit-Reset header.
Check current rate limit state:
# Find the current window key
WINDOW=$(node -e "console.log(Math.floor(Date.now() / 60000))")
# Check count for a specific client
docker-compose exec redis redis-cli GET "rate:<client_id>:$WINDOW"
Fix: Wait until X-RateLimit-Reset (Unix timestamp in the response header) before retrying. The window resets every 60 seconds.
Monitoring
AgentIdP exposes a Prometheus metrics endpoint at GET /metrics (unauthenticated, plain text).
Metrics Exposed
| Metric | Type | Labels | Description |
|---|---|---|---|
agentidp_tokens_issued_total |
Counter | scope |
OAuth 2.0 tokens issued |
agentidp_agents_registered_total |
Counter | deployment_env |
Agents registered |
agentidp_http_requests_total |
Counter | method, route, status_code |
HTTP requests |
agentidp_http_request_duration_seconds |
Histogram | method, route, status_code |
HTTP latency |
agentidp_db_query_duration_seconds |
Histogram | operation |
PostgreSQL query duration |
agentidp_redis_command_duration_seconds |
Histogram | command |
Redis command duration |
agentidp_webhook_dead_letters_total |
Counter | event_type |
Webhook deliveries moved to dead-letter queue |
agentidp_credentials_expiring_soon_total |
Gauge | — | Credentials expiring within 7 days |
agentidp_audit_chain_integrity |
Gauge | — | 1 if audit chain is intact, 0 if broken |
agentidp_rate_limit_hits_total |
Counter | client_id |
Rate limit rejections |
agentidp_db_pool_active_connections |
Gauge | — | Active PostgreSQL connections |
agentidp_db_pool_waiting_requests |
Gauge | — | Requests waiting for a pool connection |
agentidp_tenant_api_calls_total |
Counter | org_id, tier |
API calls per tenant per tier |
agentidp_billing_limit_rejections_total |
Counter | org_id, limit_type |
Tier limit enforcement rejections |
agentidp_did_documents_generated_total |
Counter | — | DID documents generated |
agentidp_oidc_tokens_issued_total |
Counter | — | OIDC ID tokens issued |
agentidp_federation_events_total |
Counter | event_type |
Federation partner events |
agentidp_delegation_chains_created_total |
Counter | — | A2A delegation chains created |
agentidp_compliance_reports_generated_total |
Counter | — | Compliance reports generated |
Starting the Monitoring Stack
# Start the full stack with monitoring
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin / agentidp)
The Grafana dashboard auto-provisions on first start. Navigate to Dashboards → AgentIdP → SentryAgent.ai — AgentIdP.
Security Note
GET /metrics is unauthenticated. In production, ensure this endpoint is:
- Only accessible from your internal network (firewall rule or reverse proxy restriction)
- Not exposed on a public-facing port
Tier limit rejected — 429 with tier_limit_exceeded code
Symptom: 429 TOO_MANY_REQUESTS with body {"code":"tier_limit_exceeded","message":"..."}
Check the tenant's current tier counter:
# Check API call counter
docker compose exec redis redis-cli GET "rate:tier:calls:<org_id>"
# Check the tenant's tier
psql "$DATABASE_URL" -c "SELECT org_id, tier FROM tenant_tiers WHERE org_id = '<org_id>';"
If the org is on the free tier and has hit 1,000 calls/day, upgrade the tier or wait until
midnight UTC for the counter to reset.
Analytics endpoints return 404
Cause: ANALYTICS_ENABLED is set to false in .env.
Fix: Set ANALYTICS_ENABLED=true and restart the application.
Compliance report returns 404
Cause: COMPLIANCE_ENABLED is set to false in .env.
Fix: Set COMPLIANCE_ENABLED=true and restart the application.
Portal CORS error
Symptom: Browser console shows Access-Control-Allow-Origin error on requests to
http://localhost:3000.
Fix: Ensure CORS_ORIGIN in .env includes http://localhost:3001:
CORS_ORIGIN=http://localhost:3001
Restart the application after changing this variable.