Files
sentryagent-idp/docs/devops/operations.md
SentryAgent.ai Developer 8cabc0191c docs: commit all Phase 6 documentation updates and OpenSpec archives
- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook)
- developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added
- engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview)
- OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update
- VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added
- .gitignore: exclude session artifacts, build artifacts, and agent workspaces

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 02:24:24 +00:00

11 KiB

Operations

Startup, shutdown, log interpretation, and troubleshooting for AgentIdP.


Startup Order

Always start services in this order. Starting the application before PostgreSQL or Redis is ready will cause connection errors on first request.

1. PostgreSQL   (must be healthy)
2. Redis        (must be healthy)
3. Migrations   (must complete successfully)
4. Application  (start last)

Startup checklist

# 1. Start the full stack
docker compose up --build -d

# 2. Verify all three services are healthy
docker compose ps
# app, postgres, and redis must all show "healthy"

# 3. Run migrations
docker compose exec app npm run db:migrate

# 4. Verify application health
curl http://localhost:3000/health
# Expected: {"status":"ok"}

# 5. (Optional) Start the portal for local dev
cd portal && npm run dev

Graceful Shutdown

The application handles SIGTERM and SIGINT gracefully:

  1. Stops accepting new connections
  2. Waits for in-flight requests to complete
  3. Exits with code 0

Sending SIGTERM

# Find the PID
ps aux | grep "node.*server"

# Send SIGTERM
kill -SIGTERM <pid>

Expected log output:

Shutting down gracefully...

The process exits cleanly. No requests are dropped if they were already in-flight.

Docker stop

docker stop sends SIGTERM by default with a 10-second timeout before SIGKILL. This is sufficient for graceful shutdown.

docker stop sentryagent-idp-app-1

Log Reference

AgentIdP logs to stdout. In development (NODE_ENV=development), Morgan HTTP request logs are included. In test (NODE_ENV=test), Morgan is suppressed.

Startup logs

Log line Meaning
SentryAgent.ai AgentIdP listening on port 3000 Server bound successfully — ready to accept requests
Shutting down gracefully... SIGTERM/SIGINT received — draining connections

Error logs

Log line Meaning
Failed to start server: Error: DATABASE_URL environment variable is required DATABASE_URL is not set in the environment
Failed to start server: Error: REDIS_URL environment variable is required REDIS_URL is not set
Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required One or both JWT keys are missing
Unexpected pg pool error <err> PostgreSQL connection dropped after startup — check DB availability
Redis client error <err> Redis connection error after startup — check Redis availability

Morgan HTTP request format (development)

::1 - - [28/Mar/2026:09:01:00 +0000] "POST /api/v1/token HTTP/1.1" 200 312 "-" "curl/7.88.1"

Format: <ip> - - [<timestamp>] "<method> <path> <protocol>" <status> <bytes> "<referrer>" "<user-agent>"


Redis Key Patterns

Three key patterns are used in Redis. Useful for debugging and manual inspection.

# Connect to Redis CLI
docker-compose exec redis redis-cli
Key pattern Example Purpose TTL
revoked:<jti> revoked:f1e2d3c4-... Revoked token JTI Remaining token lifetime
rate:<client_id>:<window> rate:a1b2c3...:29086156 Request count per window RATE_LIMIT_WINDOW_MS
monthly:<client_id>:<year>:<month> monthly:a1b2c3...:2026:3 Monthly token issuance count End of month
rate:tier:calls:<tenantId> rate:tier:calls:org-uuid Daily API call counter for tier enforcement Until midnight UTC
rate:tier:tokens:<tenantId> rate:tier:tokens:org-uuid Daily token issuance counter for tier enforcement Until midnight UTC
compliance:report:<tenantId> compliance:report:org-uuid Cached compliance report JSON 5 minutes

Inspect keys:

# List all revoked tokens
redis-cli KEYS "revoked:*"

# Check rate limit counter for a specific client
redis-cli GET "rate:<client_id>:<window_key>"

# Check monthly token count for a specific client
redis-cli GET "monthly:<client_id>:2026:3"

# Check tier API call counter for a tenant
redis-cli GET "rate:tier:calls:<org_id>"

# Check tier token counter for a tenant
redis-cli GET "rate:tier:tokens:<org_id>"

# Check cached compliance report for a tenant
redis-cli GET "compliance:report:<org_id>"
redis-cli TTL "compliance:report:<org_id>"

Where <window_key> is floor(unix_ms / 60000). For the current window:

node -e "console.log(Math.floor(Date.now() / 60000))"

Troubleshooting

Application fails to start — missing environment variable

Symptom:

Failed to start server: Error: DATABASE_URL environment variable is required

Fix: Ensure your .env file exists in the project root and contains all required variables. Verify:

grep -E "^(DATABASE_URL|REDIS_URL|JWT_PRIVATE_KEY|JWT_PUBLIC_KEY)=" .env

Application fails to start — JWT key error

Symptom:

Failed to start server: Error: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY environment variables are required

Fix: Generate RSA keys and add them to .env. See security.md.


PostgreSQL connection refused on first request

Symptom:

Error: connect ECONNREFUSED 127.0.0.1:5432

Causes and fixes:

Cause Fix
PostgreSQL container not started Run docker-compose up -d postgres
PostgreSQL container not yet healthy Wait and run docker-compose ps — wait for healthy
Wrong DATABASE_URL host/port Check DATABASE_URL matches the PostgreSQL port (5432)
PostgreSQL container exited Run docker-compose logs postgres to see why it exited

Redis connection error on first request

Symptom:

Redis client error Error: connect ECONNREFUSED 127.0.0.1:6379

Causes and fixes:

Cause Fix
Redis container not started Run docker-compose up -d redis
Redis container not yet healthy Run docker-compose ps — wait for healthy
Wrong REDIS_URL Check REDIS_URL matches the Redis port (6379)

Migration fails

Symptom:

Migration failed: Error: connect ECONNREFUSED 127.0.0.1:5432

Fix: PostgreSQL is not running or not reachable. Start it and verify health before running migrations.

Symptom:

Migration failed: Error: relation "agents" already exists

Fix: The migration has already been applied partially. Check schema_migrations:

psql "$DATABASE_URL" -c "SELECT name FROM schema_migrations ORDER BY name;"

If a migration is listed there but the table is inconsistent, manually inspect and repair the database state before re-running.


All requests return 401 after key rotation

Symptom: Every API call returns 401 UNAUTHORIZED with Token signature is invalid.

Cause: JWT keys were rotated. All previously issued tokens were signed with the old private key and are now invalid.

Fix: Clients must re-authenticate using POST /token with their client_id and client_secret to obtain a new token signed with the new key. This is expected behaviour after key rotation.


Rate limit hit unexpectedly — 429 responses

Symptom: API returns 429 RATE_LIMIT_EXCEEDED with X-RateLimit-Reset header.

Check current rate limit state:

# Find the current window key
WINDOW=$(node -e "console.log(Math.floor(Date.now() / 60000))")
# Check count for a specific client
docker-compose exec redis redis-cli GET "rate:<client_id>:$WINDOW"

Fix: Wait until X-RateLimit-Reset (Unix timestamp in the response header) before retrying. The window resets every 60 seconds.


Monitoring

AgentIdP exposes a Prometheus metrics endpoint at GET /metrics (unauthenticated, plain text).

Metrics Exposed

Metric Type Labels Description
agentidp_tokens_issued_total Counter scope OAuth 2.0 tokens issued
agentidp_agents_registered_total Counter deployment_env Agents registered
agentidp_http_requests_total Counter method, route, status_code HTTP requests
agentidp_http_request_duration_seconds Histogram method, route, status_code HTTP latency
agentidp_db_query_duration_seconds Histogram operation PostgreSQL query duration
agentidp_redis_command_duration_seconds Histogram command Redis command duration
agentidp_webhook_dead_letters_total Counter event_type Webhook deliveries moved to dead-letter queue
agentidp_credentials_expiring_soon_total Gauge Credentials expiring within 7 days
agentidp_audit_chain_integrity Gauge 1 if audit chain is intact, 0 if broken
agentidp_rate_limit_hits_total Counter client_id Rate limit rejections
agentidp_db_pool_active_connections Gauge Active PostgreSQL connections
agentidp_db_pool_waiting_requests Gauge Requests waiting for a pool connection
agentidp_tenant_api_calls_total Counter org_id, tier API calls per tenant per tier
agentidp_billing_limit_rejections_total Counter org_id, limit_type Tier limit enforcement rejections
agentidp_did_documents_generated_total Counter DID documents generated
agentidp_oidc_tokens_issued_total Counter OIDC ID tokens issued
agentidp_federation_events_total Counter event_type Federation partner events
agentidp_delegation_chains_created_total Counter A2A delegation chains created
agentidp_compliance_reports_generated_total Counter Compliance reports generated

Starting the Monitoring Stack

# Start the full stack with monitoring
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d

# Prometheus: http://localhost:9090
# Grafana:    http://localhost:3001  (admin / agentidp)

The Grafana dashboard auto-provisions on first start. Navigate to Dashboards → AgentIdP → SentryAgent.ai — AgentIdP.

Security Note

GET /metrics is unauthenticated. In production, ensure this endpoint is:

  • Only accessible from your internal network (firewall rule or reverse proxy restriction)
  • Not exposed on a public-facing port

Tier limit rejected — 429 with tier_limit_exceeded code

Symptom: 429 TOO_MANY_REQUESTS with body {"code":"tier_limit_exceeded","message":"..."}

Check the tenant's current tier counter:

# Check API call counter
docker compose exec redis redis-cli GET "rate:tier:calls:<org_id>"

# Check the tenant's tier
psql "$DATABASE_URL" -c "SELECT org_id, tier FROM tenant_tiers WHERE org_id = '<org_id>';"

If the org is on the free tier and has hit 1,000 calls/day, upgrade the tier or wait until midnight UTC for the counter to reset.


Analytics endpoints return 404

Cause: ANALYTICS_ENABLED is set to false in .env.

Fix: Set ANALYTICS_ENABLED=true and restart the application.


Compliance report returns 404

Cause: COMPLIANCE_ENABLED is set to false in .env.

Fix: Set COMPLIANCE_ENABLED=true and restart the application.


Portal CORS error

Symptom: Browser console shows Access-Control-Allow-Origin error on requests to http://localhost:3000.

Fix: Ensure CORS_ORIGIN in .env includes http://localhost:3001:

CORS_ORIGIN=http://localhost:3001

Restart the application after changing this variable.