- Replace all docker-compose.yml/docker-compose.monitoring.yml references with compose.yaml/compose.monitoring.yaml (modern Compose Spec naming) - Replace all `docker-compose` CLI commands with `docker compose` (plugin syntax) - Update Dockerfile stage descriptions: node:18-alpine → node:20.11-bookworm-slim, built-in node user → explicit nodeapp:1001 non-root user - Update image version references: postgres:14-alpine → postgres:14.12-alpine3.19, redis:7-alpine → redis:7.2-alpine3.19 - Externalize postgres credentials: hardcoded values → POSTGRES_USER/PASSWORD/DB env vars - Externalize Grafana admin password: hardcoded 'agentidp' → GF_ADMIN_PASSWORD env var - Add Docker Compose Variables section to environment-variables.md (POSTGRES_*, GF_ADMIN_PASSWORD) - Update local-development.md Step 3: cp .env.example .env, document POSTGRES_* purpose - Update quick-start.md: cp .env.example .env, use awk/sed for JWT key injection - Update 07-dev-setup.md: remove 'no .env.example' claim, reference cp .env.example - Update docker-compose.yml key file description in 04-codebase-structure.md - Update monitoring overlay launch commands across all docs (compose.yaml + compose.monitoring.yaml) - Update volume names to kebab-case: postgres_data → postgres-data, redis_data → redis-data - Fix compliance encryption-runbook: docker-compose restart agentidp → docker compose restart app All docs now consistent with compose.yaml in repo root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
12 KiB
10 — Deployment and Operations
This document covers building and running AgentIdP in production: Docker, environment variables, database migrations, Terraform multi-region deployment, Prometheus/Grafana monitoring, and operational runbooks for common incidents.
1. Docker Build and Run
The Dockerfile uses a two-stage build:
- Stage 1 (build):
node:20.11-bookworm-slim— installs all dependencies (including dev) and compiles TypeScript todist/. - Stage 2 (final):
node:20.11-bookworm-slim— copiesdist/andnode_modules(production only), installscurlfor healthcheck, and runs as the created non-rootnodeappuser (UID 1001).
# Build
docker build -t sentryagent-idp:1.0.0 .
# Run (supply required env vars)
docker run -d \
-p 3000:3000 \
-e DATABASE_URL=postgresql://sentryagent:sentryagent@<host>:5432/sentryagent_idp \
-e REDIS_URL=redis://<host>:6379 \
-e JWT_PRIVATE_KEY="-----BEGIN RSA PRIVATE KEY-----\n..." \
-e JWT_PUBLIC_KEY="-----BEGIN PUBLIC KEY-----\n..." \
sentryagent-idp:1.0.0
The container exposes port 3000. Override with PORT environment variable if needed. The container runs as non-root user nodeapp (UID 1001) — do not mount volumes requiring root ownership.
For local full-stack development, use Docker Compose instead:
docker compose up --build -d
The compose.yaml starts the app, PostgreSQL 14.12, and Redis 7.2 with health checks, resource limits, restart policies, and data volumes — per DockerSpec standards.
2. Environment Variables Reference
All variables are loaded at startup via dotenv. In production, inject them directly into the process environment — do not commit .env to version control.
| Variable | Required | Default | Purpose |
|---|---|---|---|
DATABASE_URL |
Yes | — | PostgreSQL connection string. Format: postgresql://<user>:<password>@<host>:<port>/<db> |
REDIS_URL |
Yes | — | Redis connection URL. Format: redis://<host>:<port> |
JWT_PRIVATE_KEY |
Yes | — | PEM-encoded RSA-2048 private key for signing RS256 JWT tokens |
JWT_PUBLIC_KEY |
Yes | — | PEM-encoded RSA-2048 public key for verifying tokens on every authenticated request |
PORT |
No | 3000 |
HTTP port the Express server listens on |
NODE_ENV |
No | undefined |
Set to production in production, test in test (disables Morgan logging in test) |
CORS_ORIGIN |
No | * |
Allowed CORS origin(s). Set to specific URL in production (e.g. https://app.mycompany.ai) |
VAULT_ADDR |
No | — | HashiCorp Vault server address. When set with VAULT_TOKEN, new credentials are stored in Vault KV v2 instead of bcrypt |
VAULT_TOKEN |
No | — | Vault authentication token. Required when VAULT_ADDR is set |
VAULT_MOUNT |
No | secret |
KV v2 secrets engine mount path |
POLICY_DIR |
No | <cwd>/policies |
Directory containing OPA policy files (authz.wasm or data/scopes.json) |
Validation at startup: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY are checked in createApp() (see src/app.ts:117–121). If missing, the process exits before binding to any port. DATABASE_URL and REDIS_URL are validated when their respective singletons are first initialised.
3. Database Migrations
Migrations are plain SQL files in src/db/migrations/. They are append-only — never modify an existing migration file. Always create a new numbered file.
Current migration files:
| File | What it creates |
|---|---|
001_create_agents.sql |
agents table with UUID primary key, email unique constraint, status enum |
002_create_credentials.sql |
credentials table linked to agents by client_id foreign key |
003_create_audit_events.sql |
audit_events table with JSONB metadata column |
004_create_tokens.sql |
token_monthly_counts table for free-tier token limit tracking |
005_add_vault_path.sql |
Adds vault_path VARCHAR(512) column to the credentials table |
Run migrations:
npm run db:migrate
This executes scripts/migrate.ts which applies all SQL files that have not yet been recorded in the schema_migrations tracking table.
Adding a new migration:
- Create
src/db/migrations/006_<description>.sql - Write idempotent SQL (use
CREATE TABLE IF NOT EXISTS,ADD COLUMN IF NOT EXISTS, etc.) - Run
npm run db:migrate
4. Terraform Multi-Region Deployment
The terraform/ directory contains reusable modules and two environment configurations.
Directory structure:
terraform/
modules/
agentidp/ # Core AgentIdP compute resources
lb/ # Load balancer (ALB/Cloud Load Balancer)
rds/ # RDS PostgreSQL (AWS)
redis/ # ElastiCache Redis (AWS) / Memorystore (GCP)
environments/
aws/ # AWS deployment (ECS Fargate, ALB, RDS, ElastiCache)
gcp/ # GCP deployment (Cloud Run, Cloud SQL, Memorystore)
AWS Deployment
Architecture: Internet → Route 53 → ALB (public subnets, HTTPS) → ECS Fargate tasks (private subnets) → RDS PostgreSQL 14 (Multi-AZ) + ElastiCache Redis 7
All secrets are stored in AWS Secrets Manager and injected into ECS task definitions at launch time.
cd terraform/environments/aws
terraform init
terraform plan -var="aws_region=us-east-1"
terraform apply
Resources provisioned:
- VPC with public and private subnets across multiple availability zones
- ECS Cluster and Fargate task definition (running
sentryagent-idpcontainer) - Application Load Balancer with HTTPS listener and health check target group
- RDS PostgreSQL 14 (Multi-AZ for high availability)
- ElastiCache Redis 7 (primary + replica)
- IAM roles and instance profiles for ECS task permissions
- Security groups enforcing least-privilege network access
GCP Deployment
Architecture: Internet → Cloud Run (Google-managed TLS, auto-scaling) → Cloud SQL PostgreSQL 14 (REGIONAL HA) + Memorystore Redis 7 (STANDARD_HA)
All secrets are stored in GCP Secret Manager and mounted into the Cloud Run service at startup.
cd terraform/environments/gcp
terraform init
terraform plan -var="gcp_region=us-central1"
terraform apply
Resources provisioned:
- VPC network with Serverless VPC Access connector (Cloud Run → private databases)
- Cloud Run service (auto-scales to zero, Google-managed TLS)
- Cloud Load Balancer with global anycast IP
- Cloud SQL PostgreSQL 14 with regional high-availability
- Memorystore Redis 7 (STANDARD_HA with in-transit encryption)
- IAM service accounts and bindings
Important: All infrastructure changes must go through Terraform. Never make manual edits in the AWS console or GCP Cloud Console — they will be overwritten on the next terraform apply and will not be tracked in state.
5. Prometheus and Grafana
Metrics endpoint: GET /metrics (unauthenticated — restrict in production to internal network or scrape from within the cluster)
The metrics endpoint is served by the prom-client library using a dedicated registry (metricsRegistry) defined in src/metrics/registry.ts. The registry is isolated from the default global registry to prevent conflicts in tests.
Metric Definitions
All 6 metrics are defined in src/metrics/registry.ts:
| Metric name | Type | Labels | What it measures |
|---|---|---|---|
agentidp_tokens_issued_total |
Counter | scope |
Total OAuth 2.0 access tokens issued successfully |
agentidp_agents_registered_total |
Counter | deployment_env |
Total AI agents registered successfully |
agentidp_http_requests_total |
Counter | method, route, status_code |
Total HTTP requests received |
agentidp_http_request_duration_seconds |
Histogram | method, route, status_code |
HTTP request duration in seconds (buckets: 5ms–2.5s) |
agentidp_db_query_duration_seconds |
Histogram | operation |
PostgreSQL query duration in seconds |
agentidp_redis_command_duration_seconds |
Histogram | command |
Redis command duration in seconds |
The HTTP metrics (agentidp_http_requests_total and agentidp_http_request_duration_seconds) are populated by metricsMiddleware in src/middleware/metrics.ts, which is registered before all routes in src/app.ts. Route labels are normalised to replace UUIDs with :id to prevent high cardinality (e.g. /api/v1/agents/:id rather than /api/v1/agents/a1b2c3...).
Local Grafana
docker compose -f compose.yaml -f compose.monitoring.yaml up -d
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3001 (admin password:
GF_ADMIN_PASSWORDvalue from.env)
The monitoring compose overlay starts prom/prometheus:v2.53.0 and grafana/grafana:11.2.0. Grafana dashboards and datasource provisioning are loaded from monitoring/grafana/provisioning/.
Adding a New Metric
- Define the metric in
src/metrics/registry.tsusing the sharedmetricsRegistry(not the default prom-client registry). - Export it from that file.
- Import it in the file where the instrumentation point lives.
- Call
.inc(labels)for Counters or.observe(labels, value)for Histograms at the instrumentation point. - Verify it appears in
GET /metricsafter starting the server.
6. Operational Runbook
Health Check
curl http://<host>/health
Expected response:
{"status":"ok","postgres":"connected","redis":"connected"}
Troubleshooting:
- If
postgres: "error"— verifyDATABASE_URLis correct and PostgreSQL is reachable. Checkdocker compose logs postgresfor local dev. - If
redis: "error"— verifyREDIS_URLis correct and Redis is reachable. Checkdocker compose logs redisfor local dev. - If the health endpoint returns 502 or times out — the app process has crashed; check application logs.
Rotate the JWT Signing Key
All active tokens become invalid after a key rotation — agents must re-authenticate.
- Generate a new RSA-2048 key pair:
openssl genrsa -out new-private.pem 2048 openssl rsa -in new-private.pem -pubout -out new-public.pem - Update
JWT_PRIVATE_KEYandJWT_PUBLIC_KEYin your deployment environment (AWS Secrets Manager, GCP Secret Manager, or.env). - Perform a rolling restart:
- ECS: trigger a new task deployment — ECS drains existing tasks and starts new ones with the updated secret values.
- Cloud Run: deploy a new revision — Cloud Run gradually shifts traffic to the new revision.
- Tokens signed with the old key will fail verification immediately after all instances have restarted.
Revoke All Tokens for a Compromised Agent
Suspend the agent to stop new token issuance immediately:
curl -X PATCH http://<host>/api/v1/agents/<agentId> \
-H "Authorization: Bearer <admin_token>" \
-H "Content-Type: application/json" \
-d '{"status": "suspended"}'
This prevents any new POST /api/v1/token requests for that agent. Active tokens remain valid until their TTL (1 hour). To invalidate active tokens immediately, also revoke all credentials for the agent:
# List credentials
curl http://<host>/api/v1/agents/<agentId>/credentials \
-H "Authorization: Bearer <admin_token>"
# Revoke each active credential
curl -X DELETE http://<host>/api/v1/agents/<agentId>/credentials/<credentialId> \
-H "Authorization: Bearer <admin_token>"
Read Audit Logs for an Incident
Query the audit log with date range and agent filter:
curl "http://<host>/api/v1/audit?agentId=<agentId>&startDate=2026-01-01T00:00:00Z&endDate=2026-01-31T23:59:59Z" \
-H "Authorization: Bearer <admin_token>"
Events are returned newest-first. Audit log retention is 90 days on the free tier. Each event includes: eventId, agentId, action, outcome, ipAddress, userAgent, metadata, timestamp.
Common action values: token.issued, token.revoked, token.introspected, agent.created, agent.updated, agent.suspended, agent.decommissioned, credential.generated, credential.rotated, credential.revoked, auth.failed.