Files
sentryagent-idp/docs/engineering/10-deployment.md
SentryAgent.ai Developer eced5f8699 docs: engineering knowledge base for new hires
Complete docs/engineering/ suite — 12 documents covering company overview,
system architecture, tech stack ADRs, codebase structure, service deep dives,
annotated code walkthroughs, dev setup, engineering workflow, testing strategy,
deployment/ops, SDK guide, and README index. All content verified against
source files. All 82 tasks in openspec/changes/engineering-docs/tasks.md
marked complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 12:38:42 +00:00

12 KiB
Raw Blame History

10 — Deployment and Operations

This document covers building and running AgentIdP in production: Docker, environment variables, database migrations, Terraform multi-region deployment, Prometheus/Grafana monitoring, and operational runbooks for common incidents.


1. Docker Build and Run

The Dockerfile uses a two-stage build:

  • Stage 1 (builder): node:18-alpine — installs all dependencies (including dev) and compiles TypeScript to dist/.
  • Stage 2 (production): node:18-alpine — copies dist/ and node_modules (production only), runs as the built-in non-root node user.
# Build
docker build -t sentryagent-idp:latest .

# Run (supply required env vars)
docker run -d \
  -p 3000:3000 \
  -e DATABASE_URL=postgresql://sentryagent:sentryagent@<host>:5432/sentryagent_idp \
  -e REDIS_URL=redis://<host>:6379 \
  -e JWT_PRIVATE_KEY="-----BEGIN RSA PRIVATE KEY-----\n..." \
  -e JWT_PUBLIC_KEY="-----BEGIN PUBLIC KEY-----\n..." \
  sentryagent-idp:latest

The container exposes port 3000. Override with PORT environment variable if needed.

For local full-stack development, use Docker Compose instead:

docker compose up -d

The docker-compose.yml starts the app, PostgreSQL 14, and Redis 7 with health checks and data volumes.


2. Environment Variables Reference

All variables are loaded at startup via dotenv. In production, inject them directly into the process environment — do not commit .env to version control.

Variable Required Default Purpose
DATABASE_URL Yes PostgreSQL connection string. Format: postgresql://<user>:<password>@<host>:<port>/<db>
REDIS_URL Yes Redis connection URL. Format: redis://<host>:<port>
JWT_PRIVATE_KEY Yes PEM-encoded RSA-2048 private key for signing RS256 JWT tokens
JWT_PUBLIC_KEY Yes PEM-encoded RSA-2048 public key for verifying tokens on every authenticated request
PORT No 3000 HTTP port the Express server listens on
NODE_ENV No undefined Set to production in production, test in test (disables Morgan logging in test)
CORS_ORIGIN No * Allowed CORS origin(s). Set to specific URL in production (e.g. https://app.mycompany.ai)
VAULT_ADDR No HashiCorp Vault server address. When set with VAULT_TOKEN, new credentials are stored in Vault KV v2 instead of bcrypt
VAULT_TOKEN No Vault authentication token. Required when VAULT_ADDR is set
VAULT_MOUNT No secret KV v2 secrets engine mount path
POLICY_DIR No <cwd>/policies Directory containing OPA policy files (authz.wasm or data/scopes.json)

Validation at startup: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY are checked in createApp() (see src/app.ts:117121). If missing, the process exits before binding to any port. DATABASE_URL and REDIS_URL are validated when their respective singletons are first initialised.


3. Database Migrations

Migrations are plain SQL files in src/db/migrations/. They are append-only — never modify an existing migration file. Always create a new numbered file.

Current migration files:

File What it creates
001_create_agents.sql agents table with UUID primary key, email unique constraint, status enum
002_create_credentials.sql credentials table linked to agents by client_id foreign key
003_create_audit_events.sql audit_events table with JSONB metadata column
004_create_tokens.sql token_monthly_counts table for free-tier token limit tracking
005_add_vault_path.sql Adds vault_path VARCHAR(512) column to the credentials table

Run migrations:

npm run db:migrate

This executes scripts/migrate.ts which applies all SQL files that have not yet been recorded in the schema_migrations tracking table.

Adding a new migration:

  1. Create src/db/migrations/006_<description>.sql
  2. Write idempotent SQL (use CREATE TABLE IF NOT EXISTS, ADD COLUMN IF NOT EXISTS, etc.)
  3. Run npm run db:migrate

4. Terraform Multi-Region Deployment

The terraform/ directory contains reusable modules and two environment configurations.

Directory structure:

terraform/
  modules/
    agentidp/        # Core AgentIdP compute resources
    lb/              # Load balancer (ALB/Cloud Load Balancer)
    rds/             # RDS PostgreSQL (AWS)
    redis/           # ElastiCache Redis (AWS) / Memorystore (GCP)
  environments/
    aws/             # AWS deployment (ECS Fargate, ALB, RDS, ElastiCache)
    gcp/             # GCP deployment (Cloud Run, Cloud SQL, Memorystore)

AWS Deployment

Architecture: Internet → Route 53 → ALB (public subnets, HTTPS) → ECS Fargate tasks (private subnets) → RDS PostgreSQL 14 (Multi-AZ) + ElastiCache Redis 7

All secrets are stored in AWS Secrets Manager and injected into ECS task definitions at launch time.

cd terraform/environments/aws
terraform init
terraform plan -var="aws_region=us-east-1"
terraform apply

Resources provisioned:

  • VPC with public and private subnets across multiple availability zones
  • ECS Cluster and Fargate task definition (running sentryagent-idp container)
  • Application Load Balancer with HTTPS listener and health check target group
  • RDS PostgreSQL 14 (Multi-AZ for high availability)
  • ElastiCache Redis 7 (primary + replica)
  • IAM roles and instance profiles for ECS task permissions
  • Security groups enforcing least-privilege network access

GCP Deployment

Architecture: Internet → Cloud Run (Google-managed TLS, auto-scaling) → Cloud SQL PostgreSQL 14 (REGIONAL HA) + Memorystore Redis 7 (STANDARD_HA)

All secrets are stored in GCP Secret Manager and mounted into the Cloud Run service at startup.

cd terraform/environments/gcp
terraform init
terraform plan -var="gcp_region=us-central1"
terraform apply

Resources provisioned:

  • VPC network with Serverless VPC Access connector (Cloud Run → private databases)
  • Cloud Run service (auto-scales to zero, Google-managed TLS)
  • Cloud Load Balancer with global anycast IP
  • Cloud SQL PostgreSQL 14 with regional high-availability
  • Memorystore Redis 7 (STANDARD_HA with in-transit encryption)
  • IAM service accounts and bindings

Important: All infrastructure changes must go through Terraform. Never make manual edits in the AWS console or GCP Cloud Console — they will be overwritten on the next terraform apply and will not be tracked in state.


5. Prometheus and Grafana

Metrics endpoint: GET /metrics (unauthenticated — restrict in production to internal network or scrape from within the cluster)

The metrics endpoint is served by the prom-client library using a dedicated registry (metricsRegistry) defined in src/metrics/registry.ts. The registry is isolated from the default global registry to prevent conflicts in tests.

Metric Definitions

All 6 metrics are defined in src/metrics/registry.ts:

Metric name Type Labels What it measures
agentidp_tokens_issued_total Counter scope Total OAuth 2.0 access tokens issued successfully
agentidp_agents_registered_total Counter deployment_env Total AI agents registered successfully
agentidp_http_requests_total Counter method, route, status_code Total HTTP requests received
agentidp_http_request_duration_seconds Histogram method, route, status_code HTTP request duration in seconds (buckets: 5ms2.5s)
agentidp_db_query_duration_seconds Histogram operation PostgreSQL query duration in seconds
agentidp_redis_command_duration_seconds Histogram command Redis command duration in seconds

The HTTP metrics (agentidp_http_requests_total and agentidp_http_request_duration_seconds) are populated by metricsMiddleware in src/middleware/metrics.ts, which is registered before all routes in src/app.ts. Route labels are normalised to replace UUIDs with :id to prevent high cardinality (e.g. /api/v1/agents/:id rather than /api/v1/agents/a1b2c3...).

Local Grafana

docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d

The monitoring compose overlay starts prom/prometheus:v2.53.0 and grafana/grafana:11.2.0. Grafana dashboards and datasource provisioning are loaded from monitoring/grafana/provisioning/.

Adding a New Metric

  1. Define the metric in src/metrics/registry.ts using the shared metricsRegistry (not the default prom-client registry).
  2. Export it from that file.
  3. Import it in the file where the instrumentation point lives.
  4. Call .inc(labels) for Counters or .observe(labels, value) for Histograms at the instrumentation point.
  5. Verify it appears in GET /metrics after starting the server.

6. Operational Runbook

Health Check

curl http://<host>/health

Expected response:

{"status":"ok","postgres":"connected","redis":"connected"}

Troubleshooting:

  • If postgres: "error" — verify DATABASE_URL is correct and PostgreSQL is reachable. Check docker compose logs postgres for local dev.
  • If redis: "error" — verify REDIS_URL is correct and Redis is reachable. Check docker compose logs redis for local dev.
  • If the health endpoint returns 502 or times out — the app process has crashed; check application logs.

Rotate the JWT Signing Key

All active tokens become invalid after a key rotation — agents must re-authenticate.

  1. Generate a new RSA-2048 key pair:
    openssl genrsa -out new-private.pem 2048
    openssl rsa -in new-private.pem -pubout -out new-public.pem
    
  2. Update JWT_PRIVATE_KEY and JWT_PUBLIC_KEY in your deployment environment (AWS Secrets Manager, GCP Secret Manager, or .env).
  3. Perform a rolling restart:
    • ECS: trigger a new task deployment — ECS drains existing tasks and starts new ones with the updated secret values.
    • Cloud Run: deploy a new revision — Cloud Run gradually shifts traffic to the new revision.
  4. Tokens signed with the old key will fail verification immediately after all instances have restarted.

Revoke All Tokens for a Compromised Agent

Suspend the agent to stop new token issuance immediately:

curl -X PATCH http://<host>/api/v1/agents/<agentId> \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{"status": "suspended"}'

This prevents any new POST /api/v1/token requests for that agent. Active tokens remain valid until their TTL (1 hour). To invalidate active tokens immediately, also revoke all credentials for the agent:

# List credentials
curl http://<host>/api/v1/agents/<agentId>/credentials \
  -H "Authorization: Bearer <admin_token>"

# Revoke each active credential
curl -X DELETE http://<host>/api/v1/agents/<agentId>/credentials/<credentialId> \
  -H "Authorization: Bearer <admin_token>"

Read Audit Logs for an Incident

Query the audit log with date range and agent filter:

curl "http://<host>/api/v1/audit?agentId=<agentId>&startDate=2026-01-01T00:00:00Z&endDate=2026-01-31T23:59:59Z" \
  -H "Authorization: Bearer <admin_token>"

Events are returned newest-first. Audit log retention is 90 days on the free tier. Each event includes: eventId, agentId, action, outcome, ipAddress, userAgent, metadata, timestamp.

Common action values: token.issued, token.revoked, token.introspected, agent.created, agent.updated, agent.suspended, agent.decommissioned, credential.generated, credential.rotated, credential.revoked, auth.failed.