Files

SentryAgent.ai Developer f9a6a8aafb docs(devops): update all documentation for DockerSpec compliance

- Replace all docker-compose.yml/docker-compose.monitoring.yml references with
  compose.yaml/compose.monitoring.yaml (modern Compose Spec naming)
- Replace all `docker-compose` CLI commands with `docker compose` (plugin syntax)
- Update Dockerfile stage descriptions: node:18-alpine → node:20.11-bookworm-slim,
  built-in node user → explicit nodeapp:1001 non-root user
- Update image version references: postgres:14-alpine → postgres:14.12-alpine3.19,
  redis:7-alpine → redis:7.2-alpine3.19
- Externalize postgres credentials: hardcoded values → POSTGRES_USER/PASSWORD/DB env vars
- Externalize Grafana admin password: hardcoded 'agentidp' → GF_ADMIN_PASSWORD env var
- Add Docker Compose Variables section to environment-variables.md (POSTGRES_*, GF_ADMIN_PASSWORD)
- Update local-development.md Step 3: cp .env.example .env, document POSTGRES_* purpose
- Update quick-start.md: cp .env.example .env, use awk/sed for JWT key injection
- Update 07-dev-setup.md: remove 'no .env.example' claim, reference cp .env.example
- Update docker-compose.yml key file description in 04-codebase-structure.md
- Update monitoring overlay launch commands across all docs (compose.yaml + compose.monitoring.yaml)
- Update volume names to kebab-case: postgres_data → postgres-data, redis_data → redis-data
- Fix compliance encryption-runbook: docker-compose restart agentidp → docker compose restart app

All docs now consistent with compose.yaml in repo root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-08 08:27:37 +00:00

12 KiB

Raw Blame History

10 — Deployment and Operations

This document covers building and running AgentIdP in production: Docker, environment variables, database migrations, Terraform multi-region deployment, Prometheus/Grafana monitoring, and operational runbooks for common incidents.

1. Docker Build and Run

The Dockerfile uses a two-stage build:

Stage 1 (build): node:20.11-bookworm-slim — installs all dependencies (including dev) and compiles TypeScript to dist/.
Stage 2 (final): node:20.11-bookworm-slim — copies dist/ and node_modules (production only), installs curl for healthcheck, and runs as the created non-root nodeapp user (UID 1001).

# Build
docker build -t sentryagent-idp:1.0.0 .

# Run (supply required env vars)
docker run -d \
  -p 3000:3000 \
  -e DATABASE_URL=postgresql://sentryagent:sentryagent@<host>:5432/sentryagent_idp \
  -e REDIS_URL=redis://<host>:6379 \
  -e JWT_PRIVATE_KEY="-----BEGIN RSA PRIVATE KEY-----\n..." \
  -e JWT_PUBLIC_KEY="-----BEGIN PUBLIC KEY-----\n..." \
  sentryagent-idp:1.0.0

The container exposes port 3000. Override with PORT environment variable if needed. The container runs as non-root user nodeapp (UID 1001) — do not mount volumes requiring root ownership.

For local full-stack development, use Docker Compose instead:

docker compose up --build -d

The compose.yaml starts the app, PostgreSQL 14.12, and Redis 7.2 with health checks, resource limits, restart policies, and data volumes — per DockerSpec standards.

2. Environment Variables Reference

All variables are loaded at startup via dotenv. In production, inject them directly into the process environment — do not commit .env to version control.

Variable	Required	Default	Purpose
`DATABASE_URL`	Yes	—	PostgreSQL connection string. Format: `postgresql://<user>:<password>@<host>:<port>/<db>`
`REDIS_URL`	Yes	—	Redis connection URL. Format: `redis://<host>:<port>`
`JWT_PRIVATE_KEY`	Yes	—	PEM-encoded RSA-2048 private key for signing RS256 JWT tokens
`JWT_PUBLIC_KEY`	Yes	—	PEM-encoded RSA-2048 public key for verifying tokens on every authenticated request
`PORT`	No	`3000`	HTTP port the Express server listens on
`NODE_ENV`	No	`undefined`	Set to `production` in production, `test` in test (disables Morgan logging in test)
`CORS_ORIGIN`	No	`*`	Allowed CORS origin(s). Set to specific URL in production (e.g. `https://app.mycompany.ai`)
`VAULT_ADDR`	No	—	HashiCorp Vault server address. When set with `VAULT_TOKEN`, new credentials are stored in Vault KV v2 instead of bcrypt
`VAULT_TOKEN`	No	—	Vault authentication token. Required when `VAULT_ADDR` is set
`VAULT_MOUNT`	No	`secret`	KV v2 secrets engine mount path
`POLICY_DIR`	No	`<cwd>/policies`	Directory containing OPA policy files (`authz.wasm` or `data/scopes.json`)

Validation at startup: JWT_PRIVATE_KEY and JWT_PUBLIC_KEY are checked in createApp() (see src/app.ts:117–121). If missing, the process exits before binding to any port. DATABASE_URL and REDIS_URL are validated when their respective singletons are first initialised.

3. Database Migrations

Migrations are plain SQL files in src/db/migrations/. They are append-only — never modify an existing migration file. Always create a new numbered file.

Current migration files:

File	What it creates
`001_create_agents.sql`	`agents` table with UUID primary key, email unique constraint, status enum
`002_create_credentials.sql`	`credentials` table linked to `agents` by `client_id` foreign key
`003_create_audit_events.sql`	`audit_events` table with JSONB `metadata` column
`004_create_tokens.sql`	`token_monthly_counts` table for free-tier token limit tracking
`005_add_vault_path.sql`	Adds `vault_path VARCHAR(512)` column to the `credentials` table

Run migrations:

npm run db:migrate

This executes scripts/migrate.ts which applies all SQL files that have not yet been recorded in the schema_migrations tracking table.

Adding a new migration:

Create src/db/migrations/006_<description>.sql
Write idempotent SQL (use CREATE TABLE IF NOT EXISTS, ADD COLUMN IF NOT EXISTS, etc.)
Run npm run db:migrate

4. Terraform Multi-Region Deployment

The terraform/ directory contains reusable modules and two environment configurations.

Directory structure:

terraform/
  modules/
    agentidp/        # Core AgentIdP compute resources
    lb/              # Load balancer (ALB/Cloud Load Balancer)
    rds/             # RDS PostgreSQL (AWS)
    redis/           # ElastiCache Redis (AWS) / Memorystore (GCP)
  environments/
    aws/             # AWS deployment (ECS Fargate, ALB, RDS, ElastiCache)
    gcp/             # GCP deployment (Cloud Run, Cloud SQL, Memorystore)

AWS Deployment

Architecture: Internet → Route 53 → ALB (public subnets, HTTPS) → ECS Fargate tasks (private subnets) → RDS PostgreSQL 14 (Multi-AZ) + ElastiCache Redis 7

All secrets are stored in AWS Secrets Manager and injected into ECS task definitions at launch time.

cd terraform/environments/aws
terraform init
terraform plan -var="aws_region=us-east-1"
terraform apply

Resources provisioned:

VPC with public and private subnets across multiple availability zones
ECS Cluster and Fargate task definition (running sentryagent-idp container)
Application Load Balancer with HTTPS listener and health check target group
RDS PostgreSQL 14 (Multi-AZ for high availability)
ElastiCache Redis 7 (primary + replica)
IAM roles and instance profiles for ECS task permissions
Security groups enforcing least-privilege network access

GCP Deployment

Architecture: Internet → Cloud Run (Google-managed TLS, auto-scaling) → Cloud SQL PostgreSQL 14 (REGIONAL HA) + Memorystore Redis 7 (STANDARD_HA)

All secrets are stored in GCP Secret Manager and mounted into the Cloud Run service at startup.

cd terraform/environments/gcp
terraform init
terraform plan -var="gcp_region=us-central1"
terraform apply

Resources provisioned:

VPC network with Serverless VPC Access connector (Cloud Run → private databases)
Cloud Run service (auto-scales to zero, Google-managed TLS)
Cloud Load Balancer with global anycast IP
Cloud SQL PostgreSQL 14 with regional high-availability
Memorystore Redis 7 (STANDARD_HA with in-transit encryption)
IAM service accounts and bindings

Important: All infrastructure changes must go through Terraform. Never make manual edits in the AWS console or GCP Cloud Console — they will be overwritten on the next terraform apply and will not be tracked in state.

5. Prometheus and Grafana

Metrics endpoint: GET /metrics (unauthenticated — restrict in production to internal network or scrape from within the cluster)

The metrics endpoint is served by the prom-client library using a dedicated registry (metricsRegistry) defined in src/metrics/registry.ts. The registry is isolated from the default global registry to prevent conflicts in tests.

Metric Definitions

All 6 metrics are defined in src/metrics/registry.ts:

Metric name	Type	Labels	What it measures
`agentidp_tokens_issued_total`	Counter	`scope`	Total OAuth 2.0 access tokens issued successfully
`agentidp_agents_registered_total`	Counter	`deployment_env`	Total AI agents registered successfully
`agentidp_http_requests_total`	Counter	`method`, `route`, `status_code`	Total HTTP requests received
`agentidp_http_request_duration_seconds`	Histogram	`method`, `route`, `status_code`	HTTP request duration in seconds (buckets: 5ms–2.5s)
`agentidp_db_query_duration_seconds`	Histogram	`operation`	PostgreSQL query duration in seconds
`agentidp_redis_command_duration_seconds`	Histogram	`command`	Redis command duration in seconds

The HTTP metrics (agentidp_http_requests_total and agentidp_http_request_duration_seconds) are populated by metricsMiddleware in src/middleware/metrics.ts, which is registered before all routes in src/app.ts. Route labels are normalised to replace UUIDs with :id to prevent high cardinality (e.g. /api/v1/agents/:id rather than /api/v1/agents/a1b2c3...).

Local Grafana

docker compose -f compose.yaml -f compose.monitoring.yaml up -d

Prometheus: http://localhost:9090
Grafana: http://localhost:3001 (admin password: GF_ADMIN_PASSWORD value from .env)

The monitoring compose overlay starts prom/prometheus:v2.53.0 and grafana/grafana:11.2.0. Grafana dashboards and datasource provisioning are loaded from monitoring/grafana/provisioning/.

Adding a New Metric

Define the metric in src/metrics/registry.ts using the shared metricsRegistry (not the default prom-client registry).
Export it from that file.
Import it in the file where the instrumentation point lives.
Call .inc(labels) for Counters or .observe(labels, value) for Histograms at the instrumentation point.
Verify it appears in GET /metrics after starting the server.

6. Operational Runbook

Health Check

curl http://<host>/health

Expected response:

{"status":"ok","postgres":"connected","redis":"connected"}

Troubleshooting:

If postgres: "error" — verify DATABASE_URL is correct and PostgreSQL is reachable. Check docker compose logs postgres for local dev.
If redis: "error" — verify REDIS_URL is correct and Redis is reachable. Check docker compose logs redis for local dev.
If the health endpoint returns 502 or times out — the app process has crashed; check application logs.

Rotate the JWT Signing Key

All active tokens become invalid after a key rotation — agents must re-authenticate.

Generate a new RSA-2048 key pair:

openssl genrsa -out new-private.pem 2048
openssl rsa -in new-private.pem -pubout -out new-public.pem

Update JWT_PRIVATE_KEY and JWT_PUBLIC_KEY in your deployment environment (AWS Secrets Manager, GCP Secret Manager, or .env).
Perform a rolling restart:
- ECS: trigger a new task deployment — ECS drains existing tasks and starts new ones with the updated secret values.
- Cloud Run: deploy a new revision — Cloud Run gradually shifts traffic to the new revision.
Tokens signed with the old key will fail verification immediately after all instances have restarted.

Revoke All Tokens for a Compromised Agent

Suspend the agent to stop new token issuance immediately:

curl -X PATCH http://<host>/api/v1/agents/<agentId> \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{"status": "suspended"}'

This prevents any new POST /api/v1/token requests for that agent. Active tokens remain valid until their TTL (1 hour). To invalidate active tokens immediately, also revoke all credentials for the agent:

# List credentials
curl http://<host>/api/v1/agents/<agentId>/credentials \
  -H "Authorization: Bearer <admin_token>"

# Revoke each active credential
curl -X DELETE http://<host>/api/v1/agents/<agentId>/credentials/<credentialId> \
  -H "Authorization: Bearer <admin_token>"

Read Audit Logs for an Incident

Query the audit log with date range and agent filter:

curl "http://<host>/api/v1/audit?agentId=<agentId>&startDate=2026-01-01T00:00:00Z&endDate=2026-01-31T23:59:59Z" \
  -H "Authorization: Bearer <admin_token>"

Events are returned newest-first. Audit log retention is 90 days on the free tier. Each event includes: eventId, agentId, action, outcome, ipAddress, userAgent, metadata, timestamp.

Common action values: token.issued, token.revoked, token.introspected, agent.created, agent.updated, agent.suspended, agent.decommissioned, credential.generated, credential.rotated, credential.revoked, auth.failed.

12 KiB Raw Blame History Unescape Escape