# 10 — Deployment and Operations This document covers building and running AgentIdP in production: Docker, environment variables, database migrations, Terraform multi-region deployment, Prometheus/Grafana monitoring, and operational runbooks for common incidents. --- ## 1. Docker Build and Run The Dockerfile uses a two-stage build: - **Stage 1 (build):** `node:20.11-bookworm-slim` — installs all dependencies (including dev) and compiles TypeScript to `dist/`. - **Stage 2 (final):** `node:20.11-bookworm-slim` — copies `dist/` and `node_modules` (production only), installs `curl` for healthcheck, and runs as the created non-root `nodeapp` user (UID 1001). ```bash # Build docker build -t sentryagent-idp:1.0.0 . # Run (supply required env vars) docker run -d \ -p 3000:3000 \ -e DATABASE_URL=postgresql://sentryagent:sentryagent@:5432/sentryagent_idp \ -e REDIS_URL=redis://:6379 \ -e JWT_PRIVATE_KEY="-----BEGIN RSA PRIVATE KEY-----\n..." \ -e JWT_PUBLIC_KEY="-----BEGIN PUBLIC KEY-----\n..." \ sentryagent-idp:1.0.0 ``` The container exposes port `3000`. Override with `PORT` environment variable if needed. The container runs as non-root user `nodeapp` (UID 1001) — do not mount volumes requiring root ownership. For local full-stack development, use Docker Compose instead: ```bash docker compose up --build -d ``` The `compose.yaml` starts the app, PostgreSQL 14.12, and Redis 7.2 with health checks, resource limits, restart policies, and data volumes — per DockerSpec standards. --- ## 2. Environment Variables Reference All variables are loaded at startup via `dotenv`. In production, inject them directly into the process environment — do not commit `.env` to version control. | Variable | Required | Default | Purpose | |----------|----------|---------|---------| | `DATABASE_URL` | Yes | — | PostgreSQL connection string. Format: `postgresql://:@:/` | | `REDIS_URL` | Yes | — | Redis connection URL. Format: `redis://:` | | `JWT_PRIVATE_KEY` | Yes | — | PEM-encoded RSA-2048 private key for signing RS256 JWT tokens | | `JWT_PUBLIC_KEY` | Yes | — | PEM-encoded RSA-2048 public key for verifying tokens on every authenticated request | | `PORT` | No | `3000` | HTTP port the Express server listens on | | `NODE_ENV` | No | `undefined` | Set to `production` in production, `test` in test (disables Morgan logging in test) | | `CORS_ORIGIN` | No | `*` | Allowed CORS origin(s). Set to specific URL in production (e.g. `https://app.mycompany.ai`) | | `VAULT_ADDR` | No | — | HashiCorp Vault server address. When set with `VAULT_TOKEN`, new credentials are stored in Vault KV v2 instead of bcrypt | | `VAULT_TOKEN` | No | — | Vault authentication token. Required when `VAULT_ADDR` is set | | `VAULT_MOUNT` | No | `secret` | KV v2 secrets engine mount path | | `POLICY_DIR` | No | `/policies` | Directory containing OPA policy files (`authz.wasm` or `data/scopes.json`) | **Validation at startup:** `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` are checked in `createApp()` (see `src/app.ts:117–121`). If missing, the process exits before binding to any port. `DATABASE_URL` and `REDIS_URL` are validated when their respective singletons are first initialised. --- ## 3. Database Migrations Migrations are plain SQL files in `src/db/migrations/`. They are append-only — never modify an existing migration file. Always create a new numbered file. Current migration files: | File | What it creates | |------|----------------| | `001_create_agents.sql` | `agents` table with UUID primary key, email unique constraint, status enum | | `002_create_credentials.sql` | `credentials` table linked to `agents` by `client_id` foreign key | | `003_create_audit_events.sql` | `audit_events` table with JSONB `metadata` column | | `004_create_tokens.sql` | `token_monthly_counts` table for free-tier token limit tracking | | `005_add_vault_path.sql` | Adds `vault_path VARCHAR(512)` column to the `credentials` table | **Run migrations:** ```bash npm run db:migrate ``` This executes `scripts/migrate.ts` which applies all SQL files that have not yet been recorded in the `schema_migrations` tracking table. **Adding a new migration:** 1. Create `src/db/migrations/006_.sql` 2. Write idempotent SQL (use `CREATE TABLE IF NOT EXISTS`, `ADD COLUMN IF NOT EXISTS`, etc.) 3. Run `npm run db:migrate` --- ## 4. Terraform Multi-Region Deployment The `terraform/` directory contains reusable modules and two environment configurations. **Directory structure:** ``` terraform/ modules/ agentidp/ # Core AgentIdP compute resources lb/ # Load balancer (ALB/Cloud Load Balancer) rds/ # RDS PostgreSQL (AWS) redis/ # ElastiCache Redis (AWS) / Memorystore (GCP) environments/ aws/ # AWS deployment (ECS Fargate, ALB, RDS, ElastiCache) gcp/ # GCP deployment (Cloud Run, Cloud SQL, Memorystore) ``` ### AWS Deployment Architecture: `Internet → Route 53 → ALB (public subnets, HTTPS) → ECS Fargate tasks (private subnets) → RDS PostgreSQL 14 (Multi-AZ) + ElastiCache Redis 7` All secrets are stored in AWS Secrets Manager and injected into ECS task definitions at launch time. ```bash cd terraform/environments/aws terraform init terraform plan -var="aws_region=us-east-1" terraform apply ``` Resources provisioned: - VPC with public and private subnets across multiple availability zones - ECS Cluster and Fargate task definition (running `sentryagent-idp` container) - Application Load Balancer with HTTPS listener and health check target group - RDS PostgreSQL 14 (Multi-AZ for high availability) - ElastiCache Redis 7 (primary + replica) - IAM roles and instance profiles for ECS task permissions - Security groups enforcing least-privilege network access ### GCP Deployment Architecture: `Internet → Cloud Run (Google-managed TLS, auto-scaling) → Cloud SQL PostgreSQL 14 (REGIONAL HA) + Memorystore Redis 7 (STANDARD_HA)` All secrets are stored in GCP Secret Manager and mounted into the Cloud Run service at startup. ```bash cd terraform/environments/gcp terraform init terraform plan -var="gcp_region=us-central1" terraform apply ``` Resources provisioned: - VPC network with Serverless VPC Access connector (Cloud Run → private databases) - Cloud Run service (auto-scales to zero, Google-managed TLS) - Cloud Load Balancer with global anycast IP - Cloud SQL PostgreSQL 14 with regional high-availability - Memorystore Redis 7 (STANDARD_HA with in-transit encryption) - IAM service accounts and bindings **Important:** All infrastructure changes must go through Terraform. Never make manual edits in the AWS console or GCP Cloud Console — they will be overwritten on the next `terraform apply` and will not be tracked in state. --- ## 5. Prometheus and Grafana **Metrics endpoint:** `GET /metrics` (unauthenticated — restrict in production to internal network or scrape from within the cluster) The metrics endpoint is served by the `prom-client` library using a dedicated registry (`metricsRegistry`) defined in `src/metrics/registry.ts`. The registry is isolated from the default global registry to prevent conflicts in tests. ### Metric Definitions All 6 metrics are defined in `src/metrics/registry.ts`: | Metric name | Type | Labels | What it measures | |-------------|------|--------|-----------------| | `agentidp_tokens_issued_total` | Counter | `scope` | Total OAuth 2.0 access tokens issued successfully | | `agentidp_agents_registered_total` | Counter | `deployment_env` | Total AI agents registered successfully | | `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | Total HTTP requests received | | `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP request duration in seconds (buckets: 5ms–2.5s) | | `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration in seconds | | `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration in seconds | The HTTP metrics (`agentidp_http_requests_total` and `agentidp_http_request_duration_seconds`) are populated by `metricsMiddleware` in `src/middleware/metrics.ts`, which is registered before all routes in `src/app.ts`. Route labels are normalised to replace UUIDs with `:id` to prevent high cardinality (e.g. `/api/v1/agents/:id` rather than `/api/v1/agents/a1b2c3...`). ### Local Grafana ```bash docker compose -f compose.yaml -f compose.monitoring.yaml up -d ``` - Prometheus: http://localhost:9090 - Grafana: http://localhost:3001 (admin password: `GF_ADMIN_PASSWORD` value from `.env`) The monitoring compose overlay starts `prom/prometheus:v2.53.0` and `grafana/grafana:11.2.0`. Grafana dashboards and datasource provisioning are loaded from `monitoring/grafana/provisioning/`. ### Adding a New Metric 1. Define the metric in `src/metrics/registry.ts` using the shared `metricsRegistry` (not the default prom-client registry). 2. Export it from that file. 3. Import it in the file where the instrumentation point lives. 4. Call `.inc(labels)` for Counters or `.observe(labels, value)` for Histograms at the instrumentation point. 5. Verify it appears in `GET /metrics` after starting the server. --- ## 6. Operational Runbook ### Health Check ```bash curl http:///health ``` Expected response: ```json {"status":"ok","postgres":"connected","redis":"connected"} ``` Troubleshooting: - If `postgres: "error"` — verify `DATABASE_URL` is correct and PostgreSQL is reachable. Check `docker compose logs postgres` for local dev. - If `redis: "error"` — verify `REDIS_URL` is correct and Redis is reachable. Check `docker compose logs redis` for local dev. - If the health endpoint returns 502 or times out — the app process has crashed; check application logs. --- ### Rotate the JWT Signing Key All active tokens become invalid after a key rotation — agents must re-authenticate. 1. Generate a new RSA-2048 key pair: ```bash openssl genrsa -out new-private.pem 2048 openssl rsa -in new-private.pem -pubout -out new-public.pem ``` 2. Update `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` in your deployment environment (AWS Secrets Manager, GCP Secret Manager, or `.env`). 3. Perform a rolling restart: - **ECS:** trigger a new task deployment — ECS drains existing tasks and starts new ones with the updated secret values. - **Cloud Run:** deploy a new revision — Cloud Run gradually shifts traffic to the new revision. 4. Tokens signed with the old key will fail verification immediately after all instances have restarted. --- ### Revoke All Tokens for a Compromised Agent Suspend the agent to stop new token issuance immediately: ```bash curl -X PATCH http:///api/v1/agents/ \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{"status": "suspended"}' ``` This prevents any new `POST /api/v1/token` requests for that agent. Active tokens remain valid until their TTL (1 hour). To invalidate active tokens immediately, also revoke all credentials for the agent: ```bash # List credentials curl http:///api/v1/agents//credentials \ -H "Authorization: Bearer " # Revoke each active credential curl -X DELETE http:///api/v1/agents//credentials/ \ -H "Authorization: Bearer " ``` --- ### Read Audit Logs for an Incident Query the audit log with date range and agent filter: ```bash curl "http:///api/v1/audit?agentId=&startDate=2026-01-01T00:00:00Z&endDate=2026-01-31T23:59:59Z" \ -H "Authorization: Bearer " ``` Events are returned newest-first. Audit log retention is 90 days on the free tier. Each event includes: `eventId`, `agentId`, `action`, `outcome`, `ipAddress`, `userAgent`, `metadata`, `timestamp`. Common `action` values: `token.issued`, `token.revoked`, `token.introspected`, `agent.created`, `agent.updated`, `agent.suspended`, `agent.decommissioned`, `credential.generated`, `credential.rotated`, `credential.revoked`, `auth.failed`.