Files
sentryagent-idp/docs/engineering/10-deployment.md
SentryAgent.ai Developer f9a6a8aafb docs(devops): update all documentation for DockerSpec compliance
- Replace all docker-compose.yml/docker-compose.monitoring.yml references with
  compose.yaml/compose.monitoring.yaml (modern Compose Spec naming)
- Replace all `docker-compose` CLI commands with `docker compose` (plugin syntax)
- Update Dockerfile stage descriptions: node:18-alpine → node:20.11-bookworm-slim,
  built-in node user → explicit nodeapp:1001 non-root user
- Update image version references: postgres:14-alpine → postgres:14.12-alpine3.19,
  redis:7-alpine → redis:7.2-alpine3.19
- Externalize postgres credentials: hardcoded values → POSTGRES_USER/PASSWORD/DB env vars
- Externalize Grafana admin password: hardcoded 'agentidp' → GF_ADMIN_PASSWORD env var
- Add Docker Compose Variables section to environment-variables.md (POSTGRES_*, GF_ADMIN_PASSWORD)
- Update local-development.md Step 3: cp .env.example .env, document POSTGRES_* purpose
- Update quick-start.md: cp .env.example .env, use awk/sed for JWT key injection
- Update 07-dev-setup.md: remove 'no .env.example' claim, reference cp .env.example
- Update docker-compose.yml key file description in 04-codebase-structure.md
- Update monitoring overlay launch commands across all docs (compose.yaml + compose.monitoring.yaml)
- Update volume names to kebab-case: postgres_data → postgres-data, redis_data → redis-data
- Fix compliance encryption-runbook: docker-compose restart agentidp → docker compose restart app

All docs now consistent with compose.yaml in repo root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 08:27:37 +00:00

274 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 10 — Deployment and Operations
This document covers building and running AgentIdP in production: Docker, environment variables, database migrations, Terraform multi-region deployment, Prometheus/Grafana monitoring, and operational runbooks for common incidents.
---
## 1. Docker Build and Run
The Dockerfile uses a two-stage build:
- **Stage 1 (build):** `node:20.11-bookworm-slim` — installs all dependencies (including dev) and compiles TypeScript to `dist/`.
- **Stage 2 (final):** `node:20.11-bookworm-slim` — copies `dist/` and `node_modules` (production only), installs `curl` for healthcheck, and runs as the created non-root `nodeapp` user (UID 1001).
```bash
# Build
docker build -t sentryagent-idp:1.0.0 .
# Run (supply required env vars)
docker run -d \
-p 3000:3000 \
-e DATABASE_URL=postgresql://sentryagent:sentryagent@<host>:5432/sentryagent_idp \
-e REDIS_URL=redis://<host>:6379 \
-e JWT_PRIVATE_KEY="-----BEGIN RSA PRIVATE KEY-----\n..." \
-e JWT_PUBLIC_KEY="-----BEGIN PUBLIC KEY-----\n..." \
sentryagent-idp:1.0.0
```
The container exposes port `3000`. Override with `PORT` environment variable if needed. The container runs as non-root user `nodeapp` (UID 1001) — do not mount volumes requiring root ownership.
For local full-stack development, use Docker Compose instead:
```bash
docker compose up --build -d
```
The `compose.yaml` starts the app, PostgreSQL 14.12, and Redis 7.2 with health checks, resource limits, restart policies, and data volumes — per DockerSpec standards.
---
## 2. Environment Variables Reference
All variables are loaded at startup via `dotenv`. In production, inject them directly into the process environment — do not commit `.env` to version control.
| Variable | Required | Default | Purpose |
|----------|----------|---------|---------|
| `DATABASE_URL` | Yes | — | PostgreSQL connection string. Format: `postgresql://<user>:<password>@<host>:<port>/<db>` |
| `REDIS_URL` | Yes | — | Redis connection URL. Format: `redis://<host>:<port>` |
| `JWT_PRIVATE_KEY` | Yes | — | PEM-encoded RSA-2048 private key for signing RS256 JWT tokens |
| `JWT_PUBLIC_KEY` | Yes | — | PEM-encoded RSA-2048 public key for verifying tokens on every authenticated request |
| `PORT` | No | `3000` | HTTP port the Express server listens on |
| `NODE_ENV` | No | `undefined` | Set to `production` in production, `test` in test (disables Morgan logging in test) |
| `CORS_ORIGIN` | No | `*` | Allowed CORS origin(s). Set to specific URL in production (e.g. `https://app.mycompany.ai`) |
| `VAULT_ADDR` | No | — | HashiCorp Vault server address. When set with `VAULT_TOKEN`, new credentials are stored in Vault KV v2 instead of bcrypt |
| `VAULT_TOKEN` | No | — | Vault authentication token. Required when `VAULT_ADDR` is set |
| `VAULT_MOUNT` | No | `secret` | KV v2 secrets engine mount path |
| `POLICY_DIR` | No | `<cwd>/policies` | Directory containing OPA policy files (`authz.wasm` or `data/scopes.json`) |
**Validation at startup:** `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` are checked in `createApp()` (see `src/app.ts:117121`). If missing, the process exits before binding to any port. `DATABASE_URL` and `REDIS_URL` are validated when their respective singletons are first initialised.
---
## 3. Database Migrations
Migrations are plain SQL files in `src/db/migrations/`. They are append-only — never modify an existing migration file. Always create a new numbered file.
Current migration files:
| File | What it creates |
|------|----------------|
| `001_create_agents.sql` | `agents` table with UUID primary key, email unique constraint, status enum |
| `002_create_credentials.sql` | `credentials` table linked to `agents` by `client_id` foreign key |
| `003_create_audit_events.sql` | `audit_events` table with JSONB `metadata` column |
| `004_create_tokens.sql` | `token_monthly_counts` table for free-tier token limit tracking |
| `005_add_vault_path.sql` | Adds `vault_path VARCHAR(512)` column to the `credentials` table |
**Run migrations:**
```bash
npm run db:migrate
```
This executes `scripts/migrate.ts` which applies all SQL files that have not yet been recorded in the `schema_migrations` tracking table.
**Adding a new migration:**
1. Create `src/db/migrations/006_<description>.sql`
2. Write idempotent SQL (use `CREATE TABLE IF NOT EXISTS`, `ADD COLUMN IF NOT EXISTS`, etc.)
3. Run `npm run db:migrate`
---
## 4. Terraform Multi-Region Deployment
The `terraform/` directory contains reusable modules and two environment configurations.
**Directory structure:**
```
terraform/
modules/
agentidp/ # Core AgentIdP compute resources
lb/ # Load balancer (ALB/Cloud Load Balancer)
rds/ # RDS PostgreSQL (AWS)
redis/ # ElastiCache Redis (AWS) / Memorystore (GCP)
environments/
aws/ # AWS deployment (ECS Fargate, ALB, RDS, ElastiCache)
gcp/ # GCP deployment (Cloud Run, Cloud SQL, Memorystore)
```
### AWS Deployment
Architecture: `Internet → Route 53 → ALB (public subnets, HTTPS) → ECS Fargate tasks (private subnets) → RDS PostgreSQL 14 (Multi-AZ) + ElastiCache Redis 7`
All secrets are stored in AWS Secrets Manager and injected into ECS task definitions at launch time.
```bash
cd terraform/environments/aws
terraform init
terraform plan -var="aws_region=us-east-1"
terraform apply
```
Resources provisioned:
- VPC with public and private subnets across multiple availability zones
- ECS Cluster and Fargate task definition (running `sentryagent-idp` container)
- Application Load Balancer with HTTPS listener and health check target group
- RDS PostgreSQL 14 (Multi-AZ for high availability)
- ElastiCache Redis 7 (primary + replica)
- IAM roles and instance profiles for ECS task permissions
- Security groups enforcing least-privilege network access
### GCP Deployment
Architecture: `Internet → Cloud Run (Google-managed TLS, auto-scaling) → Cloud SQL PostgreSQL 14 (REGIONAL HA) + Memorystore Redis 7 (STANDARD_HA)`
All secrets are stored in GCP Secret Manager and mounted into the Cloud Run service at startup.
```bash
cd terraform/environments/gcp
terraform init
terraform plan -var="gcp_region=us-central1"
terraform apply
```
Resources provisioned:
- VPC network with Serverless VPC Access connector (Cloud Run → private databases)
- Cloud Run service (auto-scales to zero, Google-managed TLS)
- Cloud Load Balancer with global anycast IP
- Cloud SQL PostgreSQL 14 with regional high-availability
- Memorystore Redis 7 (STANDARD_HA with in-transit encryption)
- IAM service accounts and bindings
**Important:** All infrastructure changes must go through Terraform. Never make manual edits in the AWS console or GCP Cloud Console — they will be overwritten on the next `terraform apply` and will not be tracked in state.
---
## 5. Prometheus and Grafana
**Metrics endpoint:** `GET /metrics` (unauthenticated — restrict in production to internal network or scrape from within the cluster)
The metrics endpoint is served by the `prom-client` library using a dedicated registry (`metricsRegistry`) defined in `src/metrics/registry.ts`. The registry is isolated from the default global registry to prevent conflicts in tests.
### Metric Definitions
All 6 metrics are defined in `src/metrics/registry.ts`:
| Metric name | Type | Labels | What it measures |
|-------------|------|--------|-----------------|
| `agentidp_tokens_issued_total` | Counter | `scope` | Total OAuth 2.0 access tokens issued successfully |
| `agentidp_agents_registered_total` | Counter | `deployment_env` | Total AI agents registered successfully |
| `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | Total HTTP requests received |
| `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP request duration in seconds (buckets: 5ms2.5s) |
| `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration in seconds |
| `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration in seconds |
The HTTP metrics (`agentidp_http_requests_total` and `agentidp_http_request_duration_seconds`) are populated by `metricsMiddleware` in `src/middleware/metrics.ts`, which is registered before all routes in `src/app.ts`. Route labels are normalised to replace UUIDs with `:id` to prevent high cardinality (e.g. `/api/v1/agents/:id` rather than `/api/v1/agents/a1b2c3...`).
### Local Grafana
```bash
docker compose -f compose.yaml -f compose.monitoring.yaml up -d
```
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3001 (admin password: `GF_ADMIN_PASSWORD` value from `.env`)
The monitoring compose overlay starts `prom/prometheus:v2.53.0` and `grafana/grafana:11.2.0`. Grafana dashboards and datasource provisioning are loaded from `monitoring/grafana/provisioning/`.
### Adding a New Metric
1. Define the metric in `src/metrics/registry.ts` using the shared `metricsRegistry` (not the default prom-client registry).
2. Export it from that file.
3. Import it in the file where the instrumentation point lives.
4. Call `.inc(labels)` for Counters or `.observe(labels, value)` for Histograms at the instrumentation point.
5. Verify it appears in `GET /metrics` after starting the server.
---
## 6. Operational Runbook
### Health Check
```bash
curl http://<host>/health
```
Expected response:
```json
{"status":"ok","postgres":"connected","redis":"connected"}
```
Troubleshooting:
- If `postgres: "error"` — verify `DATABASE_URL` is correct and PostgreSQL is reachable. Check `docker compose logs postgres` for local dev.
- If `redis: "error"` — verify `REDIS_URL` is correct and Redis is reachable. Check `docker compose logs redis` for local dev.
- If the health endpoint returns 502 or times out — the app process has crashed; check application logs.
---
### Rotate the JWT Signing Key
All active tokens become invalid after a key rotation — agents must re-authenticate.
1. Generate a new RSA-2048 key pair:
```bash
openssl genrsa -out new-private.pem 2048
openssl rsa -in new-private.pem -pubout -out new-public.pem
```
2. Update `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` in your deployment environment (AWS Secrets Manager, GCP Secret Manager, or `.env`).
3. Perform a rolling restart:
- **ECS:** trigger a new task deployment — ECS drains existing tasks and starts new ones with the updated secret values.
- **Cloud Run:** deploy a new revision — Cloud Run gradually shifts traffic to the new revision.
4. Tokens signed with the old key will fail verification immediately after all instances have restarted.
---
### Revoke All Tokens for a Compromised Agent
Suspend the agent to stop new token issuance immediately:
```bash
curl -X PATCH http://<host>/api/v1/agents/<agentId> \
-H "Authorization: Bearer <admin_token>" \
-H "Content-Type: application/json" \
-d '{"status": "suspended"}'
```
This prevents any new `POST /api/v1/token` requests for that agent. Active tokens remain valid until their TTL (1 hour). To invalidate active tokens immediately, also revoke all credentials for the agent:
```bash
# List credentials
curl http://<host>/api/v1/agents/<agentId>/credentials \
-H "Authorization: Bearer <admin_token>"
# Revoke each active credential
curl -X DELETE http://<host>/api/v1/agents/<agentId>/credentials/<credentialId> \
-H "Authorization: Bearer <admin_token>"
```
---
### Read Audit Logs for an Incident
Query the audit log with date range and agent filter:
```bash
curl "http://<host>/api/v1/audit?agentId=<agentId>&startDate=2026-01-01T00:00:00Z&endDate=2026-01-31T23:59:59Z" \
-H "Authorization: Bearer <admin_token>"
```
Events are returned newest-first. Audit log retention is 90 days on the free tier. Each event includes: `eventId`, `agentId`, `action`, `outcome`, `ipAddress`, `userAgent`, `metadata`, `timestamp`.
Common `action` values: `token.issued`, `token.revoked`, `token.introspected`, `agent.created`, `agent.updated`, `agent.suspended`, `agent.decommissioned`, `credential.generated`, `credential.rotated`, `credential.revoked`, `auth.failed`.