sentryagent-idp/docs/engineering/10-deployment.md

# 10 — Deployment and Operations

This document covers building and running AgentIdP in production: Docker, environment variables, database migrations, Terraform multi-region deployment, Prometheus/Grafana monitoring, and operational runbooks for common incidents.

---

## 1. Docker Build and Run

The Dockerfile uses a two-stage build:

- **Stage 1 (build):** `node:20.11-bookworm-slim` — installs all dependencies (including dev) and compiles TypeScript to `dist/`.
- **Stage 2 (final):** `node:20.11-bookworm-slim` — copies `dist/` and `node_modules` (production only), installs `curl` for healthcheck, and runs as the created non-root `nodeapp` user (UID 1001).

```bash
# Build
docker build -t sentryagent-idp:1.0.0 .

# Run (supply required env vars)
docker run -d \
  -p 3000:3000 \
  -e DATABASE_URL=postgresql://sentryagent:sentryagent@<host>:5432/sentryagent_idp \
  -e REDIS_URL=redis://<host>:6379 \
  -e JWT_PRIVATE_KEY="-----BEGIN RSA PRIVATE KEY-----\n..." \
  -e JWT_PUBLIC_KEY="-----BEGIN PUBLIC KEY-----\n..." \
  sentryagent-idp:1.0.0
```

The container exposes port `3000`. Override with `PORT` environment variable if needed. The container runs as non-root user `nodeapp` (UID 1001) — do not mount volumes requiring root ownership.

For local full-stack development, use Docker Compose instead:

```bash
docker compose up --build -d
```

The `compose.yaml` starts the app, PostgreSQL 14.12, and Redis 7.2 with health checks, resource limits, restart policies, and data volumes — per DockerSpec standards.

---

## 2. Environment Variables Reference

All variables are loaded at startup via `dotenv`. In production, inject them directly into the process environment — do not commit `.env` to version control.

| Variable | Required | Default | Purpose |
|----------|----------|---------|---------|
| `DATABASE_URL` | Yes | — | PostgreSQL connection string. Format: `postgresql://<user>:<password>@<host>:<port>/<db>` |
| `REDIS_URL` | Yes | — | Redis connection URL. Format: `redis://<host>:<port>` |
| `JWT_PRIVATE_KEY` | Yes | — | PEM-encoded RSA-2048 private key for signing RS256 JWT tokens |
| `JWT_PUBLIC_KEY` | Yes | — | PEM-encoded RSA-2048 public key for verifying tokens on every authenticated request |
| `PORT` | No | `3000` | HTTP port the Express server listens on |
| `NODE_ENV` | No | `undefined` | Set to `production` in production, `test` in test (disables Morgan logging in test) |
| `CORS_ORIGIN` | No | `*` | Allowed CORS origin(s). Set to specific URL in production (e.g. `https://app.mycompany.ai`) |
| `VAULT_ADDR` | No | — | HashiCorp Vault server address. When set with `VAULT_TOKEN`, new credentials are stored in Vault KV v2 instead of bcrypt |
| `VAULT_TOKEN` | No | — | Vault authentication token. Required when `VAULT_ADDR` is set |
| `VAULT_MOUNT` | No | `secret` | KV v2 secrets engine mount path |
| `POLICY_DIR` | No | `<cwd>/policies` | Directory containing OPA policy files (`authz.wasm` or `data/scopes.json`) |

**Validation at startup:** `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` are checked in `createApp()` (see `src/app.ts:117–121`). If missing, the process exits before binding to any port. `DATABASE_URL` and `REDIS_URL` are validated when their respective singletons are first initialised.

---

## 3. Database Migrations

Migrations are plain SQL files in `src/db/migrations/`. They are append-only — never modify an existing migration file. Always create a new numbered file.

Current migration files:

| File | What it creates |
|------|----------------|
| `001_create_agents.sql` | `agents` table with UUID primary key, email unique constraint, status enum |
| `002_create_credentials.sql` | `credentials` table linked to `agents` by `client_id` foreign key |
| `003_create_audit_events.sql` | `audit_events` table with JSONB `metadata` column |
| `004_create_tokens.sql` | `token_monthly_counts` table for free-tier token limit tracking |
| `005_add_vault_path.sql` | Adds `vault_path VARCHAR(512)` column to the `credentials` table |

**Run migrations:**

```bash
npm run db:migrate
```

This executes `scripts/migrate.ts` which applies all SQL files that have not yet been recorded in the `schema_migrations` tracking table.

**Adding a new migration:**

1. Create `src/db/migrations/006_<description>.sql`
2. Write idempotent SQL (use `CREATE TABLE IF NOT EXISTS`, `ADD COLUMN IF NOT EXISTS`, etc.)
3. Run `npm run db:migrate`

---

## 4. Terraform Multi-Region Deployment

The `terraform/` directory contains reusable modules and two environment configurations.

**Directory structure:**

```
terraform/
  modules/
    agentidp/        # Core AgentIdP compute resources
    lb/              # Load balancer (ALB/Cloud Load Balancer)
    rds/             # RDS PostgreSQL (AWS)
    redis/           # ElastiCache Redis (AWS) / Memorystore (GCP)
  environments/
    aws/             # AWS deployment (ECS Fargate, ALB, RDS, ElastiCache)
    gcp/             # GCP deployment (Cloud Run, Cloud SQL, Memorystore)
```

### AWS Deployment

Architecture: `Internet → Route 53 → ALB (public subnets, HTTPS) → ECS Fargate tasks (private subnets) → RDS PostgreSQL 14 (Multi-AZ) + ElastiCache Redis 7`

All secrets are stored in AWS Secrets Manager and injected into ECS task definitions at launch time.

```bash
cd terraform/environments/aws
terraform init
terraform plan -var="aws_region=us-east-1"
terraform apply
```

Resources provisioned:
- VPC with public and private subnets across multiple availability zones
- ECS Cluster and Fargate task definition (running `sentryagent-idp` container)
- Application Load Balancer with HTTPS listener and health check target group
- RDS PostgreSQL 14 (Multi-AZ for high availability)
- ElastiCache Redis 7 (primary + replica)
- IAM roles and instance profiles for ECS task permissions
- Security groups enforcing least-privilege network access

### GCP Deployment

Architecture: `Internet → Cloud Run (Google-managed TLS, auto-scaling) → Cloud SQL PostgreSQL 14 (REGIONAL HA) + Memorystore Redis 7 (STANDARD_HA)`

All secrets are stored in GCP Secret Manager and mounted into the Cloud Run service at startup.

```bash
cd terraform/environments/gcp
terraform init
terraform plan -var="gcp_region=us-central1"
terraform apply
```

Resources provisioned:
- VPC network with Serverless VPC Access connector (Cloud Run → private databases)
- Cloud Run service (auto-scales to zero, Google-managed TLS)
- Cloud Load Balancer with global anycast IP
- Cloud SQL PostgreSQL 14 with regional high-availability
- Memorystore Redis 7 (STANDARD_HA with in-transit encryption)
- IAM service accounts and bindings

**Important:** All infrastructure changes must go through Terraform. Never make manual edits in the AWS console or GCP Cloud Console — they will be overwritten on the next `terraform apply` and will not be tracked in state.

---

## 5. Prometheus and Grafana

**Metrics endpoint:** `GET /metrics` (unauthenticated — restrict in production to internal network or scrape from within the cluster)

The metrics endpoint is served by the `prom-client` library using a dedicated registry (`metricsRegistry`) defined in `src/metrics/registry.ts`. The registry is isolated from the default global registry to prevent conflicts in tests.

### Metric Definitions

All 6 metrics are defined in `src/metrics/registry.ts`:

| Metric name | Type | Labels | What it measures |
|-------------|------|--------|-----------------|
| `agentidp_tokens_issued_total` | Counter | `scope` | Total OAuth 2.0 access tokens issued successfully |
| `agentidp_agents_registered_total` | Counter | `deployment_env` | Total AI agents registered successfully |
| `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | Total HTTP requests received |
| `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP request duration in seconds (buckets: 5ms–2.5s) |
| `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration in seconds |
| `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration in seconds |

The HTTP metrics (`agentidp_http_requests_total` and `agentidp_http_request_duration_seconds`) are populated by `metricsMiddleware` in `src/middleware/metrics.ts`, which is registered before all routes in `src/app.ts`. Route labels are normalised to replace UUIDs with `:id` to prevent high cardinality (e.g. `/api/v1/agents/:id` rather than `/api/v1/agents/a1b2c3...`).

### Local Grafana

```bash
docker compose -f compose.yaml -f compose.monitoring.yaml up -d
```

- Prometheus: http://localhost:9090
- Grafana: http://localhost:3001 (admin password: `GF_ADMIN_PASSWORD` value from `.env`)

The monitoring compose overlay starts `prom/prometheus:v2.53.0` and `grafana/grafana:11.2.0`. Grafana dashboards and datasource provisioning are loaded from `monitoring/grafana/provisioning/`.

### Adding a New Metric

1. Define the metric in `src/metrics/registry.ts` using the shared `metricsRegistry` (not the default prom-client registry).
2. Export it from that file.
3. Import it in the file where the instrumentation point lives.
4. Call `.inc(labels)` for Counters or `.observe(labels, value)` for Histograms at the instrumentation point.
5. Verify it appears in `GET /metrics` after starting the server.

---

## 6. Operational Runbook

### Health Check

```bash
curl http://<host>/health
```

Expected response:

```json
{"status":"ok","postgres":"connected","redis":"connected"}
```

Troubleshooting:
- If `postgres: "error"` — verify `DATABASE_URL` is correct and PostgreSQL is reachable. Check `docker compose logs postgres` for local dev.
- If `redis: "error"` — verify `REDIS_URL` is correct and Redis is reachable. Check `docker compose logs redis` for local dev.
- If the health endpoint returns 502 or times out — the app process has crashed; check application logs.

---

### Rotate the JWT Signing Key

All active tokens become invalid after a key rotation — agents must re-authenticate.

1. Generate a new RSA-2048 key pair:
   ```bash
   openssl genrsa -out new-private.pem 2048
   openssl rsa -in new-private.pem -pubout -out new-public.pem
   ```
2. Update `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` in your deployment environment (AWS Secrets Manager, GCP Secret Manager, or `.env`).
3. Perform a rolling restart:
   - **ECS:** trigger a new task deployment — ECS drains existing tasks and starts new ones with the updated secret values.
   - **Cloud Run:** deploy a new revision — Cloud Run gradually shifts traffic to the new revision.
4. Tokens signed with the old key will fail verification immediately after all instances have restarted.

---

### Revoke All Tokens for a Compromised Agent

Suspend the agent to stop new token issuance immediately:

```bash
curl -X PATCH http://<host>/api/v1/agents/<agentId> \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{"status": "suspended"}'
```

This prevents any new `POST /api/v1/token` requests for that agent. Active tokens remain valid until their TTL (1 hour). To invalidate active tokens immediately, also revoke all credentials for the agent:

```bash
# List credentials
curl http://<host>/api/v1/agents/<agentId>/credentials \
  -H "Authorization: Bearer <admin_token>"

# Revoke each active credential
curl -X DELETE http://<host>/api/v1/agents/<agentId>/credentials/<credentialId> \
  -H "Authorization: Bearer <admin_token>"
```

---

### Read Audit Logs for an Incident

Query the audit log with date range and agent filter:

```bash
curl "http://<host>/api/v1/audit?agentId=<agentId>&startDate=2026-01-01T00:00:00Z&endDate=2026-01-31T23:59:59Z" \
  -H "Authorization: Bearer <admin_token>"
```

Events are returned newest-first. Audit log retention is 90 days on the free tier. Each event includes: `eventId`, `agentId`, `action`, `outcome`, `ipAddress`, `userAgent`, `metadata`, `timestamp`.

Common `action` values: `token.issued`, `token.revoked`, `token.introspected`, `agent.created`, `agent.updated`, `agent.suspended`, `agent.decommissioned`, `credential.generated`, `credential.rotated`, `credential.revoked`, `auth.failed`.