docs: engineering knowledge base for new hires

Complete docs/engineering/ suite — 12 documents covering company overview, system architecture, tech stack ADRs, codebase structure, service deep dives, annotated code walkthroughs, dev setup, engineering workflow, testing strategy, deployment/ops, SDK guide, and README index. All content verified against source files. All 82 tasks in openspec/changes/engineering-docs/tasks.md marked complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 12:38:42 +00:00
parent 1f95cfe89d
commit eced5f8699
13 changed files with 3820 additions and 0 deletions
--- a/docs/engineering/10-deployment.md
+++ b/docs/engineering/10-deployment.md
@@ -0,0 +1,273 @@
+# 10 — Deployment and Operations
+
+This document covers building and running AgentIdP in production: Docker, environment variables, database migrations, Terraform multi-region deployment, Prometheus/Grafana monitoring, and operational runbooks for common incidents.
+
+---
+
+## 1. Docker Build and Run
+
+The Dockerfile uses a two-stage build:
+
+- **Stage 1 (builder):** `node:18-alpine` — installs all dependencies (including dev) and compiles TypeScript to `dist/`.
+- **Stage 2 (production):** `node:18-alpine` — copies `dist/` and `node_modules` (production only), runs as the built-in non-root `node` user.
+
+```bash
+# Build
+docker build -t sentryagent-idp:latest .
+
+# Run (supply required env vars)
+docker run -d \
+  -p 3000:3000 \
+  -e DATABASE_URL=postgresql://sentryagent:sentryagent@<host>:5432/sentryagent_idp \
+  -e REDIS_URL=redis://<host>:6379 \
+  -e JWT_PRIVATE_KEY="-----BEGIN RSA PRIVATE KEY-----\n..." \
+  -e JWT_PUBLIC_KEY="-----BEGIN PUBLIC KEY-----\n..." \
+  sentryagent-idp:latest
+```
+
+The container exposes port `3000`. Override with `PORT` environment variable if needed.
+
+For local full-stack development, use Docker Compose instead:
+
+```bash
+docker compose up -d
+```
+
+The `docker-compose.yml` starts the app, PostgreSQL 14, and Redis 7 with health checks and data volumes.
+
+---
+
+## 2. Environment Variables Reference
+
+All variables are loaded at startup via `dotenv`. In production, inject them directly into the process environment — do not commit `.env` to version control.
+
+| Variable | Required | Default | Purpose |
+|----------|----------|---------|---------|
+| `DATABASE_URL` | Yes | — | PostgreSQL connection string. Format: `postgresql://<user>:<password>@<host>:<port>/<db>` |
+| `REDIS_URL` | Yes | — | Redis connection URL. Format: `redis://<host>:<port>` |
+| `JWT_PRIVATE_KEY` | Yes | — | PEM-encoded RSA-2048 private key for signing RS256 JWT tokens |
+| `JWT_PUBLIC_KEY` | Yes | — | PEM-encoded RSA-2048 public key for verifying tokens on every authenticated request |
+| `PORT` | No | `3000` | HTTP port the Express server listens on |
+| `NODE_ENV` | No | `undefined` | Set to `production` in production, `test` in test (disables Morgan logging in test) |
+| `CORS_ORIGIN` | No | `*` | Allowed CORS origin(s). Set to specific URL in production (e.g. `https://app.mycompany.ai`) |
+| `VAULT_ADDR` | No | — | HashiCorp Vault server address. When set with `VAULT_TOKEN`, new credentials are stored in Vault KV v2 instead of bcrypt |
+| `VAULT_TOKEN` | No | — | Vault authentication token. Required when `VAULT_ADDR` is set |
+| `VAULT_MOUNT` | No | `secret` | KV v2 secrets engine mount path |
+| `POLICY_DIR` | No | `<cwd>/policies` | Directory containing OPA policy files (`authz.wasm` or `data/scopes.json`) |
+
+**Validation at startup:** `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` are checked in `createApp()` (see `src/app.ts:117–121`). If missing, the process exits before binding to any port. `DATABASE_URL` and `REDIS_URL` are validated when their respective singletons are first initialised.
+
+---
+
+## 3. Database Migrations
+
+Migrations are plain SQL files in `src/db/migrations/`. They are append-only — never modify an existing migration file. Always create a new numbered file.
+
+Current migration files:
+
+| File | What it creates |
+|------|----------------|
+| `001_create_agents.sql` | `agents` table with UUID primary key, email unique constraint, status enum |
+| `002_create_credentials.sql` | `credentials` table linked to `agents` by `client_id` foreign key |
+| `003_create_audit_events.sql` | `audit_events` table with JSONB `metadata` column |
+| `004_create_tokens.sql` | `token_monthly_counts` table for free-tier token limit tracking |
+| `005_add_vault_path.sql` | Adds `vault_path VARCHAR(512)` column to the `credentials` table |
+
+**Run migrations:**
+
+```bash
+npm run db:migrate
+```
+
+This executes `scripts/migrate.ts` which applies all SQL files that have not yet been recorded in the `schema_migrations` tracking table.
+
+**Adding a new migration:**
+
+1. Create `src/db/migrations/006_<description>.sql`
+2. Write idempotent SQL (use `CREATE TABLE IF NOT EXISTS`, `ADD COLUMN IF NOT EXISTS`, etc.)
+3. Run `npm run db:migrate`
+
+---
+
+## 4. Terraform Multi-Region Deployment
+
+The `terraform/` directory contains reusable modules and two environment configurations.
+
+**Directory structure:**
+
+```
+terraform/
+  modules/
+    agentidp/        # Core AgentIdP compute resources
+    lb/              # Load balancer (ALB/Cloud Load Balancer)
+    rds/             # RDS PostgreSQL (AWS)
+    redis/           # ElastiCache Redis (AWS) / Memorystore (GCP)
+  environments/
+    aws/             # AWS deployment (ECS Fargate, ALB, RDS, ElastiCache)
+    gcp/             # GCP deployment (Cloud Run, Cloud SQL, Memorystore)
+```
+
+### AWS Deployment
+
+Architecture: `Internet → Route 53 → ALB (public subnets, HTTPS) → ECS Fargate tasks (private subnets) → RDS PostgreSQL 14 (Multi-AZ) + ElastiCache Redis 7`
+
+All secrets are stored in AWS Secrets Manager and injected into ECS task definitions at launch time.
+
+```bash
+cd terraform/environments/aws
+terraform init
+terraform plan -var="aws_region=us-east-1"
+terraform apply
+```
+
+Resources provisioned:
+- VPC with public and private subnets across multiple availability zones
+- ECS Cluster and Fargate task definition (running `sentryagent-idp` container)
+- Application Load Balancer with HTTPS listener and health check target group
+- RDS PostgreSQL 14 (Multi-AZ for high availability)
+- ElastiCache Redis 7 (primary + replica)
+- IAM roles and instance profiles for ECS task permissions
+- Security groups enforcing least-privilege network access
+
+### GCP Deployment
+
+Architecture: `Internet → Cloud Run (Google-managed TLS, auto-scaling) → Cloud SQL PostgreSQL 14 (REGIONAL HA) + Memorystore Redis 7 (STANDARD_HA)`
+
+All secrets are stored in GCP Secret Manager and mounted into the Cloud Run service at startup.
+
+```bash
+cd terraform/environments/gcp
+terraform init
+terraform plan -var="gcp_region=us-central1"
+terraform apply
+```
+
+Resources provisioned:
+- VPC network with Serverless VPC Access connector (Cloud Run → private databases)
+- Cloud Run service (auto-scales to zero, Google-managed TLS)
+- Cloud Load Balancer with global anycast IP
+- Cloud SQL PostgreSQL 14 with regional high-availability
+- Memorystore Redis 7 (STANDARD_HA with in-transit encryption)
+- IAM service accounts and bindings
+
+**Important:** All infrastructure changes must go through Terraform. Never make manual edits in the AWS console or GCP Cloud Console — they will be overwritten on the next `terraform apply` and will not be tracked in state.
+
+---
+
+## 5. Prometheus and Grafana
+
+**Metrics endpoint:** `GET /metrics` (unauthenticated — restrict in production to internal network or scrape from within the cluster)
+
+The metrics endpoint is served by the `prom-client` library using a dedicated registry (`metricsRegistry`) defined in `src/metrics/registry.ts`. The registry is isolated from the default global registry to prevent conflicts in tests.
+
+### Metric Definitions
+
+All 6 metrics are defined in `src/metrics/registry.ts`:
+
+| Metric name | Type | Labels | What it measures |
+|-------------|------|--------|-----------------|
+| `agentidp_tokens_issued_total` | Counter | `scope` | Total OAuth 2.0 access tokens issued successfully |
+| `agentidp_agents_registered_total` | Counter | `deployment_env` | Total AI agents registered successfully |
+| `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | Total HTTP requests received |
+| `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP request duration in seconds (buckets: 5ms–2.5s) |
+| `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration in seconds |
+| `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration in seconds |
+
+The HTTP metrics (`agentidp_http_requests_total` and `agentidp_http_request_duration_seconds`) are populated by `metricsMiddleware` in `src/middleware/metrics.ts`, which is registered before all routes in `src/app.ts`. Route labels are normalised to replace UUIDs with `:id` to prevent high cardinality (e.g. `/api/v1/agents/:id` rather than `/api/v1/agents/a1b2c3...`).
+
+### Local Grafana
+
+```bash
+docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
+```
+
+- Prometheus: http://localhost:9090
+- Grafana: http://localhost:3001 (admin password: `agentidp`)
+
+The monitoring compose overlay starts `prom/prometheus:v2.53.0` and `grafana/grafana:11.2.0`. Grafana dashboards and datasource provisioning are loaded from `monitoring/grafana/provisioning/`.
+
+### Adding a New Metric
+
+1. Define the metric in `src/metrics/registry.ts` using the shared `metricsRegistry` (not the default prom-client registry).
+2. Export it from that file.
+3. Import it in the file where the instrumentation point lives.
+4. Call `.inc(labels)` for Counters or `.observe(labels, value)` for Histograms at the instrumentation point.
+5. Verify it appears in `GET /metrics` after starting the server.
+
+---
+
+## 6. Operational Runbook
+
+### Health Check
+
+```bash
+curl http://<host>/health
+```
+
+Expected response:
+
+```json
+{"status":"ok","postgres":"connected","redis":"connected"}
+```
+
+Troubleshooting:
+- If `postgres: "error"` — verify `DATABASE_URL` is correct and PostgreSQL is reachable. Check `docker compose logs postgres` for local dev.
+- If `redis: "error"` — verify `REDIS_URL` is correct and Redis is reachable. Check `docker compose logs redis` for local dev.
+- If the health endpoint returns 502 or times out — the app process has crashed; check application logs.
+
+---
+
+### Rotate the JWT Signing Key
+
+All active tokens become invalid after a key rotation — agents must re-authenticate.
+
+1. Generate a new RSA-2048 key pair:
+   ```bash
+   openssl genrsa -out new-private.pem 2048
+   openssl rsa -in new-private.pem -pubout -out new-public.pem
+   ```
+2. Update `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` in your deployment environment (AWS Secrets Manager, GCP Secret Manager, or `.env`).
+3. Perform a rolling restart:
+   - **ECS:** trigger a new task deployment — ECS drains existing tasks and starts new ones with the updated secret values.
+   - **Cloud Run:** deploy a new revision — Cloud Run gradually shifts traffic to the new revision.
+4. Tokens signed with the old key will fail verification immediately after all instances have restarted.
+
+---
+
+### Revoke All Tokens for a Compromised Agent
+
+Suspend the agent to stop new token issuance immediately:
+
+```bash
+curl -X PATCH http://<host>/api/v1/agents/<agentId> \
+  -H "Authorization: Bearer <admin_token>" \
+  -H "Content-Type: application/json" \
+  -d '{"status": "suspended"}'
+```
+
+This prevents any new `POST /api/v1/token` requests for that agent. Active tokens remain valid until their TTL (1 hour). To invalidate active tokens immediately, also revoke all credentials for the agent:
+
+```bash
+# List credentials
+curl http://<host>/api/v1/agents/<agentId>/credentials \
+  -H "Authorization: Bearer <admin_token>"
+
+# Revoke each active credential
+curl -X DELETE http://<host>/api/v1/agents/<agentId>/credentials/<credentialId> \
+  -H "Authorization: Bearer <admin_token>"
+```
+
+---
+
+### Read Audit Logs for an Incident
+
+Query the audit log with date range and agent filter:
+
+```bash
+curl "http://<host>/api/v1/audit?agentId=<agentId>&startDate=2026-01-01T00:00:00Z&endDate=2026-01-31T23:59:59Z" \
+  -H "Authorization: Bearer <admin_token>"
+```
+
+Events are returned newest-first. Audit log retention is 90 days on the free tier. Each event includes: `eventId`, `agentId`, `action`, `outcome`, `ipAddress`, `userAgent`, `metadata`, `timestamp`.
+
+Common `action` values: `token.issued`, `token.revoked`, `token.introspected`, `agent.created`, `agent.updated`, `agent.suspended`, `agent.decommissioned`, `credential.generated`, `credential.rotated`, `credential.revoked`, `auth.failed`.