docs: engineering knowledge base for new hires

Complete docs/engineering/ suite — 12 documents covering company overview,
system architecture, tech stack ADRs, codebase structure, service deep dives,
annotated code walkthroughs, dev setup, engineering workflow, testing strategy,
deployment/ops, SDK guide, and README index. All content verified against
source files. All 82 tasks in openspec/changes/engineering-docs/tasks.md
marked complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
SentryAgent.ai Developer
2026-03-29 12:38:42 +00:00
parent 1f95cfe89d
commit eced5f8699
13 changed files with 3820 additions and 0 deletions

View File

@@ -0,0 +1,273 @@
# 10 — Deployment and Operations
This document covers building and running AgentIdP in production: Docker, environment variables, database migrations, Terraform multi-region deployment, Prometheus/Grafana monitoring, and operational runbooks for common incidents.
---
## 1. Docker Build and Run
The Dockerfile uses a two-stage build:
- **Stage 1 (builder):** `node:18-alpine` — installs all dependencies (including dev) and compiles TypeScript to `dist/`.
- **Stage 2 (production):** `node:18-alpine` — copies `dist/` and `node_modules` (production only), runs as the built-in non-root `node` user.
```bash
# Build
docker build -t sentryagent-idp:latest .
# Run (supply required env vars)
docker run -d \
-p 3000:3000 \
-e DATABASE_URL=postgresql://sentryagent:sentryagent@<host>:5432/sentryagent_idp \
-e REDIS_URL=redis://<host>:6379 \
-e JWT_PRIVATE_KEY="-----BEGIN RSA PRIVATE KEY-----\n..." \
-e JWT_PUBLIC_KEY="-----BEGIN PUBLIC KEY-----\n..." \
sentryagent-idp:latest
```
The container exposes port `3000`. Override with `PORT` environment variable if needed.
For local full-stack development, use Docker Compose instead:
```bash
docker compose up -d
```
The `docker-compose.yml` starts the app, PostgreSQL 14, and Redis 7 with health checks and data volumes.
---
## 2. Environment Variables Reference
All variables are loaded at startup via `dotenv`. In production, inject them directly into the process environment — do not commit `.env` to version control.
| Variable | Required | Default | Purpose |
|----------|----------|---------|---------|
| `DATABASE_URL` | Yes | — | PostgreSQL connection string. Format: `postgresql://<user>:<password>@<host>:<port>/<db>` |
| `REDIS_URL` | Yes | — | Redis connection URL. Format: `redis://<host>:<port>` |
| `JWT_PRIVATE_KEY` | Yes | — | PEM-encoded RSA-2048 private key for signing RS256 JWT tokens |
| `JWT_PUBLIC_KEY` | Yes | — | PEM-encoded RSA-2048 public key for verifying tokens on every authenticated request |
| `PORT` | No | `3000` | HTTP port the Express server listens on |
| `NODE_ENV` | No | `undefined` | Set to `production` in production, `test` in test (disables Morgan logging in test) |
| `CORS_ORIGIN` | No | `*` | Allowed CORS origin(s). Set to specific URL in production (e.g. `https://app.mycompany.ai`) |
| `VAULT_ADDR` | No | — | HashiCorp Vault server address. When set with `VAULT_TOKEN`, new credentials are stored in Vault KV v2 instead of bcrypt |
| `VAULT_TOKEN` | No | — | Vault authentication token. Required when `VAULT_ADDR` is set |
| `VAULT_MOUNT` | No | `secret` | KV v2 secrets engine mount path |
| `POLICY_DIR` | No | `<cwd>/policies` | Directory containing OPA policy files (`authz.wasm` or `data/scopes.json`) |
**Validation at startup:** `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` are checked in `createApp()` (see `src/app.ts:117121`). If missing, the process exits before binding to any port. `DATABASE_URL` and `REDIS_URL` are validated when their respective singletons are first initialised.
---
## 3. Database Migrations
Migrations are plain SQL files in `src/db/migrations/`. They are append-only — never modify an existing migration file. Always create a new numbered file.
Current migration files:
| File | What it creates |
|------|----------------|
| `001_create_agents.sql` | `agents` table with UUID primary key, email unique constraint, status enum |
| `002_create_credentials.sql` | `credentials` table linked to `agents` by `client_id` foreign key |
| `003_create_audit_events.sql` | `audit_events` table with JSONB `metadata` column |
| `004_create_tokens.sql` | `token_monthly_counts` table for free-tier token limit tracking |
| `005_add_vault_path.sql` | Adds `vault_path VARCHAR(512)` column to the `credentials` table |
**Run migrations:**
```bash
npm run db:migrate
```
This executes `scripts/migrate.ts` which applies all SQL files that have not yet been recorded in the `schema_migrations` tracking table.
**Adding a new migration:**
1. Create `src/db/migrations/006_<description>.sql`
2. Write idempotent SQL (use `CREATE TABLE IF NOT EXISTS`, `ADD COLUMN IF NOT EXISTS`, etc.)
3. Run `npm run db:migrate`
---
## 4. Terraform Multi-Region Deployment
The `terraform/` directory contains reusable modules and two environment configurations.
**Directory structure:**
```
terraform/
modules/
agentidp/ # Core AgentIdP compute resources
lb/ # Load balancer (ALB/Cloud Load Balancer)
rds/ # RDS PostgreSQL (AWS)
redis/ # ElastiCache Redis (AWS) / Memorystore (GCP)
environments/
aws/ # AWS deployment (ECS Fargate, ALB, RDS, ElastiCache)
gcp/ # GCP deployment (Cloud Run, Cloud SQL, Memorystore)
```
### AWS Deployment
Architecture: `Internet → Route 53 → ALB (public subnets, HTTPS) → ECS Fargate tasks (private subnets) → RDS PostgreSQL 14 (Multi-AZ) + ElastiCache Redis 7`
All secrets are stored in AWS Secrets Manager and injected into ECS task definitions at launch time.
```bash
cd terraform/environments/aws
terraform init
terraform plan -var="aws_region=us-east-1"
terraform apply
```
Resources provisioned:
- VPC with public and private subnets across multiple availability zones
- ECS Cluster and Fargate task definition (running `sentryagent-idp` container)
- Application Load Balancer with HTTPS listener and health check target group
- RDS PostgreSQL 14 (Multi-AZ for high availability)
- ElastiCache Redis 7 (primary + replica)
- IAM roles and instance profiles for ECS task permissions
- Security groups enforcing least-privilege network access
### GCP Deployment
Architecture: `Internet → Cloud Run (Google-managed TLS, auto-scaling) → Cloud SQL PostgreSQL 14 (REGIONAL HA) + Memorystore Redis 7 (STANDARD_HA)`
All secrets are stored in GCP Secret Manager and mounted into the Cloud Run service at startup.
```bash
cd terraform/environments/gcp
terraform init
terraform plan -var="gcp_region=us-central1"
terraform apply
```
Resources provisioned:
- VPC network with Serverless VPC Access connector (Cloud Run → private databases)
- Cloud Run service (auto-scales to zero, Google-managed TLS)
- Cloud Load Balancer with global anycast IP
- Cloud SQL PostgreSQL 14 with regional high-availability
- Memorystore Redis 7 (STANDARD_HA with in-transit encryption)
- IAM service accounts and bindings
**Important:** All infrastructure changes must go through Terraform. Never make manual edits in the AWS console or GCP Cloud Console — they will be overwritten on the next `terraform apply` and will not be tracked in state.
---
## 5. Prometheus and Grafana
**Metrics endpoint:** `GET /metrics` (unauthenticated — restrict in production to internal network or scrape from within the cluster)
The metrics endpoint is served by the `prom-client` library using a dedicated registry (`metricsRegistry`) defined in `src/metrics/registry.ts`. The registry is isolated from the default global registry to prevent conflicts in tests.
### Metric Definitions
All 6 metrics are defined in `src/metrics/registry.ts`:
| Metric name | Type | Labels | What it measures |
|-------------|------|--------|-----------------|
| `agentidp_tokens_issued_total` | Counter | `scope` | Total OAuth 2.0 access tokens issued successfully |
| `agentidp_agents_registered_total` | Counter | `deployment_env` | Total AI agents registered successfully |
| `agentidp_http_requests_total` | Counter | `method`, `route`, `status_code` | Total HTTP requests received |
| `agentidp_http_request_duration_seconds` | Histogram | `method`, `route`, `status_code` | HTTP request duration in seconds (buckets: 5ms2.5s) |
| `agentidp_db_query_duration_seconds` | Histogram | `operation` | PostgreSQL query duration in seconds |
| `agentidp_redis_command_duration_seconds` | Histogram | `command` | Redis command duration in seconds |
The HTTP metrics (`agentidp_http_requests_total` and `agentidp_http_request_duration_seconds`) are populated by `metricsMiddleware` in `src/middleware/metrics.ts`, which is registered before all routes in `src/app.ts`. Route labels are normalised to replace UUIDs with `:id` to prevent high cardinality (e.g. `/api/v1/agents/:id` rather than `/api/v1/agents/a1b2c3...`).
### Local Grafana
```bash
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
```
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3001 (admin password: `agentidp`)
The monitoring compose overlay starts `prom/prometheus:v2.53.0` and `grafana/grafana:11.2.0`. Grafana dashboards and datasource provisioning are loaded from `monitoring/grafana/provisioning/`.
### Adding a New Metric
1. Define the metric in `src/metrics/registry.ts` using the shared `metricsRegistry` (not the default prom-client registry).
2. Export it from that file.
3. Import it in the file where the instrumentation point lives.
4. Call `.inc(labels)` for Counters or `.observe(labels, value)` for Histograms at the instrumentation point.
5. Verify it appears in `GET /metrics` after starting the server.
---
## 6. Operational Runbook
### Health Check
```bash
curl http://<host>/health
```
Expected response:
```json
{"status":"ok","postgres":"connected","redis":"connected"}
```
Troubleshooting:
- If `postgres: "error"` — verify `DATABASE_URL` is correct and PostgreSQL is reachable. Check `docker compose logs postgres` for local dev.
- If `redis: "error"` — verify `REDIS_URL` is correct and Redis is reachable. Check `docker compose logs redis` for local dev.
- If the health endpoint returns 502 or times out — the app process has crashed; check application logs.
---
### Rotate the JWT Signing Key
All active tokens become invalid after a key rotation — agents must re-authenticate.
1. Generate a new RSA-2048 key pair:
```bash
openssl genrsa -out new-private.pem 2048
openssl rsa -in new-private.pem -pubout -out new-public.pem
```
2. Update `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` in your deployment environment (AWS Secrets Manager, GCP Secret Manager, or `.env`).
3. Perform a rolling restart:
- **ECS:** trigger a new task deployment — ECS drains existing tasks and starts new ones with the updated secret values.
- **Cloud Run:** deploy a new revision — Cloud Run gradually shifts traffic to the new revision.
4. Tokens signed with the old key will fail verification immediately after all instances have restarted.
---
### Revoke All Tokens for a Compromised Agent
Suspend the agent to stop new token issuance immediately:
```bash
curl -X PATCH http://<host>/api/v1/agents/<agentId> \
-H "Authorization: Bearer <admin_token>" \
-H "Content-Type: application/json" \
-d '{"status": "suspended"}'
```
This prevents any new `POST /api/v1/token` requests for that agent. Active tokens remain valid until their TTL (1 hour). To invalidate active tokens immediately, also revoke all credentials for the agent:
```bash
# List credentials
curl http://<host>/api/v1/agents/<agentId>/credentials \
-H "Authorization: Bearer <admin_token>"
# Revoke each active credential
curl -X DELETE http://<host>/api/v1/agents/<agentId>/credentials/<credentialId> \
-H "Authorization: Bearer <admin_token>"
```
---
### Read Audit Logs for an Incident
Query the audit log with date range and agent filter:
```bash
curl "http://<host>/api/v1/audit?agentId=<agentId>&startDate=2026-01-01T00:00:00Z&endDate=2026-01-31T23:59:59Z" \
-H "Authorization: Bearer <admin_token>"
```
Events are returned newest-first. Audit log retention is 90 days on the free tier. Each event includes: `eventId`, `agentId`, `action`, `outcome`, `ipAddress`, `userAgent`, `metadata`, `timestamp`.
Common `action` values: `token.issued`, `token.revoked`, `token.introspected`, `agent.created`, `agent.updated`, `agent.suspended`, `agent.decommissioned`, `credential.generated`, `credential.rotated`, `credential.revoked`, `auth.failed`.