Files
sentryagent-idp/docs/devops/deployment.md
SentryAgent.ai Developer 8cabc0191c docs: commit all Phase 6 documentation updates and OpenSpec archives
- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook)
- developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added
- engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview)
- OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update
- VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added
- .gitignore: exclude session artifacts, build artifacts, and agent workspaces

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 02:24:24 +00:00

622 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deployment Guide — SentryAgent.ai AgentIdP
End-to-end guide for deploying AgentIdP to AWS (primary) and GCP (secondary) using the Terraform infrastructure-as-code in `terraform/`.
---
## Table of Contents
1. [Prerequisites](#1-prerequisites)
2. [AWS Deployment](#2-aws-deployment)
3. [GCP Deployment](#3-gcp-deployment)
4. [Post-Deploy Verification](#4-post-deploy-verification)
5. [Rollback Procedure](#5-rollback-procedure)
6. [Environment Variable Reference](#6-environment-variable-reference)
---
## 1. Prerequisites
### Tools
| Tool | Minimum Version | Install |
|------|-----------------|---------|
| Terraform | 1.6.0 | https://developer.hashicorp.com/terraform/install |
| AWS CLI | 2.13 | https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html |
| gcloud CLI | 460.0 | https://cloud.google.com/sdk/docs/install |
| Docker | 24.0 | Required only for building and pushing images |
| openssl | any | Required for generating JWT key pairs |
Verify all tools are available:
```bash
terraform version
aws --version
gcloud version
docker version
openssl version
```
### Container Image
Build and push the `sentryagent/agentidp` image to your registry before deploying. Terraform references the image by tag — it does not build it.
```bash
# From the project root
docker build -t sentryagent/agentidp:1.0.0 .
# Push to your registry (ECR example):
aws ecr get-login-password --region us-east-1 \
| docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
docker tag sentryagent/agentidp:1.0.0 \
123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0
```
Update `app_image_tag` in your `terraform.tfvars` to match.
### JWT Key Pair
Generate the RSA-2048 key pair used for signing and verifying JWTs:
```bash
openssl genrsa -out jwt_private.pem 2048
openssl rsa -in jwt_private.pem -pubout -out jwt_public.pem
# Verify
openssl rsa -in jwt_private.pem -check -noout
```
Keep `jwt_private.pem` secure — treat it with the same sensitivity as a TLS private key. You will paste its contents into `terraform.tfvars`.
---
## 2. AWS Deployment
### 2.1 Configure AWS CLI
```bash
aws configure
# Provide: AWS Access Key ID, Secret Access Key, region (e.g. us-east-1), output format (json)
# Verify credentials
aws sts get-caller-identity
```
The IAM principal running Terraform requires permissions to manage: VPC, ECS, RDS, ElastiCache, ALB, IAM roles, Secrets Manager, Route 53, CloudWatch, and VPC endpoints.
### 2.2 Provision an ACM Certificate
The ALB requires an ACM certificate for your domain. Create it in the same region as your deployment.
```bash
aws acm request-certificate \
--domain-name idp.example.com \
--validation-method DNS \
--region us-east-1
```
Complete DNS validation by adding the CNAME record shown in the ACM console. Wait for the status to become `ISSUED` before proceeding.
```bash
# Monitor validation status
aws acm describe-certificate \
--certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/XXXX \
--region us-east-1 \
--query 'Certificate.Status'
```
### 2.3 Prepare tfvars
```bash
cd terraform/environments/aws
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars`. All fields marked `REPLACE_WITH_*` are required. Key fields:
- `region` — AWS region (must match the ACM certificate region)
- `domain_name` — your domain (e.g. `idp.example.com`)
- `certificate_arn` — ARN from step 2.2
- `app_image_tag` — tag of the image you pushed in step 1
- `db_password` — strong random password (no `@`, `#`, `?`, `/` characters — they break URL parsing)
- `redis_auth_token` — minimum 16 characters, no spaces
- `jwt_private_key` — full PEM contents of `jwt_private.pem` with literal `\n` for newlines
- `jwt_public_key` — full PEM contents of `jwt_public.pem` with literal `\n` for newlines
Example for encoding PEM keys in tfvars:
```bash
# Output the private key as a single line with \n separators (for pasting into tfvars)
awk 'NF {printf "%s\\n", $0}' jwt_private.pem
```
**Never commit `terraform.tfvars` to version control.**
### 2.4 Configure Remote State (Recommended)
Uncomment and configure the `backend "s3"` block in `terraform/environments/aws/main.tf`:
```hcl
backend "s3" {
bucket = "your-terraform-state-bucket"
key = "agentidp/aws/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "your-terraform-locks-table"
}
```
Create the S3 bucket and DynamoDB table if they do not exist:
```bash
# S3 bucket with versioning and encryption
aws s3api create-bucket --bucket your-terraform-state-bucket --region us-east-1
aws s3api put-bucket-versioning \
--bucket your-terraform-state-bucket \
--versioning-configuration Status=Enabled
aws s3api put-bucket-encryption \
--bucket your-terraform-state-bucket \
--server-side-encryption-configuration \
'{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
# DynamoDB table for state locking
aws dynamodb create-table \
--table-name your-terraform-locks-table \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region us-east-1
```
### 2.5 Terraform Init
```bash
cd terraform/environments/aws
terraform init
```
Expected output: provider plugins downloaded, backend initialized.
### 2.6 Terraform Plan
```bash
terraform plan -out=tfplan
```
Review the plan carefully before applying. Expected resources on first apply: ~5060 resources (VPC, subnets, NAT gateways, VPC endpoints, IAM roles, secrets, RDS, ElastiCache, ALB, ECS cluster, task definition, service, Route 53 record).
### 2.7 Terraform Apply
```bash
terraform apply tfplan
```
**First apply takes 2030 minutes** — RDS Multi-AZ provisioning is the longest step (~15 min). Do not interrupt the apply.
When complete, note the outputs:
```bash
terraform output
```
Key outputs:
- `service_url` — the HTTPS URL of your deployed service
- `alb_dns_name` — ALB DNS name (verify Route 53 alias is pointing here)
- `ecs_service_name` — use for ECS deployment commands
- `cloudwatch_log_group` — where container logs appear
### 2.8 Run Database Migrations
After first deploy, run migrations against the new RDS instance. The easiest approach is to exec into a running ECS task:
```bash
# Get a running task ARN
TASK_ARN=$(aws ecs list-tasks \
--cluster sentryagent-agentidp-production \
--service-name sentryagent-agentidp-production \
--query 'taskArns[0]' \
--output text)
# Run migrations via ECS Exec (requires enableExecuteCommand on the service)
aws ecs execute-command \
--cluster sentryagent-agentidp-production \
--task $TASK_ARN \
--container agentidp \
--command "node scripts/db-migrate.js" \
--interactive
```
Alternatively, run a one-off ECS task with the migration command as the container override.
---
## 3. GCP Deployment
### 3.1 Configure gcloud CLI
```bash
gcloud auth login
gcloud config set project your-gcp-project-id
gcloud auth application-default login
```
Verify:
```bash
gcloud config list
gcloud projects describe your-gcp-project-id
```
The principal running Terraform requires the following roles on the project:
- `roles/owner` or a custom role covering: Cloud Run Admin, Cloud SQL Admin, Redis Admin, Secret Manager Admin, IAM Admin, Compute Admin, Service Networking Admin.
### 3.2 Prepare tfvars
```bash
cd terraform/environments/gcp
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars`. Key fields:
- `project_id` — your GCP project ID
- `region` — GCP region (e.g. `us-central1`)
- `app_image_tag` — tag of the image you built
- `db_password` — strong random password for Cloud SQL
- `jwt_private_key` / `jwt_public_key` — same PEM keys used for AWS (same key pair for both regions)
**Never commit `terraform.tfvars` to version control.**
### 3.3 Configure Remote State (Recommended)
Uncomment and configure the `backend "gcs"` block in `terraform/environments/gcp/main.tf`:
```hcl
backend "gcs" {
bucket = "your-terraform-state-bucket"
prefix = "agentidp/gcp/production"
}
```
Create the GCS bucket:
```bash
gsutil mb -l us-central1 gs://your-terraform-state-bucket
gsutil versioning set on gs://your-terraform-state-bucket
```
### 3.4 Terraform Init
```bash
cd terraform/environments/gcp
terraform init
```
### 3.5 Terraform Plan
```bash
terraform plan -out=tfplan
```
Review the plan. Expected resources: ~3545 resources (VPC, subnet, VPC connector, service accounts, secrets, Cloud SQL, Memorystore, Cloud Run service, IAM bindings, API enablement).
### 3.6 Terraform Apply
```bash
terraform apply tfplan
```
**First apply takes 1520 minutes** — Cloud SQL provisioning is the longest step.
When complete:
```bash
terraform output
```
Key outputs:
- `service_url` — Cloud Run HTTPS URL (Google-managed TLS, no cert setup required)
- `cloud_sql_connection_name` — for Cloud SQL Proxy if needed
- `memorystore_host` — Redis private IP
### 3.7 Run Database Migrations
Cloud Run does not support exec. Use a one-off Cloud Run Job for migrations:
```bash
gcloud run jobs create agentidp-migrate \
--image sentryagent/agentidp:1.0.0 \
--region us-central1 \
--command node \
--args "scripts/db-migrate.js" \
--set-secrets "DATABASE_URL=sentryagent-agentidp-production-database-url:latest" \
--vpc-connector sentryagent-agentidp-production-connector \
--service-account sentryagent-agentidp-production-run-sa@your-gcp-project-id.iam.gserviceaccount.com
gcloud run jobs execute agentidp-migrate --region us-central1 --wait
```
---
## 4. Post-Deploy Verification
Run these checks after deploying to either environment. Replace `https://idp.example.com` with your actual service URL.
### 4.1 Health Check
```bash
curl -si https://idp.example.com/health
```
Expected response:
```
HTTP/2 200
content-type: application/json
{"status":"ok"}
```
If you receive a 502 or 503, the load balancer has not yet registered healthy targets. Wait 6090 seconds and retry — ECS tasks or Cloud Run instances take time to pass health checks.
### 4.2 Metrics Endpoint
```bash
curl -si https://idp.example.com/metrics
```
Expected: HTTP 200 with Prometheus-format metrics text (lines beginning with `# HELP`, `# TYPE`, and metric values).
### 4.3 Token Endpoint (Smoke Test)
First, register a test agent client (requires a valid JWT or admin credentials — see [developers guide](../developers/)):
```bash
# Issue a client credentials token (replace CLIENT_ID and CLIENT_SECRET with real values)
curl -s -X POST https://idp.example.com/api/v1/token \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=client_credentials&client_id=test-client&client_secret=test-secret&scope=read"
```
Expected response (abbreviated):
```json
{
"access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"token_type": "Bearer",
"expires_in": 3600,
"scope": "read"
}
```
### 4.4 JWKS Endpoint
```bash
curl -si https://idp.example.com/.well-known/jwks.json
```
Expected: HTTP 200 with a JSON object containing a `keys` array with at least one RSA public key entry.
### 4.5 TLS Verification
```bash
# Verify TLS certificate is valid and matches your domain
curl -vI https://idp.example.com 2>&1 | grep -E "(SSL|TLS|certificate|issuer|subject)"
```
Expected: TLS 1.2 or 1.3, certificate issued by a trusted CA, subject matching your domain.
### 4.6 AWS-Specific: ECS Service Status
```bash
aws ecs describe-services \
--cluster sentryagent-agentidp-production \
--services sentryagent-agentidp-production \
--query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount,status:status}'
```
Expected: `running` equals `desired`, `status` is `ACTIVE`.
### 4.7 GCP-Specific: Cloud Run Service Status
```bash
gcloud run services describe sentryagent-agentidp-production \
--region us-central1 \
--format='value(status.conditions[0].type,status.conditions[0].status)'
```
Expected: `Ready True`.
---
## 5. Rollback Procedure
### 5.1 Image Rollback (Recommended — fastest)
To roll back to a previous image tag without modifying infrastructure:
**AWS:**
```bash
# Find the previous task definition revision
aws ecs list-task-definitions \
--family-prefix sentryagent-agentidp-production \
--sort DESC \
--query 'taskDefinitionArns[:5]'
# Update the service to use the previous task definition
aws ecs update-service \
--cluster sentryagent-agentidp-production \
--service sentryagent-agentidp-production \
--task-definition sentryagent-agentidp-production:PREVIOUS_REVISION \
--force-new-deployment
# Monitor the rollout
aws ecs wait services-stable \
--cluster sentryagent-agentidp-production \
--services sentryagent-agentidp-production
```
**GCP:**
```bash
# Deploy the previous image tag directly
gcloud run services update sentryagent-agentidp-production \
--region us-central1 \
--image sentryagent/agentidp:PREVIOUS_TAG
# Or route 100% of traffic to a specific revision
gcloud run services update-traffic sentryagent-agentidp-production \
--region us-central1 \
--to-revisions PREVIOUS_REVISION_NAME=100
```
### 5.2 Infrastructure Rollback via Terraform
If an infrastructure change (not an image update) caused the problem:
```bash
# Check the state and plan to understand what changed
terraform show
terraform plan
# If you have a previous state file (S3/GCS versioning), restore it:
# AWS:
aws s3 cp s3://your-state-bucket/agentidp/aws/production/terraform.tfstate.PREVIOUS ./terraform.tfstate
terraform apply -target=<affected_resource>
# GCP:
gsutil cp gs://your-state-bucket/agentidp/gcp/production/PREVIOUS_VERSION ./terraform.tfstate
terraform apply -target=<affected_resource>
```
**Never run `terraform destroy` in production without CEO approval.**
### 5.3 Database Rollback
RDS (AWS) and Cloud SQL (GCP) both support point-in-time restore. Use this only as a last resort — it creates a new DB instance and requires updating the `DATABASE_URL` secret.
**AWS:**
```bash
# Restore to a point before the problematic deployment
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier sentryagent-agentidp-production \
--target-db-instance-identifier sentryagent-agentidp-production-restored \
--restore-time 2026-01-01T12:00:00Z
```
**GCP:**
```bash
# List available backups
gcloud sql backups list --instance sentryagent-agentidp-production-pg14
# Restore from a backup
gcloud sql backups restore BACKUP_ID \
--restore-instance sentryagent-agentidp-production-pg14
```
---
## 6. Environment Variable Reference
All environment variables injected into the AgentIdP container are documented in full at:
**[docs/devops/environment-variables.md](./environment-variables.md)**
### Quick Reference
| Variable | Required | Source (AWS) | Source (GCP) |
|----------|----------|--------------|--------------|
| `DATABASE_URL` | Yes | Secrets Manager: `/<project>/<env>/database-url` | Secret Manager: `<name-prefix>-database-url` |
| `REDIS_URL` | Yes | Secrets Manager: `/<project>/<env>/redis-url` | Secret Manager: `<name-prefix>-redis-url` |
| `JWT_PRIVATE_KEY` | Yes | Secrets Manager: `/<project>/<env>/jwt-private-key` | Secret Manager: `<name-prefix>-jwt-private-key` |
| `JWT_PUBLIC_KEY` | Yes | Secrets Manager: `/<project>/<env>/jwt-public-key` | Secret Manager: `<name-prefix>-jwt-public-key` |
| `PORT` | No | Task definition env var (default: 3000) | Cloud Run env var (default: 3000) |
| `NODE_ENV` | No | Task definition env var (`production`) | Cloud Run env var (`production`) |
| `CORS_ORIGIN` | No | Task definition env var | Cloud Run env var |
| `POLICY_DIR` | No | Task definition env var (`/app/policies`) | Cloud Run env var (`/app/policies`) |
| `VAULT_ADDR` | No | Task definition env var | Cloud Run env var |
| `VAULT_TOKEN` | No | Secrets Manager: `/<project>/<env>/vault-token` | Secret Manager: `<name-prefix>-vault-token` |
| `VAULT_MOUNT` | No | Task definition env var (default: `secret`) | Cloud Run env var (default: `secret`) |
| `BILLING_ENABLED` | No | Task definition env var | Cloud Run env var |
| `STRIPE_SECRET_KEY` | Only if billing enabled | Secrets Manager: `/<project>/<env>/stripe-secret-key` | Secret Manager: `<name-prefix>-stripe-secret-key` |
| `STRIPE_WEBHOOK_SECRET` | Only if billing enabled | Secrets Manager: `/<project>/<env>/stripe-webhook-secret` | Secret Manager: `<name-prefix>-stripe-webhook-secret` |
| `STRIPE_PRICE_ID` | Only if billing enabled | Task definition env var | Cloud Run env var |
| `ANALYTICS_ENABLED` | No | Task definition env var (default: `true`) | Cloud Run env var |
| `TIER_ENFORCEMENT` | No | Task definition env var (default: `true`) | Cloud Run env var |
| `COMPLIANCE_ENABLED` | No | Task definition env var (default: `true`) | Cloud Run env var |
| `REDIS_RATE_LIMIT_ENABLED` | No | Task definition env var | Cloud Run env var |
| `RATE_LIMIT_WINDOW_MS` | No | Task definition env var (default: `60000`) | Cloud Run env var |
| `RATE_LIMIT_MAX_REQUESTS` | No | Task definition env var (default: `100`) | Cloud Run env var |
| `DB_POOL_MAX` | No | Task definition env var (default: `20`) | Cloud Run env var |
| `DB_POOL_MIN` | No | Task definition env var (default: `2`) | Cloud Run env var |
| `DB_POOL_IDLE_TIMEOUT_MS` | No | Task definition env var (default: `30000`) | Cloud Run env var |
| `DB_POOL_CONNECTION_TIMEOUT_MS` | No | Task definition env var (default: `5000`) | Cloud Run env var |
| `KAFKA_BROKERS` | No | Task definition env var | Cloud Run env var |
| `ENFORCE_TLS` | No | Task definition env var | Cloud Run env var |
| `OPA_URL` | No | Task definition env var | Cloud Run env var |
| `VAULT_KV_MOUNT` | No | Task definition env var (default: `secret`) | Cloud Run env var |
### Updating a Secret
**AWS:**
```bash
# Update a secret value (e.g. rotate JWT keys)
aws secretsmanager put-secret-value \
--secret-id /sentryagent-agentidp/production/jwt-private-key \
--secret-string "$(cat new_jwt_private.pem)"
# Force new ECS deployment to pick up the new secret value
aws ecs update-service \
--cluster sentryagent-agentidp-production \
--service sentryagent-agentidp-production \
--force-new-deployment
```
**GCP:**
```bash
# Add a new version of a secret
gcloud secrets versions add sentryagent-agentidp-production-jwt-private-key \
--data-file=new_jwt_private.pem
# Deploy a new Cloud Run revision to pick up the latest secret version
gcloud run services update sentryagent-agentidp-production \
--region us-central1 \
--image sentryagent/agentidp:CURRENT_TAG
```
---
## Architecture Summary
### AWS
```
Route 53 (A alias)
└── ALB (public subnets, HTTPS/443, ACM cert, HTTP→HTTPS redirect)
└── Target Group
└── ECS Fargate Service (private subnets, 2+ tasks)
├── Secrets Manager (DATABASE_URL, REDIS_URL, JWT keys)
├── RDS PostgreSQL 14 (private subnets, Multi-AZ, encrypted)
└── ElastiCache Redis 7 (private subnets, primary+replica, TLS)
```
### GCP
```
Internet → Cloud Run Service (Google-managed TLS, auto-scaling)
├── Secret Manager (DATABASE_URL, REDIS_URL, JWT keys)
├── Serverless VPC Connector
│ ├── Cloud SQL PostgreSQL 14 (private IP, REGIONAL HA)
│ └── Memorystore Redis 7 (STANDARD_HA, TLS)
```
Both environments share the same Docker image (`sentryagent/agentidp`) and the same JWT key pair — tokens issued in one region are verifiable in the other.