feat(phase-2): workstream 8 — Multi-Region Terraform Deployment

AWS environment:
- VPC (3-AZ, public + private subnets, NAT gateways, VPC endpoints for ECR/SM/CW)
- ECS Fargate service (sentryagent/agentidp) — secrets from Secrets Manager
- RDS PostgreSQL 14 (Multi-AZ, encrypted, VPC-internal, storage autoscaling)
- ElastiCache Redis 7 (primary + replica, at-rest + in-transit encryption)
- ALB with HTTPS/443, HTTP→HTTPS redirect, ACM certificate
- Route 53 alias record

GCP environment:
- VPC + private services access + Serverless VPC connector
- Cloud Run service — secrets from Secret Manager
- Cloud SQL PostgreSQL 14 (private IP, no public endpoint)
- Cloud Memorystore Redis 7 (VPC-internal, AUTH enabled)

Shared:
- 4 reusable modules: agentidp (dual AWS/GCP), rds, redis, lb
- No hardcoded secrets; all sensitive vars marked sensitive=true
- terraform.tfvars.example for both environments
- docs/devops/deployment.md — AWS + GCP step-by-step walkthrough, rollback procedures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
SentryAgent.ai Developer
2026-03-29 06:25:14 +00:00
parent a504964e5f
commit 6913d62648
22 changed files with 4138 additions and 8 deletions

603
docs/devops/deployment.md Normal file
View File

@@ -0,0 +1,603 @@
# Deployment Guide — SentryAgent.ai AgentIdP
End-to-end guide for deploying AgentIdP to AWS (primary) and GCP (secondary) using the Terraform infrastructure-as-code in `terraform/`.
---
## Table of Contents
1. [Prerequisites](#1-prerequisites)
2. [AWS Deployment](#2-aws-deployment)
3. [GCP Deployment](#3-gcp-deployment)
4. [Post-Deploy Verification](#4-post-deploy-verification)
5. [Rollback Procedure](#5-rollback-procedure)
6. [Environment Variable Reference](#6-environment-variable-reference)
---
## 1. Prerequisites
### Tools
| Tool | Minimum Version | Install |
|------|-----------------|---------|
| Terraform | 1.6.0 | https://developer.hashicorp.com/terraform/install |
| AWS CLI | 2.13 | https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html |
| gcloud CLI | 460.0 | https://cloud.google.com/sdk/docs/install |
| Docker | 24.0 | Required only for building and pushing images |
| openssl | any | Required for generating JWT key pairs |
Verify all tools are available:
```bash
terraform version
aws --version
gcloud version
docker version
openssl version
```
### Container Image
Build and push the `sentryagent/agentidp` image to your registry before deploying. Terraform references the image by tag — it does not build it.
```bash
# From the project root
docker build -t sentryagent/agentidp:1.0.0 .
# Push to your registry (ECR example):
aws ecr get-login-password --region us-east-1 \
| docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
docker tag sentryagent/agentidp:1.0.0 \
123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0
```
Update `app_image_tag` in your `terraform.tfvars` to match.
### JWT Key Pair
Generate the RSA-2048 key pair used for signing and verifying JWTs:
```bash
openssl genrsa -out jwt_private.pem 2048
openssl rsa -in jwt_private.pem -pubout -out jwt_public.pem
# Verify
openssl rsa -in jwt_private.pem -check -noout
```
Keep `jwt_private.pem` secure — treat it with the same sensitivity as a TLS private key. You will paste its contents into `terraform.tfvars`.
---
## 2. AWS Deployment
### 2.1 Configure AWS CLI
```bash
aws configure
# Provide: AWS Access Key ID, Secret Access Key, region (e.g. us-east-1), output format (json)
# Verify credentials
aws sts get-caller-identity
```
The IAM principal running Terraform requires permissions to manage: VPC, ECS, RDS, ElastiCache, ALB, IAM roles, Secrets Manager, Route 53, CloudWatch, and VPC endpoints.
### 2.2 Provision an ACM Certificate
The ALB requires an ACM certificate for your domain. Create it in the same region as your deployment.
```bash
aws acm request-certificate \
--domain-name idp.example.com \
--validation-method DNS \
--region us-east-1
```
Complete DNS validation by adding the CNAME record shown in the ACM console. Wait for the status to become `ISSUED` before proceeding.
```bash
# Monitor validation status
aws acm describe-certificate \
--certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/XXXX \
--region us-east-1 \
--query 'Certificate.Status'
```
### 2.3 Prepare tfvars
```bash
cd terraform/environments/aws
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars`. All fields marked `REPLACE_WITH_*` are required. Key fields:
- `region` — AWS region (must match the ACM certificate region)
- `domain_name` — your domain (e.g. `idp.example.com`)
- `certificate_arn` — ARN from step 2.2
- `app_image_tag` — tag of the image you pushed in step 1
- `db_password` — strong random password (no `@`, `#`, `?`, `/` characters — they break URL parsing)
- `redis_auth_token` — minimum 16 characters, no spaces
- `jwt_private_key` — full PEM contents of `jwt_private.pem` with literal `\n` for newlines
- `jwt_public_key` — full PEM contents of `jwt_public.pem` with literal `\n` for newlines
Example for encoding PEM keys in tfvars:
```bash
# Output the private key as a single line with \n separators (for pasting into tfvars)
awk 'NF {printf "%s\\n", $0}' jwt_private.pem
```
**Never commit `terraform.tfvars` to version control.**
### 2.4 Configure Remote State (Recommended)
Uncomment and configure the `backend "s3"` block in `terraform/environments/aws/main.tf`:
```hcl
backend "s3" {
bucket = "your-terraform-state-bucket"
key = "agentidp/aws/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "your-terraform-locks-table"
}
```
Create the S3 bucket and DynamoDB table if they do not exist:
```bash
# S3 bucket with versioning and encryption
aws s3api create-bucket --bucket your-terraform-state-bucket --region us-east-1
aws s3api put-bucket-versioning \
--bucket your-terraform-state-bucket \
--versioning-configuration Status=Enabled
aws s3api put-bucket-encryption \
--bucket your-terraform-state-bucket \
--server-side-encryption-configuration \
'{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
# DynamoDB table for state locking
aws dynamodb create-table \
--table-name your-terraform-locks-table \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region us-east-1
```
### 2.5 Terraform Init
```bash
cd terraform/environments/aws
terraform init
```
Expected output: provider plugins downloaded, backend initialized.
### 2.6 Terraform Plan
```bash
terraform plan -out=tfplan
```
Review the plan carefully before applying. Expected resources on first apply: ~5060 resources (VPC, subnets, NAT gateways, VPC endpoints, IAM roles, secrets, RDS, ElastiCache, ALB, ECS cluster, task definition, service, Route 53 record).
### 2.7 Terraform Apply
```bash
terraform apply tfplan
```
**First apply takes 2030 minutes** — RDS Multi-AZ provisioning is the longest step (~15 min). Do not interrupt the apply.
When complete, note the outputs:
```bash
terraform output
```
Key outputs:
- `service_url` — the HTTPS URL of your deployed service
- `alb_dns_name` — ALB DNS name (verify Route 53 alias is pointing here)
- `ecs_service_name` — use for ECS deployment commands
- `cloudwatch_log_group` — where container logs appear
### 2.8 Run Database Migrations
After first deploy, run migrations against the new RDS instance. The easiest approach is to exec into a running ECS task:
```bash
# Get a running task ARN
TASK_ARN=$(aws ecs list-tasks \
--cluster sentryagent-agentidp-production \
--service-name sentryagent-agentidp-production \
--query 'taskArns[0]' \
--output text)
# Run migrations via ECS Exec (requires enableExecuteCommand on the service)
aws ecs execute-command \
--cluster sentryagent-agentidp-production \
--task $TASK_ARN \
--container agentidp \
--command "node scripts/db-migrate.js" \
--interactive
```
Alternatively, run a one-off ECS task with the migration command as the container override.
---
## 3. GCP Deployment
### 3.1 Configure gcloud CLI
```bash
gcloud auth login
gcloud config set project your-gcp-project-id
gcloud auth application-default login
```
Verify:
```bash
gcloud config list
gcloud projects describe your-gcp-project-id
```
The principal running Terraform requires the following roles on the project:
- `roles/owner` or a custom role covering: Cloud Run Admin, Cloud SQL Admin, Redis Admin, Secret Manager Admin, IAM Admin, Compute Admin, Service Networking Admin.
### 3.2 Prepare tfvars
```bash
cd terraform/environments/gcp
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars`. Key fields:
- `project_id` — your GCP project ID
- `region` — GCP region (e.g. `us-central1`)
- `app_image_tag` — tag of the image you built
- `db_password` — strong random password for Cloud SQL
- `jwt_private_key` / `jwt_public_key` — same PEM keys used for AWS (same key pair for both regions)
**Never commit `terraform.tfvars` to version control.**
### 3.3 Configure Remote State (Recommended)
Uncomment and configure the `backend "gcs"` block in `terraform/environments/gcp/main.tf`:
```hcl
backend "gcs" {
bucket = "your-terraform-state-bucket"
prefix = "agentidp/gcp/production"
}
```
Create the GCS bucket:
```bash
gsutil mb -l us-central1 gs://your-terraform-state-bucket
gsutil versioning set on gs://your-terraform-state-bucket
```
### 3.4 Terraform Init
```bash
cd terraform/environments/gcp
terraform init
```
### 3.5 Terraform Plan
```bash
terraform plan -out=tfplan
```
Review the plan. Expected resources: ~3545 resources (VPC, subnet, VPC connector, service accounts, secrets, Cloud SQL, Memorystore, Cloud Run service, IAM bindings, API enablement).
### 3.6 Terraform Apply
```bash
terraform apply tfplan
```
**First apply takes 1520 minutes** — Cloud SQL provisioning is the longest step.
When complete:
```bash
terraform output
```
Key outputs:
- `service_url` — Cloud Run HTTPS URL (Google-managed TLS, no cert setup required)
- `cloud_sql_connection_name` — for Cloud SQL Proxy if needed
- `memorystore_host` — Redis private IP
### 3.7 Run Database Migrations
Cloud Run does not support exec. Use a one-off Cloud Run Job for migrations:
```bash
gcloud run jobs create agentidp-migrate \
--image sentryagent/agentidp:1.0.0 \
--region us-central1 \
--command node \
--args "scripts/db-migrate.js" \
--set-secrets "DATABASE_URL=sentryagent-agentidp-production-database-url:latest" \
--vpc-connector sentryagent-agentidp-production-connector \
--service-account sentryagent-agentidp-production-run-sa@your-gcp-project-id.iam.gserviceaccount.com
gcloud run jobs execute agentidp-migrate --region us-central1 --wait
```
---
## 4. Post-Deploy Verification
Run these checks after deploying to either environment. Replace `https://idp.example.com` with your actual service URL.
### 4.1 Health Check
```bash
curl -si https://idp.example.com/health
```
Expected response:
```
HTTP/2 200
content-type: application/json
{"status":"ok"}
```
If you receive a 502 or 503, the load balancer has not yet registered healthy targets. Wait 6090 seconds and retry — ECS tasks or Cloud Run instances take time to pass health checks.
### 4.2 Metrics Endpoint
```bash
curl -si https://idp.example.com/metrics
```
Expected: HTTP 200 with Prometheus-format metrics text (lines beginning with `# HELP`, `# TYPE`, and metric values).
### 4.3 Token Endpoint (Smoke Test)
First, register a test agent client (requires a valid JWT or admin credentials — see [developers guide](../developers/)):
```bash
# Issue a client credentials token (replace CLIENT_ID and CLIENT_SECRET with real values)
curl -s -X POST https://idp.example.com/api/v1/token \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=client_credentials&client_id=test-client&client_secret=test-secret&scope=read"
```
Expected response (abbreviated):
```json
{
"access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"token_type": "Bearer",
"expires_in": 3600,
"scope": "read"
}
```
### 4.4 JWKS Endpoint
```bash
curl -si https://idp.example.com/.well-known/jwks.json
```
Expected: HTTP 200 with a JSON object containing a `keys` array with at least one RSA public key entry.
### 4.5 TLS Verification
```bash
# Verify TLS certificate is valid and matches your domain
curl -vI https://idp.example.com 2>&1 | grep -E "(SSL|TLS|certificate|issuer|subject)"
```
Expected: TLS 1.2 or 1.3, certificate issued by a trusted CA, subject matching your domain.
### 4.6 AWS-Specific: ECS Service Status
```bash
aws ecs describe-services \
--cluster sentryagent-agentidp-production \
--services sentryagent-agentidp-production \
--query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount,status:status}'
```
Expected: `running` equals `desired`, `status` is `ACTIVE`.
### 4.7 GCP-Specific: Cloud Run Service Status
```bash
gcloud run services describe sentryagent-agentidp-production \
--region us-central1 \
--format='value(status.conditions[0].type,status.conditions[0].status)'
```
Expected: `Ready True`.
---
## 5. Rollback Procedure
### 5.1 Image Rollback (Recommended — fastest)
To roll back to a previous image tag without modifying infrastructure:
**AWS:**
```bash
# Find the previous task definition revision
aws ecs list-task-definitions \
--family-prefix sentryagent-agentidp-production \
--sort DESC \
--query 'taskDefinitionArns[:5]'
# Update the service to use the previous task definition
aws ecs update-service \
--cluster sentryagent-agentidp-production \
--service sentryagent-agentidp-production \
--task-definition sentryagent-agentidp-production:PREVIOUS_REVISION \
--force-new-deployment
# Monitor the rollout
aws ecs wait services-stable \
--cluster sentryagent-agentidp-production \
--services sentryagent-agentidp-production
```
**GCP:**
```bash
# Deploy the previous image tag directly
gcloud run services update sentryagent-agentidp-production \
--region us-central1 \
--image sentryagent/agentidp:PREVIOUS_TAG
# Or route 100% of traffic to a specific revision
gcloud run services update-traffic sentryagent-agentidp-production \
--region us-central1 \
--to-revisions PREVIOUS_REVISION_NAME=100
```
### 5.2 Infrastructure Rollback via Terraform
If an infrastructure change (not an image update) caused the problem:
```bash
# Check the state and plan to understand what changed
terraform show
terraform plan
# If you have a previous state file (S3/GCS versioning), restore it:
# AWS:
aws s3 cp s3://your-state-bucket/agentidp/aws/production/terraform.tfstate.PREVIOUS ./terraform.tfstate
terraform apply -target=<affected_resource>
# GCP:
gsutil cp gs://your-state-bucket/agentidp/gcp/production/PREVIOUS_VERSION ./terraform.tfstate
terraform apply -target=<affected_resource>
```
**Never run `terraform destroy` in production without CEO approval.**
### 5.3 Database Rollback
RDS (AWS) and Cloud SQL (GCP) both support point-in-time restore. Use this only as a last resort — it creates a new DB instance and requires updating the `DATABASE_URL` secret.
**AWS:**
```bash
# Restore to a point before the problematic deployment
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier sentryagent-agentidp-production \
--target-db-instance-identifier sentryagent-agentidp-production-restored \
--restore-time 2026-01-01T12:00:00Z
```
**GCP:**
```bash
# List available backups
gcloud sql backups list --instance sentryagent-agentidp-production-pg14
# Restore from a backup
gcloud sql backups restore BACKUP_ID \
--restore-instance sentryagent-agentidp-production-pg14
```
---
## 6. Environment Variable Reference
All environment variables injected into the AgentIdP container are documented in full at:
**[docs/devops/environment-variables.md](./environment-variables.md)**
### Quick Reference
| Variable | Required | Source (AWS) | Source (GCP) |
|----------|----------|--------------|--------------|
| `DATABASE_URL` | Yes | Secrets Manager: `/<project>/<env>/database-url` | Secret Manager: `<name-prefix>-database-url` |
| `REDIS_URL` | Yes | Secrets Manager: `/<project>/<env>/redis-url` | Secret Manager: `<name-prefix>-redis-url` |
| `JWT_PRIVATE_KEY` | Yes | Secrets Manager: `/<project>/<env>/jwt-private-key` | Secret Manager: `<name-prefix>-jwt-private-key` |
| `JWT_PUBLIC_KEY` | Yes | Secrets Manager: `/<project>/<env>/jwt-public-key` | Secret Manager: `<name-prefix>-jwt-public-key` |
| `PORT` | No | Task definition env var (default: 3000) | Cloud Run env var (default: 3000) |
| `NODE_ENV` | No | Task definition env var (`production`) | Cloud Run env var (`production`) |
| `CORS_ORIGIN` | No | Task definition env var | Cloud Run env var |
| `POLICY_DIR` | No | Task definition env var (`/app/policies`) | Cloud Run env var (`/app/policies`) |
| `VAULT_ADDR` | No | Task definition env var | Cloud Run env var |
| `VAULT_TOKEN` | No | Secrets Manager: `/<project>/<env>/vault-token` | Secret Manager: `<name-prefix>-vault-token` |
| `VAULT_MOUNT` | No | Task definition env var (default: `secret`) | Cloud Run env var (default: `secret`) |
### Updating a Secret
**AWS:**
```bash
# Update a secret value (e.g. rotate JWT keys)
aws secretsmanager put-secret-value \
--secret-id /sentryagent-agentidp/production/jwt-private-key \
--secret-string "$(cat new_jwt_private.pem)"
# Force new ECS deployment to pick up the new secret value
aws ecs update-service \
--cluster sentryagent-agentidp-production \
--service sentryagent-agentidp-production \
--force-new-deployment
```
**GCP:**
```bash
# Add a new version of a secret
gcloud secrets versions add sentryagent-agentidp-production-jwt-private-key \
--data-file=new_jwt_private.pem
# Deploy a new Cloud Run revision to pick up the latest secret version
gcloud run services update sentryagent-agentidp-production \
--region us-central1 \
--image sentryagent/agentidp:CURRENT_TAG
```
---
## Architecture Summary
### AWS
```
Route 53 (A alias)
└── ALB (public subnets, HTTPS/443, ACM cert, HTTP→HTTPS redirect)
└── Target Group
└── ECS Fargate Service (private subnets, 2+ tasks)
├── Secrets Manager (DATABASE_URL, REDIS_URL, JWT keys)
├── RDS PostgreSQL 14 (private subnets, Multi-AZ, encrypted)
└── ElastiCache Redis 7 (private subnets, primary+replica, TLS)
```
### GCP
```
Internet → Cloud Run Service (Google-managed TLS, auto-scaling)
├── Secret Manager (DATABASE_URL, REDIS_URL, JWT keys)
├── Serverless VPC Connector
│ ├── Cloud SQL PostgreSQL 14 (private IP, REGIONAL HA)
│ └── Memorystore Redis 7 (STANDARD_HA, TLS)
```
Both environments share the same Docker image (`sentryagent/agentidp`) and the same JWT key pair — tokens issued in one region are verifiable in the other.