Files
sentryagent-idp/docs/devops/deployment.md
SentryAgent.ai Developer 8cabc0191c docs: commit all Phase 6 documentation updates and OpenSpec archives
- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook)
- developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added
- engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview)
- OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update
- VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added
- .gitignore: exclude session artifacts, build artifacts, and agent workspaces

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 02:24:24 +00:00

19 KiB
Raw Permalink Blame History

Deployment Guide — SentryAgent.ai AgentIdP

End-to-end guide for deploying AgentIdP to AWS (primary) and GCP (secondary) using the Terraform infrastructure-as-code in terraform/.


Table of Contents

  1. Prerequisites
  2. AWS Deployment
  3. GCP Deployment
  4. Post-Deploy Verification
  5. Rollback Procedure
  6. Environment Variable Reference

1. Prerequisites

Tools

Tool Minimum Version Install
Terraform 1.6.0 https://developer.hashicorp.com/terraform/install
AWS CLI 2.13 https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
gcloud CLI 460.0 https://cloud.google.com/sdk/docs/install
Docker 24.0 Required only for building and pushing images
openssl any Required for generating JWT key pairs

Verify all tools are available:

terraform version
aws --version
gcloud version
docker version
openssl version

Container Image

Build and push the sentryagent/agentidp image to your registry before deploying. Terraform references the image by tag — it does not build it.

# From the project root
docker build -t sentryagent/agentidp:1.0.0 .

# Push to your registry (ECR example):
aws ecr get-login-password --region us-east-1 \
  | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

docker tag sentryagent/agentidp:1.0.0 \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0

docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0

Update app_image_tag in your terraform.tfvars to match.

JWT Key Pair

Generate the RSA-2048 key pair used for signing and verifying JWTs:

openssl genrsa -out jwt_private.pem 2048
openssl rsa -in jwt_private.pem -pubout -out jwt_public.pem

# Verify
openssl rsa -in jwt_private.pem -check -noout

Keep jwt_private.pem secure — treat it with the same sensitivity as a TLS private key. You will paste its contents into terraform.tfvars.


2. AWS Deployment

2.1 Configure AWS CLI

aws configure
# Provide: AWS Access Key ID, Secret Access Key, region (e.g. us-east-1), output format (json)

# Verify credentials
aws sts get-caller-identity

The IAM principal running Terraform requires permissions to manage: VPC, ECS, RDS, ElastiCache, ALB, IAM roles, Secrets Manager, Route 53, CloudWatch, and VPC endpoints.

2.2 Provision an ACM Certificate

The ALB requires an ACM certificate for your domain. Create it in the same region as your deployment.

aws acm request-certificate \
  --domain-name idp.example.com \
  --validation-method DNS \
  --region us-east-1

Complete DNS validation by adding the CNAME record shown in the ACM console. Wait for the status to become ISSUED before proceeding.

# Monitor validation status
aws acm describe-certificate \
  --certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/XXXX \
  --region us-east-1 \
  --query 'Certificate.Status'

2.3 Prepare tfvars

cd terraform/environments/aws
cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars. All fields marked REPLACE_WITH_* are required. Key fields:

  • region — AWS region (must match the ACM certificate region)
  • domain_name — your domain (e.g. idp.example.com)
  • certificate_arn — ARN from step 2.2
  • app_image_tag — tag of the image you pushed in step 1
  • db_password — strong random password (no @, #, ?, / characters — they break URL parsing)
  • redis_auth_token — minimum 16 characters, no spaces
  • jwt_private_key — full PEM contents of jwt_private.pem with literal \n for newlines
  • jwt_public_key — full PEM contents of jwt_public.pem with literal \n for newlines

Example for encoding PEM keys in tfvars:

# Output the private key as a single line with \n separators (for pasting into tfvars)
awk 'NF {printf "%s\\n", $0}' jwt_private.pem

Never commit terraform.tfvars to version control.

Uncomment and configure the backend "s3" block in terraform/environments/aws/main.tf:

backend "s3" {
  bucket         = "your-terraform-state-bucket"
  key            = "agentidp/aws/production/terraform.tfstate"
  region         = "us-east-1"
  encrypt        = true
  dynamodb_table = "your-terraform-locks-table"
}

Create the S3 bucket and DynamoDB table if they do not exist:

# S3 bucket with versioning and encryption
aws s3api create-bucket --bucket your-terraform-state-bucket --region us-east-1
aws s3api put-bucket-versioning \
  --bucket your-terraform-state-bucket \
  --versioning-configuration Status=Enabled
aws s3api put-bucket-encryption \
  --bucket your-terraform-state-bucket \
  --server-side-encryption-configuration \
  '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'

# DynamoDB table for state locking
aws dynamodb create-table \
  --table-name your-terraform-locks-table \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

2.5 Terraform Init

cd terraform/environments/aws
terraform init

Expected output: provider plugins downloaded, backend initialized.

2.6 Terraform Plan

terraform plan -out=tfplan

Review the plan carefully before applying. Expected resources on first apply: ~5060 resources (VPC, subnets, NAT gateways, VPC endpoints, IAM roles, secrets, RDS, ElastiCache, ALB, ECS cluster, task definition, service, Route 53 record).

2.7 Terraform Apply

terraform apply tfplan

First apply takes 2030 minutes — RDS Multi-AZ provisioning is the longest step (~15 min). Do not interrupt the apply.

When complete, note the outputs:

terraform output

Key outputs:

  • service_url — the HTTPS URL of your deployed service
  • alb_dns_name — ALB DNS name (verify Route 53 alias is pointing here)
  • ecs_service_name — use for ECS deployment commands
  • cloudwatch_log_group — where container logs appear

2.8 Run Database Migrations

After first deploy, run migrations against the new RDS instance. The easiest approach is to exec into a running ECS task:

# Get a running task ARN
TASK_ARN=$(aws ecs list-tasks \
  --cluster sentryagent-agentidp-production \
  --service-name sentryagent-agentidp-production \
  --query 'taskArns[0]' \
  --output text)

# Run migrations via ECS Exec (requires enableExecuteCommand on the service)
aws ecs execute-command \
  --cluster sentryagent-agentidp-production \
  --task $TASK_ARN \
  --container agentidp \
  --command "node scripts/db-migrate.js" \
  --interactive

Alternatively, run a one-off ECS task with the migration command as the container override.


3. GCP Deployment

3.1 Configure gcloud CLI

gcloud auth login
gcloud config set project your-gcp-project-id
gcloud auth application-default login

Verify:

gcloud config list
gcloud projects describe your-gcp-project-id

The principal running Terraform requires the following roles on the project:

  • roles/owner or a custom role covering: Cloud Run Admin, Cloud SQL Admin, Redis Admin, Secret Manager Admin, IAM Admin, Compute Admin, Service Networking Admin.

3.2 Prepare tfvars

cd terraform/environments/gcp
cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars. Key fields:

  • project_id — your GCP project ID
  • region — GCP region (e.g. us-central1)
  • app_image_tag — tag of the image you built
  • db_password — strong random password for Cloud SQL
  • jwt_private_key / jwt_public_key — same PEM keys used for AWS (same key pair for both regions)

Never commit terraform.tfvars to version control.

Uncomment and configure the backend "gcs" block in terraform/environments/gcp/main.tf:

backend "gcs" {
  bucket = "your-terraform-state-bucket"
  prefix = "agentidp/gcp/production"
}

Create the GCS bucket:

gsutil mb -l us-central1 gs://your-terraform-state-bucket
gsutil versioning set on gs://your-terraform-state-bucket

3.4 Terraform Init

cd terraform/environments/gcp
terraform init

3.5 Terraform Plan

terraform plan -out=tfplan

Review the plan. Expected resources: ~3545 resources (VPC, subnet, VPC connector, service accounts, secrets, Cloud SQL, Memorystore, Cloud Run service, IAM bindings, API enablement).

3.6 Terraform Apply

terraform apply tfplan

First apply takes 1520 minutes — Cloud SQL provisioning is the longest step.

When complete:

terraform output

Key outputs:

  • service_url — Cloud Run HTTPS URL (Google-managed TLS, no cert setup required)
  • cloud_sql_connection_name — for Cloud SQL Proxy if needed
  • memorystore_host — Redis private IP

3.7 Run Database Migrations

Cloud Run does not support exec. Use a one-off Cloud Run Job for migrations:

gcloud run jobs create agentidp-migrate \
  --image sentryagent/agentidp:1.0.0 \
  --region us-central1 \
  --command node \
  --args "scripts/db-migrate.js" \
  --set-secrets "DATABASE_URL=sentryagent-agentidp-production-database-url:latest" \
  --vpc-connector sentryagent-agentidp-production-connector \
  --service-account sentryagent-agentidp-production-run-sa@your-gcp-project-id.iam.gserviceaccount.com

gcloud run jobs execute agentidp-migrate --region us-central1 --wait

4. Post-Deploy Verification

Run these checks after deploying to either environment. Replace https://idp.example.com with your actual service URL.

4.1 Health Check

curl -si https://idp.example.com/health

Expected response:

HTTP/2 200
content-type: application/json

{"status":"ok"}

If you receive a 502 or 503, the load balancer has not yet registered healthy targets. Wait 6090 seconds and retry — ECS tasks or Cloud Run instances take time to pass health checks.

4.2 Metrics Endpoint

curl -si https://idp.example.com/metrics

Expected: HTTP 200 with Prometheus-format metrics text (lines beginning with # HELP, # TYPE, and metric values).

4.3 Token Endpoint (Smoke Test)

First, register a test agent client (requires a valid JWT or admin credentials — see developers guide):

# Issue a client credentials token (replace CLIENT_ID and CLIENT_SECRET with real values)
curl -s -X POST https://idp.example.com/api/v1/token \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "grant_type=client_credentials&client_id=test-client&client_secret=test-secret&scope=read"

Expected response (abbreviated):

{
  "access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
  "token_type": "Bearer",
  "expires_in": 3600,
  "scope": "read"
}

4.4 JWKS Endpoint

curl -si https://idp.example.com/.well-known/jwks.json

Expected: HTTP 200 with a JSON object containing a keys array with at least one RSA public key entry.

4.5 TLS Verification

# Verify TLS certificate is valid and matches your domain
curl -vI https://idp.example.com 2>&1 | grep -E "(SSL|TLS|certificate|issuer|subject)"

Expected: TLS 1.2 or 1.3, certificate issued by a trusted CA, subject matching your domain.

4.6 AWS-Specific: ECS Service Status

aws ecs describe-services \
  --cluster sentryagent-agentidp-production \
  --services sentryagent-agentidp-production \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount,status:status}'

Expected: running equals desired, status is ACTIVE.

4.7 GCP-Specific: Cloud Run Service Status

gcloud run services describe sentryagent-agentidp-production \
  --region us-central1 \
  --format='value(status.conditions[0].type,status.conditions[0].status)'

Expected: Ready True.


5. Rollback Procedure

To roll back to a previous image tag without modifying infrastructure:

AWS:

# Find the previous task definition revision
aws ecs list-task-definitions \
  --family-prefix sentryagent-agentidp-production \
  --sort DESC \
  --query 'taskDefinitionArns[:5]'

# Update the service to use the previous task definition
aws ecs update-service \
  --cluster sentryagent-agentidp-production \
  --service sentryagent-agentidp-production \
  --task-definition sentryagent-agentidp-production:PREVIOUS_REVISION \
  --force-new-deployment

# Monitor the rollout
aws ecs wait services-stable \
  --cluster sentryagent-agentidp-production \
  --services sentryagent-agentidp-production

GCP:

# Deploy the previous image tag directly
gcloud run services update sentryagent-agentidp-production \
  --region us-central1 \
  --image sentryagent/agentidp:PREVIOUS_TAG

# Or route 100% of traffic to a specific revision
gcloud run services update-traffic sentryagent-agentidp-production \
  --region us-central1 \
  --to-revisions PREVIOUS_REVISION_NAME=100

5.2 Infrastructure Rollback via Terraform

If an infrastructure change (not an image update) caused the problem:

# Check the state and plan to understand what changed
terraform show
terraform plan

# If you have a previous state file (S3/GCS versioning), restore it:
# AWS:
aws s3 cp s3://your-state-bucket/agentidp/aws/production/terraform.tfstate.PREVIOUS ./terraform.tfstate
terraform apply -target=<affected_resource>

# GCP:
gsutil cp gs://your-state-bucket/agentidp/gcp/production/PREVIOUS_VERSION ./terraform.tfstate
terraform apply -target=<affected_resource>

Never run terraform destroy in production without CEO approval.

5.3 Database Rollback

RDS (AWS) and Cloud SQL (GCP) both support point-in-time restore. Use this only as a last resort — it creates a new DB instance and requires updating the DATABASE_URL secret.

AWS:

# Restore to a point before the problematic deployment
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier sentryagent-agentidp-production \
  --target-db-instance-identifier sentryagent-agentidp-production-restored \
  --restore-time 2026-01-01T12:00:00Z

GCP:

# List available backups
gcloud sql backups list --instance sentryagent-agentidp-production-pg14

# Restore from a backup
gcloud sql backups restore BACKUP_ID \
  --restore-instance sentryagent-agentidp-production-pg14

6. Environment Variable Reference

All environment variables injected into the AgentIdP container are documented in full at:

docs/devops/environment-variables.md

Quick Reference

Variable Required Source (AWS) Source (GCP)
DATABASE_URL Yes Secrets Manager: /<project>/<env>/database-url Secret Manager: <name-prefix>-database-url
REDIS_URL Yes Secrets Manager: /<project>/<env>/redis-url Secret Manager: <name-prefix>-redis-url
JWT_PRIVATE_KEY Yes Secrets Manager: /<project>/<env>/jwt-private-key Secret Manager: <name-prefix>-jwt-private-key
JWT_PUBLIC_KEY Yes Secrets Manager: /<project>/<env>/jwt-public-key Secret Manager: <name-prefix>-jwt-public-key
PORT No Task definition env var (default: 3000) Cloud Run env var (default: 3000)
NODE_ENV No Task definition env var (production) Cloud Run env var (production)
CORS_ORIGIN No Task definition env var Cloud Run env var
POLICY_DIR No Task definition env var (/app/policies) Cloud Run env var (/app/policies)
VAULT_ADDR No Task definition env var Cloud Run env var
VAULT_TOKEN No Secrets Manager: /<project>/<env>/vault-token Secret Manager: <name-prefix>-vault-token
VAULT_MOUNT No Task definition env var (default: secret) Cloud Run env var (default: secret)
BILLING_ENABLED No Task definition env var Cloud Run env var
STRIPE_SECRET_KEY Only if billing enabled Secrets Manager: /<project>/<env>/stripe-secret-key Secret Manager: <name-prefix>-stripe-secret-key
STRIPE_WEBHOOK_SECRET Only if billing enabled Secrets Manager: /<project>/<env>/stripe-webhook-secret Secret Manager: <name-prefix>-stripe-webhook-secret
STRIPE_PRICE_ID Only if billing enabled Task definition env var Cloud Run env var
ANALYTICS_ENABLED No Task definition env var (default: true) Cloud Run env var
TIER_ENFORCEMENT No Task definition env var (default: true) Cloud Run env var
COMPLIANCE_ENABLED No Task definition env var (default: true) Cloud Run env var
REDIS_RATE_LIMIT_ENABLED No Task definition env var Cloud Run env var
RATE_LIMIT_WINDOW_MS No Task definition env var (default: 60000) Cloud Run env var
RATE_LIMIT_MAX_REQUESTS No Task definition env var (default: 100) Cloud Run env var
DB_POOL_MAX No Task definition env var (default: 20) Cloud Run env var
DB_POOL_MIN No Task definition env var (default: 2) Cloud Run env var
DB_POOL_IDLE_TIMEOUT_MS No Task definition env var (default: 30000) Cloud Run env var
DB_POOL_CONNECTION_TIMEOUT_MS No Task definition env var (default: 5000) Cloud Run env var
KAFKA_BROKERS No Task definition env var Cloud Run env var
ENFORCE_TLS No Task definition env var Cloud Run env var
OPA_URL No Task definition env var Cloud Run env var
VAULT_KV_MOUNT No Task definition env var (default: secret) Cloud Run env var

Updating a Secret

AWS:

# Update a secret value (e.g. rotate JWT keys)
aws secretsmanager put-secret-value \
  --secret-id /sentryagent-agentidp/production/jwt-private-key \
  --secret-string "$(cat new_jwt_private.pem)"

# Force new ECS deployment to pick up the new secret value
aws ecs update-service \
  --cluster sentryagent-agentidp-production \
  --service sentryagent-agentidp-production \
  --force-new-deployment

GCP:

# Add a new version of a secret
gcloud secrets versions add sentryagent-agentidp-production-jwt-private-key \
  --data-file=new_jwt_private.pem

# Deploy a new Cloud Run revision to pick up the latest secret version
gcloud run services update sentryagent-agentidp-production \
  --region us-central1 \
  --image sentryagent/agentidp:CURRENT_TAG

Architecture Summary

AWS

Route 53 (A alias)
  └── ALB (public subnets, HTTPS/443, ACM cert, HTTP→HTTPS redirect)
        └── Target Group
              └── ECS Fargate Service (private subnets, 2+ tasks)
                    ├── Secrets Manager (DATABASE_URL, REDIS_URL, JWT keys)
                    ├── RDS PostgreSQL 14 (private subnets, Multi-AZ, encrypted)
                    └── ElastiCache Redis 7 (private subnets, primary+replica, TLS)

GCP

Internet → Cloud Run Service (Google-managed TLS, auto-scaling)
                    ├── Secret Manager (DATABASE_URL, REDIS_URL, JWT keys)
                    ├── Serverless VPC Connector
                    │     ├── Cloud SQL PostgreSQL 14 (private IP, REGIONAL HA)
                    │     └── Memorystore Redis 7 (STANDARD_HA, TLS)

Both environments share the same Docker image (sentryagent/agentidp) and the same JWT key pair — tokens issued in one region are verifiable in the other.