AWS environment: - VPC (3-AZ, public + private subnets, NAT gateways, VPC endpoints for ECR/SM/CW) - ECS Fargate service (sentryagent/agentidp) — secrets from Secrets Manager - RDS PostgreSQL 14 (Multi-AZ, encrypted, VPC-internal, storage autoscaling) - ElastiCache Redis 7 (primary + replica, at-rest + in-transit encryption) - ALB with HTTPS/443, HTTP→HTTPS redirect, ACM certificate - Route 53 alias record GCP environment: - VPC + private services access + Serverless VPC connector - Cloud Run service — secrets from Secret Manager - Cloud SQL PostgreSQL 14 (private IP, no public endpoint) - Cloud Memorystore Redis 7 (VPC-internal, AUTH enabled) Shared: - 4 reusable modules: agentidp (dual AWS/GCP), rds, redis, lb - No hardcoded secrets; all sensitive vars marked sensitive=true - terraform.tfvars.example for both environments - docs/devops/deployment.md — AWS + GCP step-by-step walkthrough, rollback procedures Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
18 KiB
Deployment Guide — SentryAgent.ai AgentIdP
End-to-end guide for deploying AgentIdP to AWS (primary) and GCP (secondary) using the Terraform infrastructure-as-code in terraform/.
Table of Contents
- Prerequisites
- AWS Deployment
- GCP Deployment
- Post-Deploy Verification
- Rollback Procedure
- Environment Variable Reference
1. Prerequisites
Tools
| Tool | Minimum Version | Install |
|---|---|---|
| Terraform | 1.6.0 | https://developer.hashicorp.com/terraform/install |
| AWS CLI | 2.13 | https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html |
| gcloud CLI | 460.0 | https://cloud.google.com/sdk/docs/install |
| Docker | 24.0 | Required only for building and pushing images |
| openssl | any | Required for generating JWT key pairs |
Verify all tools are available:
terraform version
aws --version
gcloud version
docker version
openssl version
Container Image
Build and push the sentryagent/agentidp image to your registry before deploying. Terraform references the image by tag — it does not build it.
# From the project root
docker build -t sentryagent/agentidp:1.0.0 .
# Push to your registry (ECR example):
aws ecr get-login-password --region us-east-1 \
| docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
docker tag sentryagent/agentidp:1.0.0 \
123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0
Update app_image_tag in your terraform.tfvars to match.
JWT Key Pair
Generate the RSA-2048 key pair used for signing and verifying JWTs:
openssl genrsa -out jwt_private.pem 2048
openssl rsa -in jwt_private.pem -pubout -out jwt_public.pem
# Verify
openssl rsa -in jwt_private.pem -check -noout
Keep jwt_private.pem secure — treat it with the same sensitivity as a TLS private key. You will paste its contents into terraform.tfvars.
2. AWS Deployment
2.1 Configure AWS CLI
aws configure
# Provide: AWS Access Key ID, Secret Access Key, region (e.g. us-east-1), output format (json)
# Verify credentials
aws sts get-caller-identity
The IAM principal running Terraform requires permissions to manage: VPC, ECS, RDS, ElastiCache, ALB, IAM roles, Secrets Manager, Route 53, CloudWatch, and VPC endpoints.
2.2 Provision an ACM Certificate
The ALB requires an ACM certificate for your domain. Create it in the same region as your deployment.
aws acm request-certificate \
--domain-name idp.example.com \
--validation-method DNS \
--region us-east-1
Complete DNS validation by adding the CNAME record shown in the ACM console. Wait for the status to become ISSUED before proceeding.
# Monitor validation status
aws acm describe-certificate \
--certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/XXXX \
--region us-east-1 \
--query 'Certificate.Status'
2.3 Prepare tfvars
cd terraform/environments/aws
cp terraform.tfvars.example terraform.tfvars
Edit terraform.tfvars. All fields marked REPLACE_WITH_* are required. Key fields:
region— AWS region (must match the ACM certificate region)domain_name— your domain (e.g.idp.example.com)certificate_arn— ARN from step 2.2app_image_tag— tag of the image you pushed in step 1db_password— strong random password (no@,#,?,/characters — they break URL parsing)redis_auth_token— minimum 16 characters, no spacesjwt_private_key— full PEM contents ofjwt_private.pemwith literal\nfor newlinesjwt_public_key— full PEM contents ofjwt_public.pemwith literal\nfor newlines
Example for encoding PEM keys in tfvars:
# Output the private key as a single line with \n separators (for pasting into tfvars)
awk 'NF {printf "%s\\n", $0}' jwt_private.pem
Never commit terraform.tfvars to version control.
2.4 Configure Remote State (Recommended)
Uncomment and configure the backend "s3" block in terraform/environments/aws/main.tf:
backend "s3" {
bucket = "your-terraform-state-bucket"
key = "agentidp/aws/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "your-terraform-locks-table"
}
Create the S3 bucket and DynamoDB table if they do not exist:
# S3 bucket with versioning and encryption
aws s3api create-bucket --bucket your-terraform-state-bucket --region us-east-1
aws s3api put-bucket-versioning \
--bucket your-terraform-state-bucket \
--versioning-configuration Status=Enabled
aws s3api put-bucket-encryption \
--bucket your-terraform-state-bucket \
--server-side-encryption-configuration \
'{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
# DynamoDB table for state locking
aws dynamodb create-table \
--table-name your-terraform-locks-table \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region us-east-1
2.5 Terraform Init
cd terraform/environments/aws
terraform init
Expected output: provider plugins downloaded, backend initialized.
2.6 Terraform Plan
terraform plan -out=tfplan
Review the plan carefully before applying. Expected resources on first apply: ~50–60 resources (VPC, subnets, NAT gateways, VPC endpoints, IAM roles, secrets, RDS, ElastiCache, ALB, ECS cluster, task definition, service, Route 53 record).
2.7 Terraform Apply
terraform apply tfplan
First apply takes 20–30 minutes — RDS Multi-AZ provisioning is the longest step (~15 min). Do not interrupt the apply.
When complete, note the outputs:
terraform output
Key outputs:
service_url— the HTTPS URL of your deployed servicealb_dns_name— ALB DNS name (verify Route 53 alias is pointing here)ecs_service_name— use for ECS deployment commandscloudwatch_log_group— where container logs appear
2.8 Run Database Migrations
After first deploy, run migrations against the new RDS instance. The easiest approach is to exec into a running ECS task:
# Get a running task ARN
TASK_ARN=$(aws ecs list-tasks \
--cluster sentryagent-agentidp-production \
--service-name sentryagent-agentidp-production \
--query 'taskArns[0]' \
--output text)
# Run migrations via ECS Exec (requires enableExecuteCommand on the service)
aws ecs execute-command \
--cluster sentryagent-agentidp-production \
--task $TASK_ARN \
--container agentidp \
--command "node scripts/db-migrate.js" \
--interactive
Alternatively, run a one-off ECS task with the migration command as the container override.
3. GCP Deployment
3.1 Configure gcloud CLI
gcloud auth login
gcloud config set project your-gcp-project-id
gcloud auth application-default login
Verify:
gcloud config list
gcloud projects describe your-gcp-project-id
The principal running Terraform requires the following roles on the project:
roles/owneror a custom role covering: Cloud Run Admin, Cloud SQL Admin, Redis Admin, Secret Manager Admin, IAM Admin, Compute Admin, Service Networking Admin.
3.2 Prepare tfvars
cd terraform/environments/gcp
cp terraform.tfvars.example terraform.tfvars
Edit terraform.tfvars. Key fields:
project_id— your GCP project IDregion— GCP region (e.g.us-central1)app_image_tag— tag of the image you builtdb_password— strong random password for Cloud SQLjwt_private_key/jwt_public_key— same PEM keys used for AWS (same key pair for both regions)
Never commit terraform.tfvars to version control.
3.3 Configure Remote State (Recommended)
Uncomment and configure the backend "gcs" block in terraform/environments/gcp/main.tf:
backend "gcs" {
bucket = "your-terraform-state-bucket"
prefix = "agentidp/gcp/production"
}
Create the GCS bucket:
gsutil mb -l us-central1 gs://your-terraform-state-bucket
gsutil versioning set on gs://your-terraform-state-bucket
3.4 Terraform Init
cd terraform/environments/gcp
terraform init
3.5 Terraform Plan
terraform plan -out=tfplan
Review the plan. Expected resources: ~35–45 resources (VPC, subnet, VPC connector, service accounts, secrets, Cloud SQL, Memorystore, Cloud Run service, IAM bindings, API enablement).
3.6 Terraform Apply
terraform apply tfplan
First apply takes 15–20 minutes — Cloud SQL provisioning is the longest step.
When complete:
terraform output
Key outputs:
service_url— Cloud Run HTTPS URL (Google-managed TLS, no cert setup required)cloud_sql_connection_name— for Cloud SQL Proxy if neededmemorystore_host— Redis private IP
3.7 Run Database Migrations
Cloud Run does not support exec. Use a one-off Cloud Run Job for migrations:
gcloud run jobs create agentidp-migrate \
--image sentryagent/agentidp:1.0.0 \
--region us-central1 \
--command node \
--args "scripts/db-migrate.js" \
--set-secrets "DATABASE_URL=sentryagent-agentidp-production-database-url:latest" \
--vpc-connector sentryagent-agentidp-production-connector \
--service-account sentryagent-agentidp-production-run-sa@your-gcp-project-id.iam.gserviceaccount.com
gcloud run jobs execute agentidp-migrate --region us-central1 --wait
4. Post-Deploy Verification
Run these checks after deploying to either environment. Replace https://idp.example.com with your actual service URL.
4.1 Health Check
curl -si https://idp.example.com/health
Expected response:
HTTP/2 200
content-type: application/json
{"status":"ok"}
If you receive a 502 or 503, the load balancer has not yet registered healthy targets. Wait 60–90 seconds and retry — ECS tasks or Cloud Run instances take time to pass health checks.
4.2 Metrics Endpoint
curl -si https://idp.example.com/metrics
Expected: HTTP 200 with Prometheus-format metrics text (lines beginning with # HELP, # TYPE, and metric values).
4.3 Token Endpoint (Smoke Test)
First, register a test agent client (requires a valid JWT or admin credentials — see developers guide):
# Issue a client credentials token (replace CLIENT_ID and CLIENT_SECRET with real values)
curl -s -X POST https://idp.example.com/api/v1/token \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=client_credentials&client_id=test-client&client_secret=test-secret&scope=read"
Expected response (abbreviated):
{
"access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"token_type": "Bearer",
"expires_in": 3600,
"scope": "read"
}
4.4 JWKS Endpoint
curl -si https://idp.example.com/.well-known/jwks.json
Expected: HTTP 200 with a JSON object containing a keys array with at least one RSA public key entry.
4.5 TLS Verification
# Verify TLS certificate is valid and matches your domain
curl -vI https://idp.example.com 2>&1 | grep -E "(SSL|TLS|certificate|issuer|subject)"
Expected: TLS 1.2 or 1.3, certificate issued by a trusted CA, subject matching your domain.
4.6 AWS-Specific: ECS Service Status
aws ecs describe-services \
--cluster sentryagent-agentidp-production \
--services sentryagent-agentidp-production \
--query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount,status:status}'
Expected: running equals desired, status is ACTIVE.
4.7 GCP-Specific: Cloud Run Service Status
gcloud run services describe sentryagent-agentidp-production \
--region us-central1 \
--format='value(status.conditions[0].type,status.conditions[0].status)'
Expected: Ready True.
5. Rollback Procedure
5.1 Image Rollback (Recommended — fastest)
To roll back to a previous image tag without modifying infrastructure:
AWS:
# Find the previous task definition revision
aws ecs list-task-definitions \
--family-prefix sentryagent-agentidp-production \
--sort DESC \
--query 'taskDefinitionArns[:5]'
# Update the service to use the previous task definition
aws ecs update-service \
--cluster sentryagent-agentidp-production \
--service sentryagent-agentidp-production \
--task-definition sentryagent-agentidp-production:PREVIOUS_REVISION \
--force-new-deployment
# Monitor the rollout
aws ecs wait services-stable \
--cluster sentryagent-agentidp-production \
--services sentryagent-agentidp-production
GCP:
# Deploy the previous image tag directly
gcloud run services update sentryagent-agentidp-production \
--region us-central1 \
--image sentryagent/agentidp:PREVIOUS_TAG
# Or route 100% of traffic to a specific revision
gcloud run services update-traffic sentryagent-agentidp-production \
--region us-central1 \
--to-revisions PREVIOUS_REVISION_NAME=100
5.2 Infrastructure Rollback via Terraform
If an infrastructure change (not an image update) caused the problem:
# Check the state and plan to understand what changed
terraform show
terraform plan
# If you have a previous state file (S3/GCS versioning), restore it:
# AWS:
aws s3 cp s3://your-state-bucket/agentidp/aws/production/terraform.tfstate.PREVIOUS ./terraform.tfstate
terraform apply -target=<affected_resource>
# GCP:
gsutil cp gs://your-state-bucket/agentidp/gcp/production/PREVIOUS_VERSION ./terraform.tfstate
terraform apply -target=<affected_resource>
Never run terraform destroy in production without CEO approval.
5.3 Database Rollback
RDS (AWS) and Cloud SQL (GCP) both support point-in-time restore. Use this only as a last resort — it creates a new DB instance and requires updating the DATABASE_URL secret.
AWS:
# Restore to a point before the problematic deployment
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier sentryagent-agentidp-production \
--target-db-instance-identifier sentryagent-agentidp-production-restored \
--restore-time 2026-01-01T12:00:00Z
GCP:
# List available backups
gcloud sql backups list --instance sentryagent-agentidp-production-pg14
# Restore from a backup
gcloud sql backups restore BACKUP_ID \
--restore-instance sentryagent-agentidp-production-pg14
6. Environment Variable Reference
All environment variables injected into the AgentIdP container are documented in full at:
docs/devops/environment-variables.md
Quick Reference
| Variable | Required | Source (AWS) | Source (GCP) |
|---|---|---|---|
DATABASE_URL |
Yes | Secrets Manager: /<project>/<env>/database-url |
Secret Manager: <name-prefix>-database-url |
REDIS_URL |
Yes | Secrets Manager: /<project>/<env>/redis-url |
Secret Manager: <name-prefix>-redis-url |
JWT_PRIVATE_KEY |
Yes | Secrets Manager: /<project>/<env>/jwt-private-key |
Secret Manager: <name-prefix>-jwt-private-key |
JWT_PUBLIC_KEY |
Yes | Secrets Manager: /<project>/<env>/jwt-public-key |
Secret Manager: <name-prefix>-jwt-public-key |
PORT |
No | Task definition env var (default: 3000) | Cloud Run env var (default: 3000) |
NODE_ENV |
No | Task definition env var (production) |
Cloud Run env var (production) |
CORS_ORIGIN |
No | Task definition env var | Cloud Run env var |
POLICY_DIR |
No | Task definition env var (/app/policies) |
Cloud Run env var (/app/policies) |
VAULT_ADDR |
No | Task definition env var | Cloud Run env var |
VAULT_TOKEN |
No | Secrets Manager: /<project>/<env>/vault-token |
Secret Manager: <name-prefix>-vault-token |
VAULT_MOUNT |
No | Task definition env var (default: secret) |
Cloud Run env var (default: secret) |
Updating a Secret
AWS:
# Update a secret value (e.g. rotate JWT keys)
aws secretsmanager put-secret-value \
--secret-id /sentryagent-agentidp/production/jwt-private-key \
--secret-string "$(cat new_jwt_private.pem)"
# Force new ECS deployment to pick up the new secret value
aws ecs update-service \
--cluster sentryagent-agentidp-production \
--service sentryagent-agentidp-production \
--force-new-deployment
GCP:
# Add a new version of a secret
gcloud secrets versions add sentryagent-agentidp-production-jwt-private-key \
--data-file=new_jwt_private.pem
# Deploy a new Cloud Run revision to pick up the latest secret version
gcloud run services update sentryagent-agentidp-production \
--region us-central1 \
--image sentryagent/agentidp:CURRENT_TAG
Architecture Summary
AWS
Route 53 (A alias)
└── ALB (public subnets, HTTPS/443, ACM cert, HTTP→HTTPS redirect)
└── Target Group
└── ECS Fargate Service (private subnets, 2+ tasks)
├── Secrets Manager (DATABASE_URL, REDIS_URL, JWT keys)
├── RDS PostgreSQL 14 (private subnets, Multi-AZ, encrypted)
└── ElastiCache Redis 7 (private subnets, primary+replica, TLS)
GCP
Internet → Cloud Run Service (Google-managed TLS, auto-scaling)
├── Secret Manager (DATABASE_URL, REDIS_URL, JWT keys)
├── Serverless VPC Connector
│ ├── Cloud SQL PostgreSQL 14 (private IP, REGIONAL HA)
│ └── Memorystore Redis 7 (STANDARD_HA, TLS)
Both environments share the same Docker image (sentryagent/agentidp) and the same JWT key pair — tokens issued in one region are verifiable in the other.