# Deployment Guide — SentryAgent.ai AgentIdP End-to-end guide for deploying AgentIdP to AWS (primary) and GCP (secondary) using the Terraform infrastructure-as-code in `terraform/`. --- ## Table of Contents 1. [Prerequisites](#1-prerequisites) 2. [AWS Deployment](#2-aws-deployment) 3. [GCP Deployment](#3-gcp-deployment) 4. [Post-Deploy Verification](#4-post-deploy-verification) 5. [Rollback Procedure](#5-rollback-procedure) 6. [Environment Variable Reference](#6-environment-variable-reference) --- ## 1. Prerequisites ### Tools | Tool | Minimum Version | Install | |------|-----------------|---------| | Terraform | 1.6.0 | https://developer.hashicorp.com/terraform/install | | AWS CLI | 2.13 | https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html | | gcloud CLI | 460.0 | https://cloud.google.com/sdk/docs/install | | Docker | 24.0 | Required only for building and pushing images | | openssl | any | Required for generating JWT key pairs | Verify all tools are available: ```bash terraform version aws --version gcloud version docker version openssl version ``` ### Container Image Build and push the `sentryagent/agentidp` image to your registry before deploying. Terraform references the image by tag — it does not build it. ```bash # From the project root docker build -t sentryagent/agentidp:1.0.0 . # Push to your registry (ECR example): aws ecr get-login-password --region us-east-1 \ | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com docker tag sentryagent/agentidp:1.0.0 \ 123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0 docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0 ``` Update `app_image_tag` in your `terraform.tfvars` to match. ### JWT Key Pair Generate the RSA-2048 key pair used for signing and verifying JWTs: ```bash openssl genrsa -out jwt_private.pem 2048 openssl rsa -in jwt_private.pem -pubout -out jwt_public.pem # Verify openssl rsa -in jwt_private.pem -check -noout ``` Keep `jwt_private.pem` secure — treat it with the same sensitivity as a TLS private key. You will paste its contents into `terraform.tfvars`. --- ## 2. AWS Deployment ### 2.1 Configure AWS CLI ```bash aws configure # Provide: AWS Access Key ID, Secret Access Key, region (e.g. us-east-1), output format (json) # Verify credentials aws sts get-caller-identity ``` The IAM principal running Terraform requires permissions to manage: VPC, ECS, RDS, ElastiCache, ALB, IAM roles, Secrets Manager, Route 53, CloudWatch, and VPC endpoints. ### 2.2 Provision an ACM Certificate The ALB requires an ACM certificate for your domain. Create it in the same region as your deployment. ```bash aws acm request-certificate \ --domain-name idp.example.com \ --validation-method DNS \ --region us-east-1 ``` Complete DNS validation by adding the CNAME record shown in the ACM console. Wait for the status to become `ISSUED` before proceeding. ```bash # Monitor validation status aws acm describe-certificate \ --certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/XXXX \ --region us-east-1 \ --query 'Certificate.Status' ``` ### 2.3 Prepare tfvars ```bash cd terraform/environments/aws cp terraform.tfvars.example terraform.tfvars ``` Edit `terraform.tfvars`. All fields marked `REPLACE_WITH_*` are required. Key fields: - `region` — AWS region (must match the ACM certificate region) - `domain_name` — your domain (e.g. `idp.example.com`) - `certificate_arn` — ARN from step 2.2 - `app_image_tag` — tag of the image you pushed in step 1 - `db_password` — strong random password (no `@`, `#`, `?`, `/` characters — they break URL parsing) - `redis_auth_token` — minimum 16 characters, no spaces - `jwt_private_key` — full PEM contents of `jwt_private.pem` with literal `\n` for newlines - `jwt_public_key` — full PEM contents of `jwt_public.pem` with literal `\n` for newlines Example for encoding PEM keys in tfvars: ```bash # Output the private key as a single line with \n separators (for pasting into tfvars) awk 'NF {printf "%s\\n", $0}' jwt_private.pem ``` **Never commit `terraform.tfvars` to version control.** ### 2.4 Configure Remote State (Recommended) Uncomment and configure the `backend "s3"` block in `terraform/environments/aws/main.tf`: ```hcl backend "s3" { bucket = "your-terraform-state-bucket" key = "agentidp/aws/production/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "your-terraform-locks-table" } ``` Create the S3 bucket and DynamoDB table if they do not exist: ```bash # S3 bucket with versioning and encryption aws s3api create-bucket --bucket your-terraform-state-bucket --region us-east-1 aws s3api put-bucket-versioning \ --bucket your-terraform-state-bucket \ --versioning-configuration Status=Enabled aws s3api put-bucket-encryption \ --bucket your-terraform-state-bucket \ --server-side-encryption-configuration \ '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}' # DynamoDB table for state locking aws dynamodb create-table \ --table-name your-terraform-locks-table \ --attribute-definitions AttributeName=LockID,AttributeType=S \ --key-schema AttributeName=LockID,KeyType=HASH \ --billing-mode PAY_PER_REQUEST \ --region us-east-1 ``` ### 2.5 Terraform Init ```bash cd terraform/environments/aws terraform init ``` Expected output: provider plugins downloaded, backend initialized. ### 2.6 Terraform Plan ```bash terraform plan -out=tfplan ``` Review the plan carefully before applying. Expected resources on first apply: ~50–60 resources (VPC, subnets, NAT gateways, VPC endpoints, IAM roles, secrets, RDS, ElastiCache, ALB, ECS cluster, task definition, service, Route 53 record). ### 2.7 Terraform Apply ```bash terraform apply tfplan ``` **First apply takes 20–30 minutes** — RDS Multi-AZ provisioning is the longest step (~15 min). Do not interrupt the apply. When complete, note the outputs: ```bash terraform output ``` Key outputs: - `service_url` — the HTTPS URL of your deployed service - `alb_dns_name` — ALB DNS name (verify Route 53 alias is pointing here) - `ecs_service_name` — use for ECS deployment commands - `cloudwatch_log_group` — where container logs appear ### 2.8 Run Database Migrations After first deploy, run migrations against the new RDS instance. The easiest approach is to exec into a running ECS task: ```bash # Get a running task ARN TASK_ARN=$(aws ecs list-tasks \ --cluster sentryagent-agentidp-production \ --service-name sentryagent-agentidp-production \ --query 'taskArns[0]' \ --output text) # Run migrations via ECS Exec (requires enableExecuteCommand on the service) aws ecs execute-command \ --cluster sentryagent-agentidp-production \ --task $TASK_ARN \ --container agentidp \ --command "node scripts/db-migrate.js" \ --interactive ``` Alternatively, run a one-off ECS task with the migration command as the container override. --- ## 3. GCP Deployment ### 3.1 Configure gcloud CLI ```bash gcloud auth login gcloud config set project your-gcp-project-id gcloud auth application-default login ``` Verify: ```bash gcloud config list gcloud projects describe your-gcp-project-id ``` The principal running Terraform requires the following roles on the project: - `roles/owner` or a custom role covering: Cloud Run Admin, Cloud SQL Admin, Redis Admin, Secret Manager Admin, IAM Admin, Compute Admin, Service Networking Admin. ### 3.2 Prepare tfvars ```bash cd terraform/environments/gcp cp terraform.tfvars.example terraform.tfvars ``` Edit `terraform.tfvars`. Key fields: - `project_id` — your GCP project ID - `region` — GCP region (e.g. `us-central1`) - `app_image_tag` — tag of the image you built - `db_password` — strong random password for Cloud SQL - `jwt_private_key` / `jwt_public_key` — same PEM keys used for AWS (same key pair for both regions) **Never commit `terraform.tfvars` to version control.** ### 3.3 Configure Remote State (Recommended) Uncomment and configure the `backend "gcs"` block in `terraform/environments/gcp/main.tf`: ```hcl backend "gcs" { bucket = "your-terraform-state-bucket" prefix = "agentidp/gcp/production" } ``` Create the GCS bucket: ```bash gsutil mb -l us-central1 gs://your-terraform-state-bucket gsutil versioning set on gs://your-terraform-state-bucket ``` ### 3.4 Terraform Init ```bash cd terraform/environments/gcp terraform init ``` ### 3.5 Terraform Plan ```bash terraform plan -out=tfplan ``` Review the plan. Expected resources: ~35–45 resources (VPC, subnet, VPC connector, service accounts, secrets, Cloud SQL, Memorystore, Cloud Run service, IAM bindings, API enablement). ### 3.6 Terraform Apply ```bash terraform apply tfplan ``` **First apply takes 15–20 minutes** — Cloud SQL provisioning is the longest step. When complete: ```bash terraform output ``` Key outputs: - `service_url` — Cloud Run HTTPS URL (Google-managed TLS, no cert setup required) - `cloud_sql_connection_name` — for Cloud SQL Proxy if needed - `memorystore_host` — Redis private IP ### 3.7 Run Database Migrations Cloud Run does not support exec. Use a one-off Cloud Run Job for migrations: ```bash gcloud run jobs create agentidp-migrate \ --image sentryagent/agentidp:1.0.0 \ --region us-central1 \ --command node \ --args "scripts/db-migrate.js" \ --set-secrets "DATABASE_URL=sentryagent-agentidp-production-database-url:latest" \ --vpc-connector sentryagent-agentidp-production-connector \ --service-account sentryagent-agentidp-production-run-sa@your-gcp-project-id.iam.gserviceaccount.com gcloud run jobs execute agentidp-migrate --region us-central1 --wait ``` --- ## 4. Post-Deploy Verification Run these checks after deploying to either environment. Replace `https://idp.example.com` with your actual service URL. ### 4.1 Health Check ```bash curl -si https://idp.example.com/health ``` Expected response: ``` HTTP/2 200 content-type: application/json {"status":"ok"} ``` If you receive a 502 or 503, the load balancer has not yet registered healthy targets. Wait 60–90 seconds and retry — ECS tasks or Cloud Run instances take time to pass health checks. ### 4.2 Metrics Endpoint ```bash curl -si https://idp.example.com/metrics ``` Expected: HTTP 200 with Prometheus-format metrics text (lines beginning with `# HELP`, `# TYPE`, and metric values). ### 4.3 Token Endpoint (Smoke Test) First, register a test agent client (requires a valid JWT or admin credentials — see [developers guide](../developers/)): ```bash # Issue a client credentials token (replace CLIENT_ID and CLIENT_SECRET with real values) curl -s -X POST https://idp.example.com/api/v1/token \ -H "Content-Type: application/x-www-form-urlencoded" \ -d "grant_type=client_credentials&client_id=test-client&client_secret=test-secret&scope=read" ``` Expected response (abbreviated): ```json { "access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...", "token_type": "Bearer", "expires_in": 3600, "scope": "read" } ``` ### 4.4 JWKS Endpoint ```bash curl -si https://idp.example.com/.well-known/jwks.json ``` Expected: HTTP 200 with a JSON object containing a `keys` array with at least one RSA public key entry. ### 4.5 TLS Verification ```bash # Verify TLS certificate is valid and matches your domain curl -vI https://idp.example.com 2>&1 | grep -E "(SSL|TLS|certificate|issuer|subject)" ``` Expected: TLS 1.2 or 1.3, certificate issued by a trusted CA, subject matching your domain. ### 4.6 AWS-Specific: ECS Service Status ```bash aws ecs describe-services \ --cluster sentryagent-agentidp-production \ --services sentryagent-agentidp-production \ --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount,status:status}' ``` Expected: `running` equals `desired`, `status` is `ACTIVE`. ### 4.7 GCP-Specific: Cloud Run Service Status ```bash gcloud run services describe sentryagent-agentidp-production \ --region us-central1 \ --format='value(status.conditions[0].type,status.conditions[0].status)' ``` Expected: `Ready True`. --- ## 5. Rollback Procedure ### 5.1 Image Rollback (Recommended — fastest) To roll back to a previous image tag without modifying infrastructure: **AWS:** ```bash # Find the previous task definition revision aws ecs list-task-definitions \ --family-prefix sentryagent-agentidp-production \ --sort DESC \ --query 'taskDefinitionArns[:5]' # Update the service to use the previous task definition aws ecs update-service \ --cluster sentryagent-agentidp-production \ --service sentryagent-agentidp-production \ --task-definition sentryagent-agentidp-production:PREVIOUS_REVISION \ --force-new-deployment # Monitor the rollout aws ecs wait services-stable \ --cluster sentryagent-agentidp-production \ --services sentryagent-agentidp-production ``` **GCP:** ```bash # Deploy the previous image tag directly gcloud run services update sentryagent-agentidp-production \ --region us-central1 \ --image sentryagent/agentidp:PREVIOUS_TAG # Or route 100% of traffic to a specific revision gcloud run services update-traffic sentryagent-agentidp-production \ --region us-central1 \ --to-revisions PREVIOUS_REVISION_NAME=100 ``` ### 5.2 Infrastructure Rollback via Terraform If an infrastructure change (not an image update) caused the problem: ```bash # Check the state and plan to understand what changed terraform show terraform plan # If you have a previous state file (S3/GCS versioning), restore it: # AWS: aws s3 cp s3://your-state-bucket/agentidp/aws/production/terraform.tfstate.PREVIOUS ./terraform.tfstate terraform apply -target= # GCP: gsutil cp gs://your-state-bucket/agentidp/gcp/production/PREVIOUS_VERSION ./terraform.tfstate terraform apply -target= ``` **Never run `terraform destroy` in production without CEO approval.** ### 5.3 Database Rollback RDS (AWS) and Cloud SQL (GCP) both support point-in-time restore. Use this only as a last resort — it creates a new DB instance and requires updating the `DATABASE_URL` secret. **AWS:** ```bash # Restore to a point before the problematic deployment aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier sentryagent-agentidp-production \ --target-db-instance-identifier sentryagent-agentidp-production-restored \ --restore-time 2026-01-01T12:00:00Z ``` **GCP:** ```bash # List available backups gcloud sql backups list --instance sentryagent-agentidp-production-pg14 # Restore from a backup gcloud sql backups restore BACKUP_ID \ --restore-instance sentryagent-agentidp-production-pg14 ``` --- ## 6. Environment Variable Reference All environment variables injected into the AgentIdP container are documented in full at: **[docs/devops/environment-variables.md](./environment-variables.md)** ### Quick Reference | Variable | Required | Source (AWS) | Source (GCP) | |----------|----------|--------------|--------------| | `DATABASE_URL` | Yes | Secrets Manager: `///database-url` | Secret Manager: `-database-url` | | `REDIS_URL` | Yes | Secrets Manager: `///redis-url` | Secret Manager: `-redis-url` | | `JWT_PRIVATE_KEY` | Yes | Secrets Manager: `///jwt-private-key` | Secret Manager: `-jwt-private-key` | | `JWT_PUBLIC_KEY` | Yes | Secrets Manager: `///jwt-public-key` | Secret Manager: `-jwt-public-key` | | `PORT` | No | Task definition env var (default: 3000) | Cloud Run env var (default: 3000) | | `NODE_ENV` | No | Task definition env var (`production`) | Cloud Run env var (`production`) | | `CORS_ORIGIN` | No | Task definition env var | Cloud Run env var | | `POLICY_DIR` | No | Task definition env var (`/app/policies`) | Cloud Run env var (`/app/policies`) | | `VAULT_ADDR` | No | Task definition env var | Cloud Run env var | | `VAULT_TOKEN` | No | Secrets Manager: `///vault-token` | Secret Manager: `-vault-token` | | `VAULT_MOUNT` | No | Task definition env var (default: `secret`) | Cloud Run env var (default: `secret`) | ### Updating a Secret **AWS:** ```bash # Update a secret value (e.g. rotate JWT keys) aws secretsmanager put-secret-value \ --secret-id /sentryagent-agentidp/production/jwt-private-key \ --secret-string "$(cat new_jwt_private.pem)" # Force new ECS deployment to pick up the new secret value aws ecs update-service \ --cluster sentryagent-agentidp-production \ --service sentryagent-agentidp-production \ --force-new-deployment ``` **GCP:** ```bash # Add a new version of a secret gcloud secrets versions add sentryagent-agentidp-production-jwt-private-key \ --data-file=new_jwt_private.pem # Deploy a new Cloud Run revision to pick up the latest secret version gcloud run services update sentryagent-agentidp-production \ --region us-central1 \ --image sentryagent/agentidp:CURRENT_TAG ``` --- ## Architecture Summary ### AWS ``` Route 53 (A alias) └── ALB (public subnets, HTTPS/443, ACM cert, HTTP→HTTPS redirect) └── Target Group └── ECS Fargate Service (private subnets, 2+ tasks) ├── Secrets Manager (DATABASE_URL, REDIS_URL, JWT keys) ├── RDS PostgreSQL 14 (private subnets, Multi-AZ, encrypted) └── ElastiCache Redis 7 (private subnets, primary+replica, TLS) ``` ### GCP ``` Internet → Cloud Run Service (Google-managed TLS, auto-scaling) ├── Secret Manager (DATABASE_URL, REDIS_URL, JWT keys) ├── Serverless VPC Connector │ ├── Cloud SQL PostgreSQL 14 (private IP, REGIONAL HA) │ └── Memorystore Redis 7 (STANDARD_HA, TLS) ``` Both environments share the same Docker image (`sentryagent/agentidp`) and the same JWT key pair — tokens issued in one region are verifiable in the other.