# Deployment Guide — SentryAgent.ai AgentIdP

End-to-end guide for deploying AgentIdP to AWS (primary) and GCP (secondary) using the Terraform infrastructure-as-code in `terraform/`.

---

## Table of Contents

1. [Prerequisites](#1-prerequisites)
2. [AWS Deployment](#2-aws-deployment)
3. [GCP Deployment](#3-gcp-deployment)
4. [Post-Deploy Verification](#4-post-deploy-verification)
5. [Rollback Procedure](#5-rollback-procedure)
6. [Environment Variable Reference](#6-environment-variable-reference)

---

## 1. Prerequisites

### Tools

| Tool | Minimum Version | Install |
|------|-----------------|---------|
| Terraform | 1.6.0 | https://developer.hashicorp.com/terraform/install |
| AWS CLI | 2.13 | https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html |
| gcloud CLI | 460.0 | https://cloud.google.com/sdk/docs/install |
| Docker | 24.0 | Required only for building and pushing images |
| openssl | any | Required for generating JWT key pairs |

Verify all tools are available:

```bash
terraform version
aws --version
gcloud version
docker version
openssl version
```

### Container Image

Build and push the `sentryagent/agentidp` image to your registry before deploying. Terraform references the image by tag — it does not build it.

```bash
# From the project root
docker build -t sentryagent/agentidp:1.0.0 .

# Push to your registry (ECR example):
aws ecr get-login-password --region us-east-1 \
  | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

docker tag sentryagent/agentidp:1.0.0 \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0

docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/sentryagent/agentidp:1.0.0
```

Update `app_image_tag` in your `terraform.tfvars` to match.

### JWT Key Pair

Generate the RSA-2048 key pair used for signing and verifying JWTs:

```bash
openssl genrsa -out jwt_private.pem 2048
openssl rsa -in jwt_private.pem -pubout -out jwt_public.pem

# Verify
openssl rsa -in jwt_private.pem -check -noout
```

Keep `jwt_private.pem` secure — treat it with the same sensitivity as a TLS private key. You will paste its contents into `terraform.tfvars`.

---

## 2. AWS Deployment

### 2.1 Configure AWS CLI

```bash
aws configure
# Provide: AWS Access Key ID, Secret Access Key, region (e.g. us-east-1), output format (json)

# Verify credentials
aws sts get-caller-identity
```

The IAM principal running Terraform requires permissions to manage: VPC, ECS, RDS, ElastiCache, ALB, IAM roles, Secrets Manager, Route 53, CloudWatch, and VPC endpoints.

### 2.2 Provision an ACM Certificate

The ALB requires an ACM certificate for your domain. Create it in the same region as your deployment.

```bash
aws acm request-certificate \
  --domain-name idp.example.com \
  --validation-method DNS \
  --region us-east-1
```

Complete DNS validation by adding the CNAME record shown in the ACM console. Wait for the status to become `ISSUED` before proceeding.

```bash
# Monitor validation status
aws acm describe-certificate \
  --certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/XXXX \
  --region us-east-1 \
  --query 'Certificate.Status'
```

### 2.3 Prepare tfvars

```bash
cd terraform/environments/aws
cp terraform.tfvars.example terraform.tfvars
```

Edit `terraform.tfvars`. All fields marked `REPLACE_WITH_*` are required. Key fields:

- `region` — AWS region (must match the ACM certificate region)
- `domain_name` — your domain (e.g. `idp.example.com`)
- `certificate_arn` — ARN from step 2.2
- `app_image_tag` — tag of the image you pushed in step 1
- `db_password` — strong random password (no `@`, `#`, `?`, `/` characters — they break URL parsing)
- `redis_auth_token` — minimum 16 characters, no spaces
- `jwt_private_key` — full PEM contents of `jwt_private.pem` with literal `\n` for newlines
- `jwt_public_key` — full PEM contents of `jwt_public.pem` with literal `\n` for newlines

Example for encoding PEM keys in tfvars:

```bash
# Output the private key as a single line with \n separators (for pasting into tfvars)
awk 'NF {printf "%s\\n", $0}' jwt_private.pem
```

**Never commit `terraform.tfvars` to version control.**

### 2.4 Configure Remote State (Recommended)

Uncomment and configure the `backend "s3"` block in `terraform/environments/aws/main.tf`:

```hcl
backend "s3" {
  bucket         = "your-terraform-state-bucket"
  key            = "agentidp/aws/production/terraform.tfstate"
  region         = "us-east-1"
  encrypt        = true
  dynamodb_table = "your-terraform-locks-table"
}
```

Create the S3 bucket and DynamoDB table if they do not exist:

```bash
# S3 bucket with versioning and encryption
aws s3api create-bucket --bucket your-terraform-state-bucket --region us-east-1
aws s3api put-bucket-versioning \
  --bucket your-terraform-state-bucket \
  --versioning-configuration Status=Enabled
aws s3api put-bucket-encryption \
  --bucket your-terraform-state-bucket \
  --server-side-encryption-configuration \
  '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'

# DynamoDB table for state locking
aws dynamodb create-table \
  --table-name your-terraform-locks-table \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1
```

### 2.5 Terraform Init

```bash
cd terraform/environments/aws
terraform init
```

Expected output: provider plugins downloaded, backend initialized.

### 2.6 Terraform Plan

```bash
terraform plan -out=tfplan
```

Review the plan carefully before applying. Expected resources on first apply: ~50–60 resources (VPC, subnets, NAT gateways, VPC endpoints, IAM roles, secrets, RDS, ElastiCache, ALB, ECS cluster, task definition, service, Route 53 record).

### 2.7 Terraform Apply

```bash
terraform apply tfplan
```

**First apply takes 20–30 minutes** — RDS Multi-AZ provisioning is the longest step (~15 min). Do not interrupt the apply.

When complete, note the outputs:

```bash
terraform output
```

Key outputs:
- `service_url` — the HTTPS URL of your deployed service
- `alb_dns_name` — ALB DNS name (verify Route 53 alias is pointing here)
- `ecs_service_name` — use for ECS deployment commands
- `cloudwatch_log_group` — where container logs appear

### 2.8 Run Database Migrations

After first deploy, run migrations against the new RDS instance. The easiest approach is to exec into a running ECS task:

```bash
# Get a running task ARN
TASK_ARN=$(aws ecs list-tasks \
  --cluster sentryagent-agentidp-production \
  --service-name sentryagent-agentidp-production \
  --query 'taskArns[0]' \
  --output text)

# Run migrations via ECS Exec (requires enableExecuteCommand on the service)
aws ecs execute-command \
  --cluster sentryagent-agentidp-production \
  --task $TASK_ARN \
  --container agentidp \
  --command "node scripts/db-migrate.js" \
  --interactive
```

Alternatively, run a one-off ECS task with the migration command as the container override.

---

## 3. GCP Deployment

### 3.1 Configure gcloud CLI

```bash
gcloud auth login
gcloud config set project your-gcp-project-id
gcloud auth application-default login
```

Verify:

```bash
gcloud config list
gcloud projects describe your-gcp-project-id
```

The principal running Terraform requires the following roles on the project:
- `roles/owner` or a custom role covering: Cloud Run Admin, Cloud SQL Admin, Redis Admin, Secret Manager Admin, IAM Admin, Compute Admin, Service Networking Admin.

### 3.2 Prepare tfvars

```bash
cd terraform/environments/gcp
cp terraform.tfvars.example terraform.tfvars
```

Edit `terraform.tfvars`. Key fields:

- `project_id` — your GCP project ID
- `region` — GCP region (e.g. `us-central1`)
- `app_image_tag` — tag of the image you built
- `db_password` — strong random password for Cloud SQL
- `jwt_private_key` / `jwt_public_key` — same PEM keys used for AWS (same key pair for both regions)

**Never commit `terraform.tfvars` to version control.**

### 3.3 Configure Remote State (Recommended)

Uncomment and configure the `backend "gcs"` block in `terraform/environments/gcp/main.tf`:

```hcl
backend "gcs" {
  bucket = "your-terraform-state-bucket"
  prefix = "agentidp/gcp/production"
}
```

Create the GCS bucket:

```bash
gsutil mb -l us-central1 gs://your-terraform-state-bucket
gsutil versioning set on gs://your-terraform-state-bucket
```

### 3.4 Terraform Init

```bash
cd terraform/environments/gcp
terraform init
```

### 3.5 Terraform Plan

```bash
terraform plan -out=tfplan
```

Review the plan. Expected resources: ~35–45 resources (VPC, subnet, VPC connector, service accounts, secrets, Cloud SQL, Memorystore, Cloud Run service, IAM bindings, API enablement).

### 3.6 Terraform Apply

```bash
terraform apply tfplan
```

**First apply takes 15–20 minutes** — Cloud SQL provisioning is the longest step.

When complete:

```bash
terraform output
```

Key outputs:
- `service_url` — Cloud Run HTTPS URL (Google-managed TLS, no cert setup required)
- `cloud_sql_connection_name` — for Cloud SQL Proxy if needed
- `memorystore_host` — Redis private IP

### 3.7 Run Database Migrations

Cloud Run does not support exec. Use a one-off Cloud Run Job for migrations:

```bash
gcloud run jobs create agentidp-migrate \
  --image sentryagent/agentidp:1.0.0 \
  --region us-central1 \
  --command node \
  --args "scripts/db-migrate.js" \
  --set-secrets "DATABASE_URL=sentryagent-agentidp-production-database-url:latest" \
  --vpc-connector sentryagent-agentidp-production-connector \
  --service-account sentryagent-agentidp-production-run-sa@your-gcp-project-id.iam.gserviceaccount.com

gcloud run jobs execute agentidp-migrate --region us-central1 --wait
```

---

## 4. Post-Deploy Verification

Run these checks after deploying to either environment. Replace `https://idp.example.com` with your actual service URL.

### 4.1 Health Check

```bash
curl -si https://idp.example.com/health
```

Expected response:

```
HTTP/2 200
content-type: application/json

{"status":"ok"}
```

If you receive a 502 or 503, the load balancer has not yet registered healthy targets. Wait 60–90 seconds and retry — ECS tasks or Cloud Run instances take time to pass health checks.

### 4.2 Metrics Endpoint

```bash
curl -si https://idp.example.com/metrics
```

Expected: HTTP 200 with Prometheus-format metrics text (lines beginning with `# HELP`, `# TYPE`, and metric values).

### 4.3 Token Endpoint (Smoke Test)

First, register a test agent client (requires a valid JWT or admin credentials — see [developers guide](../developers/)):

```bash
# Issue a client credentials token (replace CLIENT_ID and CLIENT_SECRET with real values)
curl -s -X POST https://idp.example.com/api/v1/token \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "grant_type=client_credentials&client_id=test-client&client_secret=test-secret&scope=read"
```

Expected response (abbreviated):

```json
{
  "access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
  "token_type": "Bearer",
  "expires_in": 3600,
  "scope": "read"
}
```

### 4.4 JWKS Endpoint

```bash
curl -si https://idp.example.com/.well-known/jwks.json
```

Expected: HTTP 200 with a JSON object containing a `keys` array with at least one RSA public key entry.

### 4.5 TLS Verification

```bash
# Verify TLS certificate is valid and matches your domain
curl -vI https://idp.example.com 2>&1 | grep -E "(SSL|TLS|certificate|issuer|subject)"
```

Expected: TLS 1.2 or 1.3, certificate issued by a trusted CA, subject matching your domain.

### 4.6 AWS-Specific: ECS Service Status

```bash
aws ecs describe-services \
  --cluster sentryagent-agentidp-production \
  --services sentryagent-agentidp-production \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount,status:status}'
```

Expected: `running` equals `desired`, `status` is `ACTIVE`.

### 4.7 GCP-Specific: Cloud Run Service Status

```bash
gcloud run services describe sentryagent-agentidp-production \
  --region us-central1 \
  --format='value(status.conditions[0].type,status.conditions[0].status)'
```

Expected: `Ready True`.

---

## 5. Rollback Procedure

### 5.1 Image Rollback (Recommended — fastest)

To roll back to a previous image tag without modifying infrastructure:

**AWS:**

```bash
# Find the previous task definition revision
aws ecs list-task-definitions \
  --family-prefix sentryagent-agentidp-production \
  --sort DESC \
  --query 'taskDefinitionArns[:5]'

# Update the service to use the previous task definition
aws ecs update-service \
  --cluster sentryagent-agentidp-production \
  --service sentryagent-agentidp-production \
  --task-definition sentryagent-agentidp-production:PREVIOUS_REVISION \
  --force-new-deployment

# Monitor the rollout
aws ecs wait services-stable \
  --cluster sentryagent-agentidp-production \
  --services sentryagent-agentidp-production
```

**GCP:**

```bash
# Deploy the previous image tag directly
gcloud run services update sentryagent-agentidp-production \
  --region us-central1 \
  --image sentryagent/agentidp:PREVIOUS_TAG

# Or route 100% of traffic to a specific revision
gcloud run services update-traffic sentryagent-agentidp-production \
  --region us-central1 \
  --to-revisions PREVIOUS_REVISION_NAME=100
```

### 5.2 Infrastructure Rollback via Terraform

If an infrastructure change (not an image update) caused the problem:

```bash
# Check the state and plan to understand what changed
terraform show
terraform plan

# If you have a previous state file (S3/GCS versioning), restore it:
# AWS:
aws s3 cp s3://your-state-bucket/agentidp/aws/production/terraform.tfstate.PREVIOUS ./terraform.tfstate
terraform apply -target=<affected_resource>

# GCP:
gsutil cp gs://your-state-bucket/agentidp/gcp/production/PREVIOUS_VERSION ./terraform.tfstate
terraform apply -target=<affected_resource>
```

**Never run `terraform destroy` in production without CEO approval.**

### 5.3 Database Rollback

RDS (AWS) and Cloud SQL (GCP) both support point-in-time restore. Use this only as a last resort — it creates a new DB instance and requires updating the `DATABASE_URL` secret.

**AWS:**

```bash
# Restore to a point before the problematic deployment
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier sentryagent-agentidp-production \
  --target-db-instance-identifier sentryagent-agentidp-production-restored \
  --restore-time 2026-01-01T12:00:00Z
```

**GCP:**

```bash
# List available backups
gcloud sql backups list --instance sentryagent-agentidp-production-pg14

# Restore from a backup
gcloud sql backups restore BACKUP_ID \
  --restore-instance sentryagent-agentidp-production-pg14
```

---

## 6. Environment Variable Reference

All environment variables injected into the AgentIdP container are documented in full at:

**[docs/devops/environment-variables.md](./environment-variables.md)**

### Quick Reference

| Variable | Required | Source (AWS) | Source (GCP) |
|----------|----------|--------------|--------------|
| `DATABASE_URL` | Yes | Secrets Manager: `/<project>/<env>/database-url` | Secret Manager: `<name-prefix>-database-url` |
| `REDIS_URL` | Yes | Secrets Manager: `/<project>/<env>/redis-url` | Secret Manager: `<name-prefix>-redis-url` |
| `JWT_PRIVATE_KEY` | Yes | Secrets Manager: `/<project>/<env>/jwt-private-key` | Secret Manager: `<name-prefix>-jwt-private-key` |
| `JWT_PUBLIC_KEY` | Yes | Secrets Manager: `/<project>/<env>/jwt-public-key` | Secret Manager: `<name-prefix>-jwt-public-key` |
| `PORT` | No | Task definition env var (default: 3000) | Cloud Run env var (default: 3000) |
| `NODE_ENV` | No | Task definition env var (`production`) | Cloud Run env var (`production`) |
| `CORS_ORIGIN` | No | Task definition env var | Cloud Run env var |
| `POLICY_DIR` | No | Task definition env var (`/app/policies`) | Cloud Run env var (`/app/policies`) |
| `VAULT_ADDR` | No | Task definition env var | Cloud Run env var |
| `VAULT_TOKEN` | No | Secrets Manager: `/<project>/<env>/vault-token` | Secret Manager: `<name-prefix>-vault-token` |
| `VAULT_MOUNT` | No | Task definition env var (default: `secret`) | Cloud Run env var (default: `secret`) |

### Updating a Secret

**AWS:**

```bash
# Update a secret value (e.g. rotate JWT keys)
aws secretsmanager put-secret-value \
  --secret-id /sentryagent-agentidp/production/jwt-private-key \
  --secret-string "$(cat new_jwt_private.pem)"

# Force new ECS deployment to pick up the new secret value
aws ecs update-service \
  --cluster sentryagent-agentidp-production \
  --service sentryagent-agentidp-production \
  --force-new-deployment
```

**GCP:**

```bash
# Add a new version of a secret
gcloud secrets versions add sentryagent-agentidp-production-jwt-private-key \
  --data-file=new_jwt_private.pem

# Deploy a new Cloud Run revision to pick up the latest secret version
gcloud run services update sentryagent-agentidp-production \
  --region us-central1 \
  --image sentryagent/agentidp:CURRENT_TAG
```

---

## Architecture Summary

### AWS

```
Route 53 (A alias)
  └── ALB (public subnets, HTTPS/443, ACM cert, HTTP→HTTPS redirect)
        └── Target Group
              └── ECS Fargate Service (private subnets, 2+ tasks)
                    ├── Secrets Manager (DATABASE_URL, REDIS_URL, JWT keys)
                    ├── RDS PostgreSQL 14 (private subnets, Multi-AZ, encrypted)
                    └── ElastiCache Redis 7 (private subnets, primary+replica, TLS)
```

### GCP

```
Internet → Cloud Run Service (Google-managed TLS, auto-scaling)
                    ├── Secret Manager (DATABASE_URL, REDIS_URL, JWT keys)
                    ├── Serverless VPC Connector
                    │     ├── Cloud SQL PostgreSQL 14 (private IP, REGIONAL HA)
                    │     └── Memorystore Redis 7 (STANDARD_HA, TLS)
```

Both environments share the same Docker image (`sentryagent/agentidp`) and the same JWT key pair — tokens issued in one region are verifiable in the other.