sentryagent-idp/openspec/changes/phase-2-production-ready/design.md

# Phase 2: Production-Ready — Technical Design

**Date**: 2026-03-28
**Author**: Virtual Architect
**Status**: Draft — pending CEO approval of proposal

---

## 1. HashiCorp Vault Integration

### Architecture

```
AgentIdP Server
  └── CredentialService
        └── VaultClient (new)
              └── HashiCorp Vault (sidecar or external)
                    └── KV Secrets Engine v2
```

### Design Decisions

**ADR-001: Vault over AWS KMS/GCP Secret Manager**
Vault is cloud-agnostic, open-source, and already standard in enterprise environments. Using Vault keeps Phase 2 cloud-provider independent.

**ADR-002: KV Secrets Engine v2**
KV v2 provides versioned secrets and metadata. When a credential is rotated, the old version is retained in Vault history, enabling audit-grade secret lifecycle tracking.

**ADR-003: AgentIdP stores Vault path, not secret**
`credentials.vault_path` stores the Vault KV path (e.g. `secret/agentidp/agents/{agentId}/credentials/{credentialId}`). The secret itself is never written to PostgreSQL.

### New environment variables
| Variable | Description |
|----------|-------------|
| `VAULT_ADDR` | Vault server address |
| `VAULT_TOKEN` | Vault root/service token |
| `VAULT_MOUNT` | KV mount path (default: `secret`) |

### Migration
Add `vault_path` column to `credentials` table (`005_add_vault_path.sql`). Existing credentials retain bcrypt hashes; new credentials use Vault. Both code paths coexist until all credentials are rotated (migration guide provided).

---

## 2. Multi-Language SDKs

### Shared contract (all SDKs implement identically)

```
AgentIdPClient(baseUrl, clientId, clientSecret, scopes?)
  .agents     → AgentRegistryClient   (5 methods)
  .credentials → CredentialClient     (4 methods)
  .tokens     → TokenClient           (2 methods)
  .audit      → AuditClient           (2 methods)
  .clearTokenCache()

TokenManager — auto-refresh 60s before expiry
AgentIdPError — code, message, httpStatus, details
```

### Python SDK (`sentryagent-idp`)
- Python 3.9+ (httpx for async, requests for sync)
- Both sync and async client variants
- PyPI package: `sentryagent-idp`
- Type hints throughout (`mypy --strict` clean)

### Go SDK (`github.com/sentryagent/idp-sdk-go`)
- Go 1.21+, standard library `net/http`
- Context-aware methods (`context.Context` first arg)
- Idiomatic Go error handling (`error` return, no panic)
- Go module: `github.com/sentryagent/idp-sdk-go`

### Java SDK (`ai.sentryagent:idp-sdk`)
- Java 17+, Apache HttpClient 5
- Synchronous and CompletableFuture async variants
- Maven Central: `ai.sentryagent:idp-sdk`
- Fully typed with generics

---

## 3. OPA Policy Engine

### Architecture

```
HTTP Request
  → Auth Middleware (JWT verify) — unchanged
  → OPA Middleware (new) — evaluates policy
      → OPA Wasm (embedded, no network call)
          → Rego policy files (hot-reloadable)
  → Controller
```

### Design Decisions

**ADR-004: OPA Wasm over OPA sidecar**
Embedding OPA as Wasm in the Node.js process eliminates a network hop and removes a runtime dependency. Policy files are loaded from `policies/` directory at startup and reloaded on SIGHUP.

**ADR-005: Policy replaces, does not wrap, scope check**
The existing static scope check in `auth.ts` is replaced by an OPA policy evaluation. This keeps the policy as the single source of truth for access control.

### Policy structure (`policies/`)
```
policies/
  authz.rego          — main policy: allow/deny
  data/
    scopes.json       — scope → permission mapping
```

---

## 4. Web Dashboard UI

### Architecture

```
dashboard/            (new — separate from sdk/)
  src/
    components/       — reusable UI components
    pages/            — Agents, Credentials, Audit, Health
    hooks/            — useAgents, useCredentials, useAudit
    lib/
      client.ts       — wraps @sentryagent/idp-sdk
      auth.ts         — credential entry and storage
```

### Tech Stack
- React 18 + TypeScript strict
- Vite 5 (build tool)
- TanStack Query v5 (server state)
- shadcn/ui components (Radix UI + Tailwind CSS)

### Pages
| Page | Scope Required | Features |
|------|---------------|----------|
| Agents | `agents:read` | List, search, view detail, suspend/reactivate |
| Credentials | `agents:read` | List credentials per agent, rotate, revoke |
| Audit Log | `audit:read` | Filter by agent/action/outcome/date, paginate |
| Health | None | Server uptime, Redis/PostgreSQL connectivity |

### Authentication
The dashboard accepts `clientId` + `clientSecret` via a login form. The `@sentryagent/idp-sdk` `TokenManager` handles token acquisition and caching in `sessionStorage`. No backend session — all state is client-side.

---

## 5. Prometheus + Grafana Monitoring

### Metrics exposed at `GET /metrics`

| Metric | Type | Description |
|--------|------|-------------|
| `agentidp_tokens_issued_total` | Counter | Tokens issued, labelled by outcome |
| `agentidp_agents_registered_total` | Counter | Agent registrations |
| `agentidp_http_requests_total` | Counter | All requests, labelled by method/path/status |
| `agentidp_http_request_duration_seconds` | Histogram | Request latency |
| `agentidp_rate_limit_rejections_total` | Counter | 429 responses |
| `agentidp_db_query_duration_seconds` | Histogram | PostgreSQL query latency |
| `agentidp_redis_command_duration_seconds` | Histogram | Redis command latency |

### Grafana dashboard
Pre-built JSON dashboard shipped in `monitoring/grafana/dashboards/agentidp.json`. Auto-provisioned via `monitoring/grafana/provisioning/`.

### Docker Compose extension
Add `prometheus` and `grafana` services to a `docker-compose.monitoring.yml` overlay — keeps the base `docker-compose.yml` clean for developers who don't need monitoring.

---

## 6. Multi-Region Deployment (Terraform)

### Structure

```
terraform/
  modules/
    agentidp/         — reusable module: compute + networking
    rds/              — managed PostgreSQL
    redis/            — managed Redis
    lb/               — load balancer + TLS
  environments/
    aws/              — AWS-specific config (ECS + RDS + ElastiCache)
    gcp/              — GCP-specific config (Cloud Run + Cloud SQL + Memorystore)
```

### Design Decisions

**ADR-006: Two provider targets (AWS + GCP) in Phase 2**
AWS and GCP cover the majority of developer deployments. Azure module is Phase 3. Each environment is a thin wrapper over the shared `agentidp` module.

**ADR-007: Terraform over Pulumi/CDK**
Terraform is the most widely-used IaC tool, familiar to most DevOps teams. The HCL syntax is simpler for documentation purposes.

---

## Component Interaction Map (Phase 2)

```
                      ┌────────────────────┐
                      │   Web Dashboard    │
                      │  (React + Vite)    │
                      └────────┬───────────┘
                               │ HTTPS
              ┌────────────────▼────────────────┐
              │         AgentIdP Server         │
              │  Auth MW → OPA MW → Controllers │
              │  /metrics (prom-client)         │
              └──┬──────────┬──────────┬────────┘
                 │          │          │
           ┌─────▼──┐  ┌────▼───┐  ┌──▼───────┐
           │Postgres│  │ Redis  │  │  Vault   │
           └────────┘  └────────┘  └──────────┘
                 │
        ┌────────▼────────┐
        │   Prometheus    │
        └────────┬────────┘
                 │
        ┌────────▼────────┐
        │    Grafana      │
        └─────────────────┘
```