Files
sentryagent-idp/openspec/changes/phase-2-production-ready/design.md
SentryAgent.ai Developer 7593bfe1c1 chore: Phase 2 OpenSpec scoping — proposal, design, specs, tasks
8 workstreams scoped per OpenSpec standards:
1. HashiCorp Vault integration (secret management)
2. Python SDK (sentryagent-idp)
3. Go SDK (idp-sdk-go)
4. Java SDK (ai.sentryagent:idp-sdk)
5. OPA policy engine (dynamic ABAC, hot-reload Rego)
6. Web Dashboard UI (React 18 + TypeScript)
7. Prometheus + Grafana monitoring (7 metrics, pre-built dashboard)
8. Multi-region Terraform deployment (AWS + GCP)

Status: proposed — awaiting CEO dependency approvals (A0.1–A0.5)
before any implementation begins.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 14:53:09 +00:00

219 lines
8.0 KiB
Markdown

# Phase 2: Production-Ready — Technical Design
**Date**: 2026-03-28
**Author**: Virtual Architect
**Status**: Draft — pending CEO approval of proposal
---
## 1. HashiCorp Vault Integration
### Architecture
```
AgentIdP Server
└── CredentialService
└── VaultClient (new)
└── HashiCorp Vault (sidecar or external)
└── KV Secrets Engine v2
```
### Design Decisions
**ADR-001: Vault over AWS KMS/GCP Secret Manager**
Vault is cloud-agnostic, open-source, and already standard in enterprise environments. Using Vault keeps Phase 2 cloud-provider independent.
**ADR-002: KV Secrets Engine v2**
KV v2 provides versioned secrets and metadata. When a credential is rotated, the old version is retained in Vault history, enabling audit-grade secret lifecycle tracking.
**ADR-003: AgentIdP stores Vault path, not secret**
`credentials.vault_path` stores the Vault KV path (e.g. `secret/agentidp/agents/{agentId}/credentials/{credentialId}`). The secret itself is never written to PostgreSQL.
### New environment variables
| Variable | Description |
|----------|-------------|
| `VAULT_ADDR` | Vault server address |
| `VAULT_TOKEN` | Vault root/service token |
| `VAULT_MOUNT` | KV mount path (default: `secret`) |
### Migration
Add `vault_path` column to `credentials` table (`005_add_vault_path.sql`). Existing credentials retain bcrypt hashes; new credentials use Vault. Both code paths coexist until all credentials are rotated (migration guide provided).
---
## 2. Multi-Language SDKs
### Shared contract (all SDKs implement identically)
```
AgentIdPClient(baseUrl, clientId, clientSecret, scopes?)
.agents → AgentRegistryClient (5 methods)
.credentials → CredentialClient (4 methods)
.tokens → TokenClient (2 methods)
.audit → AuditClient (2 methods)
.clearTokenCache()
TokenManager — auto-refresh 60s before expiry
AgentIdPError — code, message, httpStatus, details
```
### Python SDK (`sentryagent-idp`)
- Python 3.9+ (httpx for async, requests for sync)
- Both sync and async client variants
- PyPI package: `sentryagent-idp`
- Type hints throughout (`mypy --strict` clean)
### Go SDK (`github.com/sentryagent/idp-sdk-go`)
- Go 1.21+, standard library `net/http`
- Context-aware methods (`context.Context` first arg)
- Idiomatic Go error handling (`error` return, no panic)
- Go module: `github.com/sentryagent/idp-sdk-go`
### Java SDK (`ai.sentryagent:idp-sdk`)
- Java 17+, Apache HttpClient 5
- Synchronous and CompletableFuture async variants
- Maven Central: `ai.sentryagent:idp-sdk`
- Fully typed with generics
---
## 3. OPA Policy Engine
### Architecture
```
HTTP Request
→ Auth Middleware (JWT verify) — unchanged
→ OPA Middleware (new) — evaluates policy
→ OPA Wasm (embedded, no network call)
→ Rego policy files (hot-reloadable)
→ Controller
```
### Design Decisions
**ADR-004: OPA Wasm over OPA sidecar**
Embedding OPA as Wasm in the Node.js process eliminates a network hop and removes a runtime dependency. Policy files are loaded from `policies/` directory at startup and reloaded on SIGHUP.
**ADR-005: Policy replaces, does not wrap, scope check**
The existing static scope check in `auth.ts` is replaced by an OPA policy evaluation. This keeps the policy as the single source of truth for access control.
### Policy structure (`policies/`)
```
policies/
authz.rego — main policy: allow/deny
data/
scopes.json — scope → permission mapping
```
---
## 4. Web Dashboard UI
### Architecture
```
dashboard/ (new — separate from sdk/)
src/
components/ — reusable UI components
pages/ — Agents, Credentials, Audit, Health
hooks/ — useAgents, useCredentials, useAudit
lib/
client.ts — wraps @sentryagent/idp-sdk
auth.ts — credential entry and storage
```
### Tech Stack
- React 18 + TypeScript strict
- Vite 5 (build tool)
- TanStack Query v5 (server state)
- shadcn/ui components (Radix UI + Tailwind CSS)
### Pages
| Page | Scope Required | Features |
|------|---------------|----------|
| Agents | `agents:read` | List, search, view detail, suspend/reactivate |
| Credentials | `agents:read` | List credentials per agent, rotate, revoke |
| Audit Log | `audit:read` | Filter by agent/action/outcome/date, paginate |
| Health | None | Server uptime, Redis/PostgreSQL connectivity |
### Authentication
The dashboard accepts `clientId` + `clientSecret` via a login form. The `@sentryagent/idp-sdk` `TokenManager` handles token acquisition and caching in `sessionStorage`. No backend session — all state is client-side.
---
## 5. Prometheus + Grafana Monitoring
### Metrics exposed at `GET /metrics`
| Metric | Type | Description |
|--------|------|-------------|
| `agentidp_tokens_issued_total` | Counter | Tokens issued, labelled by outcome |
| `agentidp_agents_registered_total` | Counter | Agent registrations |
| `agentidp_http_requests_total` | Counter | All requests, labelled by method/path/status |
| `agentidp_http_request_duration_seconds` | Histogram | Request latency |
| `agentidp_rate_limit_rejections_total` | Counter | 429 responses |
| `agentidp_db_query_duration_seconds` | Histogram | PostgreSQL query latency |
| `agentidp_redis_command_duration_seconds` | Histogram | Redis command latency |
### Grafana dashboard
Pre-built JSON dashboard shipped in `monitoring/grafana/dashboards/agentidp.json`. Auto-provisioned via `monitoring/grafana/provisioning/`.
### Docker Compose extension
Add `prometheus` and `grafana` services to a `docker-compose.monitoring.yml` overlay — keeps the base `docker-compose.yml` clean for developers who don't need monitoring.
---
## 6. Multi-Region Deployment (Terraform)
### Structure
```
terraform/
modules/
agentidp/ — reusable module: compute + networking
rds/ — managed PostgreSQL
redis/ — managed Redis
lb/ — load balancer + TLS
environments/
aws/ — AWS-specific config (ECS + RDS + ElastiCache)
gcp/ — GCP-specific config (Cloud Run + Cloud SQL + Memorystore)
```
### Design Decisions
**ADR-006: Two provider targets (AWS + GCP) in Phase 2**
AWS and GCP cover the majority of developer deployments. Azure module is Phase 3. Each environment is a thin wrapper over the shared `agentidp` module.
**ADR-007: Terraform over Pulumi/CDK**
Terraform is the most widely-used IaC tool, familiar to most DevOps teams. The HCL syntax is simpler for documentation purposes.
---
## Component Interaction Map (Phase 2)
```
┌────────────────────┐
│ Web Dashboard │
│ (React + Vite) │
└────────┬───────────┘
│ HTTPS
┌────────────────▼────────────────┐
│ AgentIdP Server │
│ Auth MW → OPA MW → Controllers │
│ /metrics (prom-client) │
└──┬──────────┬──────────┬────────┘
│ │ │
┌─────▼──┐ ┌────▼───┐ ┌──▼───────┐
│Postgres│ │ Redis │ │ Vault │
└────────┘ └────────┘ └──────────┘
┌────────▼────────┐
│ Prometheus │
└────────┬────────┘
┌────────▼────────┐
│ Grafana │
└─────────────────┘
```