chore: Phase 2 OpenSpec scoping — proposal, design, specs, tasks
8 workstreams scoped per OpenSpec standards: 1. HashiCorp Vault integration (secret management) 2. Python SDK (sentryagent-idp) 3. Go SDK (idp-sdk-go) 4. Java SDK (ai.sentryagent:idp-sdk) 5. OPA policy engine (dynamic ABAC, hot-reload Rego) 6. Web Dashboard UI (React 18 + TypeScript) 7. Prometheus + Grafana monitoring (7 metrics, pre-built dashboard) 8. Multi-region Terraform deployment (AWS + GCP) Status: proposed — awaiting CEO dependency approvals (A0.1–A0.5) before any implementation begins. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
218
openspec/changes/phase-2-production-ready/design.md
Normal file
218
openspec/changes/phase-2-production-ready/design.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# Phase 2: Production-Ready — Technical Design
|
||||
|
||||
**Date**: 2026-03-28
|
||||
**Author**: Virtual Architect
|
||||
**Status**: Draft — pending CEO approval of proposal
|
||||
|
||||
---
|
||||
|
||||
## 1. HashiCorp Vault Integration
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
AgentIdP Server
|
||||
└── CredentialService
|
||||
└── VaultClient (new)
|
||||
└── HashiCorp Vault (sidecar or external)
|
||||
└── KV Secrets Engine v2
|
||||
```
|
||||
|
||||
### Design Decisions
|
||||
|
||||
**ADR-001: Vault over AWS KMS/GCP Secret Manager**
|
||||
Vault is cloud-agnostic, open-source, and already standard in enterprise environments. Using Vault keeps Phase 2 cloud-provider independent.
|
||||
|
||||
**ADR-002: KV Secrets Engine v2**
|
||||
KV v2 provides versioned secrets and metadata. When a credential is rotated, the old version is retained in Vault history, enabling audit-grade secret lifecycle tracking.
|
||||
|
||||
**ADR-003: AgentIdP stores Vault path, not secret**
|
||||
`credentials.vault_path` stores the Vault KV path (e.g. `secret/agentidp/agents/{agentId}/credentials/{credentialId}`). The secret itself is never written to PostgreSQL.
|
||||
|
||||
### New environment variables
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `VAULT_ADDR` | Vault server address |
|
||||
| `VAULT_TOKEN` | Vault root/service token |
|
||||
| `VAULT_MOUNT` | KV mount path (default: `secret`) |
|
||||
|
||||
### Migration
|
||||
Add `vault_path` column to `credentials` table (`005_add_vault_path.sql`). Existing credentials retain bcrypt hashes; new credentials use Vault. Both code paths coexist until all credentials are rotated (migration guide provided).
|
||||
|
||||
---
|
||||
|
||||
## 2. Multi-Language SDKs
|
||||
|
||||
### Shared contract (all SDKs implement identically)
|
||||
|
||||
```
|
||||
AgentIdPClient(baseUrl, clientId, clientSecret, scopes?)
|
||||
.agents → AgentRegistryClient (5 methods)
|
||||
.credentials → CredentialClient (4 methods)
|
||||
.tokens → TokenClient (2 methods)
|
||||
.audit → AuditClient (2 methods)
|
||||
.clearTokenCache()
|
||||
|
||||
TokenManager — auto-refresh 60s before expiry
|
||||
AgentIdPError — code, message, httpStatus, details
|
||||
```
|
||||
|
||||
### Python SDK (`sentryagent-idp`)
|
||||
- Python 3.9+ (httpx for async, requests for sync)
|
||||
- Both sync and async client variants
|
||||
- PyPI package: `sentryagent-idp`
|
||||
- Type hints throughout (`mypy --strict` clean)
|
||||
|
||||
### Go SDK (`github.com/sentryagent/idp-sdk-go`)
|
||||
- Go 1.21+, standard library `net/http`
|
||||
- Context-aware methods (`context.Context` first arg)
|
||||
- Idiomatic Go error handling (`error` return, no panic)
|
||||
- Go module: `github.com/sentryagent/idp-sdk-go`
|
||||
|
||||
### Java SDK (`ai.sentryagent:idp-sdk`)
|
||||
- Java 17+, Apache HttpClient 5
|
||||
- Synchronous and CompletableFuture async variants
|
||||
- Maven Central: `ai.sentryagent:idp-sdk`
|
||||
- Fully typed with generics
|
||||
|
||||
---
|
||||
|
||||
## 3. OPA Policy Engine
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
HTTP Request
|
||||
→ Auth Middleware (JWT verify) — unchanged
|
||||
→ OPA Middleware (new) — evaluates policy
|
||||
→ OPA Wasm (embedded, no network call)
|
||||
→ Rego policy files (hot-reloadable)
|
||||
→ Controller
|
||||
```
|
||||
|
||||
### Design Decisions
|
||||
|
||||
**ADR-004: OPA Wasm over OPA sidecar**
|
||||
Embedding OPA as Wasm in the Node.js process eliminates a network hop and removes a runtime dependency. Policy files are loaded from `policies/` directory at startup and reloaded on SIGHUP.
|
||||
|
||||
**ADR-005: Policy replaces, does not wrap, scope check**
|
||||
The existing static scope check in `auth.ts` is replaced by an OPA policy evaluation. This keeps the policy as the single source of truth for access control.
|
||||
|
||||
### Policy structure (`policies/`)
|
||||
```
|
||||
policies/
|
||||
authz.rego — main policy: allow/deny
|
||||
data/
|
||||
scopes.json — scope → permission mapping
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Web Dashboard UI
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
dashboard/ (new — separate from sdk/)
|
||||
src/
|
||||
components/ — reusable UI components
|
||||
pages/ — Agents, Credentials, Audit, Health
|
||||
hooks/ — useAgents, useCredentials, useAudit
|
||||
lib/
|
||||
client.ts — wraps @sentryagent/idp-sdk
|
||||
auth.ts — credential entry and storage
|
||||
```
|
||||
|
||||
### Tech Stack
|
||||
- React 18 + TypeScript strict
|
||||
- Vite 5 (build tool)
|
||||
- TanStack Query v5 (server state)
|
||||
- shadcn/ui components (Radix UI + Tailwind CSS)
|
||||
|
||||
### Pages
|
||||
| Page | Scope Required | Features |
|
||||
|------|---------------|----------|
|
||||
| Agents | `agents:read` | List, search, view detail, suspend/reactivate |
|
||||
| Credentials | `agents:read` | List credentials per agent, rotate, revoke |
|
||||
| Audit Log | `audit:read` | Filter by agent/action/outcome/date, paginate |
|
||||
| Health | None | Server uptime, Redis/PostgreSQL connectivity |
|
||||
|
||||
### Authentication
|
||||
The dashboard accepts `clientId` + `clientSecret` via a login form. The `@sentryagent/idp-sdk` `TokenManager` handles token acquisition and caching in `sessionStorage`. No backend session — all state is client-side.
|
||||
|
||||
---
|
||||
|
||||
## 5. Prometheus + Grafana Monitoring
|
||||
|
||||
### Metrics exposed at `GET /metrics`
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `agentidp_tokens_issued_total` | Counter | Tokens issued, labelled by outcome |
|
||||
| `agentidp_agents_registered_total` | Counter | Agent registrations |
|
||||
| `agentidp_http_requests_total` | Counter | All requests, labelled by method/path/status |
|
||||
| `agentidp_http_request_duration_seconds` | Histogram | Request latency |
|
||||
| `agentidp_rate_limit_rejections_total` | Counter | 429 responses |
|
||||
| `agentidp_db_query_duration_seconds` | Histogram | PostgreSQL query latency |
|
||||
| `agentidp_redis_command_duration_seconds` | Histogram | Redis command latency |
|
||||
|
||||
### Grafana dashboard
|
||||
Pre-built JSON dashboard shipped in `monitoring/grafana/dashboards/agentidp.json`. Auto-provisioned via `monitoring/grafana/provisioning/`.
|
||||
|
||||
### Docker Compose extension
|
||||
Add `prometheus` and `grafana` services to a `docker-compose.monitoring.yml` overlay — keeps the base `docker-compose.yml` clean for developers who don't need monitoring.
|
||||
|
||||
---
|
||||
|
||||
## 6. Multi-Region Deployment (Terraform)
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
terraform/
|
||||
modules/
|
||||
agentidp/ — reusable module: compute + networking
|
||||
rds/ — managed PostgreSQL
|
||||
redis/ — managed Redis
|
||||
lb/ — load balancer + TLS
|
||||
environments/
|
||||
aws/ — AWS-specific config (ECS + RDS + ElastiCache)
|
||||
gcp/ — GCP-specific config (Cloud Run + Cloud SQL + Memorystore)
|
||||
```
|
||||
|
||||
### Design Decisions
|
||||
|
||||
**ADR-006: Two provider targets (AWS + GCP) in Phase 2**
|
||||
AWS and GCP cover the majority of developer deployments. Azure module is Phase 3. Each environment is a thin wrapper over the shared `agentidp` module.
|
||||
|
||||
**ADR-007: Terraform over Pulumi/CDK**
|
||||
Terraform is the most widely-used IaC tool, familiar to most DevOps teams. The HCL syntax is simpler for documentation purposes.
|
||||
|
||||
---
|
||||
|
||||
## Component Interaction Map (Phase 2)
|
||||
|
||||
```
|
||||
┌────────────────────┐
|
||||
│ Web Dashboard │
|
||||
│ (React + Vite) │
|
||||
└────────┬───────────┘
|
||||
│ HTTPS
|
||||
┌────────────────▼────────────────┐
|
||||
│ AgentIdP Server │
|
||||
│ Auth MW → OPA MW → Controllers │
|
||||
│ /metrics (prom-client) │
|
||||
└──┬──────────┬──────────┬────────┘
|
||||
│ │ │
|
||||
┌─────▼──┐ ┌────▼───┐ ┌──▼───────┐
|
||||
│Postgres│ │ Redis │ │ Vault │
|
||||
└────────┘ └────────┘ └──────────┘
|
||||
│
|
||||
┌────────▼────────┐
|
||||
│ Prometheus │
|
||||
└────────┬────────┘
|
||||
│
|
||||
┌────────▼────────┐
|
||||
│ Grafana │
|
||||
└─────────────────┘
|
||||
```
|
||||
Reference in New Issue
Block a user