diff --git a/openspec/changes/phase-2-production-ready/.openspec.yaml b/openspec/changes/phase-2-production-ready/.openspec.yaml new file mode 100644 index 0000000..fd1e0d3 --- /dev/null +++ b/openspec/changes/phase-2-production-ready/.openspec.yaml @@ -0,0 +1,3 @@ +change: phase-2-production-ready +status: proposed +date: 2026-03-28 diff --git a/openspec/changes/phase-2-production-ready/design.md b/openspec/changes/phase-2-production-ready/design.md new file mode 100644 index 0000000..a5f4663 --- /dev/null +++ b/openspec/changes/phase-2-production-ready/design.md @@ -0,0 +1,218 @@ +# Phase 2: Production-Ready — Technical Design + +**Date**: 2026-03-28 +**Author**: Virtual Architect +**Status**: Draft — pending CEO approval of proposal + +--- + +## 1. HashiCorp Vault Integration + +### Architecture + +``` +AgentIdP Server + └── CredentialService + └── VaultClient (new) + └── HashiCorp Vault (sidecar or external) + └── KV Secrets Engine v2 +``` + +### Design Decisions + +**ADR-001: Vault over AWS KMS/GCP Secret Manager** +Vault is cloud-agnostic, open-source, and already standard in enterprise environments. Using Vault keeps Phase 2 cloud-provider independent. + +**ADR-002: KV Secrets Engine v2** +KV v2 provides versioned secrets and metadata. When a credential is rotated, the old version is retained in Vault history, enabling audit-grade secret lifecycle tracking. + +**ADR-003: AgentIdP stores Vault path, not secret** +`credentials.vault_path` stores the Vault KV path (e.g. `secret/agentidp/agents/{agentId}/credentials/{credentialId}`). The secret itself is never written to PostgreSQL. + +### New environment variables +| Variable | Description | +|----------|-------------| +| `VAULT_ADDR` | Vault server address | +| `VAULT_TOKEN` | Vault root/service token | +| `VAULT_MOUNT` | KV mount path (default: `secret`) | + +### Migration +Add `vault_path` column to `credentials` table (`005_add_vault_path.sql`). Existing credentials retain bcrypt hashes; new credentials use Vault. Both code paths coexist until all credentials are rotated (migration guide provided). + +--- + +## 2. Multi-Language SDKs + +### Shared contract (all SDKs implement identically) + +``` +AgentIdPClient(baseUrl, clientId, clientSecret, scopes?) + .agents → AgentRegistryClient (5 methods) + .credentials → CredentialClient (4 methods) + .tokens → TokenClient (2 methods) + .audit → AuditClient (2 methods) + .clearTokenCache() + +TokenManager — auto-refresh 60s before expiry +AgentIdPError — code, message, httpStatus, details +``` + +### Python SDK (`sentryagent-idp`) +- Python 3.9+ (httpx for async, requests for sync) +- Both sync and async client variants +- PyPI package: `sentryagent-idp` +- Type hints throughout (`mypy --strict` clean) + +### Go SDK (`github.com/sentryagent/idp-sdk-go`) +- Go 1.21+, standard library `net/http` +- Context-aware methods (`context.Context` first arg) +- Idiomatic Go error handling (`error` return, no panic) +- Go module: `github.com/sentryagent/idp-sdk-go` + +### Java SDK (`ai.sentryagent:idp-sdk`) +- Java 17+, Apache HttpClient 5 +- Synchronous and CompletableFuture async variants +- Maven Central: `ai.sentryagent:idp-sdk` +- Fully typed with generics + +--- + +## 3. OPA Policy Engine + +### Architecture + +``` +HTTP Request + → Auth Middleware (JWT verify) — unchanged + → OPA Middleware (new) — evaluates policy + → OPA Wasm (embedded, no network call) + → Rego policy files (hot-reloadable) + → Controller +``` + +### Design Decisions + +**ADR-004: OPA Wasm over OPA sidecar** +Embedding OPA as Wasm in the Node.js process eliminates a network hop and removes a runtime dependency. Policy files are loaded from `policies/` directory at startup and reloaded on SIGHUP. + +**ADR-005: Policy replaces, does not wrap, scope check** +The existing static scope check in `auth.ts` is replaced by an OPA policy evaluation. This keeps the policy as the single source of truth for access control. + +### Policy structure (`policies/`) +``` +policies/ + authz.rego — main policy: allow/deny + data/ + scopes.json — scope → permission mapping +``` + +--- + +## 4. Web Dashboard UI + +### Architecture + +``` +dashboard/ (new — separate from sdk/) + src/ + components/ — reusable UI components + pages/ — Agents, Credentials, Audit, Health + hooks/ — useAgents, useCredentials, useAudit + lib/ + client.ts — wraps @sentryagent/idp-sdk + auth.ts — credential entry and storage +``` + +### Tech Stack +- React 18 + TypeScript strict +- Vite 5 (build tool) +- TanStack Query v5 (server state) +- shadcn/ui components (Radix UI + Tailwind CSS) + +### Pages +| Page | Scope Required | Features | +|------|---------------|----------| +| Agents | `agents:read` | List, search, view detail, suspend/reactivate | +| Credentials | `agents:read` | List credentials per agent, rotate, revoke | +| Audit Log | `audit:read` | Filter by agent/action/outcome/date, paginate | +| Health | None | Server uptime, Redis/PostgreSQL connectivity | + +### Authentication +The dashboard accepts `clientId` + `clientSecret` via a login form. The `@sentryagent/idp-sdk` `TokenManager` handles token acquisition and caching in `sessionStorage`. No backend session — all state is client-side. + +--- + +## 5. Prometheus + Grafana Monitoring + +### Metrics exposed at `GET /metrics` + +| Metric | Type | Description | +|--------|------|-------------| +| `agentidp_tokens_issued_total` | Counter | Tokens issued, labelled by outcome | +| `agentidp_agents_registered_total` | Counter | Agent registrations | +| `agentidp_http_requests_total` | Counter | All requests, labelled by method/path/status | +| `agentidp_http_request_duration_seconds` | Histogram | Request latency | +| `agentidp_rate_limit_rejections_total` | Counter | 429 responses | +| `agentidp_db_query_duration_seconds` | Histogram | PostgreSQL query latency | +| `agentidp_redis_command_duration_seconds` | Histogram | Redis command latency | + +### Grafana dashboard +Pre-built JSON dashboard shipped in `monitoring/grafana/dashboards/agentidp.json`. Auto-provisioned via `monitoring/grafana/provisioning/`. + +### Docker Compose extension +Add `prometheus` and `grafana` services to a `docker-compose.monitoring.yml` overlay — keeps the base `docker-compose.yml` clean for developers who don't need monitoring. + +--- + +## 6. Multi-Region Deployment (Terraform) + +### Structure + +``` +terraform/ + modules/ + agentidp/ — reusable module: compute + networking + rds/ — managed PostgreSQL + redis/ — managed Redis + lb/ — load balancer + TLS + environments/ + aws/ — AWS-specific config (ECS + RDS + ElastiCache) + gcp/ — GCP-specific config (Cloud Run + Cloud SQL + Memorystore) +``` + +### Design Decisions + +**ADR-006: Two provider targets (AWS + GCP) in Phase 2** +AWS and GCP cover the majority of developer deployments. Azure module is Phase 3. Each environment is a thin wrapper over the shared `agentidp` module. + +**ADR-007: Terraform over Pulumi/CDK** +Terraform is the most widely-used IaC tool, familiar to most DevOps teams. The HCL syntax is simpler for documentation purposes. + +--- + +## Component Interaction Map (Phase 2) + +``` + ┌────────────────────┐ + │ Web Dashboard │ + │ (React + Vite) │ + └────────┬───────────┘ + │ HTTPS + ┌────────────────▼────────────────┐ + │ AgentIdP Server │ + │ Auth MW → OPA MW → Controllers │ + │ /metrics (prom-client) │ + └──┬──────────┬──────────┬────────┘ + │ │ │ + ┌─────▼──┐ ┌────▼───┐ ┌──▼───────┐ + │Postgres│ │ Redis │ │ Vault │ + └────────┘ └────────┘ └──────────┘ + │ + ┌────────▼────────┐ + │ Prometheus │ + └────────┬────────┘ + │ + ┌────────▼────────┐ + │ Grafana │ + └─────────────────┘ +``` diff --git a/openspec/changes/phase-2-production-ready/proposal.md b/openspec/changes/phase-2-production-ready/proposal.md new file mode 100644 index 0000000..1d02f8d --- /dev/null +++ b/openspec/changes/phase-2-production-ready/proposal.md @@ -0,0 +1,96 @@ +# Phase 2: Production-Ready — Change Proposal + +**Date**: 2026-03-28 +**Author**: Virtual CTO +**Status**: Proposed — awaiting CEO approval + +--- + +## Summary + +Phase 1 delivered a complete, working AgentIdP MVP. Phase 2 makes it production-ready: hardened secrets management, multi-language SDKs, a policy engine, a web dashboard, observability, and multi-region deployment. + +--- + +## Problem Statement + +Phase 1 is functional but has the following production gaps: + +| Gap | Risk | +|-----|------| +| Credentials stored as bcrypt hashes in PostgreSQL | No HSM/KMS — acceptable for MVP, not for enterprise | +| Only Node.js SDK | Developers in Python/Go/Java cannot use the SDK | +| No policy engine | Scope enforcement is static — no dynamic ABAC/RBAC | +| No web UI | Operators must use `curl` to manage agents | +| No observability | No metrics, no dashboards, no alerting | +| Single-region deployment | No HA, no geo-redundancy | + +--- + +## Proposed Changes + +### 1. HashiCorp Vault Integration +Replace raw bcrypt credential storage with Vault-backed secret management. Vault handles secret generation, versioning, and revocation. AgentIdP stores only Vault secret paths, not the secrets themselves. + +### 2. Multi-Language SDKs +Add Python, Go, and Java SDKs with identical API surface to the existing Node.js SDK: `AgentIdPClient`, `TokenManager`, service clients for all 14 endpoints, typed error hierarchy. + +### 3. Advanced Policy Engine (OPA) +Integrate Open Policy Agent (OPA) as a sidecar for dynamic scope and attribute-based access control. Policies are hot-reloadable Rego files — no server restart required. + +### 4. Web Dashboard UI +A React + TypeScript dashboard for operators: agent list and management, credential overview, audit log viewer, system health panel. Read-only by default; write operations require `agents:write` scope. + +### 5. Prometheus + Grafana Monitoring +Instrument all services with Prometheus metrics (`/metrics` endpoint). Ship a pre-built Grafana dashboard for: token issuance rate, agent registration rate, error rates, Redis latency, PostgreSQL query latency. + +### 6. Multi-Region Deployment +Terraform modules for AWS/GCP deployment with: managed PostgreSQL (RDS/Cloud SQL), managed Redis (ElastiCache/Memorystore), container orchestration (ECS/Cloud Run), load balancer, and a deployment guide. + +--- + +## Out of Scope for Phase 2 + +- AGNTCY federation (Phase 3) +- W3C DID support (Phase 3) +- SOC 2 certification (Phase 3) +- Rust/C++ SDKs (Phase 3) + +--- + +## Dependencies + +| New Dependency | Purpose | CEO Approval Required | +|---------------|---------|----------------------| +| `@openpolicyagent/opa-wasm` | OPA policy evaluation | Yes | +| `node-vault` | HashiCorp Vault client | Yes | +| React 18 + Vite | Web dashboard | Yes | +| `prom-client` | Prometheus metrics | Yes | +| Terraform | Infrastructure as code | Yes | + +--- + +## Delivery Sequence (per OpenSpec spec-first workflow) + +``` +1. Vault integration (highest security impact) +2. Python SDK (highest developer demand) +3. Go SDK +4. Java SDK +5. OPA policy engine +6. Web dashboard UI +7. Prometheus + Grafana monitoring +8. Multi-region deployment (Terraform) +``` + +--- + +## Success Criteria + +- All new dependencies CEO-approved before implementation begins +- All new API endpoints have OpenAPI 3.0 specs before implementation +- TypeScript strict mode + zero `any` maintained throughout +- >80% test coverage on all new services +- All SDKs pass the same QA gate: 14-endpoint coverage, typed errors, zero `any` +- Web dashboard passes OWASP Top 10 security review +- Monitoring stack ships with pre-built dashboards — zero manual setup required diff --git a/openspec/changes/phase-2-production-ready/specs/deployment/spec.md b/openspec/changes/phase-2-production-ready/specs/deployment/spec.md new file mode 100644 index 0000000..ddede2c --- /dev/null +++ b/openspec/changes/phase-2-production-ready/specs/deployment/spec.md @@ -0,0 +1,44 @@ +# Spec: Multi-Region Deployment (Terraform) + +**Status**: Pending CEO approval +**Workstream**: 8 of 8 + +## Scope +- `terraform/` directory at project root +- Shared `agentidp` module (compute, networking, secrets) +- `environments/aws/` — ECS Fargate + RDS PostgreSQL + ElastiCache Redis +- `environments/gcp/` — Cloud Run + Cloud SQL + Memorystore Redis +- Deployment guide: `docs/devops/deployment.md` + +## Module structure + +``` +terraform/ + modules/ + agentidp/ + main.tf — compute (ECS task or Cloud Run service) + networking.tf — VPC, subnets, security groups + variables.tf — all configurable inputs + outputs.tf — service URL, DB endpoint, Redis endpoint + rds/ — managed PostgreSQL + redis/ — managed Redis + lb/ — ALB (AWS) or Cloud LB (GCP), TLS cert + environments/ + aws/ + main.tf — calls modules, sets AWS-specific vars + variables.tf + terraform.tfvars.example + gcp/ + main.tf + variables.tf + terraform.tfvars.example +``` + +## Acceptance Criteria +- [ ] `terraform validate` passes for both aws and gcp environments +- [ ] `terraform plan` produces no errors against a live AWS/GCP account (test in dev env) +- [ ] JWT_PRIVATE_KEY and JWT_PUBLIC_KEY injected as environment secrets (not hardcoded) +- [ ] TLS termination at load balancer — HTTPS only in production modules +- [ ] PostgreSQL and Redis not publicly accessible — VPC-internal only +- [ ] `docs/devops/deployment.md` — end-to-end deployment walkthrough for AWS and GCP +- [ ] `terraform.tfvars.example` provided for both environments — no secrets in version control diff --git a/openspec/changes/phase-2-production-ready/specs/go-sdk/spec.md b/openspec/changes/phase-2-production-ready/specs/go-sdk/spec.md new file mode 100644 index 0000000..60c12b7 --- /dev/null +++ b/openspec/changes/phase-2-production-ready/specs/go-sdk/spec.md @@ -0,0 +1,23 @@ +# Spec: Go SDK (`github.com/sentryagent/idp-sdk-go`) + +**Status**: Pending CEO approval +**Workstream**: 3 of 8 + +## Scope +- `sdk-go/` directory at project root +- Context-aware `AgentIdPClient` using standard library `net/http` +- `TokenManager` with mutex-guarded cache and 60s auto-refresh +- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient` +- Idiomatic Go error type `AgentIdPError` implementing `error` interface +- `go.mod` module: `github.com/sentryagent/idp-sdk-go` +- `sdk-go/README.md` + +## Acceptance Criteria +- [ ] All 14 endpoints covered +- [ ] All methods take `context.Context` as first argument +- [ ] No panics — all errors returned as `error` +- [ ] `AgentIdPError` implements `error` and exposes `.Code`, `.HTTPStatus`, `.Details` +- [ ] `TokenManager` is goroutine-safe (`sync.Mutex` on cache) +- [ ] `go vet` and `staticcheck` pass with zero warnings +- [ ] `go test ./...` with >80% coverage +- [ ] README matches Node.js SDK structure diff --git a/openspec/changes/phase-2-production-ready/specs/java-sdk/spec.md b/openspec/changes/phase-2-production-ready/specs/java-sdk/spec.md new file mode 100644 index 0000000..ba7cb9d --- /dev/null +++ b/openspec/changes/phase-2-production-ready/specs/java-sdk/spec.md @@ -0,0 +1,23 @@ +# Spec: Java SDK (`ai.sentryagent:idp-sdk`) + +**Status**: Pending CEO approval +**Workstream**: 4 of 8 + +## Scope +- `sdk-java/` directory at project root +- `AgentIdPClient` with sync and `CompletableFuture` async variants +- `TokenManager` with thread-safe cache and 60s auto-refresh +- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient` +- `AgentIdPException` extending `RuntimeException` with `code`, `httpStatus`, `details` +- `pom.xml`: groupId=`ai.sentryagent`, artifactId=`idp-sdk`, Java 17+ +- `sdk-java/README.md` + +## Acceptance Criteria +- [ ] All 14 endpoints covered +- [ ] Sync methods return typed POJOs; async methods return `CompletableFuture` +- [ ] `AgentIdPException` thrown (not raw IOException) on all failure paths +- [ ] `TokenManager` is thread-safe (`synchronized` on cache) +- [ ] Apache HttpClient 5 for HTTP transport +- [ ] Jackson for JSON serialization +- [ ] `mvn verify` passes with >80% coverage (JUnit 5) +- [ ] README matches Node.js SDK structure diff --git a/openspec/changes/phase-2-production-ready/specs/monitoring/spec.md b/openspec/changes/phase-2-production-ready/specs/monitoring/spec.md new file mode 100644 index 0000000..502064e --- /dev/null +++ b/openspec/changes/phase-2-production-ready/specs/monitoring/spec.md @@ -0,0 +1,32 @@ +# Spec: Prometheus + Grafana Monitoring + +**Status**: Pending CEO approval +**Workstream**: 7 of 8 + +## Scope +- `prom-client` integration — expose `GET /metrics` +- 7 metrics (counters + histograms) across all services +- `monitoring/` directory: Prometheus config + Grafana provisioning +- `docker-compose.monitoring.yml` overlay (adds prometheus + grafana services) +- Pre-built Grafana dashboard JSON (`monitoring/grafana/dashboards/agentidp.json`) + +## Metrics + +| Metric | Type | Labels | +|--------|------|--------| +| `agentidp_tokens_issued_total` | Counter | `outcome` (success/failure) | +| `agentidp_agents_registered_total` | Counter | `outcome` | +| `agentidp_http_requests_total` | Counter | `method`, `path`, `status_code` | +| `agentidp_http_request_duration_seconds` | Histogram | `method`, `path` | +| `agentidp_rate_limit_rejections_total` | Counter | — | +| `agentidp_db_query_duration_seconds` | Histogram | `operation` | +| `agentidp_redis_command_duration_seconds` | Histogram | `command` | + +## Acceptance Criteria +- [ ] `GET /metrics` returns Prometheus text format +- [ ] `/metrics` endpoint does NOT require Bearer auth (Prometheus scrapes it) +- [ ] All 7 metrics present and updating under load +- [ ] Grafana dashboard auto-provisions on `docker compose -f docker-compose.monitoring.yml up` +- [ ] Grafana runs on port 3001 (no conflict with AgentIdP on 3000) +- [ ] `docs/devops/operations.md` updated with monitoring section +- [ ] `prom-client` added as new dependency — CEO approval gate diff --git a/openspec/changes/phase-2-production-ready/specs/opa-policy/spec.md b/openspec/changes/phase-2-production-ready/specs/opa-policy/spec.md new file mode 100644 index 0000000..102854a --- /dev/null +++ b/openspec/changes/phase-2-production-ready/specs/opa-policy/spec.md @@ -0,0 +1,37 @@ +# Spec: OPA Policy Engine Integration + +**Status**: Pending CEO approval +**Workstream**: 5 of 8 + +## Scope +- New `OpaMiddleware` replacing static scope check in `auth.ts` +- `@openpolicyagent/opa-wasm` integration (embedded Wasm, no sidecar) +- `policies/authz.rego` — main allow/deny policy +- `policies/data/scopes.json` — scope to permission mapping +- SIGHUP handler to hot-reload policies without restart +- New env var: `POLICY_DIR` (default: `./policies`) + +## Policy interface + +``` +input = { + "method": "GET", + "path": "/api/v1/agents", + "scopes": ["agents:read"], + "agentId": "uuid" +} + +output = { + "allow": true | false, + "reason": "string" // populated when allow=false +} +``` + +## Acceptance Criteria +- [ ] All existing scope checks replaced by OPA evaluation +- [ ] Policy files hot-reloadable on SIGHUP (no restart required) +- [ ] OPA Wasm loaded at startup — fail-fast if `POLICY_DIR` invalid +- [ ] `allow=false` responses return `403` with `reason` in error body +- [ ] Existing test suite passes unchanged (OPA evaluates same rules as before) +- [ ] New unit tests for OPA middleware: allow/deny cases, missing scope, invalid input +- [ ] `POLICY_DIR` env var documented in `docs/devops/environment-variables.md` diff --git a/openspec/changes/phase-2-production-ready/specs/python-sdk/spec.md b/openspec/changes/phase-2-production-ready/specs/python-sdk/spec.md new file mode 100644 index 0000000..4e8df1f --- /dev/null +++ b/openspec/changes/phase-2-production-ready/specs/python-sdk/spec.md @@ -0,0 +1,24 @@ +# Spec: Python SDK (`sentryagent-idp`) + +**Status**: Pending CEO approval +**Workstream**: 2 of 8 + +## Scope +- `sdk-python/` directory at project root +- `AgentIdPClient` with sync and async variants +- `TokenManager` with 60s auto-refresh +- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient` +- `AgentIdPError` typed exception +- Full type hints — `mypy --strict` clean +- `sdk-python/README.md` with installation and usage + +## Acceptance Criteria +- [ ] All 14 API endpoints covered +- [ ] Sync client: `requests` library +- [ ] Async client: `httpx` library +- [ ] `mypy --strict` passes with zero errors +- [ ] Zero untyped code +- [ ] `AgentIdPError` raised (not raw requests/httpx exceptions) on all failure paths +- [ ] `TokenManager` tested: caches token, refreshes at exp-60s +- [ ] `pyproject.toml` with: name=sentryagent-idp, python>=3.9, dependencies declared +- [ ] README matches Node.js SDK structure diff --git a/openspec/changes/phase-2-production-ready/specs/vault/spec.md b/openspec/changes/phase-2-production-ready/specs/vault/spec.md new file mode 100644 index 0000000..6653f41 --- /dev/null +++ b/openspec/changes/phase-2-production-ready/specs/vault/spec.md @@ -0,0 +1,21 @@ +# Spec: HashiCorp Vault Integration + +**Status**: Pending CEO approval +**Workstream**: 1 of 8 + +## Scope +- VaultClient class wrapping `node-vault` +- `005_add_vault_path.sql` migration +- Updated CredentialService to write secrets to Vault instead of PostgreSQL +- New env vars: VAULT_ADDR, VAULT_TOKEN, VAULT_MOUNT +- Migration guide: bcrypt → Vault coexistence strategy + +## Acceptance Criteria +- [ ] New credentials: secret written to Vault KV v2, `vault_path` stored in PostgreSQL +- [ ] Credential rotation: Vault versioned update, `vault_path` unchanged +- [ ] Credential revocation: Vault secret deleted, DB status = `revoked` +- [ ] Existing bcrypt credentials continue to work until rotated +- [ ] VaultClient follows existing service interface pattern (DRY, SOLID) +- [ ] Zero `any` types, TypeScript strict +- [ ] `VAULT_ADDR` / `VAULT_TOKEN` validation at startup (fail-fast) +- [ ] DevOps docs updated with Vault setup section diff --git a/openspec/changes/phase-2-production-ready/specs/web-dashboard/spec.md b/openspec/changes/phase-2-production-ready/specs/web-dashboard/spec.md new file mode 100644 index 0000000..1c493f6 --- /dev/null +++ b/openspec/changes/phase-2-production-ready/specs/web-dashboard/spec.md @@ -0,0 +1,34 @@ +# Spec: Web Dashboard UI + +**Status**: Pending CEO approval +**Workstream**: 6 of 8 + +## Scope +- `dashboard/` directory at project root +- React 18 + TypeScript strict, built with Vite 5 +- TanStack Query v5 for server state +- shadcn/ui (Radix UI + Tailwind CSS) for components +- Four pages: Agents, Credentials, Audit Log, Health +- Client-side auth: `clientId` + `clientSecret` → `TokenManager` +- Served from AgentIdP server at `GET /dashboard` (static build) + +## Pages + +| Page | Route | Scope Required | +|------|-------|---------------| +| Login | `/dashboard/login` | None | +| Agents | `/dashboard/agents` | `agents:read` | +| Agent Detail | `/dashboard/agents/:id` | `agents:read` | +| Credentials | `/dashboard/agents/:id/credentials` | `agents:read` | +| Audit Log | `/dashboard/audit` | `audit:read` | +| Health | `/dashboard/health` | None | + +## Acceptance Criteria +- [ ] TypeScript strict — zero `any` across all dashboard files +- [ ] `dashboard/tsconfig.json` with `strict: true` +- [ ] Login form stores token in `sessionStorage` only (not `localStorage`) +- [ ] All write operations (suspend, revoke, rotate) require confirmation dialog +- [ ] OWASP Top 10 review: no XSS, no CSRF, no sensitive data in URL params +- [ ] Vite build outputs to `dashboard/dist/`; AgentIdP serves it as static +- [ ] `dashboard/README.md` — how to build and serve +- [ ] Responsive layout — functional on desktop and tablet diff --git a/openspec/changes/phase-2-production-ready/tasks.md b/openspec/changes/phase-2-production-ready/tasks.md new file mode 100644 index 0000000..8e98bdb --- /dev/null +++ b/openspec/changes/phase-2-production-ready/tasks.md @@ -0,0 +1,127 @@ +# Phase 2: Production-Ready — Tasks + +**Status**: Awaiting CEO dependency approvals before any implementation begins. + +## CEO Approval Gates (required before implementation) + +- [ ] A0.1 Approve dependency: `node-vault` (Vault integration) +- [ ] A0.2 Approve dependency: `@openpolicyagent/opa-wasm` (OPA policy engine) +- [ ] A0.3 Approve dependency: React 18 + Vite 5 (web dashboard) +- [ ] A0.4 Approve dependency: `prom-client` (Prometheus metrics) +- [ ] A0.5 Approve dependency: Terraform (infrastructure as code) + +--- + +## Workstream 1: HashiCorp Vault Integration + +- [ ] 1.1 Write `src/vault/VaultClient.ts` — wraps `node-vault`; methods: writeSecret, readSecret, deleteSecret, rotateSecret +- [ ] 1.2 Write `src/db/migrations/005_add_vault_path.sql` — add `vault_path` column to `credentials` +- [ ] 1.3 Update `CredentialService.ts` — new credentials use Vault; existing bcrypt credentials continue to work +- [ ] 1.4 Update `docs/devops/environment-variables.md` — add VAULT_ADDR, VAULT_TOKEN, VAULT_MOUNT +- [ ] 1.5 Write `docs/devops/vault-setup.md` — Vault dev server setup, production Vault config, migration guide +- [ ] 1.6 Write unit tests for VaultClient (mocked Vault) and updated CredentialService +- [ ] 1.7 QA sign-off: zero `any`, TypeScript strict, >80% coverage, coexistence verified + +## Workstream 2: Python SDK + +- [ ] 2.1 Create `sdk-python/` with `pyproject.toml` — name: sentryagent-idp, python>=3.9 +- [ ] 2.2 Write `sdk-python/src/sentryagent_idp/types.py` — all request/response dataclasses +- [ ] 2.3 Write `sdk-python/src/sentryagent_idp/errors.py` — AgentIdPError exception +- [ ] 2.4 Write `sdk-python/src/sentryagent_idp/token_manager.py` — sync TokenManager +- [ ] 2.5 Write `sdk-python/src/sentryagent_idp/async_token_manager.py` — async TokenManager (httpx) +- [ ] 2.6 Write `sdk-python/src/sentryagent_idp/services/agents.py` — AgentRegistryClient (sync + async) +- [ ] 2.7 Write `sdk-python/src/sentryagent_idp/services/credentials.py` — CredentialClient (sync + async) +- [ ] 2.8 Write `sdk-python/src/sentryagent_idp/services/token.py` — TokenClient (sync + async) +- [ ] 2.9 Write `sdk-python/src/sentryagent_idp/services/audit.py` — AuditClient (sync + async) +- [ ] 2.10 Write `sdk-python/src/sentryagent_idp/client.py` — AgentIdPClient (sync) + AsyncAgentIdPClient +- [ ] 2.11 Write `sdk-python/src/sentryagent_idp/__init__.py` — barrel exports +- [ ] 2.12 Write `sdk-python/README.md` +- [ ] 2.13 QA: `mypy --strict` clean, all 14 endpoints, AgentIdPError on all failure paths, pytest >80% + +## Workstream 3: Go SDK + +- [ ] 3.1 Create `sdk-go/` with `go.mod` — module: github.com/sentryagent/idp-sdk-go, go 1.21 +- [ ] 3.2 Write `sdk-go/types.go` — all request/response structs +- [ ] 3.3 Write `sdk-go/errors.go` — AgentIdPError type implementing error interface +- [ ] 3.4 Write `sdk-go/token_manager.go` — mutex-guarded TokenManager +- [ ] 3.5 Write `sdk-go/services/agents.go` — AgentRegistryClient +- [ ] 3.6 Write `sdk-go/services/credentials.go` — CredentialClient +- [ ] 3.7 Write `sdk-go/services/token.go` — TokenClient +- [ ] 3.8 Write `sdk-go/services/audit.go` — AuditClient +- [ ] 3.9 Write `sdk-go/client.go` — AgentIdPClient +- [ ] 3.10 Write `sdk-go/README.md` +- [ ] 3.11 QA: `go vet` clean, `staticcheck` clean, all 14 endpoints, goroutine-safe, `go test ./...` >80% + +## Workstream 4: Java SDK + +- [ ] 4.1 Create `sdk-java/` with `pom.xml` — groupId: ai.sentryagent, artifactId: idp-sdk, Java 17 +- [ ] 4.2 Write all POJO request/response model classes +- [ ] 4.3 Write `AgentIdPException.java` extending RuntimeException +- [ ] 4.4 Write `TokenManager.java` — synchronized cache with 60s refresh buffer +- [ ] 4.5 Write `AgentRegistryClient.java` — sync + CompletableFuture methods +- [ ] 4.6 Write `CredentialClient.java` — sync + CompletableFuture methods +- [ ] 4.7 Write `TokenClient.java` — sync + CompletableFuture methods +- [ ] 4.8 Write `AuditClient.java` — sync + CompletableFuture methods +- [ ] 4.9 Write `AgentIdPClient.java` — composes all service clients +- [ ] 4.10 Write `sdk-java/README.md` +- [ ] 4.11 QA: `mvn verify` passes, all 14 endpoints, AgentIdPException on all failure paths, JUnit 5 >80% + +## Workstream 5: OPA Policy Engine + +- [ ] 5.1 Write `policies/authz.rego` — allow/deny rules matching all current scope checks +- [ ] 5.2 Write `policies/data/scopes.json` — scope to endpoint permission mapping +- [ ] 5.3 Write `src/middleware/opa.ts` — OpaMiddleware: loads Wasm, evaluates input, returns allow/deny +- [ ] 5.4 Replace static scope check in `src/middleware/auth.ts` with OpaMiddleware +- [ ] 5.5 Add SIGHUP handler in `src/server.ts` to hot-reload policy files +- [ ] 5.6 Update `docs/devops/environment-variables.md` — add POLICY_DIR +- [ ] 5.7 QA: all existing auth tests pass unchanged, new OPA unit tests, hot-reload verified + +## Workstream 6: Web Dashboard UI + +- [ ] 6.1 Create `dashboard/` with Vite 5 + React 18 + TypeScript strict configuration +- [ ] 6.2 Set up shadcn/ui with Tailwind CSS +- [ ] 6.3 Write `dashboard/src/lib/auth.ts` — credential entry, TokenManager, sessionStorage +- [ ] 6.4 Write `dashboard/src/lib/client.ts` — wraps @sentryagent/idp-sdk AgentIdPClient +- [ ] 6.5 Write Login page (`/dashboard/login`) +- [ ] 6.6 Write Agents page (`/dashboard/agents`) — list, search, filter by status +- [ ] 6.7 Write Agent Detail page (`/dashboard/agents/:id`) — suspend/reactivate with confirm dialog +- [ ] 6.8 Write Credentials page (`/dashboard/agents/:id/credentials`) — rotate/revoke with confirm +- [ ] 6.9 Write Audit Log page (`/dashboard/audit`) — filters, pagination +- [ ] 6.10 Write Health page (`/dashboard/health`) — PostgreSQL + Redis connectivity status +- [ ] 6.11 Configure AgentIdP Express app to serve `dashboard/dist/` at `/dashboard` +- [ ] 6.12 Write `dashboard/README.md` +- [ ] 6.13 QA: TypeScript strict, zero `any`, OWASP Top 10 review, responsive layout verified + +## Workstream 7: Prometheus + Grafana Monitoring + +- [ ] 7.1 Add `prom-client` to dependencies (after CEO approval A0.4) +- [ ] 7.2 Write `src/metrics/registry.ts` — shared Prometheus Registry with all 7 metric definitions +- [ ] 7.3 Instrument `OAuth2Service.ts` — increment `agentidp_tokens_issued_total` +- [ ] 7.4 Instrument `AgentService.ts` — increment `agentidp_agents_registered_total` +- [ ] 7.5 Instrument `src/middleware/` — HTTP request counter and duration histogram +- [ ] 7.6 Instrument `src/db/pool.ts` — DB query duration histogram +- [ ] 7.7 Instrument `src/cache/redis.ts` — Redis command duration histogram +- [ ] 7.8 Add `GET /metrics` route (unauthenticated, Prometheus text format) +- [ ] 7.9 Write `monitoring/prometheus/prometheus.yml` — scrape config +- [ ] 7.10 Write `monitoring/grafana/provisioning/` — datasource + dashboard provisioning +- [ ] 7.11 Write `monitoring/grafana/dashboards/agentidp.json` — pre-built Grafana dashboard +- [ ] 7.12 Write `docker-compose.monitoring.yml` overlay +- [ ] 7.13 Update `docs/devops/operations.md` — monitoring section +- [ ] 7.14 QA: all 7 metrics verified under load, Grafana auto-provisions, no auth leak on /metrics + +## Workstream 8: Multi-Region Deployment (Terraform) + +- [ ] 8.1 Write `terraform/modules/agentidp/main.tf` + `variables.tf` + `outputs.tf` +- [ ] 8.2 Write `terraform/modules/rds/` — managed PostgreSQL module +- [ ] 8.3 Write `terraform/modules/redis/` — managed Redis module +- [ ] 8.4 Write `terraform/modules/lb/` — load balancer + TLS module +- [ ] 8.5 Write `terraform/environments/aws/main.tf` + `variables.tf` + `terraform.tfvars.example` +- [ ] 8.6 Write `terraform/environments/gcp/main.tf` + `variables.tf` + `terraform.tfvars.example` +- [ ] 8.7 Write `docs/devops/deployment.md` — end-to-end AWS and GCP deployment walkthrough +- [ ] 8.8 QA: `terraform validate` passes, secrets not hardcoded, TLS enforced, DB/Redis VPC-internal + +--- + +## Phase 2 Complete Criteria + +All 8 workstreams done. All tasks checked. All QA gates passed. CEO reviewed.