chore: Phase 2 OpenSpec scoping — proposal, design, specs, tasks

8 workstreams scoped per OpenSpec standards:
1. HashiCorp Vault integration (secret management)
2. Python SDK (sentryagent-idp)
3. Go SDK (idp-sdk-go)
4. Java SDK (ai.sentryagent:idp-sdk)
5. OPA policy engine (dynamic ABAC, hot-reload Rego)
6. Web Dashboard UI (React 18 + TypeScript)
7. Prometheus + Grafana monitoring (7 metrics, pre-built dashboard)
8. Multi-region Terraform deployment (AWS + GCP)

Status: proposed — awaiting CEO dependency approvals (A0.1–A0.5)
before any implementation begins.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
SentryAgent.ai Developer
2026-03-28 14:53:09 +00:00
parent 0d56895fae
commit 7593bfe1c1
12 changed files with 682 additions and 0 deletions

View File

@@ -0,0 +1,3 @@
change: phase-2-production-ready
status: proposed
date: 2026-03-28

View File

@@ -0,0 +1,218 @@
# Phase 2: Production-Ready — Technical Design
**Date**: 2026-03-28
**Author**: Virtual Architect
**Status**: Draft — pending CEO approval of proposal
---
## 1. HashiCorp Vault Integration
### Architecture
```
AgentIdP Server
└── CredentialService
└── VaultClient (new)
└── HashiCorp Vault (sidecar or external)
└── KV Secrets Engine v2
```
### Design Decisions
**ADR-001: Vault over AWS KMS/GCP Secret Manager**
Vault is cloud-agnostic, open-source, and already standard in enterprise environments. Using Vault keeps Phase 2 cloud-provider independent.
**ADR-002: KV Secrets Engine v2**
KV v2 provides versioned secrets and metadata. When a credential is rotated, the old version is retained in Vault history, enabling audit-grade secret lifecycle tracking.
**ADR-003: AgentIdP stores Vault path, not secret**
`credentials.vault_path` stores the Vault KV path (e.g. `secret/agentidp/agents/{agentId}/credentials/{credentialId}`). The secret itself is never written to PostgreSQL.
### New environment variables
| Variable | Description |
|----------|-------------|
| `VAULT_ADDR` | Vault server address |
| `VAULT_TOKEN` | Vault root/service token |
| `VAULT_MOUNT` | KV mount path (default: `secret`) |
### Migration
Add `vault_path` column to `credentials` table (`005_add_vault_path.sql`). Existing credentials retain bcrypt hashes; new credentials use Vault. Both code paths coexist until all credentials are rotated (migration guide provided).
---
## 2. Multi-Language SDKs
### Shared contract (all SDKs implement identically)
```
AgentIdPClient(baseUrl, clientId, clientSecret, scopes?)
.agents → AgentRegistryClient (5 methods)
.credentials → CredentialClient (4 methods)
.tokens → TokenClient (2 methods)
.audit → AuditClient (2 methods)
.clearTokenCache()
TokenManager — auto-refresh 60s before expiry
AgentIdPError — code, message, httpStatus, details
```
### Python SDK (`sentryagent-idp`)
- Python 3.9+ (httpx for async, requests for sync)
- Both sync and async client variants
- PyPI package: `sentryagent-idp`
- Type hints throughout (`mypy --strict` clean)
### Go SDK (`github.com/sentryagent/idp-sdk-go`)
- Go 1.21+, standard library `net/http`
- Context-aware methods (`context.Context` first arg)
- Idiomatic Go error handling (`error` return, no panic)
- Go module: `github.com/sentryagent/idp-sdk-go`
### Java SDK (`ai.sentryagent:idp-sdk`)
- Java 17+, Apache HttpClient 5
- Synchronous and CompletableFuture async variants
- Maven Central: `ai.sentryagent:idp-sdk`
- Fully typed with generics
---
## 3. OPA Policy Engine
### Architecture
```
HTTP Request
→ Auth Middleware (JWT verify) — unchanged
→ OPA Middleware (new) — evaluates policy
→ OPA Wasm (embedded, no network call)
→ Rego policy files (hot-reloadable)
→ Controller
```
### Design Decisions
**ADR-004: OPA Wasm over OPA sidecar**
Embedding OPA as Wasm in the Node.js process eliminates a network hop and removes a runtime dependency. Policy files are loaded from `policies/` directory at startup and reloaded on SIGHUP.
**ADR-005: Policy replaces, does not wrap, scope check**
The existing static scope check in `auth.ts` is replaced by an OPA policy evaluation. This keeps the policy as the single source of truth for access control.
### Policy structure (`policies/`)
```
policies/
authz.rego — main policy: allow/deny
data/
scopes.json — scope → permission mapping
```
---
## 4. Web Dashboard UI
### Architecture
```
dashboard/ (new — separate from sdk/)
src/
components/ — reusable UI components
pages/ — Agents, Credentials, Audit, Health
hooks/ — useAgents, useCredentials, useAudit
lib/
client.ts — wraps @sentryagent/idp-sdk
auth.ts — credential entry and storage
```
### Tech Stack
- React 18 + TypeScript strict
- Vite 5 (build tool)
- TanStack Query v5 (server state)
- shadcn/ui components (Radix UI + Tailwind CSS)
### Pages
| Page | Scope Required | Features |
|------|---------------|----------|
| Agents | `agents:read` | List, search, view detail, suspend/reactivate |
| Credentials | `agents:read` | List credentials per agent, rotate, revoke |
| Audit Log | `audit:read` | Filter by agent/action/outcome/date, paginate |
| Health | None | Server uptime, Redis/PostgreSQL connectivity |
### Authentication
The dashboard accepts `clientId` + `clientSecret` via a login form. The `@sentryagent/idp-sdk` `TokenManager` handles token acquisition and caching in `sessionStorage`. No backend session — all state is client-side.
---
## 5. Prometheus + Grafana Monitoring
### Metrics exposed at `GET /metrics`
| Metric | Type | Description |
|--------|------|-------------|
| `agentidp_tokens_issued_total` | Counter | Tokens issued, labelled by outcome |
| `agentidp_agents_registered_total` | Counter | Agent registrations |
| `agentidp_http_requests_total` | Counter | All requests, labelled by method/path/status |
| `agentidp_http_request_duration_seconds` | Histogram | Request latency |
| `agentidp_rate_limit_rejections_total` | Counter | 429 responses |
| `agentidp_db_query_duration_seconds` | Histogram | PostgreSQL query latency |
| `agentidp_redis_command_duration_seconds` | Histogram | Redis command latency |
### Grafana dashboard
Pre-built JSON dashboard shipped in `monitoring/grafana/dashboards/agentidp.json`. Auto-provisioned via `monitoring/grafana/provisioning/`.
### Docker Compose extension
Add `prometheus` and `grafana` services to a `docker-compose.monitoring.yml` overlay — keeps the base `docker-compose.yml` clean for developers who don't need monitoring.
---
## 6. Multi-Region Deployment (Terraform)
### Structure
```
terraform/
modules/
agentidp/ — reusable module: compute + networking
rds/ — managed PostgreSQL
redis/ — managed Redis
lb/ — load balancer + TLS
environments/
aws/ — AWS-specific config (ECS + RDS + ElastiCache)
gcp/ — GCP-specific config (Cloud Run + Cloud SQL + Memorystore)
```
### Design Decisions
**ADR-006: Two provider targets (AWS + GCP) in Phase 2**
AWS and GCP cover the majority of developer deployments. Azure module is Phase 3. Each environment is a thin wrapper over the shared `agentidp` module.
**ADR-007: Terraform over Pulumi/CDK**
Terraform is the most widely-used IaC tool, familiar to most DevOps teams. The HCL syntax is simpler for documentation purposes.
---
## Component Interaction Map (Phase 2)
```
┌────────────────────┐
│ Web Dashboard │
│ (React + Vite) │
└────────┬───────────┘
│ HTTPS
┌────────────────▼────────────────┐
│ AgentIdP Server │
│ Auth MW → OPA MW → Controllers │
│ /metrics (prom-client) │
└──┬──────────┬──────────┬────────┘
│ │ │
┌─────▼──┐ ┌────▼───┐ ┌──▼───────┐
│Postgres│ │ Redis │ │ Vault │
└────────┘ └────────┘ └──────────┘
┌────────▼────────┐
│ Prometheus │
└────────┬────────┘
┌────────▼────────┐
│ Grafana │
└─────────────────┘
```

View File

@@ -0,0 +1,96 @@
# Phase 2: Production-Ready — Change Proposal
**Date**: 2026-03-28
**Author**: Virtual CTO
**Status**: Proposed — awaiting CEO approval
---
## Summary
Phase 1 delivered a complete, working AgentIdP MVP. Phase 2 makes it production-ready: hardened secrets management, multi-language SDKs, a policy engine, a web dashboard, observability, and multi-region deployment.
---
## Problem Statement
Phase 1 is functional but has the following production gaps:
| Gap | Risk |
|-----|------|
| Credentials stored as bcrypt hashes in PostgreSQL | No HSM/KMS — acceptable for MVP, not for enterprise |
| Only Node.js SDK | Developers in Python/Go/Java cannot use the SDK |
| No policy engine | Scope enforcement is static — no dynamic ABAC/RBAC |
| No web UI | Operators must use `curl` to manage agents |
| No observability | No metrics, no dashboards, no alerting |
| Single-region deployment | No HA, no geo-redundancy |
---
## Proposed Changes
### 1. HashiCorp Vault Integration
Replace raw bcrypt credential storage with Vault-backed secret management. Vault handles secret generation, versioning, and revocation. AgentIdP stores only Vault secret paths, not the secrets themselves.
### 2. Multi-Language SDKs
Add Python, Go, and Java SDKs with identical API surface to the existing Node.js SDK: `AgentIdPClient`, `TokenManager`, service clients for all 14 endpoints, typed error hierarchy.
### 3. Advanced Policy Engine (OPA)
Integrate Open Policy Agent (OPA) as a sidecar for dynamic scope and attribute-based access control. Policies are hot-reloadable Rego files — no server restart required.
### 4. Web Dashboard UI
A React + TypeScript dashboard for operators: agent list and management, credential overview, audit log viewer, system health panel. Read-only by default; write operations require `agents:write` scope.
### 5. Prometheus + Grafana Monitoring
Instrument all services with Prometheus metrics (`/metrics` endpoint). Ship a pre-built Grafana dashboard for: token issuance rate, agent registration rate, error rates, Redis latency, PostgreSQL query latency.
### 6. Multi-Region Deployment
Terraform modules for AWS/GCP deployment with: managed PostgreSQL (RDS/Cloud SQL), managed Redis (ElastiCache/Memorystore), container orchestration (ECS/Cloud Run), load balancer, and a deployment guide.
---
## Out of Scope for Phase 2
- AGNTCY federation (Phase 3)
- W3C DID support (Phase 3)
- SOC 2 certification (Phase 3)
- Rust/C++ SDKs (Phase 3)
---
## Dependencies
| New Dependency | Purpose | CEO Approval Required |
|---------------|---------|----------------------|
| `@openpolicyagent/opa-wasm` | OPA policy evaluation | Yes |
| `node-vault` | HashiCorp Vault client | Yes |
| React 18 + Vite | Web dashboard | Yes |
| `prom-client` | Prometheus metrics | Yes |
| Terraform | Infrastructure as code | Yes |
---
## Delivery Sequence (per OpenSpec spec-first workflow)
```
1. Vault integration (highest security impact)
2. Python SDK (highest developer demand)
3. Go SDK
4. Java SDK
5. OPA policy engine
6. Web dashboard UI
7. Prometheus + Grafana monitoring
8. Multi-region deployment (Terraform)
```
---
## Success Criteria
- All new dependencies CEO-approved before implementation begins
- All new API endpoints have OpenAPI 3.0 specs before implementation
- TypeScript strict mode + zero `any` maintained throughout
- >80% test coverage on all new services
- All SDKs pass the same QA gate: 14-endpoint coverage, typed errors, zero `any`
- Web dashboard passes OWASP Top 10 security review
- Monitoring stack ships with pre-built dashboards — zero manual setup required

View File

@@ -0,0 +1,44 @@
# Spec: Multi-Region Deployment (Terraform)
**Status**: Pending CEO approval
**Workstream**: 8 of 8
## Scope
- `terraform/` directory at project root
- Shared `agentidp` module (compute, networking, secrets)
- `environments/aws/` — ECS Fargate + RDS PostgreSQL + ElastiCache Redis
- `environments/gcp/` — Cloud Run + Cloud SQL + Memorystore Redis
- Deployment guide: `docs/devops/deployment.md`
## Module structure
```
terraform/
modules/
agentidp/
main.tf — compute (ECS task or Cloud Run service)
networking.tf — VPC, subnets, security groups
variables.tf — all configurable inputs
outputs.tf — service URL, DB endpoint, Redis endpoint
rds/ — managed PostgreSQL
redis/ — managed Redis
lb/ — ALB (AWS) or Cloud LB (GCP), TLS cert
environments/
aws/
main.tf — calls modules, sets AWS-specific vars
variables.tf
terraform.tfvars.example
gcp/
main.tf
variables.tf
terraform.tfvars.example
```
## Acceptance Criteria
- [ ] `terraform validate` passes for both aws and gcp environments
- [ ] `terraform plan` produces no errors against a live AWS/GCP account (test in dev env)
- [ ] JWT_PRIVATE_KEY and JWT_PUBLIC_KEY injected as environment secrets (not hardcoded)
- [ ] TLS termination at load balancer — HTTPS only in production modules
- [ ] PostgreSQL and Redis not publicly accessible — VPC-internal only
- [ ] `docs/devops/deployment.md` — end-to-end deployment walkthrough for AWS and GCP
- [ ] `terraform.tfvars.example` provided for both environments — no secrets in version control

View File

@@ -0,0 +1,23 @@
# Spec: Go SDK (`github.com/sentryagent/idp-sdk-go`)
**Status**: Pending CEO approval
**Workstream**: 3 of 8
## Scope
- `sdk-go/` directory at project root
- Context-aware `AgentIdPClient` using standard library `net/http`
- `TokenManager` with mutex-guarded cache and 60s auto-refresh
- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient`
- Idiomatic Go error type `AgentIdPError` implementing `error` interface
- `go.mod` module: `github.com/sentryagent/idp-sdk-go`
- `sdk-go/README.md`
## Acceptance Criteria
- [ ] All 14 endpoints covered
- [ ] All methods take `context.Context` as first argument
- [ ] No panics — all errors returned as `error`
- [ ] `AgentIdPError` implements `error` and exposes `.Code`, `.HTTPStatus`, `.Details`
- [ ] `TokenManager` is goroutine-safe (`sync.Mutex` on cache)
- [ ] `go vet` and `staticcheck` pass with zero warnings
- [ ] `go test ./...` with >80% coverage
- [ ] README matches Node.js SDK structure

View File

@@ -0,0 +1,23 @@
# Spec: Java SDK (`ai.sentryagent:idp-sdk`)
**Status**: Pending CEO approval
**Workstream**: 4 of 8
## Scope
- `sdk-java/` directory at project root
- `AgentIdPClient` with sync and `CompletableFuture` async variants
- `TokenManager` with thread-safe cache and 60s auto-refresh
- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient`
- `AgentIdPException` extending `RuntimeException` with `code`, `httpStatus`, `details`
- `pom.xml`: groupId=`ai.sentryagent`, artifactId=`idp-sdk`, Java 17+
- `sdk-java/README.md`
## Acceptance Criteria
- [ ] All 14 endpoints covered
- [ ] Sync methods return typed POJOs; async methods return `CompletableFuture<T>`
- [ ] `AgentIdPException` thrown (not raw IOException) on all failure paths
- [ ] `TokenManager` is thread-safe (`synchronized` on cache)
- [ ] Apache HttpClient 5 for HTTP transport
- [ ] Jackson for JSON serialization
- [ ] `mvn verify` passes with >80% coverage (JUnit 5)
- [ ] README matches Node.js SDK structure

View File

@@ -0,0 +1,32 @@
# Spec: Prometheus + Grafana Monitoring
**Status**: Pending CEO approval
**Workstream**: 7 of 8
## Scope
- `prom-client` integration — expose `GET /metrics`
- 7 metrics (counters + histograms) across all services
- `monitoring/` directory: Prometheus config + Grafana provisioning
- `docker-compose.monitoring.yml` overlay (adds prometheus + grafana services)
- Pre-built Grafana dashboard JSON (`monitoring/grafana/dashboards/agentidp.json`)
## Metrics
| Metric | Type | Labels |
|--------|------|--------|
| `agentidp_tokens_issued_total` | Counter | `outcome` (success/failure) |
| `agentidp_agents_registered_total` | Counter | `outcome` |
| `agentidp_http_requests_total` | Counter | `method`, `path`, `status_code` |
| `agentidp_http_request_duration_seconds` | Histogram | `method`, `path` |
| `agentidp_rate_limit_rejections_total` | Counter | — |
| `agentidp_db_query_duration_seconds` | Histogram | `operation` |
| `agentidp_redis_command_duration_seconds` | Histogram | `command` |
## Acceptance Criteria
- [ ] `GET /metrics` returns Prometheus text format
- [ ] `/metrics` endpoint does NOT require Bearer auth (Prometheus scrapes it)
- [ ] All 7 metrics present and updating under load
- [ ] Grafana dashboard auto-provisions on `docker compose -f docker-compose.monitoring.yml up`
- [ ] Grafana runs on port 3001 (no conflict with AgentIdP on 3000)
- [ ] `docs/devops/operations.md` updated with monitoring section
- [ ] `prom-client` added as new dependency — CEO approval gate

View File

@@ -0,0 +1,37 @@
# Spec: OPA Policy Engine Integration
**Status**: Pending CEO approval
**Workstream**: 5 of 8
## Scope
- New `OpaMiddleware` replacing static scope check in `auth.ts`
- `@openpolicyagent/opa-wasm` integration (embedded Wasm, no sidecar)
- `policies/authz.rego` — main allow/deny policy
- `policies/data/scopes.json` — scope to permission mapping
- SIGHUP handler to hot-reload policies without restart
- New env var: `POLICY_DIR` (default: `./policies`)
## Policy interface
```
input = {
"method": "GET",
"path": "/api/v1/agents",
"scopes": ["agents:read"],
"agentId": "uuid"
}
output = {
"allow": true | false,
"reason": "string" // populated when allow=false
}
```
## Acceptance Criteria
- [ ] All existing scope checks replaced by OPA evaluation
- [ ] Policy files hot-reloadable on SIGHUP (no restart required)
- [ ] OPA Wasm loaded at startup — fail-fast if `POLICY_DIR` invalid
- [ ] `allow=false` responses return `403` with `reason` in error body
- [ ] Existing test suite passes unchanged (OPA evaluates same rules as before)
- [ ] New unit tests for OPA middleware: allow/deny cases, missing scope, invalid input
- [ ] `POLICY_DIR` env var documented in `docs/devops/environment-variables.md`

View File

@@ -0,0 +1,24 @@
# Spec: Python SDK (`sentryagent-idp`)
**Status**: Pending CEO approval
**Workstream**: 2 of 8
## Scope
- `sdk-python/` directory at project root
- `AgentIdPClient` with sync and async variants
- `TokenManager` with 60s auto-refresh
- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient`
- `AgentIdPError` typed exception
- Full type hints — `mypy --strict` clean
- `sdk-python/README.md` with installation and usage
## Acceptance Criteria
- [ ] All 14 API endpoints covered
- [ ] Sync client: `requests` library
- [ ] Async client: `httpx` library
- [ ] `mypy --strict` passes with zero errors
- [ ] Zero untyped code
- [ ] `AgentIdPError` raised (not raw requests/httpx exceptions) on all failure paths
- [ ] `TokenManager` tested: caches token, refreshes at exp-60s
- [ ] `pyproject.toml` with: name=sentryagent-idp, python>=3.9, dependencies declared
- [ ] README matches Node.js SDK structure

View File

@@ -0,0 +1,21 @@
# Spec: HashiCorp Vault Integration
**Status**: Pending CEO approval
**Workstream**: 1 of 8
## Scope
- VaultClient class wrapping `node-vault`
- `005_add_vault_path.sql` migration
- Updated CredentialService to write secrets to Vault instead of PostgreSQL
- New env vars: VAULT_ADDR, VAULT_TOKEN, VAULT_MOUNT
- Migration guide: bcrypt → Vault coexistence strategy
## Acceptance Criteria
- [ ] New credentials: secret written to Vault KV v2, `vault_path` stored in PostgreSQL
- [ ] Credential rotation: Vault versioned update, `vault_path` unchanged
- [ ] Credential revocation: Vault secret deleted, DB status = `revoked`
- [ ] Existing bcrypt credentials continue to work until rotated
- [ ] VaultClient follows existing service interface pattern (DRY, SOLID)
- [ ] Zero `any` types, TypeScript strict
- [ ] `VAULT_ADDR` / `VAULT_TOKEN` validation at startup (fail-fast)
- [ ] DevOps docs updated with Vault setup section

View File

@@ -0,0 +1,34 @@
# Spec: Web Dashboard UI
**Status**: Pending CEO approval
**Workstream**: 6 of 8
## Scope
- `dashboard/` directory at project root
- React 18 + TypeScript strict, built with Vite 5
- TanStack Query v5 for server state
- shadcn/ui (Radix UI + Tailwind CSS) for components
- Four pages: Agents, Credentials, Audit Log, Health
- Client-side auth: `clientId` + `clientSecret``TokenManager`
- Served from AgentIdP server at `GET /dashboard` (static build)
## Pages
| Page | Route | Scope Required |
|------|-------|---------------|
| Login | `/dashboard/login` | None |
| Agents | `/dashboard/agents` | `agents:read` |
| Agent Detail | `/dashboard/agents/:id` | `agents:read` |
| Credentials | `/dashboard/agents/:id/credentials` | `agents:read` |
| Audit Log | `/dashboard/audit` | `audit:read` |
| Health | `/dashboard/health` | None |
## Acceptance Criteria
- [ ] TypeScript strict — zero `any` across all dashboard files
- [ ] `dashboard/tsconfig.json` with `strict: true`
- [ ] Login form stores token in `sessionStorage` only (not `localStorage`)
- [ ] All write operations (suspend, revoke, rotate) require confirmation dialog
- [ ] OWASP Top 10 review: no XSS, no CSRF, no sensitive data in URL params
- [ ] Vite build outputs to `dashboard/dist/`; AgentIdP serves it as static
- [ ] `dashboard/README.md` — how to build and serve
- [ ] Responsive layout — functional on desktop and tablet

View File

@@ -0,0 +1,127 @@
# Phase 2: Production-Ready — Tasks
**Status**: Awaiting CEO dependency approvals before any implementation begins.
## CEO Approval Gates (required before implementation)
- [ ] A0.1 Approve dependency: `node-vault` (Vault integration)
- [ ] A0.2 Approve dependency: `@openpolicyagent/opa-wasm` (OPA policy engine)
- [ ] A0.3 Approve dependency: React 18 + Vite 5 (web dashboard)
- [ ] A0.4 Approve dependency: `prom-client` (Prometheus metrics)
- [ ] A0.5 Approve dependency: Terraform (infrastructure as code)
---
## Workstream 1: HashiCorp Vault Integration
- [ ] 1.1 Write `src/vault/VaultClient.ts` — wraps `node-vault`; methods: writeSecret, readSecret, deleteSecret, rotateSecret
- [ ] 1.2 Write `src/db/migrations/005_add_vault_path.sql` — add `vault_path` column to `credentials`
- [ ] 1.3 Update `CredentialService.ts` — new credentials use Vault; existing bcrypt credentials continue to work
- [ ] 1.4 Update `docs/devops/environment-variables.md` — add VAULT_ADDR, VAULT_TOKEN, VAULT_MOUNT
- [ ] 1.5 Write `docs/devops/vault-setup.md` — Vault dev server setup, production Vault config, migration guide
- [ ] 1.6 Write unit tests for VaultClient (mocked Vault) and updated CredentialService
- [ ] 1.7 QA sign-off: zero `any`, TypeScript strict, >80% coverage, coexistence verified
## Workstream 2: Python SDK
- [ ] 2.1 Create `sdk-python/` with `pyproject.toml` — name: sentryagent-idp, python>=3.9
- [ ] 2.2 Write `sdk-python/src/sentryagent_idp/types.py` — all request/response dataclasses
- [ ] 2.3 Write `sdk-python/src/sentryagent_idp/errors.py` — AgentIdPError exception
- [ ] 2.4 Write `sdk-python/src/sentryagent_idp/token_manager.py` — sync TokenManager
- [ ] 2.5 Write `sdk-python/src/sentryagent_idp/async_token_manager.py` — async TokenManager (httpx)
- [ ] 2.6 Write `sdk-python/src/sentryagent_idp/services/agents.py` — AgentRegistryClient (sync + async)
- [ ] 2.7 Write `sdk-python/src/sentryagent_idp/services/credentials.py` — CredentialClient (sync + async)
- [ ] 2.8 Write `sdk-python/src/sentryagent_idp/services/token.py` — TokenClient (sync + async)
- [ ] 2.9 Write `sdk-python/src/sentryagent_idp/services/audit.py` — AuditClient (sync + async)
- [ ] 2.10 Write `sdk-python/src/sentryagent_idp/client.py` — AgentIdPClient (sync) + AsyncAgentIdPClient
- [ ] 2.11 Write `sdk-python/src/sentryagent_idp/__init__.py` — barrel exports
- [ ] 2.12 Write `sdk-python/README.md`
- [ ] 2.13 QA: `mypy --strict` clean, all 14 endpoints, AgentIdPError on all failure paths, pytest >80%
## Workstream 3: Go SDK
- [ ] 3.1 Create `sdk-go/` with `go.mod` — module: github.com/sentryagent/idp-sdk-go, go 1.21
- [ ] 3.2 Write `sdk-go/types.go` — all request/response structs
- [ ] 3.3 Write `sdk-go/errors.go` — AgentIdPError type implementing error interface
- [ ] 3.4 Write `sdk-go/token_manager.go` — mutex-guarded TokenManager
- [ ] 3.5 Write `sdk-go/services/agents.go` — AgentRegistryClient
- [ ] 3.6 Write `sdk-go/services/credentials.go` — CredentialClient
- [ ] 3.7 Write `sdk-go/services/token.go` — TokenClient
- [ ] 3.8 Write `sdk-go/services/audit.go` — AuditClient
- [ ] 3.9 Write `sdk-go/client.go` — AgentIdPClient
- [ ] 3.10 Write `sdk-go/README.md`
- [ ] 3.11 QA: `go vet` clean, `staticcheck` clean, all 14 endpoints, goroutine-safe, `go test ./...` >80%
## Workstream 4: Java SDK
- [ ] 4.1 Create `sdk-java/` with `pom.xml` — groupId: ai.sentryagent, artifactId: idp-sdk, Java 17
- [ ] 4.2 Write all POJO request/response model classes
- [ ] 4.3 Write `AgentIdPException.java` extending RuntimeException
- [ ] 4.4 Write `TokenManager.java` — synchronized cache with 60s refresh buffer
- [ ] 4.5 Write `AgentRegistryClient.java` — sync + CompletableFuture methods
- [ ] 4.6 Write `CredentialClient.java` — sync + CompletableFuture methods
- [ ] 4.7 Write `TokenClient.java` — sync + CompletableFuture methods
- [ ] 4.8 Write `AuditClient.java` — sync + CompletableFuture methods
- [ ] 4.9 Write `AgentIdPClient.java` — composes all service clients
- [ ] 4.10 Write `sdk-java/README.md`
- [ ] 4.11 QA: `mvn verify` passes, all 14 endpoints, AgentIdPException on all failure paths, JUnit 5 >80%
## Workstream 5: OPA Policy Engine
- [ ] 5.1 Write `policies/authz.rego` — allow/deny rules matching all current scope checks
- [ ] 5.2 Write `policies/data/scopes.json` — scope to endpoint permission mapping
- [ ] 5.3 Write `src/middleware/opa.ts` — OpaMiddleware: loads Wasm, evaluates input, returns allow/deny
- [ ] 5.4 Replace static scope check in `src/middleware/auth.ts` with OpaMiddleware
- [ ] 5.5 Add SIGHUP handler in `src/server.ts` to hot-reload policy files
- [ ] 5.6 Update `docs/devops/environment-variables.md` — add POLICY_DIR
- [ ] 5.7 QA: all existing auth tests pass unchanged, new OPA unit tests, hot-reload verified
## Workstream 6: Web Dashboard UI
- [ ] 6.1 Create `dashboard/` with Vite 5 + React 18 + TypeScript strict configuration
- [ ] 6.2 Set up shadcn/ui with Tailwind CSS
- [ ] 6.3 Write `dashboard/src/lib/auth.ts` — credential entry, TokenManager, sessionStorage
- [ ] 6.4 Write `dashboard/src/lib/client.ts` — wraps @sentryagent/idp-sdk AgentIdPClient
- [ ] 6.5 Write Login page (`/dashboard/login`)
- [ ] 6.6 Write Agents page (`/dashboard/agents`) — list, search, filter by status
- [ ] 6.7 Write Agent Detail page (`/dashboard/agents/:id`) — suspend/reactivate with confirm dialog
- [ ] 6.8 Write Credentials page (`/dashboard/agents/:id/credentials`) — rotate/revoke with confirm
- [ ] 6.9 Write Audit Log page (`/dashboard/audit`) — filters, pagination
- [ ] 6.10 Write Health page (`/dashboard/health`) — PostgreSQL + Redis connectivity status
- [ ] 6.11 Configure AgentIdP Express app to serve `dashboard/dist/` at `/dashboard`
- [ ] 6.12 Write `dashboard/README.md`
- [ ] 6.13 QA: TypeScript strict, zero `any`, OWASP Top 10 review, responsive layout verified
## Workstream 7: Prometheus + Grafana Monitoring
- [ ] 7.1 Add `prom-client` to dependencies (after CEO approval A0.4)
- [ ] 7.2 Write `src/metrics/registry.ts` — shared Prometheus Registry with all 7 metric definitions
- [ ] 7.3 Instrument `OAuth2Service.ts` — increment `agentidp_tokens_issued_total`
- [ ] 7.4 Instrument `AgentService.ts` — increment `agentidp_agents_registered_total`
- [ ] 7.5 Instrument `src/middleware/` — HTTP request counter and duration histogram
- [ ] 7.6 Instrument `src/db/pool.ts` — DB query duration histogram
- [ ] 7.7 Instrument `src/cache/redis.ts` — Redis command duration histogram
- [ ] 7.8 Add `GET /metrics` route (unauthenticated, Prometheus text format)
- [ ] 7.9 Write `monitoring/prometheus/prometheus.yml` — scrape config
- [ ] 7.10 Write `monitoring/grafana/provisioning/` — datasource + dashboard provisioning
- [ ] 7.11 Write `monitoring/grafana/dashboards/agentidp.json` — pre-built Grafana dashboard
- [ ] 7.12 Write `docker-compose.monitoring.yml` overlay
- [ ] 7.13 Update `docs/devops/operations.md` — monitoring section
- [ ] 7.14 QA: all 7 metrics verified under load, Grafana auto-provisions, no auth leak on /metrics
## Workstream 8: Multi-Region Deployment (Terraform)
- [ ] 8.1 Write `terraform/modules/agentidp/main.tf` + `variables.tf` + `outputs.tf`
- [ ] 8.2 Write `terraform/modules/rds/` — managed PostgreSQL module
- [ ] 8.3 Write `terraform/modules/redis/` — managed Redis module
- [ ] 8.4 Write `terraform/modules/lb/` — load balancer + TLS module
- [ ] 8.5 Write `terraform/environments/aws/main.tf` + `variables.tf` + `terraform.tfvars.example`
- [ ] 8.6 Write `terraform/environments/gcp/main.tf` + `variables.tf` + `terraform.tfvars.example`
- [ ] 8.7 Write `docs/devops/deployment.md` — end-to-end AWS and GCP deployment walkthrough
- [ ] 8.8 QA: `terraform validate` passes, secrets not hardcoded, TLS enforced, DB/Redis VPC-internal
---
## Phase 2 Complete Criteria
All 8 workstreams done. All tasks checked. All QA gates passed. CEO reviewed.