chore: Phase 2 OpenSpec scoping — proposal, design, specs, tasks
8 workstreams scoped per OpenSpec standards: 1. HashiCorp Vault integration (secret management) 2. Python SDK (sentryagent-idp) 3. Go SDK (idp-sdk-go) 4. Java SDK (ai.sentryagent:idp-sdk) 5. OPA policy engine (dynamic ABAC, hot-reload Rego) 6. Web Dashboard UI (React 18 + TypeScript) 7. Prometheus + Grafana monitoring (7 metrics, pre-built dashboard) 8. Multi-region Terraform deployment (AWS + GCP) Status: proposed — awaiting CEO dependency approvals (A0.1–A0.5) before any implementation begins. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
3
openspec/changes/phase-2-production-ready/.openspec.yaml
Normal file
3
openspec/changes/phase-2-production-ready/.openspec.yaml
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
change: phase-2-production-ready
|
||||||
|
status: proposed
|
||||||
|
date: 2026-03-28
|
||||||
218
openspec/changes/phase-2-production-ready/design.md
Normal file
218
openspec/changes/phase-2-production-ready/design.md
Normal file
@@ -0,0 +1,218 @@
|
|||||||
|
# Phase 2: Production-Ready — Technical Design
|
||||||
|
|
||||||
|
**Date**: 2026-03-28
|
||||||
|
**Author**: Virtual Architect
|
||||||
|
**Status**: Draft — pending CEO approval of proposal
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. HashiCorp Vault Integration
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
AgentIdP Server
|
||||||
|
└── CredentialService
|
||||||
|
└── VaultClient (new)
|
||||||
|
└── HashiCorp Vault (sidecar or external)
|
||||||
|
└── KV Secrets Engine v2
|
||||||
|
```
|
||||||
|
|
||||||
|
### Design Decisions
|
||||||
|
|
||||||
|
**ADR-001: Vault over AWS KMS/GCP Secret Manager**
|
||||||
|
Vault is cloud-agnostic, open-source, and already standard in enterprise environments. Using Vault keeps Phase 2 cloud-provider independent.
|
||||||
|
|
||||||
|
**ADR-002: KV Secrets Engine v2**
|
||||||
|
KV v2 provides versioned secrets and metadata. When a credential is rotated, the old version is retained in Vault history, enabling audit-grade secret lifecycle tracking.
|
||||||
|
|
||||||
|
**ADR-003: AgentIdP stores Vault path, not secret**
|
||||||
|
`credentials.vault_path` stores the Vault KV path (e.g. `secret/agentidp/agents/{agentId}/credentials/{credentialId}`). The secret itself is never written to PostgreSQL.
|
||||||
|
|
||||||
|
### New environment variables
|
||||||
|
| Variable | Description |
|
||||||
|
|----------|-------------|
|
||||||
|
| `VAULT_ADDR` | Vault server address |
|
||||||
|
| `VAULT_TOKEN` | Vault root/service token |
|
||||||
|
| `VAULT_MOUNT` | KV mount path (default: `secret`) |
|
||||||
|
|
||||||
|
### Migration
|
||||||
|
Add `vault_path` column to `credentials` table (`005_add_vault_path.sql`). Existing credentials retain bcrypt hashes; new credentials use Vault. Both code paths coexist until all credentials are rotated (migration guide provided).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Multi-Language SDKs
|
||||||
|
|
||||||
|
### Shared contract (all SDKs implement identically)
|
||||||
|
|
||||||
|
```
|
||||||
|
AgentIdPClient(baseUrl, clientId, clientSecret, scopes?)
|
||||||
|
.agents → AgentRegistryClient (5 methods)
|
||||||
|
.credentials → CredentialClient (4 methods)
|
||||||
|
.tokens → TokenClient (2 methods)
|
||||||
|
.audit → AuditClient (2 methods)
|
||||||
|
.clearTokenCache()
|
||||||
|
|
||||||
|
TokenManager — auto-refresh 60s before expiry
|
||||||
|
AgentIdPError — code, message, httpStatus, details
|
||||||
|
```
|
||||||
|
|
||||||
|
### Python SDK (`sentryagent-idp`)
|
||||||
|
- Python 3.9+ (httpx for async, requests for sync)
|
||||||
|
- Both sync and async client variants
|
||||||
|
- PyPI package: `sentryagent-idp`
|
||||||
|
- Type hints throughout (`mypy --strict` clean)
|
||||||
|
|
||||||
|
### Go SDK (`github.com/sentryagent/idp-sdk-go`)
|
||||||
|
- Go 1.21+, standard library `net/http`
|
||||||
|
- Context-aware methods (`context.Context` first arg)
|
||||||
|
- Idiomatic Go error handling (`error` return, no panic)
|
||||||
|
- Go module: `github.com/sentryagent/idp-sdk-go`
|
||||||
|
|
||||||
|
### Java SDK (`ai.sentryagent:idp-sdk`)
|
||||||
|
- Java 17+, Apache HttpClient 5
|
||||||
|
- Synchronous and CompletableFuture async variants
|
||||||
|
- Maven Central: `ai.sentryagent:idp-sdk`
|
||||||
|
- Fully typed with generics
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. OPA Policy Engine
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
HTTP Request
|
||||||
|
→ Auth Middleware (JWT verify) — unchanged
|
||||||
|
→ OPA Middleware (new) — evaluates policy
|
||||||
|
→ OPA Wasm (embedded, no network call)
|
||||||
|
→ Rego policy files (hot-reloadable)
|
||||||
|
→ Controller
|
||||||
|
```
|
||||||
|
|
||||||
|
### Design Decisions
|
||||||
|
|
||||||
|
**ADR-004: OPA Wasm over OPA sidecar**
|
||||||
|
Embedding OPA as Wasm in the Node.js process eliminates a network hop and removes a runtime dependency. Policy files are loaded from `policies/` directory at startup and reloaded on SIGHUP.
|
||||||
|
|
||||||
|
**ADR-005: Policy replaces, does not wrap, scope check**
|
||||||
|
The existing static scope check in `auth.ts` is replaced by an OPA policy evaluation. This keeps the policy as the single source of truth for access control.
|
||||||
|
|
||||||
|
### Policy structure (`policies/`)
|
||||||
|
```
|
||||||
|
policies/
|
||||||
|
authz.rego — main policy: allow/deny
|
||||||
|
data/
|
||||||
|
scopes.json — scope → permission mapping
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Web Dashboard UI
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
dashboard/ (new — separate from sdk/)
|
||||||
|
src/
|
||||||
|
components/ — reusable UI components
|
||||||
|
pages/ — Agents, Credentials, Audit, Health
|
||||||
|
hooks/ — useAgents, useCredentials, useAudit
|
||||||
|
lib/
|
||||||
|
client.ts — wraps @sentryagent/idp-sdk
|
||||||
|
auth.ts — credential entry and storage
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tech Stack
|
||||||
|
- React 18 + TypeScript strict
|
||||||
|
- Vite 5 (build tool)
|
||||||
|
- TanStack Query v5 (server state)
|
||||||
|
- shadcn/ui components (Radix UI + Tailwind CSS)
|
||||||
|
|
||||||
|
### Pages
|
||||||
|
| Page | Scope Required | Features |
|
||||||
|
|------|---------------|----------|
|
||||||
|
| Agents | `agents:read` | List, search, view detail, suspend/reactivate |
|
||||||
|
| Credentials | `agents:read` | List credentials per agent, rotate, revoke |
|
||||||
|
| Audit Log | `audit:read` | Filter by agent/action/outcome/date, paginate |
|
||||||
|
| Health | None | Server uptime, Redis/PostgreSQL connectivity |
|
||||||
|
|
||||||
|
### Authentication
|
||||||
|
The dashboard accepts `clientId` + `clientSecret` via a login form. The `@sentryagent/idp-sdk` `TokenManager` handles token acquisition and caching in `sessionStorage`. No backend session — all state is client-side.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Prometheus + Grafana Monitoring
|
||||||
|
|
||||||
|
### Metrics exposed at `GET /metrics`
|
||||||
|
|
||||||
|
| Metric | Type | Description |
|
||||||
|
|--------|------|-------------|
|
||||||
|
| `agentidp_tokens_issued_total` | Counter | Tokens issued, labelled by outcome |
|
||||||
|
| `agentidp_agents_registered_total` | Counter | Agent registrations |
|
||||||
|
| `agentidp_http_requests_total` | Counter | All requests, labelled by method/path/status |
|
||||||
|
| `agentidp_http_request_duration_seconds` | Histogram | Request latency |
|
||||||
|
| `agentidp_rate_limit_rejections_total` | Counter | 429 responses |
|
||||||
|
| `agentidp_db_query_duration_seconds` | Histogram | PostgreSQL query latency |
|
||||||
|
| `agentidp_redis_command_duration_seconds` | Histogram | Redis command latency |
|
||||||
|
|
||||||
|
### Grafana dashboard
|
||||||
|
Pre-built JSON dashboard shipped in `monitoring/grafana/dashboards/agentidp.json`. Auto-provisioned via `monitoring/grafana/provisioning/`.
|
||||||
|
|
||||||
|
### Docker Compose extension
|
||||||
|
Add `prometheus` and `grafana` services to a `docker-compose.monitoring.yml` overlay — keeps the base `docker-compose.yml` clean for developers who don't need monitoring.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Multi-Region Deployment (Terraform)
|
||||||
|
|
||||||
|
### Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
terraform/
|
||||||
|
modules/
|
||||||
|
agentidp/ — reusable module: compute + networking
|
||||||
|
rds/ — managed PostgreSQL
|
||||||
|
redis/ — managed Redis
|
||||||
|
lb/ — load balancer + TLS
|
||||||
|
environments/
|
||||||
|
aws/ — AWS-specific config (ECS + RDS + ElastiCache)
|
||||||
|
gcp/ — GCP-specific config (Cloud Run + Cloud SQL + Memorystore)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Design Decisions
|
||||||
|
|
||||||
|
**ADR-006: Two provider targets (AWS + GCP) in Phase 2**
|
||||||
|
AWS and GCP cover the majority of developer deployments. Azure module is Phase 3. Each environment is a thin wrapper over the shared `agentidp` module.
|
||||||
|
|
||||||
|
**ADR-007: Terraform over Pulumi/CDK**
|
||||||
|
Terraform is the most widely-used IaC tool, familiar to most DevOps teams. The HCL syntax is simpler for documentation purposes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Component Interaction Map (Phase 2)
|
||||||
|
|
||||||
|
```
|
||||||
|
┌────────────────────┐
|
||||||
|
│ Web Dashboard │
|
||||||
|
│ (React + Vite) │
|
||||||
|
└────────┬───────────┘
|
||||||
|
│ HTTPS
|
||||||
|
┌────────────────▼────────────────┐
|
||||||
|
│ AgentIdP Server │
|
||||||
|
│ Auth MW → OPA MW → Controllers │
|
||||||
|
│ /metrics (prom-client) │
|
||||||
|
└──┬──────────┬──────────┬────────┘
|
||||||
|
│ │ │
|
||||||
|
┌─────▼──┐ ┌────▼───┐ ┌──▼───────┐
|
||||||
|
│Postgres│ │ Redis │ │ Vault │
|
||||||
|
└────────┘ └────────┘ └──────────┘
|
||||||
|
│
|
||||||
|
┌────────▼────────┐
|
||||||
|
│ Prometheus │
|
||||||
|
└────────┬────────┘
|
||||||
|
│
|
||||||
|
┌────────▼────────┐
|
||||||
|
│ Grafana │
|
||||||
|
└─────────────────┘
|
||||||
|
```
|
||||||
96
openspec/changes/phase-2-production-ready/proposal.md
Normal file
96
openspec/changes/phase-2-production-ready/proposal.md
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
# Phase 2: Production-Ready — Change Proposal
|
||||||
|
|
||||||
|
**Date**: 2026-03-28
|
||||||
|
**Author**: Virtual CTO
|
||||||
|
**Status**: Proposed — awaiting CEO approval
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Phase 1 delivered a complete, working AgentIdP MVP. Phase 2 makes it production-ready: hardened secrets management, multi-language SDKs, a policy engine, a web dashboard, observability, and multi-region deployment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem Statement
|
||||||
|
|
||||||
|
Phase 1 is functional but has the following production gaps:
|
||||||
|
|
||||||
|
| Gap | Risk |
|
||||||
|
|-----|------|
|
||||||
|
| Credentials stored as bcrypt hashes in PostgreSQL | No HSM/KMS — acceptable for MVP, not for enterprise |
|
||||||
|
| Only Node.js SDK | Developers in Python/Go/Java cannot use the SDK |
|
||||||
|
| No policy engine | Scope enforcement is static — no dynamic ABAC/RBAC |
|
||||||
|
| No web UI | Operators must use `curl` to manage agents |
|
||||||
|
| No observability | No metrics, no dashboards, no alerting |
|
||||||
|
| Single-region deployment | No HA, no geo-redundancy |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Proposed Changes
|
||||||
|
|
||||||
|
### 1. HashiCorp Vault Integration
|
||||||
|
Replace raw bcrypt credential storage with Vault-backed secret management. Vault handles secret generation, versioning, and revocation. AgentIdP stores only Vault secret paths, not the secrets themselves.
|
||||||
|
|
||||||
|
### 2. Multi-Language SDKs
|
||||||
|
Add Python, Go, and Java SDKs with identical API surface to the existing Node.js SDK: `AgentIdPClient`, `TokenManager`, service clients for all 14 endpoints, typed error hierarchy.
|
||||||
|
|
||||||
|
### 3. Advanced Policy Engine (OPA)
|
||||||
|
Integrate Open Policy Agent (OPA) as a sidecar for dynamic scope and attribute-based access control. Policies are hot-reloadable Rego files — no server restart required.
|
||||||
|
|
||||||
|
### 4. Web Dashboard UI
|
||||||
|
A React + TypeScript dashboard for operators: agent list and management, credential overview, audit log viewer, system health panel. Read-only by default; write operations require `agents:write` scope.
|
||||||
|
|
||||||
|
### 5. Prometheus + Grafana Monitoring
|
||||||
|
Instrument all services with Prometheus metrics (`/metrics` endpoint). Ship a pre-built Grafana dashboard for: token issuance rate, agent registration rate, error rates, Redis latency, PostgreSQL query latency.
|
||||||
|
|
||||||
|
### 6. Multi-Region Deployment
|
||||||
|
Terraform modules for AWS/GCP deployment with: managed PostgreSQL (RDS/Cloud SQL), managed Redis (ElastiCache/Memorystore), container orchestration (ECS/Cloud Run), load balancer, and a deployment guide.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Out of Scope for Phase 2
|
||||||
|
|
||||||
|
- AGNTCY federation (Phase 3)
|
||||||
|
- W3C DID support (Phase 3)
|
||||||
|
- SOC 2 certification (Phase 3)
|
||||||
|
- Rust/C++ SDKs (Phase 3)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
| New Dependency | Purpose | CEO Approval Required |
|
||||||
|
|---------------|---------|----------------------|
|
||||||
|
| `@openpolicyagent/opa-wasm` | OPA policy evaluation | Yes |
|
||||||
|
| `node-vault` | HashiCorp Vault client | Yes |
|
||||||
|
| React 18 + Vite | Web dashboard | Yes |
|
||||||
|
| `prom-client` | Prometheus metrics | Yes |
|
||||||
|
| Terraform | Infrastructure as code | Yes |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Delivery Sequence (per OpenSpec spec-first workflow)
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Vault integration (highest security impact)
|
||||||
|
2. Python SDK (highest developer demand)
|
||||||
|
3. Go SDK
|
||||||
|
4. Java SDK
|
||||||
|
5. OPA policy engine
|
||||||
|
6. Web dashboard UI
|
||||||
|
7. Prometheus + Grafana monitoring
|
||||||
|
8. Multi-region deployment (Terraform)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- All new dependencies CEO-approved before implementation begins
|
||||||
|
- All new API endpoints have OpenAPI 3.0 specs before implementation
|
||||||
|
- TypeScript strict mode + zero `any` maintained throughout
|
||||||
|
- >80% test coverage on all new services
|
||||||
|
- All SDKs pass the same QA gate: 14-endpoint coverage, typed errors, zero `any`
|
||||||
|
- Web dashboard passes OWASP Top 10 security review
|
||||||
|
- Monitoring stack ships with pre-built dashboards — zero manual setup required
|
||||||
@@ -0,0 +1,44 @@
|
|||||||
|
# Spec: Multi-Region Deployment (Terraform)
|
||||||
|
|
||||||
|
**Status**: Pending CEO approval
|
||||||
|
**Workstream**: 8 of 8
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
- `terraform/` directory at project root
|
||||||
|
- Shared `agentidp` module (compute, networking, secrets)
|
||||||
|
- `environments/aws/` — ECS Fargate + RDS PostgreSQL + ElastiCache Redis
|
||||||
|
- `environments/gcp/` — Cloud Run + Cloud SQL + Memorystore Redis
|
||||||
|
- Deployment guide: `docs/devops/deployment.md`
|
||||||
|
|
||||||
|
## Module structure
|
||||||
|
|
||||||
|
```
|
||||||
|
terraform/
|
||||||
|
modules/
|
||||||
|
agentidp/
|
||||||
|
main.tf — compute (ECS task or Cloud Run service)
|
||||||
|
networking.tf — VPC, subnets, security groups
|
||||||
|
variables.tf — all configurable inputs
|
||||||
|
outputs.tf — service URL, DB endpoint, Redis endpoint
|
||||||
|
rds/ — managed PostgreSQL
|
||||||
|
redis/ — managed Redis
|
||||||
|
lb/ — ALB (AWS) or Cloud LB (GCP), TLS cert
|
||||||
|
environments/
|
||||||
|
aws/
|
||||||
|
main.tf — calls modules, sets AWS-specific vars
|
||||||
|
variables.tf
|
||||||
|
terraform.tfvars.example
|
||||||
|
gcp/
|
||||||
|
main.tf
|
||||||
|
variables.tf
|
||||||
|
terraform.tfvars.example
|
||||||
|
```
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
- [ ] `terraform validate` passes for both aws and gcp environments
|
||||||
|
- [ ] `terraform plan` produces no errors against a live AWS/GCP account (test in dev env)
|
||||||
|
- [ ] JWT_PRIVATE_KEY and JWT_PUBLIC_KEY injected as environment secrets (not hardcoded)
|
||||||
|
- [ ] TLS termination at load balancer — HTTPS only in production modules
|
||||||
|
- [ ] PostgreSQL and Redis not publicly accessible — VPC-internal only
|
||||||
|
- [ ] `docs/devops/deployment.md` — end-to-end deployment walkthrough for AWS and GCP
|
||||||
|
- [ ] `terraform.tfvars.example` provided for both environments — no secrets in version control
|
||||||
@@ -0,0 +1,23 @@
|
|||||||
|
# Spec: Go SDK (`github.com/sentryagent/idp-sdk-go`)
|
||||||
|
|
||||||
|
**Status**: Pending CEO approval
|
||||||
|
**Workstream**: 3 of 8
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
- `sdk-go/` directory at project root
|
||||||
|
- Context-aware `AgentIdPClient` using standard library `net/http`
|
||||||
|
- `TokenManager` with mutex-guarded cache and 60s auto-refresh
|
||||||
|
- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient`
|
||||||
|
- Idiomatic Go error type `AgentIdPError` implementing `error` interface
|
||||||
|
- `go.mod` module: `github.com/sentryagent/idp-sdk-go`
|
||||||
|
- `sdk-go/README.md`
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
- [ ] All 14 endpoints covered
|
||||||
|
- [ ] All methods take `context.Context` as first argument
|
||||||
|
- [ ] No panics — all errors returned as `error`
|
||||||
|
- [ ] `AgentIdPError` implements `error` and exposes `.Code`, `.HTTPStatus`, `.Details`
|
||||||
|
- [ ] `TokenManager` is goroutine-safe (`sync.Mutex` on cache)
|
||||||
|
- [ ] `go vet` and `staticcheck` pass with zero warnings
|
||||||
|
- [ ] `go test ./...` with >80% coverage
|
||||||
|
- [ ] README matches Node.js SDK structure
|
||||||
@@ -0,0 +1,23 @@
|
|||||||
|
# Spec: Java SDK (`ai.sentryagent:idp-sdk`)
|
||||||
|
|
||||||
|
**Status**: Pending CEO approval
|
||||||
|
**Workstream**: 4 of 8
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
- `sdk-java/` directory at project root
|
||||||
|
- `AgentIdPClient` with sync and `CompletableFuture` async variants
|
||||||
|
- `TokenManager` with thread-safe cache and 60s auto-refresh
|
||||||
|
- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient`
|
||||||
|
- `AgentIdPException` extending `RuntimeException` with `code`, `httpStatus`, `details`
|
||||||
|
- `pom.xml`: groupId=`ai.sentryagent`, artifactId=`idp-sdk`, Java 17+
|
||||||
|
- `sdk-java/README.md`
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
- [ ] All 14 endpoints covered
|
||||||
|
- [ ] Sync methods return typed POJOs; async methods return `CompletableFuture<T>`
|
||||||
|
- [ ] `AgentIdPException` thrown (not raw IOException) on all failure paths
|
||||||
|
- [ ] `TokenManager` is thread-safe (`synchronized` on cache)
|
||||||
|
- [ ] Apache HttpClient 5 for HTTP transport
|
||||||
|
- [ ] Jackson for JSON serialization
|
||||||
|
- [ ] `mvn verify` passes with >80% coverage (JUnit 5)
|
||||||
|
- [ ] README matches Node.js SDK structure
|
||||||
@@ -0,0 +1,32 @@
|
|||||||
|
# Spec: Prometheus + Grafana Monitoring
|
||||||
|
|
||||||
|
**Status**: Pending CEO approval
|
||||||
|
**Workstream**: 7 of 8
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
- `prom-client` integration — expose `GET /metrics`
|
||||||
|
- 7 metrics (counters + histograms) across all services
|
||||||
|
- `monitoring/` directory: Prometheus config + Grafana provisioning
|
||||||
|
- `docker-compose.monitoring.yml` overlay (adds prometheus + grafana services)
|
||||||
|
- Pre-built Grafana dashboard JSON (`monitoring/grafana/dashboards/agentidp.json`)
|
||||||
|
|
||||||
|
## Metrics
|
||||||
|
|
||||||
|
| Metric | Type | Labels |
|
||||||
|
|--------|------|--------|
|
||||||
|
| `agentidp_tokens_issued_total` | Counter | `outcome` (success/failure) |
|
||||||
|
| `agentidp_agents_registered_total` | Counter | `outcome` |
|
||||||
|
| `agentidp_http_requests_total` | Counter | `method`, `path`, `status_code` |
|
||||||
|
| `agentidp_http_request_duration_seconds` | Histogram | `method`, `path` |
|
||||||
|
| `agentidp_rate_limit_rejections_total` | Counter | — |
|
||||||
|
| `agentidp_db_query_duration_seconds` | Histogram | `operation` |
|
||||||
|
| `agentidp_redis_command_duration_seconds` | Histogram | `command` |
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
- [ ] `GET /metrics` returns Prometheus text format
|
||||||
|
- [ ] `/metrics` endpoint does NOT require Bearer auth (Prometheus scrapes it)
|
||||||
|
- [ ] All 7 metrics present and updating under load
|
||||||
|
- [ ] Grafana dashboard auto-provisions on `docker compose -f docker-compose.monitoring.yml up`
|
||||||
|
- [ ] Grafana runs on port 3001 (no conflict with AgentIdP on 3000)
|
||||||
|
- [ ] `docs/devops/operations.md` updated with monitoring section
|
||||||
|
- [ ] `prom-client` added as new dependency — CEO approval gate
|
||||||
@@ -0,0 +1,37 @@
|
|||||||
|
# Spec: OPA Policy Engine Integration
|
||||||
|
|
||||||
|
**Status**: Pending CEO approval
|
||||||
|
**Workstream**: 5 of 8
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
- New `OpaMiddleware` replacing static scope check in `auth.ts`
|
||||||
|
- `@openpolicyagent/opa-wasm` integration (embedded Wasm, no sidecar)
|
||||||
|
- `policies/authz.rego` — main allow/deny policy
|
||||||
|
- `policies/data/scopes.json` — scope to permission mapping
|
||||||
|
- SIGHUP handler to hot-reload policies without restart
|
||||||
|
- New env var: `POLICY_DIR` (default: `./policies`)
|
||||||
|
|
||||||
|
## Policy interface
|
||||||
|
|
||||||
|
```
|
||||||
|
input = {
|
||||||
|
"method": "GET",
|
||||||
|
"path": "/api/v1/agents",
|
||||||
|
"scopes": ["agents:read"],
|
||||||
|
"agentId": "uuid"
|
||||||
|
}
|
||||||
|
|
||||||
|
output = {
|
||||||
|
"allow": true | false,
|
||||||
|
"reason": "string" // populated when allow=false
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
- [ ] All existing scope checks replaced by OPA evaluation
|
||||||
|
- [ ] Policy files hot-reloadable on SIGHUP (no restart required)
|
||||||
|
- [ ] OPA Wasm loaded at startup — fail-fast if `POLICY_DIR` invalid
|
||||||
|
- [ ] `allow=false` responses return `403` with `reason` in error body
|
||||||
|
- [ ] Existing test suite passes unchanged (OPA evaluates same rules as before)
|
||||||
|
- [ ] New unit tests for OPA middleware: allow/deny cases, missing scope, invalid input
|
||||||
|
- [ ] `POLICY_DIR` env var documented in `docs/devops/environment-variables.md`
|
||||||
@@ -0,0 +1,24 @@
|
|||||||
|
# Spec: Python SDK (`sentryagent-idp`)
|
||||||
|
|
||||||
|
**Status**: Pending CEO approval
|
||||||
|
**Workstream**: 2 of 8
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
- `sdk-python/` directory at project root
|
||||||
|
- `AgentIdPClient` with sync and async variants
|
||||||
|
- `TokenManager` with 60s auto-refresh
|
||||||
|
- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient`
|
||||||
|
- `AgentIdPError` typed exception
|
||||||
|
- Full type hints — `mypy --strict` clean
|
||||||
|
- `sdk-python/README.md` with installation and usage
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
- [ ] All 14 API endpoints covered
|
||||||
|
- [ ] Sync client: `requests` library
|
||||||
|
- [ ] Async client: `httpx` library
|
||||||
|
- [ ] `mypy --strict` passes with zero errors
|
||||||
|
- [ ] Zero untyped code
|
||||||
|
- [ ] `AgentIdPError` raised (not raw requests/httpx exceptions) on all failure paths
|
||||||
|
- [ ] `TokenManager` tested: caches token, refreshes at exp-60s
|
||||||
|
- [ ] `pyproject.toml` with: name=sentryagent-idp, python>=3.9, dependencies declared
|
||||||
|
- [ ] README matches Node.js SDK structure
|
||||||
@@ -0,0 +1,21 @@
|
|||||||
|
# Spec: HashiCorp Vault Integration
|
||||||
|
|
||||||
|
**Status**: Pending CEO approval
|
||||||
|
**Workstream**: 1 of 8
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
- VaultClient class wrapping `node-vault`
|
||||||
|
- `005_add_vault_path.sql` migration
|
||||||
|
- Updated CredentialService to write secrets to Vault instead of PostgreSQL
|
||||||
|
- New env vars: VAULT_ADDR, VAULT_TOKEN, VAULT_MOUNT
|
||||||
|
- Migration guide: bcrypt → Vault coexistence strategy
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
- [ ] New credentials: secret written to Vault KV v2, `vault_path` stored in PostgreSQL
|
||||||
|
- [ ] Credential rotation: Vault versioned update, `vault_path` unchanged
|
||||||
|
- [ ] Credential revocation: Vault secret deleted, DB status = `revoked`
|
||||||
|
- [ ] Existing bcrypt credentials continue to work until rotated
|
||||||
|
- [ ] VaultClient follows existing service interface pattern (DRY, SOLID)
|
||||||
|
- [ ] Zero `any` types, TypeScript strict
|
||||||
|
- [ ] `VAULT_ADDR` / `VAULT_TOKEN` validation at startup (fail-fast)
|
||||||
|
- [ ] DevOps docs updated with Vault setup section
|
||||||
@@ -0,0 +1,34 @@
|
|||||||
|
# Spec: Web Dashboard UI
|
||||||
|
|
||||||
|
**Status**: Pending CEO approval
|
||||||
|
**Workstream**: 6 of 8
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
- `dashboard/` directory at project root
|
||||||
|
- React 18 + TypeScript strict, built with Vite 5
|
||||||
|
- TanStack Query v5 for server state
|
||||||
|
- shadcn/ui (Radix UI + Tailwind CSS) for components
|
||||||
|
- Four pages: Agents, Credentials, Audit Log, Health
|
||||||
|
- Client-side auth: `clientId` + `clientSecret` → `TokenManager`
|
||||||
|
- Served from AgentIdP server at `GET /dashboard` (static build)
|
||||||
|
|
||||||
|
## Pages
|
||||||
|
|
||||||
|
| Page | Route | Scope Required |
|
||||||
|
|------|-------|---------------|
|
||||||
|
| Login | `/dashboard/login` | None |
|
||||||
|
| Agents | `/dashboard/agents` | `agents:read` |
|
||||||
|
| Agent Detail | `/dashboard/agents/:id` | `agents:read` |
|
||||||
|
| Credentials | `/dashboard/agents/:id/credentials` | `agents:read` |
|
||||||
|
| Audit Log | `/dashboard/audit` | `audit:read` |
|
||||||
|
| Health | `/dashboard/health` | None |
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
- [ ] TypeScript strict — zero `any` across all dashboard files
|
||||||
|
- [ ] `dashboard/tsconfig.json` with `strict: true`
|
||||||
|
- [ ] Login form stores token in `sessionStorage` only (not `localStorage`)
|
||||||
|
- [ ] All write operations (suspend, revoke, rotate) require confirmation dialog
|
||||||
|
- [ ] OWASP Top 10 review: no XSS, no CSRF, no sensitive data in URL params
|
||||||
|
- [ ] Vite build outputs to `dashboard/dist/`; AgentIdP serves it as static
|
||||||
|
- [ ] `dashboard/README.md` — how to build and serve
|
||||||
|
- [ ] Responsive layout — functional on desktop and tablet
|
||||||
127
openspec/changes/phase-2-production-ready/tasks.md
Normal file
127
openspec/changes/phase-2-production-ready/tasks.md
Normal file
@@ -0,0 +1,127 @@
|
|||||||
|
# Phase 2: Production-Ready — Tasks
|
||||||
|
|
||||||
|
**Status**: Awaiting CEO dependency approvals before any implementation begins.
|
||||||
|
|
||||||
|
## CEO Approval Gates (required before implementation)
|
||||||
|
|
||||||
|
- [ ] A0.1 Approve dependency: `node-vault` (Vault integration)
|
||||||
|
- [ ] A0.2 Approve dependency: `@openpolicyagent/opa-wasm` (OPA policy engine)
|
||||||
|
- [ ] A0.3 Approve dependency: React 18 + Vite 5 (web dashboard)
|
||||||
|
- [ ] A0.4 Approve dependency: `prom-client` (Prometheus metrics)
|
||||||
|
- [ ] A0.5 Approve dependency: Terraform (infrastructure as code)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Workstream 1: HashiCorp Vault Integration
|
||||||
|
|
||||||
|
- [ ] 1.1 Write `src/vault/VaultClient.ts` — wraps `node-vault`; methods: writeSecret, readSecret, deleteSecret, rotateSecret
|
||||||
|
- [ ] 1.2 Write `src/db/migrations/005_add_vault_path.sql` — add `vault_path` column to `credentials`
|
||||||
|
- [ ] 1.3 Update `CredentialService.ts` — new credentials use Vault; existing bcrypt credentials continue to work
|
||||||
|
- [ ] 1.4 Update `docs/devops/environment-variables.md` — add VAULT_ADDR, VAULT_TOKEN, VAULT_MOUNT
|
||||||
|
- [ ] 1.5 Write `docs/devops/vault-setup.md` — Vault dev server setup, production Vault config, migration guide
|
||||||
|
- [ ] 1.6 Write unit tests for VaultClient (mocked Vault) and updated CredentialService
|
||||||
|
- [ ] 1.7 QA sign-off: zero `any`, TypeScript strict, >80% coverage, coexistence verified
|
||||||
|
|
||||||
|
## Workstream 2: Python SDK
|
||||||
|
|
||||||
|
- [ ] 2.1 Create `sdk-python/` with `pyproject.toml` — name: sentryagent-idp, python>=3.9
|
||||||
|
- [ ] 2.2 Write `sdk-python/src/sentryagent_idp/types.py` — all request/response dataclasses
|
||||||
|
- [ ] 2.3 Write `sdk-python/src/sentryagent_idp/errors.py` — AgentIdPError exception
|
||||||
|
- [ ] 2.4 Write `sdk-python/src/sentryagent_idp/token_manager.py` — sync TokenManager
|
||||||
|
- [ ] 2.5 Write `sdk-python/src/sentryagent_idp/async_token_manager.py` — async TokenManager (httpx)
|
||||||
|
- [ ] 2.6 Write `sdk-python/src/sentryagent_idp/services/agents.py` — AgentRegistryClient (sync + async)
|
||||||
|
- [ ] 2.7 Write `sdk-python/src/sentryagent_idp/services/credentials.py` — CredentialClient (sync + async)
|
||||||
|
- [ ] 2.8 Write `sdk-python/src/sentryagent_idp/services/token.py` — TokenClient (sync + async)
|
||||||
|
- [ ] 2.9 Write `sdk-python/src/sentryagent_idp/services/audit.py` — AuditClient (sync + async)
|
||||||
|
- [ ] 2.10 Write `sdk-python/src/sentryagent_idp/client.py` — AgentIdPClient (sync) + AsyncAgentIdPClient
|
||||||
|
- [ ] 2.11 Write `sdk-python/src/sentryagent_idp/__init__.py` — barrel exports
|
||||||
|
- [ ] 2.12 Write `sdk-python/README.md`
|
||||||
|
- [ ] 2.13 QA: `mypy --strict` clean, all 14 endpoints, AgentIdPError on all failure paths, pytest >80%
|
||||||
|
|
||||||
|
## Workstream 3: Go SDK
|
||||||
|
|
||||||
|
- [ ] 3.1 Create `sdk-go/` with `go.mod` — module: github.com/sentryagent/idp-sdk-go, go 1.21
|
||||||
|
- [ ] 3.2 Write `sdk-go/types.go` — all request/response structs
|
||||||
|
- [ ] 3.3 Write `sdk-go/errors.go` — AgentIdPError type implementing error interface
|
||||||
|
- [ ] 3.4 Write `sdk-go/token_manager.go` — mutex-guarded TokenManager
|
||||||
|
- [ ] 3.5 Write `sdk-go/services/agents.go` — AgentRegistryClient
|
||||||
|
- [ ] 3.6 Write `sdk-go/services/credentials.go` — CredentialClient
|
||||||
|
- [ ] 3.7 Write `sdk-go/services/token.go` — TokenClient
|
||||||
|
- [ ] 3.8 Write `sdk-go/services/audit.go` — AuditClient
|
||||||
|
- [ ] 3.9 Write `sdk-go/client.go` — AgentIdPClient
|
||||||
|
- [ ] 3.10 Write `sdk-go/README.md`
|
||||||
|
- [ ] 3.11 QA: `go vet` clean, `staticcheck` clean, all 14 endpoints, goroutine-safe, `go test ./...` >80%
|
||||||
|
|
||||||
|
## Workstream 4: Java SDK
|
||||||
|
|
||||||
|
- [ ] 4.1 Create `sdk-java/` with `pom.xml` — groupId: ai.sentryagent, artifactId: idp-sdk, Java 17
|
||||||
|
- [ ] 4.2 Write all POJO request/response model classes
|
||||||
|
- [ ] 4.3 Write `AgentIdPException.java` extending RuntimeException
|
||||||
|
- [ ] 4.4 Write `TokenManager.java` — synchronized cache with 60s refresh buffer
|
||||||
|
- [ ] 4.5 Write `AgentRegistryClient.java` — sync + CompletableFuture methods
|
||||||
|
- [ ] 4.6 Write `CredentialClient.java` — sync + CompletableFuture methods
|
||||||
|
- [ ] 4.7 Write `TokenClient.java` — sync + CompletableFuture methods
|
||||||
|
- [ ] 4.8 Write `AuditClient.java` — sync + CompletableFuture methods
|
||||||
|
- [ ] 4.9 Write `AgentIdPClient.java` — composes all service clients
|
||||||
|
- [ ] 4.10 Write `sdk-java/README.md`
|
||||||
|
- [ ] 4.11 QA: `mvn verify` passes, all 14 endpoints, AgentIdPException on all failure paths, JUnit 5 >80%
|
||||||
|
|
||||||
|
## Workstream 5: OPA Policy Engine
|
||||||
|
|
||||||
|
- [ ] 5.1 Write `policies/authz.rego` — allow/deny rules matching all current scope checks
|
||||||
|
- [ ] 5.2 Write `policies/data/scopes.json` — scope to endpoint permission mapping
|
||||||
|
- [ ] 5.3 Write `src/middleware/opa.ts` — OpaMiddleware: loads Wasm, evaluates input, returns allow/deny
|
||||||
|
- [ ] 5.4 Replace static scope check in `src/middleware/auth.ts` with OpaMiddleware
|
||||||
|
- [ ] 5.5 Add SIGHUP handler in `src/server.ts` to hot-reload policy files
|
||||||
|
- [ ] 5.6 Update `docs/devops/environment-variables.md` — add POLICY_DIR
|
||||||
|
- [ ] 5.7 QA: all existing auth tests pass unchanged, new OPA unit tests, hot-reload verified
|
||||||
|
|
||||||
|
## Workstream 6: Web Dashboard UI
|
||||||
|
|
||||||
|
- [ ] 6.1 Create `dashboard/` with Vite 5 + React 18 + TypeScript strict configuration
|
||||||
|
- [ ] 6.2 Set up shadcn/ui with Tailwind CSS
|
||||||
|
- [ ] 6.3 Write `dashboard/src/lib/auth.ts` — credential entry, TokenManager, sessionStorage
|
||||||
|
- [ ] 6.4 Write `dashboard/src/lib/client.ts` — wraps @sentryagent/idp-sdk AgentIdPClient
|
||||||
|
- [ ] 6.5 Write Login page (`/dashboard/login`)
|
||||||
|
- [ ] 6.6 Write Agents page (`/dashboard/agents`) — list, search, filter by status
|
||||||
|
- [ ] 6.7 Write Agent Detail page (`/dashboard/agents/:id`) — suspend/reactivate with confirm dialog
|
||||||
|
- [ ] 6.8 Write Credentials page (`/dashboard/agents/:id/credentials`) — rotate/revoke with confirm
|
||||||
|
- [ ] 6.9 Write Audit Log page (`/dashboard/audit`) — filters, pagination
|
||||||
|
- [ ] 6.10 Write Health page (`/dashboard/health`) — PostgreSQL + Redis connectivity status
|
||||||
|
- [ ] 6.11 Configure AgentIdP Express app to serve `dashboard/dist/` at `/dashboard`
|
||||||
|
- [ ] 6.12 Write `dashboard/README.md`
|
||||||
|
- [ ] 6.13 QA: TypeScript strict, zero `any`, OWASP Top 10 review, responsive layout verified
|
||||||
|
|
||||||
|
## Workstream 7: Prometheus + Grafana Monitoring
|
||||||
|
|
||||||
|
- [ ] 7.1 Add `prom-client` to dependencies (after CEO approval A0.4)
|
||||||
|
- [ ] 7.2 Write `src/metrics/registry.ts` — shared Prometheus Registry with all 7 metric definitions
|
||||||
|
- [ ] 7.3 Instrument `OAuth2Service.ts` — increment `agentidp_tokens_issued_total`
|
||||||
|
- [ ] 7.4 Instrument `AgentService.ts` — increment `agentidp_agents_registered_total`
|
||||||
|
- [ ] 7.5 Instrument `src/middleware/` — HTTP request counter and duration histogram
|
||||||
|
- [ ] 7.6 Instrument `src/db/pool.ts` — DB query duration histogram
|
||||||
|
- [ ] 7.7 Instrument `src/cache/redis.ts` — Redis command duration histogram
|
||||||
|
- [ ] 7.8 Add `GET /metrics` route (unauthenticated, Prometheus text format)
|
||||||
|
- [ ] 7.9 Write `monitoring/prometheus/prometheus.yml` — scrape config
|
||||||
|
- [ ] 7.10 Write `monitoring/grafana/provisioning/` — datasource + dashboard provisioning
|
||||||
|
- [ ] 7.11 Write `monitoring/grafana/dashboards/agentidp.json` — pre-built Grafana dashboard
|
||||||
|
- [ ] 7.12 Write `docker-compose.monitoring.yml` overlay
|
||||||
|
- [ ] 7.13 Update `docs/devops/operations.md` — monitoring section
|
||||||
|
- [ ] 7.14 QA: all 7 metrics verified under load, Grafana auto-provisions, no auth leak on /metrics
|
||||||
|
|
||||||
|
## Workstream 8: Multi-Region Deployment (Terraform)
|
||||||
|
|
||||||
|
- [ ] 8.1 Write `terraform/modules/agentidp/main.tf` + `variables.tf` + `outputs.tf`
|
||||||
|
- [ ] 8.2 Write `terraform/modules/rds/` — managed PostgreSQL module
|
||||||
|
- [ ] 8.3 Write `terraform/modules/redis/` — managed Redis module
|
||||||
|
- [ ] 8.4 Write `terraform/modules/lb/` — load balancer + TLS module
|
||||||
|
- [ ] 8.5 Write `terraform/environments/aws/main.tf` + `variables.tf` + `terraform.tfvars.example`
|
||||||
|
- [ ] 8.6 Write `terraform/environments/gcp/main.tf` + `variables.tf` + `terraform.tfvars.example`
|
||||||
|
- [ ] 8.7 Write `docs/devops/deployment.md` — end-to-end AWS and GCP deployment walkthrough
|
||||||
|
- [ ] 8.8 QA: `terraform validate` passes, secrets not hardcoded, TLS enforced, DB/Redis VPC-internal
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 Complete Criteria
|
||||||
|
|
||||||
|
All 8 workstreams done. All tasks checked. All QA gates passed. CEO reviewed.
|
||||||
Reference in New Issue
Block a user