# Phase 2: Production-Ready — Change Proposal **Date**: 2026-03-28 **Author**: Virtual CTO **Status**: Proposed — awaiting CEO approval --- ## Summary Phase 1 delivered a complete, working AgentIdP MVP. Phase 2 makes it production-ready: hardened secrets management, multi-language SDKs, a policy engine, a web dashboard, observability, and multi-region deployment. --- ## Problem Statement Phase 1 is functional but has the following production gaps: | Gap | Risk | |-----|------| | Credentials stored as bcrypt hashes in PostgreSQL | No HSM/KMS — acceptable for MVP, not for enterprise | | Only Node.js SDK | Developers in Python/Go/Java cannot use the SDK | | No policy engine | Scope enforcement is static — no dynamic ABAC/RBAC | | No web UI | Operators must use `curl` to manage agents | | No observability | No metrics, no dashboards, no alerting | | Single-region deployment | No HA, no geo-redundancy | --- ## Proposed Changes ### 1. HashiCorp Vault Integration Replace raw bcrypt credential storage with Vault-backed secret management. Vault handles secret generation, versioning, and revocation. AgentIdP stores only Vault secret paths, not the secrets themselves. ### 2. Multi-Language SDKs Add Python, Go, and Java SDKs with identical API surface to the existing Node.js SDK: `AgentIdPClient`, `TokenManager`, service clients for all 14 endpoints, typed error hierarchy. ### 3. Advanced Policy Engine (OPA) Integrate Open Policy Agent (OPA) as a sidecar for dynamic scope and attribute-based access control. Policies are hot-reloadable Rego files — no server restart required. ### 4. Web Dashboard UI A React + TypeScript dashboard for operators: agent list and management, credential overview, audit log viewer, system health panel. Read-only by default; write operations require `agents:write` scope. ### 5. Prometheus + Grafana Monitoring Instrument all services with Prometheus metrics (`/metrics` endpoint). Ship a pre-built Grafana dashboard for: token issuance rate, agent registration rate, error rates, Redis latency, PostgreSQL query latency. ### 6. Multi-Region Deployment Terraform modules for AWS/GCP deployment with: managed PostgreSQL (RDS/Cloud SQL), managed Redis (ElastiCache/Memorystore), container orchestration (ECS/Cloud Run), load balancer, and a deployment guide. --- ## Out of Scope for Phase 2 - AGNTCY federation (Phase 3) - W3C DID support (Phase 3) - SOC 2 certification (Phase 3) - Rust/C++ SDKs (Phase 3) --- ## Dependencies | New Dependency | Purpose | CEO Approval Required | |---------------|---------|----------------------| | `@openpolicyagent/opa-wasm` | OPA policy evaluation | Yes | | `node-vault` | HashiCorp Vault client | Yes | | React 18 + Vite | Web dashboard | Yes | | `prom-client` | Prometheus metrics | Yes | | Terraform | Infrastructure as code | Yes | --- ## Delivery Sequence (per OpenSpec spec-first workflow) ``` 1. Vault integration (highest security impact) 2. Python SDK (highest developer demand) 3. Go SDK 4. Java SDK 5. OPA policy engine 6. Web dashboard UI 7. Prometheus + Grafana monitoring 8. Multi-region deployment (Terraform) ``` --- ## Success Criteria - All new dependencies CEO-approved before implementation begins - All new API endpoints have OpenAPI 3.0 specs before implementation - TypeScript strict mode + zero `any` maintained throughout - >80% test coverage on all new services - All SDKs pass the same QA gate: 14-endpoint coverage, typed errors, zero `any` - Web dashboard passes OWASP Top 10 security review - Monitoring stack ships with pre-built dashboards — zero manual setup required