chore(openspec): archive engineering-docs and phase-2-production-ready changes

- engineering-docs → archive/2026-03-29-engineering-docs (63/63 tasks complete) - phase-2-production-ready → archive/2026-03-29-phase-2-production-ready (89/89 tasks complete) - openspec/specs/ synced with all Phase 1 + Phase 2 + engineering-docs capabilities (22 specs total) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 12:41:53 +00:00
parent eced5f8699
commit d42c653eea
44 changed files with 999 additions and 0 deletions
--- a/openspec/specs/architecture-guide/spec.md
+++ b/openspec/specs/architecture-guide/spec.md
@@ -0,0 +1,35 @@
+## ADDED Requirements
+
+### Requirement: System architecture document
+The system SHALL include a document (`docs/engineering/02-architecture.md`) that describes the full system architecture: components, their responsibilities, how they communicate, and the deployment topology.
+
+#### Scenario: Component diagram present
+- **WHEN** a new engineer reads 02-architecture.md
+- **THEN** they SHALL find an ASCII or Mermaid component diagram showing all major components (API server, PostgreSQL, Redis, Vault, OPA, Web Dashboard, Prometheus, Grafana) and their connections
+
+#### Scenario: Request lifecycle explained
+- **WHEN** a new engineer reads 02-architecture.md
+- **THEN** they SHALL understand how an incoming HTTP request flows from client → Express router → middleware chain → controller → service → repository → database and back
+
+#### Scenario: Data flow for authentication described
+- **WHEN** a new engineer reads 02-architecture.md
+- **THEN** they SHALL understand the OAuth 2.0 Client Credentials flow: client presents credentials → token service validates → Redis checked for existing token → JWT signed and returned
+
+#### Scenario: Deployment topology covered
+- **WHEN** a new engineer reads 02-architecture.md
+- **THEN** they SHALL understand the multi-region deployment model (US, EU, APAC) and how Terraform provisions it
+
+### Requirement: Technology stack and ADR document
+The system SHALL include a document (`docs/engineering/03-tech-stack.md`) that lists every technology in the stack and explains why it was chosen over alternatives.
+
+#### Scenario: Every major technology documented with rationale
+- **WHEN** a new engineer reads 03-tech-stack.md
+- **THEN** they SHALL find an entry for each technology (Node.js 18, TypeScript 5.3, Express 4.18, PostgreSQL 14, Redis 7, HashiCorp Vault, OPA, React 18, Vite 5, Prometheus, Grafana, Terraform) with: what it does in the system, why it was chosen, and what was considered but rejected
+
+#### Scenario: TypeScript strict mode rationale explained
+- **WHEN** a new engineer reads 03-tech-stack.md
+- **THEN** they SHALL understand why strict mode is mandatory (safety, correctness, no implicit any) and what the consequences of violating it are
+
+#### Scenario: PostgreSQL vs Redis responsibility boundary clear
+- **WHEN** a new engineer reads 03-tech-stack.md
+- **THEN** they SHALL understand what is stored in PostgreSQL (persistent state: agents, credentials, audit logs) vs Redis (ephemeral state: active tokens, rate limit counters)
--- a/openspec/specs/code-walkthroughs/spec.md
+++ b/openspec/specs/code-walkthroughs/spec.md
@@ -0,0 +1,27 @@
+## ADDED Requirements
+
+### Requirement: Annotated code walkthrough documents
+The system SHALL include a document (`docs/engineering/06-walkthroughs.md`) containing three annotated end-to-end walkthroughs of the system's critical flows, with file:line references to actual source code.
+
+#### Scenario: Token issuance walkthrough complete
+- **WHEN** a new engineer reads the token issuance walkthrough
+- **THEN** they SHALL be guided step by step from: HTTP POST /oauth2/token → Express router → auth middleware → OAuth2Controller → OAuth2Service → CredentialRepository → Vault/bcrypt credential check → Redis token cache check → JWT signing (src/utils/jwt.ts) → AuditService.logEvent → HTTP 200 response
+- **AND** every step SHALL reference the actual file and line number where it occurs
+
+#### Scenario: Agent registration walkthrough complete
+- **WHEN** a new engineer reads the agent registration walkthrough
+- **THEN** they SHALL be guided step by step from: HTTP POST /agents → auth middleware → validation middleware → AgentController → AgentService.createAgent → input validation (src/utils/validators.ts) → AgentRepository.create → PostgreSQL INSERT → AuditService.logEvent → HTTP 201 response with agent object
+- **AND** every step SHALL reference the actual file and line number
+
+#### Scenario: Credential rotation walkthrough complete
+- **WHEN** a new engineer reads the credential rotation walkthrough
+- **THEN** they SHALL be guided step by step from: HTTP POST /agents/:id/credentials/:credId/rotate → auth middleware → CredentialController → CredentialService.rotateCredential → old credential revocation → new secret generation (src/utils/crypto.ts) → Vault write or bcrypt hash → CredentialRepository.update → token revocation for old credentials → AuditService.logEvent → HTTP 200 response
+- **AND** every step SHALL reference the actual file and line number
+
+#### Scenario: Walkthroughs include version reference
+- **WHEN** a new engineer reads any walkthrough
+- **THEN** the document SHALL include a header stating the commit hash it was last verified against, so engineers know if the walkthrough may have drifted from the current code
+
+#### Scenario: Each walkthrough annotates why, not just what
+- **WHEN** a new engineer reads a walkthrough step
+- **THEN** each step SHALL explain not just what the code does but WHY — e.g., why Redis is checked before signing a new JWT, why constant-time comparison is used for credential verification, why audit logging happens after persistence not before
--- a/openspec/specs/codebase-structure/spec.md
+++ b/openspec/specs/codebase-structure/spec.md
@@ -0,0 +1,24 @@
+## ADDED Requirements
+
+### Requirement: Codebase structure document
+The system SHALL include a document (`docs/engineering/04-codebase-structure.md`) that provides an annotated map of every top-level directory and key file in the repository, explaining what lives where and why.
+
+#### Scenario: Full directory tree annotated
+- **WHEN** a new engineer reads 04-codebase-structure.md
+- **THEN** they SHALL find an annotated directory tree covering: `src/`, `tests/`, `docs/`, `sdk/`, `sdk-python/`, `sdk-go/`, `sdk-java/`, `terraform/`, `dashboard/`, `migrations/`, `openspec/`, `scripts/`
+
+#### Scenario: src/ subdirectory roles explained
+- **WHEN** a new engineer reads 04-codebase-structure.md
+- **THEN** they SHALL understand the role of each `src/` subdirectory: `controllers/` (HTTP layer), `services/` (business logic), `repositories/` (data access), `middleware/` (cross-cutting concerns), `utils/` (shared utilities), `types/` (TypeScript interfaces), `routes/` (Express router definitions)
+
+#### Scenario: Where to add new code explained
+- **WHEN** a new engineer needs to add a new feature
+- **THEN** the document SHALL tell them exactly where each type of code belongs: new endpoint → controller + route; new business logic → service; new DB query → repository; new shared utility → utils/
+
+#### Scenario: Key files identified and explained
+- **WHEN** a new engineer reads 04-codebase-structure.md
+- **THEN** they SHALL find explanations of: `src/app.ts` (Express app setup), `src/server.ts` (entry point), `src/types/index.ts` (canonical type definitions), `src/utils/errors.ts` (error hierarchy), `docker-compose.yml` (local dev stack), `tsconfig.json` (TypeScript config)
+
+#### Scenario: DRY principle mapped to structure
+- **WHEN** a new engineer reads 04-codebase-structure.md
+- **THEN** they SHALL understand how the directory structure enforces DRY: one location for types, one for crypto utilities, one for JWT utilities, one for validators — and why duplication across these is a blocking PR issue
--- a/openspec/specs/deployment-operations/spec.md
+++ b/openspec/specs/deployment-operations/spec.md
@@ -0,0 +1,28 @@
+## ADDED Requirements
+
+### Requirement: Deployment and operations guide
+The system SHALL include a document (`docs/engineering/10-deployment.md`) that explains how the application is built, deployed, and operated — covering Docker, Terraform, environment configuration, and monitoring.
+
+#### Scenario: Docker build and run documented
+- **WHEN** a new engineer reads 10-deployment.md
+- **THEN** they SHALL understand the multi-stage Dockerfile (builder stage compiles TypeScript, production stage runs compiled JS with node:18-alpine and non-root USER node), how to build the image, and how to run it with the required environment variables
+
+#### Scenario: Environment variables fully documented
+- **WHEN** a new engineer needs to configure the application
+- **THEN** the guide SHALL provide a complete table of all environment variables: name, purpose, required/optional, example value — covering database, Redis, JWT signing key, Vault, OPA, and rate limiting config
+
+#### Scenario: Database migrations documented
+- **WHEN** a new engineer needs to run or write migrations
+- **THEN** the guide SHALL explain: where migration files live (`migrations/`), the naming convention, how to run them (`npm run migrate`), and how to write a new migration following the existing pattern
+
+#### Scenario: Terraform multi-region deployment explained
+- **WHEN** a new engineer reads 10-deployment.md
+- **THEN** they SHALL understand the Terraform structure: what modules exist, what the three regions (US, EU, APAC) deploy, how to run `terraform plan` and `terraform apply`, and what AWS/GCP resources are provisioned
+
+#### Scenario: Prometheus metrics and Grafana explained
+- **WHEN** a new engineer reads 10-deployment.md
+- **THEN** they SHALL find: which endpoint exposes metrics (`/metrics`), the key metrics tracked, how to access the Grafana dashboard locally (port, login), and how to add a new metric counter or histogram to the API server
+
+#### Scenario: Operational runbook for common tasks
+- **WHEN** a new engineer is on-call or supporting operations
+- **THEN** the guide SHALL include a runbook covering: how to check application health, how to rotate the JWT signing key, how to revoke all tokens for a compromised agent, and how to read audit logs for an incident
--- a/openspec/specs/deployment/spec.md
+++ b/openspec/specs/deployment/spec.md
@@ -0,0 +1,44 @@
+# Spec: Multi-Region Deployment (Terraform)
+
+**Status**: Pending CEO approval
+**Workstream**: 8 of 8
+
+## Scope
+- `terraform/` directory at project root
+- Shared `agentidp` module (compute, networking, secrets)
+- `environments/aws/` — ECS Fargate + RDS PostgreSQL + ElastiCache Redis
+- `environments/gcp/` — Cloud Run + Cloud SQL + Memorystore Redis
+- Deployment guide: `docs/devops/deployment.md`
+
+## Module structure
+
+```
+terraform/
+  modules/
+    agentidp/
+      main.tf       — compute (ECS task or Cloud Run service)
+      networking.tf — VPC, subnets, security groups
+      variables.tf  — all configurable inputs
+      outputs.tf    — service URL, DB endpoint, Redis endpoint
+    rds/            — managed PostgreSQL
+    redis/          — managed Redis
+    lb/             — ALB (AWS) or Cloud LB (GCP), TLS cert
+  environments/
+    aws/
+      main.tf       — calls modules, sets AWS-specific vars
+      variables.tf
+      terraform.tfvars.example
+    gcp/
+      main.tf
+      variables.tf
+      terraform.tfvars.example
+```
+
+## Acceptance Criteria
+- [ ] `terraform validate` passes for both aws and gcp environments
+- [ ] `terraform plan` produces no errors against a live AWS/GCP account (test in dev env)
+- [ ] JWT_PRIVATE_KEY and JWT_PUBLIC_KEY injected as environment secrets (not hardcoded)
+- [ ] TLS termination at load balancer — HTTPS only in production modules
+- [ ] PostgreSQL and Redis not publicly accessible — VPC-internal only
+- [ ] `docs/devops/deployment.md` — end-to-end deployment walkthrough for AWS and GCP
+- [ ] `terraform.tfvars.example` provided for both environments — no secrets in version control
--- a/openspec/specs/dev-environment-setup/spec.md
+++ b/openspec/specs/dev-environment-setup/spec.md
@@ -0,0 +1,32 @@
+## ADDED Requirements
+
+### Requirement: Development environment setup guide
+The system SHALL include a document (`docs/engineering/07-dev-setup.md`) that takes a new engineer from zero to a fully running local stack in under 30 minutes, with no prior knowledge of the project assumed.
+
+#### Scenario: Prerequisites listed completely
+- **WHEN** a new engineer reads 07-dev-setup.md
+- **THEN** they SHALL find a complete prerequisites list: Node.js 18+, Docker Desktop, Git, a PostgreSQL client (optional), and links to install each — with no undocumented dependencies
+
+#### Scenario: Repository clone and setup steps complete
+- **WHEN** a new engineer follows the clone and setup steps
+- **THEN** they SHALL be able to: clone the repo, copy `.env.example` to `.env`, run `npm install`, and have all dependencies installed with zero manual configuration
+
+#### Scenario: Docker Compose local stack starts successfully
+- **WHEN** a new engineer runs `docker-compose up -d`
+- **THEN** all services (PostgreSQL, Redis, API server) SHALL start, migrations SHALL run automatically, and the guide SHALL show how to verify each service is healthy
+
+#### Scenario: Smoke test confirms working stack
+- **WHEN** a new engineer follows the smoke test section
+- **THEN** they SHALL run a curl command to POST /oauth2/token with the seed credentials and receive a valid JWT — confirming the full stack is operational
+
+#### Scenario: Common setup errors documented
+- **WHEN** a new engineer encounters a setup error
+- **THEN** the guide SHALL include a troubleshooting section covering the 5 most common errors: port already in use, migration failure, Node version mismatch, Docker not running, and missing .env variables
+
+#### Scenario: Running tests locally documented
+- **WHEN** a new engineer wants to run the test suite
+- **THEN** the guide SHALL show: `npm test` (unit tests only, no services needed), `npm run test:integration` (requires Docker stack), and how to run a single test file
+
+#### Scenario: Web dashboard local development documented
+- **WHEN** a new engineer wants to run the web dashboard
+- **THEN** the guide SHALL show how to start the Vite dev server (`npm run dev` in `dashboard/`) and which port it runs on, and confirm it connects to the local API server
--- a/openspec/specs/engineering-overview/spec.md
+++ b/openspec/specs/engineering-overview/spec.md
@@ -0,0 +1,28 @@
+## ADDED Requirements
+
+### Requirement: Company and product overview document
+The system SHALL include a document (`docs/engineering/01-overview.md`) that explains SentryAgent.ai's mission, the AgentIdP product, target users, and why the product exists — providing new engineers with business and product context before they read any technical content.
+
+#### Scenario: Mission and vision covered
+- **WHEN** a new engineer reads 01-overview.md
+- **THEN** they SHALL understand what SentryAgent.ai builds, why it exists, and what problem it solves for AI developers
+
+#### Scenario: AGNTCY alignment explained
+- **WHEN** a new engineer reads 01-overview.md
+- **THEN** they SHALL understand what AGNTCY is, why SentryAgent.ai aligns to it, and what "first-class agent identity" means
+
+#### Scenario: Product features listed
+- **WHEN** a new engineer reads 01-overview.md
+- **THEN** they SHALL see a summary of all product capabilities: agent registry, OAuth 2.0 auth, credential management, audit logs, SDKs, web dashboard, policy engine, and monitoring
+
+#### Scenario: Phase roadmap visible
+- **WHEN** a new engineer reads 01-overview.md
+- **THEN** they SHALL understand which capabilities belong to Phase 1, Phase 2, and Phase 3
+
+#### Scenario: Engineering team structure explained
+- **WHEN** a new engineer reads 01-overview.md
+- **THEN** they SHALL understand the Virtual Engineering Team model (CTO → Architect → Developer → QA) and how Claude operates as the engineering partner
+
+#### Scenario: Free tier limits documented
+- **WHEN** a new engineer reads 01-overview.md
+- **THEN** they SHALL see the free tier limits (100 agents, 10,000 token requests/month, 90-day audit retention, 100 req/min) and understand the product's positioning
--- a/openspec/specs/engineering-workflow/spec.md
+++ b/openspec/specs/engineering-workflow/spec.md
@@ -0,0 +1,32 @@
+## ADDED Requirements
+
+### Requirement: Engineering workflow and contribution guide
+The system SHALL include a document (`docs/engineering/08-workflow.md`) that prescribes the exact steps an engineer MUST follow to contribute any new feature or change, from idea to merged code.
+
+#### Scenario: OpenSpec spec-first workflow explained
+- **WHEN** a new engineer reads 08-workflow.md
+- **THEN** they SHALL understand that NO implementation begins without an approved OpenAPI spec — and the exact sequence: CEO approves → Architect writes spec → CTO reviews → Developer implements → QA signs off → CEO approves merge
+
+#### Scenario: OpenSpec CLI commands documented
+- **WHEN** a new engineer wants to start a new change
+- **THEN** the guide SHALL provide the exact commands: `openspec new change <name>`, `openspec status --change <name>`, `openspec instructions <artifact> --change <name>`, and what each command does
+
+#### Scenario: Branching strategy documented
+- **WHEN** a new engineer creates a branch
+- **THEN** the guide SHALL prescribe: feature branches from `develop`, naming convention `feature/<change-name>`, PR targets `develop`, `develop` → `main` requires CTO + CEO approval
+
+#### Scenario: TypeScript and code standards enforced in workflow
+- **WHEN** a new engineer writes code
+- **THEN** the guide SHALL state the non-negotiable standards: strict mode, no `any`, DRY, SOLID, JSDoc on all public methods — and that PRs violating these are blocked by the CTO regardless of functionality
+
+#### Scenario: PR checklist documented
+- **WHEN** a new engineer opens a PR
+- **THEN** the guide SHALL provide a PR checklist: TypeScript compiles with zero errors, ESLint passes with zero warnings, unit tests pass, coverage gate met (>80%), integration tests pass, OpenAPI spec updated if endpoint changed, engineering docs updated if architecture changed
+
+#### Scenario: Virtual engineering team roles explained for contributors
+- **WHEN** a new engineer reads 08-workflow.md
+- **THEN** they SHALL understand the role separation: they contribute as the Principal Developer role, the CTO reviews all PRs, the Architect owns spec changes, and QA owns the test sign-off — and how to interact with each role in practice
+
+#### Scenario: Commit message conventions documented
+- **WHEN** a new engineer writes a commit message
+- **THEN** the guide SHALL prescribe the Conventional Commits format: `feat:`, `fix:`, `docs:`, `test:`, `chore:`, `refactor:` prefixes — with examples for each
--- a/openspec/specs/go-sdk/spec.md
+++ b/openspec/specs/go-sdk/spec.md
@@ -0,0 +1,23 @@
+# Spec: Go SDK (`github.com/sentryagent/idp-sdk-go`)
+
+**Status**: Pending CEO approval
+**Workstream**: 3 of 8
+
+## Scope
+- `sdk-go/` directory at project root
+- Context-aware `AgentIdPClient` using standard library `net/http`
+- `TokenManager` with mutex-guarded cache and 60s auto-refresh
+- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient`
+- Idiomatic Go error type `AgentIdPError` implementing `error` interface
+- `go.mod` module: `github.com/sentryagent/idp-sdk-go`
+- `sdk-go/README.md`
+
+## Acceptance Criteria
+- [ ] All 14 endpoints covered
+- [ ] All methods take `context.Context` as first argument
+- [ ] No panics — all errors returned as `error`
+- [ ] `AgentIdPError` implements `error` and exposes `.Code`, `.HTTPStatus`, `.Details`
+- [ ] `TokenManager` is goroutine-safe (`sync.Mutex` on cache)
+- [ ] `go vet` and `staticcheck` pass with zero warnings
+- [ ] `go test ./...` with >80% coverage
+- [ ] README matches Node.js SDK structure
--- a/openspec/specs/java-sdk/spec.md
+++ b/openspec/specs/java-sdk/spec.md
@@ -0,0 +1,23 @@
+# Spec: Java SDK (`ai.sentryagent:idp-sdk`)
+
+**Status**: Pending CEO approval
+**Workstream**: 4 of 8
+
+## Scope
+- `sdk-java/` directory at project root
+- `AgentIdPClient` with sync and `CompletableFuture` async variants
+- `TokenManager` with thread-safe cache and 60s auto-refresh
+- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient`
+- `AgentIdPException` extending `RuntimeException` with `code`, `httpStatus`, `details`
+- `pom.xml`: groupId=`ai.sentryagent`, artifactId=`idp-sdk`, Java 17+
+- `sdk-java/README.md`
+
+## Acceptance Criteria
+- [ ] All 14 endpoints covered
+- [ ] Sync methods return typed POJOs; async methods return `CompletableFuture<T>`
+- [ ] `AgentIdPException` thrown (not raw IOException) on all failure paths
+- [ ] `TokenManager` is thread-safe (`synchronized` on cache)
+- [ ] Apache HttpClient 5 for HTTP transport
+- [ ] Jackson for JSON serialization
+- [ ] `mvn verify` passes with >80% coverage (JUnit 5)
+- [ ] README matches Node.js SDK structure
--- a/openspec/specs/monitoring/spec.md
+++ b/openspec/specs/monitoring/spec.md
@@ -0,0 +1,32 @@
+# Spec: Prometheus + Grafana Monitoring
+
+**Status**: Pending CEO approval
+**Workstream**: 7 of 8
+
+## Scope
+- `prom-client` integration — expose `GET /metrics`
+- 7 metrics (counters + histograms) across all services
+- `monitoring/` directory: Prometheus config + Grafana provisioning
+- `docker-compose.monitoring.yml` overlay (adds prometheus + grafana services)
+- Pre-built Grafana dashboard JSON (`monitoring/grafana/dashboards/agentidp.json`)
+
+## Metrics
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `agentidp_tokens_issued_total` | Counter | `outcome` (success/failure) |
+| `agentidp_agents_registered_total` | Counter | `outcome` |
+| `agentidp_http_requests_total` | Counter | `method`, `path`, `status_code` |
+| `agentidp_http_request_duration_seconds` | Histogram | `method`, `path` |
+| `agentidp_rate_limit_rejections_total` | Counter | — |
+| `agentidp_db_query_duration_seconds` | Histogram | `operation` |
+| `agentidp_redis_command_duration_seconds` | Histogram | `command` |
+
+## Acceptance Criteria
+- [ ] `GET /metrics` returns Prometheus text format
+- [ ] `/metrics` endpoint does NOT require Bearer auth (Prometheus scrapes it)
+- [ ] All 7 metrics present and updating under load
+- [ ] Grafana dashboard auto-provisions on `docker compose -f docker-compose.monitoring.yml up`
+- [ ] Grafana runs on port 3001 (no conflict with AgentIdP on 3000)
+- [ ] `docs/devops/operations.md` updated with monitoring section
+- [ ] `prom-client` added as new dependency — CEO approval gate
--- a/openspec/specs/opa-policy/spec.md
+++ b/openspec/specs/opa-policy/spec.md
@@ -0,0 +1,37 @@
+# Spec: OPA Policy Engine Integration
+
+**Status**: Pending CEO approval
+**Workstream**: 5 of 8
+
+## Scope
+- New `OpaMiddleware` replacing static scope check in `auth.ts`
+- `@openpolicyagent/opa-wasm` integration (embedded Wasm, no sidecar)
+- `policies/authz.rego` — main allow/deny policy
+- `policies/data/scopes.json` — scope to permission mapping
+- SIGHUP handler to hot-reload policies without restart
+- New env var: `POLICY_DIR` (default: `./policies`)
+
+## Policy interface
+
+```
+input = {
+  "method": "GET",
+  "path": "/api/v1/agents",
+  "scopes": ["agents:read"],
+  "agentId": "uuid"
+}
+
+output = {
+  "allow": true | false,
+  "reason": "string"   // populated when allow=false
+}
+```
+
+## Acceptance Criteria
+- [ ] All existing scope checks replaced by OPA evaluation
+- [ ] Policy files hot-reloadable on SIGHUP (no restart required)
+- [ ] OPA Wasm loaded at startup — fail-fast if `POLICY_DIR` invalid
+- [ ] `allow=false` responses return `403` with `reason` in error body
+- [ ] Existing test suite passes unchanged (OPA evaluates same rules as before)
+- [ ] New unit tests for OPA middleware: allow/deny cases, missing scope, invalid input
+- [ ] `POLICY_DIR` env var documented in `docs/devops/environment-variables.md`
--- a/openspec/specs/python-sdk/spec.md
+++ b/openspec/specs/python-sdk/spec.md
@@ -0,0 +1,24 @@
+# Spec: Python SDK (`sentryagent-idp`)
+
+**Status**: Pending CEO approval
+**Workstream**: 2 of 8
+
+## Scope
+- `sdk-python/` directory at project root
+- `AgentIdPClient` with sync and async variants
+- `TokenManager` with 60s auto-refresh
+- Service clients: `AgentRegistryClient`, `CredentialClient`, `TokenClient`, `AuditClient`
+- `AgentIdPError` typed exception
+- Full type hints — `mypy --strict` clean
+- `sdk-python/README.md` with installation and usage
+
+## Acceptance Criteria
+- [ ] All 14 API endpoints covered
+- [ ] Sync client: `requests` library
+- [ ] Async client: `httpx` library
+- [ ] `mypy --strict` passes with zero errors
+- [ ] Zero untyped code
+- [ ] `AgentIdPError` raised (not raw requests/httpx exceptions) on all failure paths
+- [ ] `TokenManager` tested: caches token, refreshes at exp-60s
+- [ ] `pyproject.toml` with: name=sentryagent-idp, python>=3.9, dependencies declared
+- [ ] README matches Node.js SDK structure
--- a/openspec/specs/sdk-guide/spec.md
+++ b/openspec/specs/sdk-guide/spec.md
@@ -0,0 +1,28 @@
+## ADDED Requirements
+
+### Requirement: SDK integration guide
+The system SHALL include a document (`docs/engineering/11-sdk-guide.md`) that explains how each of the four language SDKs is structured, how to use them, and how to contribute to or extend them.
+
+#### Scenario: SDK architecture overview present
+- **WHEN** a new engineer reads 11-sdk-guide.md
+- **THEN** they SHALL understand that all four SDKs (Node.js, Python, Go, Java) implement the same API surface (14 endpoints, 4 service clients, 1 TokenManager, 1 error type) with identical semantics, and why consistency across SDKs is a non-negotiable standard
+
+#### Scenario: Node.js SDK documented
+- **WHEN** a new engineer reads the Node.js SDK section
+- **THEN** they SHALL find: installation (`npm install @sentryagent/idp-sdk`), the AgentIdPClient constructor, all 4 service clients (agents, credentials, tokens, audit), TokenManager auto-refresh behaviour, AgentIdPError structure, and a complete working code example for the most common flow (register agent → generate credential → issue token)
+
+#### Scenario: Python SDK documented
+- **WHEN** a new engineer reads the Python SDK section
+- **THEN** they SHALL find: installation (`pip install sentryagent-idp`), both sync (AgentIdPClient) and async (AsyncAgentIdPClient) variants, TokenManager and AsyncTokenManager auto-refresh, AgentIdPError, and a complete working example for sync and async usage
+
+#### Scenario: Go SDK documented
+- **WHEN** a new engineer reads the Go SDK section
+- **THEN** they SHALL find: installation (`go get github.com/sentryagent/idp-sdk-go`), AgentIdPClient construction, goroutine-safe TokenManager, context.Context usage pattern, AgentIdPError with Code/HTTPStatus/Details, and a complete working example
+
+#### Scenario: Java SDK documented
+- **WHEN** a new engineer reads the Java SDK section
+- **THEN** they SHALL find: Maven/Gradle dependency snippet, AgentIdPClient construction with builder pattern, sync methods and CompletableFuture async counterparts, thread-safe TokenManager, AgentIdPException, and a complete working example
+
+#### Scenario: SDK contribution guide included
+- **WHEN** a new engineer needs to add a new endpoint to all SDKs
+- **THEN** the guide SHALL provide a step-by-step checklist for adding a new method to all four SDKs consistently: where to add the method, what the signature pattern is, how to write the corresponding test, and how to verify it compiles/passes in each language
--- a/openspec/specs/service-deep-dives/spec.md
+++ b/openspec/specs/service-deep-dives/spec.md
@@ -0,0 +1,40 @@
+## ADDED Requirements
+
+### Requirement: Service deep-dive documents
+The system SHALL include a document (`docs/engineering/05-services.md`) providing a deep-dive reference for every core service and component, following a consistent template: Purpose → Responsibility boundary → Public interface → Key methods → Database schema (if applicable) → Error types → Configuration.
+
+#### Scenario: AgentService documented
+- **WHEN** a new engineer reads 05-services.md
+- **THEN** they SHALL find the AgentService section covering: responsibility (agent CRUD only), public methods (createAgent, getAgent, listAgents, updateAgent, deleteAgent), the `agents` table schema, AgentNotFoundError and AgentAlreadyExistsError, and what AgentService does NOT do (no auth, no credentials — Single Responsibility)
+
+#### Scenario: OAuth2Service documented
+- **WHEN** a new engineer reads 05-services.md
+- **THEN** they SHALL find the OAuth2Service section covering: responsibility (token issuance and revocation only), public methods (issueToken, validateToken, revokeToken), Redis token storage schema, JWT payload structure, token TTL configuration, and the Vault credential verification path vs bcrypt path
+
+#### Scenario: CredentialService documented
+- **WHEN** a new engineer reads 05-services.md
+- **THEN** they SHALL find the CredentialService section covering: responsibility (credential lifecycle only), public methods (generateCredential, rotateCredential, revokeCredential, listCredentials), the `credentials` table schema, bcrypt vs Vault storage decision, and the `vault_path` column purpose
+
+#### Scenario: AuditService documented
+- **WHEN** a new engineer reads 05-services.md
+- **THEN** they SHALL find the AuditService section covering: responsibility (immutable audit logging only), public methods (logEvent, queryLogs), the `audit_logs` table schema, event types enum, 90-day retention policy, and why audit records are never updated or deleted
+
+#### Scenario: VaultClient documented
+- **WHEN** a new engineer reads 05-services.md
+- **THEN** they SHALL find the VaultClient section covering: purpose (wraps node-vault for KV v2 operations), public methods (writeSecret, readSecret, verifySecret, deleteSecret), the opt-in configuration (VAULT_ADDR env var), and the constant-time comparison in verifySecret and why it matters (timing attack prevention)
+
+#### Scenario: OPA policy engine documented
+- **WHEN** a new engineer reads 05-services.md
+- **THEN** they SHALL find the OPA section covering: purpose (dynamic access control beyond static OAuth scopes), how policies are loaded, how authorization decisions are made, the policy file locations, and how to write and test a new policy
+
+#### Scenario: Web Dashboard documented
+- **WHEN** a new engineer reads 05-services.md
+- **THEN** they SHALL find the Web Dashboard section covering: React 18 + Vite 5 + TypeScript structure, how it authenticates against the AgentIdP API, the main views (agent list, credential management, audit log viewer, policy editor), and how to run it locally
+
+#### Scenario: Monitoring stack documented
+- **WHEN** a new engineer reads 05-services.md
+- **THEN** they SHALL find the monitoring section covering: Prometheus metrics exposed by the API server (`/metrics`), the key metrics (request count, latency histograms, active tokens, agent count), Grafana dashboard structure, and how to add a new metric to the API server
+
+#### Scenario: Consistent template enforced
+- **WHEN** a new engineer looks up any service
+- **THEN** every service section SHALL follow the same template so the engineer knows exactly where to find each type of information
--- a/openspec/specs/testing-strategy/spec.md
+++ b/openspec/specs/testing-strategy/spec.md
@@ -0,0 +1,32 @@
+## ADDED Requirements
+
+### Requirement: Testing strategy document
+The system SHALL include a document (`docs/engineering/09-testing.md`) that explains the test architecture, how to run tests, coverage requirements, and how to write new tests following project conventions.
+
+#### Scenario: Test types and their purposes explained
+- **WHEN** a new engineer reads 09-testing.md
+- **THEN** they SHALL understand the distinction between: unit tests (test one service/util in isolation, mock all dependencies, no running services needed) and integration tests (test full HTTP request/response cycle with real PostgreSQL + Redis)
+
+#### Scenario: Test framework stack documented
+- **WHEN** a new engineer reads 09-testing.md
+- **THEN** they SHALL find the test stack listed and explained: Jest 29.7 (test runner + assertions), ts-jest (TypeScript compilation), Supertest 6.3 (HTTP integration testing), and how each is configured
+
+#### Scenario: Coverage gates documented
+- **WHEN** a new engineer reads 09-testing.md
+- **THEN** they SHALL know the mandatory gates: >80% statements, >80% branches, >80% functions, >80% lines — and that PRs below these thresholds are blocked
+
+#### Scenario: How to run the test suite documented
+- **WHEN** a new engineer wants to run tests
+- **THEN** the guide SHALL show: `npm test` (unit tests, no services), `npm run test:coverage` (unit tests + coverage report), `npm run test:integration` (requires Docker stack), and `npx jest src/services/agentService.test.ts` (single file)
+
+#### Scenario: Unit test writing conventions shown
+- **WHEN** a new engineer writes a new unit test
+- **THEN** the guide SHALL show a complete example: how to mock a repository with `jest.mock()`, how to structure `describe`/`it` blocks, how to assert on thrown errors, and how to verify mock calls — using an actual test from the codebase as the example
+
+#### Scenario: Integration test writing conventions shown
+- **WHEN** a new engineer writes a new integration test
+- **THEN** the guide SHALL show a complete example using Supertest: how to boot the Express app, how to seed test data, how to make authenticated requests (including getting a JWT first), and how to clean up after the test
+
+#### Scenario: OWASP security testing reference included
+- **WHEN** a new engineer writes security-relevant code
+- **THEN** the guide SHALL include a reference to the OWASP Top 10 checks that are verified in QA sign-off and what each means in the context of this codebase (SQL injection, JWT attacks, credential exposure, etc.)
--- a/openspec/specs/vault/spec.md
+++ b/openspec/specs/vault/spec.md
@@ -0,0 +1,21 @@
+# Spec: HashiCorp Vault Integration
+
+**Status**: Pending CEO approval
+**Workstream**: 1 of 8
+
+## Scope
+- VaultClient class wrapping `node-vault`
+- `005_add_vault_path.sql` migration
+- Updated CredentialService to write secrets to Vault instead of PostgreSQL
+- New env vars: VAULT_ADDR, VAULT_TOKEN, VAULT_MOUNT
+- Migration guide: bcrypt → Vault coexistence strategy
+
+## Acceptance Criteria
+- [ ] New credentials: secret written to Vault KV v2, `vault_path` stored in PostgreSQL
+- [ ] Credential rotation: Vault versioned update, `vault_path` unchanged
+- [ ] Credential revocation: Vault secret deleted, DB status = `revoked`
+- [ ] Existing bcrypt credentials continue to work until rotated
+- [ ] VaultClient follows existing service interface pattern (DRY, SOLID)
+- [ ] Zero `any` types, TypeScript strict
+- [ ] `VAULT_ADDR` / `VAULT_TOKEN` validation at startup (fail-fast)
+- [ ] DevOps docs updated with Vault setup section
--- a/openspec/specs/web-dashboard/spec.md
+++ b/openspec/specs/web-dashboard/spec.md
@@ -0,0 +1,34 @@
+# Spec: Web Dashboard UI
+
+**Status**: Pending CEO approval
+**Workstream**: 6 of 8
+
+## Scope
+- `dashboard/` directory at project root
+- React 18 + TypeScript strict, built with Vite 5
+- TanStack Query v5 for server state
+- shadcn/ui (Radix UI + Tailwind CSS) for components
+- Four pages: Agents, Credentials, Audit Log, Health
+- Client-side auth: `clientId` + `clientSecret` → `TokenManager`
+- Served from AgentIdP server at `GET /dashboard` (static build)
+
+## Pages
+
+| Page | Route | Scope Required |
+|------|-------|---------------|
+| Login | `/dashboard/login` | None |
+| Agents | `/dashboard/agents` | `agents:read` |
+| Agent Detail | `/dashboard/agents/:id` | `agents:read` |
+| Credentials | `/dashboard/agents/:id/credentials` | `agents:read` |
+| Audit Log | `/dashboard/audit` | `audit:read` |
+| Health | `/dashboard/health` | None |
+
+## Acceptance Criteria
+- [ ] TypeScript strict — zero `any` across all dashboard files
+- [ ] `dashboard/tsconfig.json` with `strict: true`
+- [ ] Login form stores token in `sessionStorage` only (not `localStorage`)
+- [ ] All write operations (suspend, revoke, rotate) require confirmation dialog
+- [ ] OWASP Top 10 review: no XSS, no CSRF, no sensitive data in URL params
+- [ ] Vite build outputs to `dashboard/dist/`; AgentIdP serves it as static
+- [ ] `dashboard/README.md` — how to build and serve
+- [ ] Responsive layout — functional on desktop and tablet