- devops docs: 8 files updated for Phase 6 state; field-trial.md added (946-line runbook) - developer docs: api-reference (50+ endpoints), quick-start, 5 existing guides updated, 5 new guides added - engineering docs: all 12 files updated (services, architecture, SDK guide, testing, overview) - OpenSpec archives: phase-7-devops-field-trial, developer-docs-phase6-update, engineering-docs-phase6-update - VALIDATOR.md + scripts/start-validator.sh: V&V Architect tooling added - .gitignore: exclude session artifacts, build artifacts, and agent workspaces Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
975 lines
33 KiB
Markdown
975 lines
33 KiB
Markdown
# 06 — Code Walkthroughs
|
||
|
||
Last verified against commit: `1f95cfe89d1f45fa43b9fb7cff237f07bf9e889e`
|
||
|
||
These walkthroughs trace three real production code paths from the HTTP request
|
||
to the database and back. Every step includes a `file:line` reference and a
|
||
"why" annotation explaining the design decision.
|
||
|
||
---
|
||
|
||
## Walkthrough 1 — Token Issuance
|
||
|
||
**Request:** `POST /api/v1/token` with `grant_type=client_credentials`
|
||
|
||
This is the most security-critical path in the codebase. An AI agent calling this
|
||
endpoint is proving its identity and receiving a token that grants access to the
|
||
entire API for one hour.
|
||
|
||
---
|
||
|
||
### Step 1 — Express middleware stack
|
||
|
||
**File:** `src/app.ts` lines 57–83
|
||
|
||
```
|
||
helmet() → security headers
|
||
cors() → CORS headers
|
||
morgan() → access log line (skipped in test env)
|
||
express.json() → parse JSON bodies
|
||
express.urlencoded({ extended: false }) → parse form-encoded bodies
|
||
metricsMiddleware → start request timer, record counters on finish
|
||
```
|
||
|
||
**Why `extended: false`?** The token endpoint receives `application/x-www-form-urlencoded`
|
||
bodies (RFC 6749 mandates this format for OAuth 2.0). The `express.urlencoded`
|
||
middleware parses them into `req.body`. `extended: false` uses the native `querystring`
|
||
parser, which is sufficient and avoids `qs` library complexity for flat key-value data.
|
||
|
||
---
|
||
|
||
### Step 2 — Route dispatch
|
||
|
||
**File:** `src/routes/token.ts` line 24
|
||
|
||
```typescript
|
||
router.post('/', asyncHandler(rateLimitMiddleware), asyncHandler(tokenController.issueToken.bind(tokenController)));
|
||
```
|
||
|
||
**Why no `authMiddleware` here?** The token endpoint is where the agent _gets_ its
|
||
token — it cannot present a Bearer token to authenticate. Instead, credentials go
|
||
in the request body (`client_id`, `client_secret`). `POST /token` is deliberately
|
||
unauthenticated at the transport layer; authentication happens inside the controller.
|
||
|
||
**Why `asyncHandler`?** Express does not natively support async middleware. `asyncHandler`
|
||
wraps the async function and calls `next(err)` if the promise rejects, routing the
|
||
error to `errorHandler`.
|
||
|
||
---
|
||
|
||
### Step 3 — Rate limit check
|
||
|
||
**File:** `src/middleware/rateLimit.ts`
|
||
|
||
The rate limiter checks a Redis sliding-window counter for the client's IP address.
|
||
If the counter exceeds 100 requests/minute, it throws `RateLimitError` (429).
|
||
|
||
**Why Redis, not in-memory?** If the server restarts or scales horizontally to multiple
|
||
instances, an in-memory counter would reset. Redis maintains the counter across
|
||
instances and restarts.
|
||
|
||
---
|
||
|
||
### Step 4 — Controller: validate grant_type
|
||
|
||
**File:** `src/controllers/TokenController.ts` lines 84–103
|
||
|
||
```typescript
|
||
issueToken = async (req: Request, res: Response, _next: NextFunction): Promise<void> => {
|
||
const body = req.body as ITokenRequest;
|
||
|
||
if (!body.grant_type) { ... return res.status(400).json({error: 'invalid_request', ...}) }
|
||
if (body.grant_type !== 'client_credentials') { ... return res.status(400).json(...) }
|
||
```
|
||
|
||
**Why does this method catch errors itself instead of calling `next(err)`?** The token
|
||
endpoint must return errors in the **OAuth 2.0 error format** (`{ error, error_description }`)
|
||
per RFC 6749 §5.2, not the standard SentryAgent.ai format (`{ code, message }`). The
|
||
`mapToOAuth2Error()` helper translates `AuthenticationError` and `AuthorizationError`
|
||
into OAuth2 error codes. The `_next` parameter is intentionally unused for the error path.
|
||
|
||
---
|
||
|
||
### Step 5 — Controller: Joi validation and credential extraction
|
||
|
||
**File:** `src/controllers/TokenController.ts` lines 106–138
|
||
|
||
```typescript
|
||
const { error, value } = tokenRequestSchema.validate(body, { abortEarly: false });
|
||
// ...
|
||
// Support HTTP Basic auth fallback (RFC 6749 §2.3.1)
|
||
const authHeader = req.headers['authorization'];
|
||
if (authHeader?.startsWith('Basic ')) {
|
||
const base64 = authHeader.slice(6);
|
||
const decoded = Buffer.from(base64, 'base64').toString('utf-8');
|
||
const colonIndex = decoded.indexOf(':');
|
||
clientId = decoded.slice(0, colonIndex);
|
||
clientSecret = decoded.slice(colonIndex + 1);
|
||
}
|
||
```
|
||
|
||
**Why `abortEarly: false`?** This returns all validation errors at once, so the
|
||
client can fix all problems in one round trip.
|
||
|
||
**Why Basic auth support?** RFC 6749 §2.3.1 specifies that client credentials MAY
|
||
be sent via HTTP Basic authentication. Some OAuth libraries default to this method.
|
||
|
||
---
|
||
|
||
### Step 6 — Controller: scope validation
|
||
|
||
**File:** `src/controllers/TokenController.ts` lines 141–151
|
||
|
||
```typescript
|
||
const requestedScope = tokenBody.scope ?? 'agents:read';
|
||
const validScopes = ['agents:read', 'agents:write', 'tokens:read', 'audit:read'];
|
||
const scopeList = requestedScope.split(' ');
|
||
const invalidScope = scopeList.find((s) => !validScopes.includes(s));
|
||
if (invalidScope) { return res.status(400).json({error: 'invalid_scope', ...}) }
|
||
```
|
||
|
||
**Why validate scopes here?** Scope validation at the controller layer provides an
|
||
RFC 6749-compliant `invalid_scope` error before we even look up the agent. This is
|
||
faster and gives the client a clearer error message.
|
||
|
||
---
|
||
|
||
### Step 7 — Service: agent lookup
|
||
|
||
**File:** `src/services/OAuth2Service.ts` lines 83–94
|
||
|
||
```typescript
|
||
const agent = await this.agentRepository.findById(clientId);
|
||
if (!agent) {
|
||
void this.auditService.logEvent(clientId, 'auth.failed', 'failure', ..., { reason: 'agent_not_found', clientId });
|
||
throw new AuthenticationError('Client authentication failed...');
|
||
}
|
||
```
|
||
|
||
**Why log auth failures?** Failed authentication attempts may indicate a brute-force
|
||
attack or a misconfigured client. Having them in the audit log enables incident
|
||
investigation and alerting.
|
||
|
||
**Why not distinguish between "agent not found" and "wrong secret" in the error message?**
|
||
Revealing which is wrong gives an attacker information — they can enumerate valid
|
||
`client_id` values by checking whether they get "agent not found" vs "wrong secret".
|
||
Both cases return the same message.
|
||
|
||
---
|
||
|
||
### Step 8 — Service: credential verification
|
||
|
||
**File:** `src/services/OAuth2Service.ts` lines 97–131
|
||
|
||
```typescript
|
||
const { credentials } = await this.credentialRepository.findByAgentId(clientId, { status: 'active', page: 1, limit: 100 });
|
||
|
||
for (const cred of credentials) {
|
||
const credRow = await this.credentialRepository.findById(cred.credentialId);
|
||
if (credRow) {
|
||
if (credRow.expiresAt !== null && credRow.expiresAt < new Date()) { continue; }
|
||
|
||
let matches: boolean;
|
||
if (credRow.vaultPath !== null && this.vaultClient !== null) {
|
||
matches = await this.vaultClient.verifySecret(clientId, credRow.credentialId, clientSecret);
|
||
} else {
|
||
matches = await verifySecret(clientSecret, credRow.secretHash);
|
||
}
|
||
if (matches) { credentialVerified = true; break; }
|
||
}
|
||
}
|
||
```
|
||
|
||
**Why iterate over multiple credentials?** An agent can have multiple active
|
||
credentials (e.g. one per service that calls it). The agent rotates credentials
|
||
one at a time — if credential A is rotated while service X is still using it,
|
||
service X will fail. By checking all active credentials, we allow overlapping rotation.
|
||
|
||
**Why check expiry before hashing?** Bcrypt is intentionally slow (~100ms). Checking
|
||
expiry first is a cheap early exit that avoids the bcrypt computation on expired
|
||
credentials.
|
||
|
||
---
|
||
|
||
### Step 9 — Service: status and monthly limit checks
|
||
|
||
**File:** `src/services/OAuth2Service.ts` lines 144–176
|
||
|
||
```typescript
|
||
if (agent.status === 'suspended') { throw new AuthorizationError(...) }
|
||
if (agent.status === 'decommissioned') { throw new AuthorizationError(...) }
|
||
|
||
const monthlyCount = await this.tokenRepository.getMonthlyCount(clientId);
|
||
if (monthlyCount >= FREE_TIER_MAX_MONTHLY_TOKENS) { throw new FreeTierLimitError(...) }
|
||
```
|
||
|
||
**Why check status after credential verification?** We verify credentials first so
|
||
a suspended agent with a wrong secret gets `AuthenticationError` (401) not
|
||
`AuthorizationError` (403). This prevents leaking which agents are suspended to
|
||
unauthenticated callers.
|
||
|
||
---
|
||
|
||
### Step 10 — Service: sign the JWT
|
||
|
||
**File:** `src/services/OAuth2Service.ts` lines 179–190
|
||
|
||
```typescript
|
||
const jti = uuidv4();
|
||
const payload: Omit<ITokenPayload, 'iat' | 'exp'> = { sub: clientId, client_id: clientId, scope, jti };
|
||
const accessToken = signToken(payload, this.privateKey);
|
||
```
|
||
|
||
**File:** `src/utils/jwt.ts` lines 19–31
|
||
|
||
```typescript
|
||
export function signToken(payload: Omit<ITokenPayload, 'iat' | 'exp'>, privateKey: string): string {
|
||
const now = Math.floor(Date.now() / 1000);
|
||
const fullPayload: ITokenPayload = { ...payload, iat: now, exp: now + TOKEN_EXPIRES_IN };
|
||
return jwt.sign(fullPayload, privateKey, { algorithm: 'RS256' });
|
||
}
|
||
```
|
||
|
||
**Why RS256 instead of HS256?** RS256 (RSA asymmetric) allows any consumer of the
|
||
token to verify it using the public key without needing the private signing key.
|
||
HS256 (HMAC symmetric) would require sharing the secret with every service that
|
||
verifies tokens.
|
||
|
||
**Why `jti` (JWT ID)?** The `jti` is a unique identifier for this specific token.
|
||
It is used as the key in the Redis revocation list. Without `jti`, you cannot
|
||
revoke a single token without revoking all tokens for the agent.
|
||
|
||
---
|
||
|
||
### Step 11 — Service: fire-and-forget operations
|
||
|
||
**File:** `src/services/OAuth2Service.ts` lines 193–207
|
||
|
||
```typescript
|
||
void this.tokenRepository.incrementMonthlyCount(clientId);
|
||
void this.auditService.logEvent(clientId, 'token.issued', 'success', ..., { scope, expiresAt });
|
||
tokensIssuedTotal.inc({ scope });
|
||
```
|
||
|
||
**Why `void` (fire-and-forget)?** The token has been signed and is ready to return.
|
||
Waiting for the Redis increment and audit write would add ~5–10ms to every token
|
||
request. These operations are best-effort — if they fail, the token is still valid.
|
||
|
||
**Why is the Prometheus `.inc()` call synchronous?** Prometheus counters are
|
||
in-process memory operations — they do not write to Redis or PostgreSQL. They are
|
||
O(1) and sub-microsecond.
|
||
|
||
---
|
||
|
||
### Step 12 — Response
|
||
|
||
**File:** `src/controllers/TokenController.ts` lines 163–167
|
||
|
||
```typescript
|
||
res.setHeader('Cache-Control', 'no-store');
|
||
res.setHeader('Pragma', 'no-cache');
|
||
res.status(200).json(tokenResponse);
|
||
```
|
||
|
||
**Why `Cache-Control: no-store`?** RFC 6749 §5.1 mandates that token responses
|
||
must not be cached. Without this header, a shared proxy or CDN could cache the
|
||
response and replay it to another client.
|
||
|
||
Final response:
|
||
```json
|
||
{
|
||
"access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
|
||
"token_type": "Bearer",
|
||
"expires_in": 3600,
|
||
"scope": "agents:read agents:write"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Walkthrough 2 — Agent Registration
|
||
|
||
**Request:** `POST /api/v1/agents` with Bearer token and agent data JSON body
|
||
|
||
After token issuance, registering an agent is the second most common operation.
|
||
This walkthrough shows a request that goes through all three auth middleware layers.
|
||
|
||
---
|
||
|
||
### Step 1 — Middleware stack
|
||
|
||
**File:** `src/app.ts` lines 57–83 (same security and parsing middleware as Walkthrough 1)
|
||
|
||
---
|
||
|
||
### Step 2 — Route dispatch
|
||
|
||
**File:** `src/routes/agents.ts` lines 22–27
|
||
|
||
```typescript
|
||
router.use(asyncHandler(authMiddleware));
|
||
router.use(opaMiddleware);
|
||
router.use(asyncHandler(rateLimitMiddleware));
|
||
router.post('/', asyncHandler(agentController.registerAgent.bind(agentController)));
|
||
```
|
||
|
||
All three middleware run on every request to the agents router before the handler.
|
||
|
||
---
|
||
|
||
### Step 3 — Auth middleware: Bearer token verification
|
||
|
||
**File:** `src/middleware/auth.ts` lines 28–77
|
||
|
||
```typescript
|
||
const authHeader = req.headers['authorization'];
|
||
if (!authHeader || !authHeader.startsWith('Bearer ')) { throw new AuthenticationError(...) }
|
||
|
||
const token = authHeader.slice(7).trim();
|
||
const publicKey = process.env['JWT_PUBLIC_KEY'];
|
||
let payload: ITokenPayload;
|
||
try {
|
||
payload = verifyToken(token, publicKey);
|
||
} catch (err) {
|
||
if (err instanceof TokenExpiredError) { throw new AuthenticationError('Token has expired.') }
|
||
if (err instanceof JsonWebTokenError) { throw new AuthenticationError('Token signature is invalid.') }
|
||
}
|
||
|
||
const redis = await getRedisClient();
|
||
const revocationKey = `revoked:${payload.jti}`;
|
||
const isRevoked = await redis.get(revocationKey);
|
||
if (isRevoked !== null) { throw new AuthenticationError('Token has been revoked.') }
|
||
|
||
req.user = payload;
|
||
next();
|
||
```
|
||
|
||
**Why check Redis after signature verification?** Signature verification is a pure
|
||
cryptographic operation (no I/O). If the token is expired or has a bad signature,
|
||
there is no need to hit Redis. The fast path exits early; Redis is the slower
|
||
secondary check.
|
||
|
||
**Why `await getRedisClient()` instead of storing the client?** `getRedisClient()`
|
||
returns the same singleton every time — the connection is created once and reused.
|
||
The `await` is fast (no I/O after the first call).
|
||
|
||
---
|
||
|
||
### Step 4 — OPA middleware: scope enforcement
|
||
|
||
**File:** `src/middleware/opa.ts` lines 230–257
|
||
|
||
```typescript
|
||
const input: OpaInput = {
|
||
method: req.method, // "POST"
|
||
path: req.baseUrl + req.path, // "/api/v1/agents"
|
||
scopes: req.user.scope.split(' '), // ["agents:read", "agents:write"]
|
||
};
|
||
|
||
if (!evaluate(input)) {
|
||
next(new AuthorizationError());
|
||
return;
|
||
}
|
||
```
|
||
|
||
For `POST /api/v1/agents`, the policy requires `["agents:write"]`. If `agents:write`
|
||
is not in the token's scope, the request is rejected with 403 before the controller
|
||
runs.
|
||
|
||
**Why reconstruct the full path with `req.baseUrl + req.path`?** The OPA policy
|
||
uses full paths (`/api/v1/agents/:id`). Inside a nested router, `req.path` is
|
||
relative to the router's mount point (e.g. `/`). `req.baseUrl` is the mount prefix
|
||
(`/api/v1/agents`). Concatenating them gives the full path the policy expects.
|
||
|
||
---
|
||
|
||
### Step 5 — Controller: validation
|
||
|
||
**File:** `src/controllers/AgentController.ts` lines 37–60
|
||
|
||
```typescript
|
||
registerAgent = async (req: Request, res: Response, next: NextFunction): Promise<void> => {
|
||
if (!req.user) { throw new AuthorizationError() }
|
||
|
||
const { error, value } = createAgentSchema.validate(req.body, { abortEarly: false });
|
||
if (error) {
|
||
throw new ValidationError('Request validation failed.', {
|
||
details: error.details.map((d) => ({ field: d.path.join('.'), reason: d.message })),
|
||
});
|
||
}
|
||
|
||
const data = value as ICreateAgentRequest;
|
||
const ipAddress = req.ip ?? '0.0.0.0';
|
||
const userAgent = req.headers['user-agent'] ?? 'unknown';
|
||
|
||
const agent = await this.agentService.registerAgent(data, ipAddress, userAgent);
|
||
res.status(201).json(agent);
|
||
```
|
||
|
||
**Why check `req.user` in the controller when `authMiddleware` already set it?**
|
||
TypeScript's type system marks `req.user` as `ITokenPayload | undefined`. The check
|
||
at line 39 narrows the type so subsequent code can use `req.user` without null
|
||
assertions. It is a guard, not redundant authentication.
|
||
|
||
**Why pass `ipAddress` and `userAgent` to the service?** The service logs audit events.
|
||
Audit events include the client IP and User-Agent for forensic value. These values
|
||
come from the HTTP request, which the service has no access to — so the controller
|
||
extracts them and passes them down.
|
||
|
||
---
|
||
|
||
### Step 6 — Service: free-tier limit check
|
||
|
||
**File:** `src/services/AgentService.ts` lines 59–65
|
||
|
||
```typescript
|
||
const currentCount = await this.agentRepository.countActive();
|
||
if (currentCount >= FREE_TIER_MAX_AGENTS) {
|
||
throw new FreeTierLimitError('Free tier limit of 100 registered agents has been reached.', ...);
|
||
}
|
||
```
|
||
|
||
**Why count before checking email uniqueness?** If the limit is reached, there is
|
||
no point checking whether the email already exists. Doing the cheaper check (count)
|
||
first avoids an unnecessary query.
|
||
|
||
---
|
||
|
||
### Step 7 — Service: email uniqueness check
|
||
|
||
**File:** `src/services/AgentService.ts` lines 68–71
|
||
|
||
```typescript
|
||
const existing = await this.agentRepository.findByEmail(data.email);
|
||
if (existing !== null) { throw new AgentAlreadyExistsError(data.email) }
|
||
```
|
||
|
||
**Why not rely on the database UNIQUE constraint?** We could, but catching a
|
||
PostgreSQL `23505` error code in the repository would be less readable and would
|
||
not produce a typed `AgentAlreadyExistsError` with a structured `details` field.
|
||
The explicit check gives better error messages and keeps the repository layer clean.
|
||
|
||
---
|
||
|
||
### Step 8 — Repository: INSERT
|
||
|
||
**File:** `src/repositories/AgentRepository.ts` lines 67–85
|
||
|
||
```typescript
|
||
async create(data: ICreateAgentRequest): Promise<IAgent> {
|
||
const agentId = uuidv4();
|
||
const result: QueryResult<AgentRow> = await this.pool.query(
|
||
`INSERT INTO agents (agent_id, email, agent_type, version, capabilities, owner, deployment_env, status, created_at, updated_at)
|
||
VALUES ($1, $2, $3, $4, $5, $6, $7, 'active', NOW(), NOW())
|
||
RETURNING *`,
|
||
[agentId, data.email, data.agentType, data.version, data.capabilities, data.owner, data.deploymentEnv],
|
||
);
|
||
return mapRowToAgent(result.rows[0]);
|
||
}
|
||
```
|
||
|
||
**Why generate `agentId` in application code instead of relying on `gen_random_uuid()`?**
|
||
Because we use the UUID as the OAuth 2.0 `client_id`. We need the UUID before writing
|
||
to the database so we can use it in the audit event and the response. Having it in
|
||
application code avoids a separate SELECT after the INSERT.
|
||
|
||
**Why `RETURNING *`?** PostgreSQL's `RETURNING` clause sends back the inserted row
|
||
in the same round trip as the INSERT. This avoids a second SELECT to fetch the
|
||
newly created record.
|
||
|
||
---
|
||
|
||
### Step 9 — Service: audit event
|
||
|
||
**File:** `src/services/AgentService.ts` lines 76–83
|
||
|
||
```typescript
|
||
await this.auditService.logEvent(
|
||
agent.agentId,
|
||
'agent.created',
|
||
'success',
|
||
ipAddress,
|
||
userAgent,
|
||
{ agentType: agent.agentType, owner: agent.owner },
|
||
);
|
||
```
|
||
|
||
**Why `await` here but `void` for token audit events?** Agent registration is a
|
||
database write operation that happens once. Adding ~5ms for the audit write is
|
||
acceptable and ensures the audit event is recorded before the 201 response is sent.
|
||
Token issuance happens far more frequently — audit is fire-and-forget there.
|
||
|
||
---
|
||
|
||
### Step 10 — Response
|
||
|
||
**File:** `src/controllers/AgentController.ts` line 56
|
||
|
||
```typescript
|
||
res.status(201).json(agent);
|
||
```
|
||
|
||
Returns the full `IAgent` object with HTTP 201 Created.
|
||
|
||
---
|
||
|
||
## Walkthrough 3 — Credential Rotation
|
||
|
||
**Request:** `POST /api/v1/agents/:agentId/credentials/:credentialId/rotate`
|
||
|
||
Credential rotation is the process of replacing an existing client secret with a
|
||
new one without changing the `credentialId`. This is the recommended security
|
||
practice — rotate periodically and rotate immediately after suspected compromise.
|
||
|
||
---
|
||
|
||
### Step 1 — Route dispatch
|
||
|
||
**File:** `src/routes/credentials.ts` line 34
|
||
|
||
```typescript
|
||
router.post('/:credentialId/rotate', asyncHandler(credentialController.rotateCredential.bind(credentialController)));
|
||
```
|
||
|
||
The credentials router is mounted at `/api/v1/agents/:agentId/credentials` in `app.ts`.
|
||
The full path becomes `POST /api/v1/agents/:agentId/credentials/:credentialId/rotate`.
|
||
|
||
---
|
||
|
||
### Step 2 — Auth middleware
|
||
|
||
Same as Walkthrough 2, Step 3. Bearer token is verified via RS256 and Redis revocation check.
|
||
`req.user` is populated with the JWT payload.
|
||
|
||
---
|
||
|
||
### Step 3 — OPA middleware
|
||
|
||
The path `/api/v1/agents/:agentId/credentials/:credId/rotate` is normalised to
|
||
`/api/v1/agents/:id/credentials/:credId/rotate`. The policy requires `["agents:write"]`.
|
||
|
||
---
|
||
|
||
### Step 4 — Controller: ownership check
|
||
|
||
**File:** `src/controllers/CredentialController.ts` lines 127–137
|
||
|
||
```typescript
|
||
rotateCredential = async (req: Request, res: Response, next: NextFunction): Promise<void> => {
|
||
if (!req.user) { throw new AuthenticationError() }
|
||
|
||
const { agentId, credentialId } = req.params;
|
||
|
||
if (req.user.sub !== agentId) {
|
||
throw new AuthorizationError('You do not have permission to manage credentials for this agent.');
|
||
}
|
||
```
|
||
|
||
**Why check `req.user.sub !== agentId`?** An agent's token contains its own
|
||
`agentId` as the `sub` claim. This check enforces that an agent can only manage
|
||
its own credentials. Even if an agent has `agents:write` scope, it cannot rotate
|
||
another agent's credentials. This is Phase 1 behaviour — there is no admin scope yet.
|
||
|
||
---
|
||
|
||
### Step 5 — Controller: request validation
|
||
|
||
**File:** `src/controllers/CredentialController.ts` lines 139–157
|
||
|
||
```typescript
|
||
const { error, value } = generateCredentialSchema.validate(req.body ?? {}, { abortEarly: false });
|
||
// generateCredentialSchema validates optional `expiresAt` field
|
||
const data = value as IGenerateCredentialRequest;
|
||
const result = await this.credentialService.rotateCredential(agentId, credentialId, data, ipAddress, userAgent);
|
||
res.status(200).json(result);
|
||
```
|
||
|
||
**Why `req.body ?? {}`?** The rotation body is optional — an agent may rotate a
|
||
credential without an expiry date, in which case the body may be empty. Passing
|
||
`undefined` to Joi would cause a different error than passing `{}`.
|
||
|
||
---
|
||
|
||
### Step 6 — Service: existence checks
|
||
|
||
**File:** `src/services/CredentialService.ts` lines 163–177
|
||
|
||
```typescript
|
||
const agent = await this.agentRepository.findById(agentId);
|
||
if (!agent) { throw new AgentNotFoundError(agentId) }
|
||
|
||
const existing = await this.credentialRepository.findById(credentialId);
|
||
if (!existing || existing.clientId !== agentId) { throw new CredentialNotFoundError(credentialId) }
|
||
|
||
if (existing.status === 'revoked') {
|
||
throw new CredentialAlreadyRevokedError(credentialId, existing.revokedAt?.toISOString() ?? ...);
|
||
}
|
||
```
|
||
|
||
**Why check `existing.clientId !== agentId`?** Even though OPA restricts the agent
|
||
to its own credentials, a malicious actor could craft a request with a valid
|
||
`agentId` in the path but a `credentialId` belonging to another agent. This check
|
||
ensures that a credential is only accessible to the agent it was created for.
|
||
|
||
---
|
||
|
||
### Step 7 — Service: generate new secret and write to Vault or bcrypt
|
||
|
||
**File:** `src/services/CredentialService.ts` lines 180–192
|
||
|
||
```typescript
|
||
const expiresAt = data.expiresAt !== undefined ? new Date(data.expiresAt) : null;
|
||
const plainSecret = generateClientSecret(); // sk_live_<64 hex chars>
|
||
|
||
let updated: ICredential | null;
|
||
|
||
if (this.vaultClient !== null) {
|
||
// Phase 2: overwrite the existing Vault secret (KV v2 creates a new version)
|
||
const vaultPath = await this.vaultClient.writeSecret(agentId, credentialId, plainSecret);
|
||
updated = await this.credentialRepository.updateVaultPath(credentialId, vaultPath, expiresAt);
|
||
} else {
|
||
// Phase 1: use bcrypt
|
||
const newHash = await hashSecret(plainSecret);
|
||
updated = await this.credentialRepository.updateHash(credentialId, newHash, expiresAt);
|
||
}
|
||
```
|
||
|
||
**Why does Vault rotation write to the same path?** Vault KV v2 is versioned — writing
|
||
to an existing path creates a new version without overwriting previous versions.
|
||
This preserves an audit trail in Vault itself.
|
||
|
||
**Why does the Vault path stay the same after rotation?** The `vault_path` column
|
||
stores the path, not the secret. The path is deterministic:
|
||
`{mount}/data/agentidp/agents/{agentId}/credentials/{credentialId}`. Since the
|
||
`credentialId` does not change on rotation, the path does not change either.
|
||
Only the Vault version at that path changes.
|
||
|
||
---
|
||
|
||
### Step 8 — Repository: UPDATE the credential
|
||
|
||
**File:** `src/repositories/CredentialRepository.ts` lines 180–218
|
||
|
||
```typescript
|
||
// Bcrypt path (updateHash):
|
||
UPDATE credentials
|
||
SET secret_hash = $1, vault_path = NULL, expires_at = $2, status = 'active', revoked_at = NULL
|
||
WHERE credential_id = $3
|
||
RETURNING *
|
||
|
||
// Vault path (updateVaultPath):
|
||
UPDATE credentials
|
||
SET vault_path = $1, secret_hash = '', expires_at = $2, status = 'active', revoked_at = NULL
|
||
WHERE credential_id = $3
|
||
RETURNING *
|
||
```
|
||
|
||
**Why `status = 'active'` in the UPDATE?** A credential could theoretically be
|
||
in any state when rotated. The UPDATE explicitly sets it to active. This handles
|
||
edge cases where a revoked credential is being "un-revoked" by rotation (though
|
||
the service layer prevents this — revoked credentials throw `CredentialAlreadyRevokedError`).
|
||
The belt-and-suspenders approach at the SQL layer ensures data integrity.
|
||
|
||
---
|
||
|
||
### Step 9 — Service: audit event
|
||
|
||
**File:** `src/services/CredentialService.ts` lines 199–206
|
||
|
||
```typescript
|
||
await this.auditService.logEvent(
|
||
agentId,
|
||
'credential.rotated',
|
||
'success',
|
||
ipAddress,
|
||
userAgent,
|
||
{ credentialId },
|
||
);
|
||
```
|
||
|
||
The audit event records which credential was rotated. Combined with the timestamp,
|
||
this gives a complete rotation history for each credential.
|
||
|
||
---
|
||
|
||
### Step 10 — Response
|
||
|
||
**File:** `src/controllers/CredentialController.ts` line 161
|
||
|
||
```typescript
|
||
res.status(200).json(result);
|
||
```
|
||
|
||
Returns `ICredentialWithSecret` — the updated credential including the new
|
||
`clientSecret`. This is the only time the new secret is ever returned. The caller
|
||
must store it securely.
|
||
|
||
```json
|
||
{
|
||
"credentialId": "d4e5f6a7-...",
|
||
"clientId": "a1b2c3d4-...",
|
||
"status": "active",
|
||
"clientSecret": "sk_live_4f8a2e9b...",
|
||
"createdAt": "2026-01-15T10:00:00Z",
|
||
"expiresAt": "2027-01-15T10:00:00Z",
|
||
"revokedAt": null
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Walkthrough 4 — A2A Delegation End-to-End
|
||
|
||
**Request:** `POST /api/v1/oauth2/token/delegate` — one AI agent delegating a scoped capability to another
|
||
|
||
This walkthrough traces how agent A (an orchestrator) issues a delegation token that grants agent B (a sub-agent) the right to act on its behalf with a restricted scope.
|
||
|
||
---
|
||
|
||
### Step 1 — Route dispatch
|
||
|
||
**File:** `src/routes/delegation.ts`
|
||
|
||
```typescript
|
||
router.post(
|
||
'/token/delegate',
|
||
asyncHandler(authMiddleware),
|
||
opaMiddleware,
|
||
asyncHandler(delegationController.createDelegation.bind(delegationController))
|
||
);
|
||
```
|
||
|
||
Both `authMiddleware` and `opaMiddleware` run. The OPA policy requires scope `agents:write` for delegation creation.
|
||
|
||
---
|
||
|
||
### Step 2 — Controller: extract delegator and validate
|
||
|
||
**File:** `src/controllers/DelegationController.ts`
|
||
|
||
```typescript
|
||
const delegatorId = req.user.sub; // From the Bearer token's sub claim
|
||
const { delegatee_id, scope, expires_at } = req.body;
|
||
```
|
||
|
||
The controller validates that `delegatee_id` is a non-empty UUID, `scope` is a non-empty string, and `expires_at` (if provided) is a valid ISO 8601 datetime in the future. It passes these to `DelegationService.createDelegation()`.
|
||
|
||
---
|
||
|
||
### Step 3 — Service: verify both agents exist
|
||
|
||
**File:** `src/services/DelegationService.ts`
|
||
|
||
```typescript
|
||
const delegator = await this.agentRepository.findById(delegatorId);
|
||
if (!delegator || delegator.status !== 'active') { throw new AgentNotFoundError(delegatorId) }
|
||
|
||
const delegatee = await this.agentRepository.findById(delegateeId);
|
||
if (!delegatee || delegatee.status !== 'active') { throw new AgentNotFoundError(delegateeId) }
|
||
```
|
||
|
||
Both agents must exist and be in `active` status. A suspended or decommissioned agent cannot participate in delegation.
|
||
|
||
---
|
||
|
||
### Step 4 — Service: insert delegation chain record
|
||
|
||
**File:** `src/services/DelegationService.ts`
|
||
|
||
```typescript
|
||
await this.pool.query(
|
||
`INSERT INTO delegation_chains (chain_id, delegator_id, delegatee_id, scope, status, expires_at)
|
||
VALUES ($1, $2, $3, $4, 'active', $5)`,
|
||
[chainId, delegatorId, delegateeId, scope, expiresAt]
|
||
);
|
||
```
|
||
|
||
The `chain_id` is a UUID generated by the service. The `delegation_chains` table provides the authoritative source of truth for which delegations are active, independent of any token.
|
||
|
||
---
|
||
|
||
### Step 5 — Response
|
||
|
||
```json
|
||
{
|
||
"chain_id": "f1e2d3c4-...",
|
||
"token": "eyJhbGciOiJSUzI1NiJ9...",
|
||
"delegator_id": "a1b2c3d4-...",
|
||
"delegatee_id": "b2c3d4e5-...",
|
||
"scope": "agents:read",
|
||
"status": "active",
|
||
"expires_at": "2026-04-05T00:00:00Z"
|
||
}
|
||
```
|
||
|
||
The `token` field is the signed delegation JWT. The delegatee presents this token to `POST /api/v1/oauth2/token/verify-delegation` to prove it has authority to act on the delegator's behalf.
|
||
|
||
**Why store both the DB record and the JWT?** The DB record allows revocation — when the delegator calls `DELETE /api/v1/delegation-chains/:chainId`, the record is soft-deleted and all subsequent `verify-delegation` calls will fail even if the JWT itself has not yet expired.
|
||
|
||
---
|
||
|
||
## Walkthrough 5 — Tier Enforcement Request Lifecycle
|
||
|
||
**Request:** Any authenticated API request when the organisation's daily call limit is reached
|
||
|
||
This walkthrough traces how `tierMiddleware` intercepts a request before it reaches the OPA middleware, preventing quota-exceeded traffic from consuming service resources.
|
||
|
||
---
|
||
|
||
### Step 1 — Auth middleware passes
|
||
|
||
Same as Walkthrough 2, Step 3. The Bearer JWT is verified and `req.user` is populated with `sub` (agentId) and `organization_id`.
|
||
|
||
---
|
||
|
||
### Step 2 — Tier middleware: fetch org tier
|
||
|
||
**File:** `src/middleware/tier.ts`
|
||
|
||
```typescript
|
||
const orgId = req.user.organization_id;
|
||
const tier = await tierService.fetchTier(orgId);
|
||
const config = TIER_CONFIG[tier];
|
||
```
|
||
|
||
`fetchTier()` issues `SELECT tier FROM organizations WHERE organization_id = $1`. Returns `'free'` if no row is found (safe default).
|
||
|
||
---
|
||
|
||
### Step 3 — Tier middleware: read daily counter
|
||
|
||
**File:** `src/middleware/tier.ts`
|
||
|
||
```typescript
|
||
const callsKey = `rate:tier:calls:${orgId}`;
|
||
const callsToday = await redis.get(callsKey);
|
||
const count = callsToday !== null ? parseInt(callsToday, 10) : 0;
|
||
|
||
if (count >= config.maxCallsPerDay) {
|
||
throw new TierLimitError('calls', config.maxCallsPerDay, { orgId, tier, current: count });
|
||
}
|
||
```
|
||
|
||
The Redis key `rate:tier:calls:<orgId>` is read. If null (first call of the day), count is 0. When count equals or exceeds the tier limit, `TierLimitError` (HTTP 429) is thrown immediately — no further middleware runs.
|
||
|
||
---
|
||
|
||
### Step 4 — Tier middleware: increment counter (fire-and-forget)
|
||
|
||
**File:** `src/middleware/tier.ts`
|
||
|
||
```typescript
|
||
// Set TTL to next UTC midnight if key is new
|
||
void redis.multi()
|
||
.incr(callsKey)
|
||
.expireAt(callsKey, nextUtcMidnightUnix())
|
||
.exec();
|
||
next();
|
||
```
|
||
|
||
The counter is incremented atomically using a Redis MULTI block. The `EXPIREAT` command sets the key to auto-delete at the next UTC midnight, resetting the daily counter without any scheduled job. The increment is fire-and-forget — the request proceeds immediately to `opaMiddleware`.
|
||
|
||
**Why expire at UTC midnight rather than a rolling 24-hour window?** Tier limits are documented as "per day", which users interpret as resetting at midnight. A rolling window would allow a user to consume their full daily quota twice within a 48-hour period straddling midnight, which is counterintuitive. UTC midnight is predictable and easy to reason about.
|
||
|
||
---
|
||
|
||
### Step 5 — Error handler serialises TierLimitError
|
||
|
||
**File:** `src/middleware/errorHandler.ts`
|
||
|
||
```json
|
||
HTTP 429
|
||
{
|
||
"code": "TIER_LIMIT_EXCEEDED",
|
||
"message": "Daily API call limit reached for your tier.",
|
||
"details": {
|
||
"tier": "free",
|
||
"limit": 1000,
|
||
"current": 1000
|
||
}
|
||
}
|
||
```
|
||
|
||
The `Retry-After` header is set to the number of seconds until next UTC midnight so clients can implement automatic backoff.
|
||
|
||
---
|
||
|
||
## Walkthrough 6 — Analytics Event Capture Flow
|
||
|
||
**Trigger:** Any successful token issuance (`POST /api/v1/token`)
|
||
|
||
This walkthrough traces how an analytics event is captured without affecting the latency of the primary token issuance response.
|
||
|
||
---
|
||
|
||
### Step 1 — Token issuance completes
|
||
|
||
**File:** `src/services/OAuth2Service.ts`
|
||
|
||
```typescript
|
||
const accessToken = signToken(payload, this.privateKey);
|
||
// Primary response is ready — analytics is now fire-and-forget
|
||
void this.analyticsService.recordEvent(tenantId, 'token_issued');
|
||
tokensIssuedTotal.inc({ scope });
|
||
```
|
||
|
||
The `signToken()` call completes synchronously (RSA signing is CPU-bound, not I/O). The controller can now send the response. `analyticsService.recordEvent()` is called with `void` — the `await` is deliberately omitted.
|
||
|
||
**Why `void` instead of `await`?** Token issuance latency must remain below 100ms (per the QA performance gate). A PostgreSQL write adds 5–15ms. Since analytics data is aggregated (not transactional), losing an occasional event due to an error is acceptable. The response is never delayed for analytics.
|
||
|
||
---
|
||
|
||
### Step 2 — AnalyticsService: UPSERT daily counter
|
||
|
||
**File:** `src/services/AnalyticsService.ts`
|
||
|
||
```typescript
|
||
async recordEvent(tenantId: string, metricType: string): Promise<void> {
|
||
try {
|
||
await this.pool.query(
|
||
`INSERT INTO analytics_events (organization_id, date, metric_type, count)
|
||
VALUES ($1, CURRENT_DATE, $2, 1)
|
||
ON CONFLICT (organization_id, date, metric_type)
|
||
DO UPDATE SET count = analytics_events.count + 1`,
|
||
[tenantId, metricType],
|
||
);
|
||
} catch (err) {
|
||
console.error('[AnalyticsService] recordEvent failed — primary path unaffected', err);
|
||
}
|
||
}
|
||
```
|
||
|
||
The `ON CONFLICT DO UPDATE` upsert is atomic. Whether this is the first or the ten-thousandth `token_issued` event for this tenant today, the row is updated correctly. All errors are caught and swallowed — the token has already been returned to the caller.
|
||
|
||
**Why one row per day per metric, not one row per event?** Storing a row per event would create millions of rows. The daily aggregate model keeps the table compact while still providing daily trend data (the granularity that analytics dashboards need). Sub-day granularity is available from the Prometheus `agentidp_tokens_issued_total` counter if needed.
|
||
|
||
---
|
||
|
||
### Step 3 — Dashboard query (deferred)
|
||
|
||
When a developer visits the analytics page in the developer portal, the portal calls:
|
||
|
||
```
|
||
GET /api/v1/analytics/token-trend?days=30
|
||
```
|
||
|
||
**File:** `src/services/AnalyticsService.ts` — `getTokenTrend(tenantId, 30)`
|
||
|
||
```sql
|
||
SELECT
|
||
gs.date::DATE::TEXT AS date,
|
||
COALESCE(ae.count, 0)::INTEGER AS count
|
||
FROM generate_series(
|
||
CURRENT_DATE - 29 * INTERVAL '1 day',
|
||
CURRENT_DATE,
|
||
INTERVAL '1 day'
|
||
) AS gs(date)
|
||
LEFT JOIN analytics_events ae
|
||
ON ae.date = gs.date::DATE
|
||
AND ae.organization_id = $2
|
||
AND ae.metric_type = 'token_issued'
|
||
ORDER BY gs.date ASC
|
||
```
|
||
|
||
The `generate_series` + `LEFT JOIN` pattern ensures all 30 days appear in the result, with `count: 0` for days with no events. This avoids the need for the client to fill in gaps.
|