docs(architecture): add ADR-023, resolve OQ-24 — operation error schemas
ADR-023 adds error_schemas to OperationSpec so operations can declare their domain-level failure modes (FILE_NOT_FOUND, RATE_LIMITED, etc.) distinct from protocol-level codes (NOT_FOUND, FORBIDDEN, etc.). The call.error payload gains an optional 'details' field carrying the typed error payload conforming to the declared schema. from_openapi/to_openapi map OpenAPI response status codes to/from ErrorDefinitions, making the adapter contract from ADR-017 faithful on the error axis. Also fixes W2 (KeyVersionMismatch stale comment in encryption.md — ADR-021 implements rotation without this variant) and W4 (derive_encryption_key_for_version missing from service.md method list). Spec updates: operation-registry.md (OperationSpec, ErrorDefinition, Handler error mapping, services/schema), call-protocol.md (call.error payload, CallError, ResponseEnvelope), README.md, overview.md, open-questions.md (OQ-24), call/README.md, encryption.md, service.md.
This commit is contained in:
@@ -7,7 +7,7 @@ last_updated: 2026-06-20
|
||||
|
||||
## Current State
|
||||
|
||||
**Pre-implementation.** The project has completed a pivot from a three-layer model to an ALPN-as-service model. The greenfield workspace contains only `alknet-vault` (stable — implementation exists) and research/reference material. Foundational ADRs (001–022) are in place, including the BiStream type definition (ADR-007), vault integration (ADR-008), ALPN router/endpoint (ADR-010), AuthContext structure (ADR-011), call protocol stream model (ADR-012), Rust as canonical implementation language (ADR-013), secret material flow with capability injection (ADR-014), privilege model with authority context (ADR-015), abort cascade for nested calls (ADR-016), call protocol client and adapter contract (ADR-017), vault standalone crate (ADR-018), vault assembly-layer-only access (ADR-019), HD derivation for encryption keys (ADR-020), key rotation via version-indexed paths (ADR-021), and handler registration, provenance, and composition authority (ADR-022). The alknet-core, alknet-call, and alknet-vault crate specs are in draft.
|
||||
**Pre-implementation.** The project has completed a pivot from a three-layer model to an ALPN-as-service model. The greenfield workspace contains only `alknet-vault` (stable — implementation exists) and research/reference material. Foundational ADRs (001–023) are in place, including the BiStream type definition (ADR-007), vault integration (ADR-008), ALPN router/endpoint (ADR-010), AuthContext structure (ADR-011), call protocol stream model (ADR-012), Rust as canonical implementation language (ADR-013), secret material flow with capability injection (ADR-014), privilege model with authority context (ADR-015), abort cascade for nested calls (ADR-016), call protocol client and adapter contract (ADR-017), vault standalone crate (ADR-018), vault assembly-layer-only access (ADR-019), HD derivation for encryption keys (ADR-020), key rotation via version-indexed paths (ADR-021), handler registration, provenance, and composition authority (ADR-022), and operation error schemas (ADR-023). The alknet-core, alknet-call, and alknet-vault crate specs are in draft.
|
||||
|
||||
**Next step**: Review the vault spec documents, then begin implementation. All open questions for the core and call crates are resolved; the vault crate has one deferred OQ (OQ-21, remote vault administration) that does not block implementation.
|
||||
|
||||
@@ -57,6 +57,7 @@ last_updated: 2026-06-20
|
||||
| [020](decisions/020-hd-derivation-for-encryption-keys.md) | HD Derivation for Encryption Keys | Accepted |
|
||||
| [021](decisions/021-key-rotation-via-version-indexed-paths.md) | Key Rotation via Version-Indexed Paths | Accepted |
|
||||
| [022](decisions/022-handler-registration-provenance-and-composition-authority.md) | Handler Registration, Provenance, and Composition Authority | Proposed |
|
||||
| [023](decisions/023-operation-error-schemas.md) | Operation Error Schemas | Proposed |
|
||||
|
||||
## Open Questions
|
||||
|
||||
@@ -85,6 +86,7 @@ See [open-questions.md](open-questions.md) for the full tracker.
|
||||
- **OQ-20**: Encryption key derivation — HD derivation from BIP39 seed, not PBKDF2; salt field unused in v2 (wire-format compat) (ADR-020)
|
||||
- **OQ-22**: Key rotation — version-indexed derivation paths; `rotate` method re-encrypts (ADR-021)
|
||||
- **OQ-23**: Handler identity registration path — registration bundle with provenance, composition authority, scoped env, capabilities (ADR-022)
|
||||
- **OQ-24**: Operation error schemas — declared domain errors with typed `details` payload; adapter fidelity for `from_openapi`/`to_openapi` (ADR-023)
|
||||
|
||||
**Deferred (not active):**
|
||||
- **OQ-09**: WASM target boundaries — design constraint, not deliverable
|
||||
|
||||
@@ -33,6 +33,7 @@ Structured RPC over QUIC: operations, request/response, streaming subscriptions,
|
||||
| [016](../../decisions/016-abort-cascade-for-nested-calls.md) | Abort Cascade for Nested Calls | `call.aborted` cascades to descendants; default `abort-dependents`, `continue-running` opt-in |
|
||||
| [017](../../decisions/017-call-protocol-client-and-adapter-contract.md) | Call Protocol Client and Adapter Contract | `CallClient` opens connections; `from_call` imports remote ops; connection direction independent of call direction |
|
||||
| [022](../../decisions/022-handler-registration-provenance-and-composition-authority.md) | Handler Registration, Provenance, and Composition Authority | Registration bundle carries provenance, composition authority, scoped env, capabilities |
|
||||
| [023](../../decisions/023-operation-error-schemas.md) | Operation Error Schemas | Operations declare domain errors; `call.error` carries typed `details`; adapter fidelity |
|
||||
|
||||
## Relevant Open Questions
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-21
|
||||
last_updated: 2026-06-22
|
||||
---
|
||||
|
||||
# Call Protocol
|
||||
@@ -127,19 +127,28 @@ The `payload` of a `call.requested` event has this shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"code": "NOT_FOUND",
|
||||
"message": "operation not found: /fs/readFile",
|
||||
"retryable": false
|
||||
"code": "FILE_NOT_FOUND",
|
||||
"message": "file not found: /etc/nonexistent",
|
||||
"retryable": false,
|
||||
"details": { "path": "/etc/nonexistent", "errno": 2 }
|
||||
}
|
||||
```
|
||||
|
||||
Error codes use an extensible string enum. The protocol defines the following codes:
|
||||
- `NOT_FOUND` — operation not in registry
|
||||
Error codes use an extensible string enum. The protocol defines the following **protocol-level codes** (emitted by the dispatch machinery, not by handlers):
|
||||
- `NOT_FOUND` — operation not in registry (or Internal op called from wire)
|
||||
- `FORBIDDEN` — access denied (insufficient scopes or unauthenticated)
|
||||
- `INVALID_INPUT` — input doesn't match the operation's JSON Schema
|
||||
- `INTERNAL` — handler error
|
||||
- `INTERNAL` — handler error, panic, connection failure
|
||||
- `TIMEOUT` — request timed out (retryable: true)
|
||||
|
||||
Operations may also declare **operation-level domain codes** in their `error_schemas` (ADR-023) — e.g., `FILE_NOT_FOUND`, `RATE_LIMITED`, `INSUFFICIENT_CREDITS`. These are emitted by handlers and carry a `details` payload conforming to the declared `ErrorDefinition.schema`. Protocol-level errors omit `details` or carry protocol-specific context (e.g., the operation name for `NOT_FOUND`).
|
||||
|
||||
Fields:
|
||||
- `code` — the error code (protocol-level or operation-level)
|
||||
- `message` — human-readable error message. For logging and debugging, not for programmatic handling. Clients should switch on `code`, not parse `message`.
|
||||
- `retryable` — whether the caller should retry. `true` for transient failures, `false` for permanent ones.
|
||||
- `details` — optional. When the code matches a declared `ErrorDefinition`, `details` conforms to that definition's schema. This is the typed error payload — it makes errors structured instead of string-matched. See ADR-023.
|
||||
|
||||
New error codes may be added in future versions. Clients should treat unknown error codes as `INTERNAL` with `retryable: false`.
|
||||
|
||||
### Protocol Operations
|
||||
@@ -304,13 +313,14 @@ pub struct ResponseEnvelope {
|
||||
}
|
||||
|
||||
pub struct CallError {
|
||||
pub code: String,
|
||||
pub message: String,
|
||||
pub code: String, // protocol-level (NOT_FOUND, FORBIDDEN, ...) or operation-level (ADR-023)
|
||||
pub message: String, // human-readable, for logging — not for programmatic handling
|
||||
pub retryable: bool,
|
||||
pub details: Option<Value>, // typed error payload, conforms to ErrorDefinition.schema (ADR-023)
|
||||
}
|
||||
```
|
||||
|
||||
Local dispatch produces `ResponseEnvelope` with no serialization overhead. The `CallAdapter` converts `ResponseEnvelope` to `EventEnvelope` for the wire.
|
||||
Local dispatch produces `ResponseEnvelope` with no serialization overhead. The `CallAdapter` converts `ResponseEnvelope` to `EventEnvelope` for the wire. When a handler returns a `CallError` whose `code` matches a declared `ErrorDefinition`, the `details` field carries the typed error payload. See ADR-023.
|
||||
|
||||
### Connection and Stream Lifecycle
|
||||
|
||||
@@ -356,6 +366,7 @@ Handlers clean up resources when their call is cancelled (in Rust, the future is
|
||||
| Abort cascade for nested calls | [ADR-016](../../decisions/016-abort-cascade-for-nested-calls.md) | `call.aborted` cascades to descendants; default `abort-dependents`, `continue-running` opt-in |
|
||||
| Call protocol client and adapter contract | [ADR-017](../../decisions/017-call-protocol-client-and-adapter-contract.md) | `CallClient` opens connections; `from_call` imports remote ops; connection direction independent of call direction |
|
||||
| Handler registration, provenance, and composition authority | [ADR-022](../../decisions/022-handler-registration-provenance-and-composition-authority.md) | Registration bundle carries provenance, composition authority, scoped env, capabilities; dispatch path reads from bundle |
|
||||
| Operation error schemas | [ADR-023](../../decisions/023-operation-error-schemas.md) | Operations declare domain errors; `call.error` carries typed `details` |
|
||||
|
||||
## Open Questions
|
||||
|
||||
|
||||
@@ -37,6 +37,7 @@ pub struct OperationSpec {
|
||||
pub visibility: Visibility, // External (wire-callable) or Internal (composition-only)
|
||||
pub input_schema: Value, // JSON Schema for input
|
||||
pub output_schema: Value, // JSON Schema for output
|
||||
pub error_schemas: Vec<ErrorDefinition>, // Declared domain errors (ADR-023)
|
||||
pub access_control: AccessControl,
|
||||
}
|
||||
|
||||
@@ -50,6 +51,14 @@ pub enum Visibility {
|
||||
External, // Callable from the wire (call.requested from a client)
|
||||
Internal, // Composition-only (env.invoke from a handler)
|
||||
}
|
||||
|
||||
/// A declared operation-level error. See ADR-023.
|
||||
pub struct ErrorDefinition {
|
||||
pub code: String, // e.g., "FILE_NOT_FOUND", "RATE_LIMITED"
|
||||
pub description: String, // Human-readable description
|
||||
pub schema: Value, // JSON Schema for the error detail payload
|
||||
pub http_status: Option<u16>, // HTTP status for adapter projection (from_openapi/to_openapi)
|
||||
}
|
||||
```
|
||||
|
||||
Operation names use slash-based paths without a leading slash, aligned with URL path conventions: `fs/readFile`, `agent/chat`, `services/list`. The leading slash is added when needed for display (`spec.path()` returns `/fs/readFile`) and for wire format (the `call.requested` payload uses `/fs/readFile`). See OQ-13 for the path format decision (single-node `service/op` vs head/worker `node/service/op`).
|
||||
@@ -94,6 +103,8 @@ A handler receives:
|
||||
|
||||
And returns a `ResponseEnvelope` containing the result or an error. `ResponseEnvelope` is defined in [call-protocol.md](call-protocol.md#responseenvelope) — it carries the request ID and a `Result<Value, CallError>`. Local dispatch produces it with no serialization overhead; the `CallAdapter` converts it to `EventEnvelope` for the wire.
|
||||
|
||||
When a handler returns an error, the `CallError.code` is matched against the operation's declared `error_schemas` (ADR-023). If the code matches a declared `ErrorDefinition`, the `call.error` event carries that code and the error's detail payload. If it doesn't match, the `call.error` carries `INTERNAL`. This is how handler failures become typed errors on the wire instead of string-matched messages.
|
||||
|
||||
### OperationContext
|
||||
|
||||
```rust
|
||||
@@ -272,7 +283,7 @@ These are read-only — no admin operations are exposed through the call protoco
|
||||
}
|
||||
```
|
||||
|
||||
`services/schema` accepts `{ "name": "fs/readFile" }` and returns the full `OperationSpec` including input/output JSON Schemas.
|
||||
`services/schema` accepts `{ "name": "fs/readFile" }` and returns the full `OperationSpec` including input/output JSON Schemas and declared `error_schemas` (ADR-023). This enables client code generation: a client reading the schema can produce typed error enums instead of generic error handling.
|
||||
|
||||
### irpc Integration
|
||||
|
||||
@@ -392,6 +403,7 @@ The `Capabilities` type holds non-serializable, zeroized secret material. It doe
|
||||
| Secret material flow and capability injection | [ADR-014](../../decisions/014-secret-material-flow-and-capability-injection.md) | Capabilities carry outbound credentials; call protocol carries no secret material |
|
||||
| Privilege model and authority context | [ADR-015](../../decisions/015-privilege-model-and-authority-context.md) | `internal` = authority switch not ACL skip; External/Internal visibility; composition authority + scoped env |
|
||||
| Handler registration, provenance, and composition authority | [ADR-022](../../decisions/022-handler-registration-provenance-and-composition-authority.md) | Registration bundle carries provenance, composition authority, scoped env, capabilities; dispatch path reads from bundle |
|
||||
| Operation error schemas | [ADR-023](../../decisions/023-operation-error-schemas.md) | Operations declare domain errors; `call.error` carries typed `details`; adapter fidelity for `from_openapi`/`to_openapi` |
|
||||
|
||||
## Open Questions
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-19
|
||||
last_updated: 2026-06-20
|
||||
---
|
||||
|
||||
# Encryption
|
||||
@@ -194,7 +194,7 @@ pub enum EncryptionError {
|
||||
Encryption(String), // encryption failed
|
||||
Decryption(String), // decryption failed (wrong key, tampered data, bad UTF-8)
|
||||
Decoding(String), // base64 decoding failed
|
||||
KeyVersionMismatch { expected: u32, actual: u32 }, // reserved for future rotation (OQ-22)
|
||||
KeyVersionMismatch { expected: u32, actual: u32 }, // unused — see note below
|
||||
}
|
||||
```
|
||||
|
||||
@@ -202,12 +202,17 @@ Decryption failures are intentionally generic — they don't distinguish
|
||||
"wrong key" from "tampered data" from "corrupted storage" to avoid
|
||||
leaking information to an attacker.
|
||||
|
||||
`KeyVersionMismatch` is **defined but unused in v2** — neither `encrypt()`
|
||||
nor `decrypt()` returns it. It is reserved for future key rotation
|
||||
enforcement (OQ-22), where the vault may enforce version matching before
|
||||
decrypting. In v2, the `key_version` is stamped onto `EncryptedData` and
|
||||
`EncryptionKey` for forward compatibility but does not gate decryption. An
|
||||
implementer should not expect this variant to fire in v2.
|
||||
`KeyVersionMismatch` is **defined but unused.** ADR-021 implements key
|
||||
rotation via version-indexed derivation paths — `decrypt` derives the key
|
||||
at the path indicated by `encrypted.key_version`, so there is no
|
||||
version-mismatch to detect at the error level (every blob carries its own
|
||||
version, and every version has a derivable key). This variant predates
|
||||
ADR-021's rotation mechanism and is retained in the enum for source
|
||||
compatibility but is not emitted by any code path in v2. An implementer
|
||||
should not wire it up or expect it to fire. If a future use case requires
|
||||
enforcing version constraints (e.g., "refuse to decrypt blobs older than
|
||||
v3"), this variant could be repurposed — but that would be a new decision,
|
||||
not part of ADR-021's rotation scheme.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-19
|
||||
last_updated: 2026-06-20
|
||||
---
|
||||
|
||||
# Service
|
||||
@@ -126,6 +126,23 @@ Derive an AES-256-GCM encryption key at the given path. Same cache
|
||||
behavior as `derive_ed25519`. Returns a `DerivedKey` with
|
||||
`KeyType::Aes256Gcm`.
|
||||
|
||||
### derive_encryption_key_for_version(version) → EncryptionKey
|
||||
|
||||
```rust
|
||||
pub fn derive_encryption_key_for_version(&self, version: u32) -> Result<EncryptionKey, VaultServiceError>;
|
||||
```
|
||||
|
||||
Derive the encryption key for a specific key version. Maps the version to
|
||||
its derivation path via `encryption_path_for_version(version)` (ADR-021):
|
||||
v2 → `m/74'/2'/0'/0'`, v3 → `m/74'/2'/0'/1'`, etc. Cached by path. This is
|
||||
the version-aware method that `decrypt` uses to select the correct key for
|
||||
each blob — see [encryption.md](encryption.md) and ADR-021.
|
||||
|
||||
`derive_encryption_key(path)` (above) remains as the path-based API for
|
||||
deriving at arbitrary paths. `derive_encryption_key_for_version(version)`
|
||||
is the version-aware API used by `encrypt` and `decrypt`. The two share
|
||||
the same cache (keyed by derivation path).
|
||||
|
||||
### derive_ethereum_key(path) → DerivedKey (feature-gated)
|
||||
|
||||
```rust
|
||||
@@ -173,10 +190,10 @@ pub fn decrypt(&self, encrypted: &EncryptedData) -> Result<String, VaultServiceE
|
||||
```
|
||||
|
||||
Decrypt an `EncryptedData` blob. Derives (and caches) the encryption key
|
||||
at the version-indexed path indicated by `encrypted.key_version` (ADR-021).
|
||||
Each version maps to a distinct path (`m/74'/2'/0'/{version-2}'`), so old
|
||||
and new keys can coexist during partial rotation. See
|
||||
[encryption.md](encryption.md).
|
||||
at the version-indexed path indicated by `encrypted.key_version` via
|
||||
`derive_encryption_key_for_version` (ADR-021). Each version maps to a
|
||||
distinct path (`m/74'/2'/0'/{version-2}'`), so old and new keys can
|
||||
coexist during partial rotation. See [encryption.md](encryption.md).
|
||||
|
||||
### rotate(encrypted, to_version) → EncryptedData
|
||||
|
||||
|
||||
394
docs/architecture/decisions/023-operation-error-schemas.md
Normal file
394
docs/architecture/decisions/023-operation-error-schemas.md
Normal file
@@ -0,0 +1,394 @@
|
||||
# ADR-023: Operation Error Schemas
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
The `OperationSpec` in alknet-call has `input_schema` and `output_schema` but
|
||||
no `error_schemas`. The `call.error` payload (call-protocol.md L128–134)
|
||||
carries a `code` and `message`, where `code` is one of five infrastructure
|
||||
codes: `NOT_FOUND`, `FORBIDDEN`, `INVALID_INPUT`, `INTERNAL`, `TIMEOUT`.
|
||||
|
||||
These five codes cover **protocol-level failures** — the call protocol
|
||||
itself can always fail to find an operation, deny access, reject bad input,
|
||||
time out, or hit an internal error. They are emitted by the dispatch
|
||||
machinery (the registry, the adapter), not by operation handlers.
|
||||
|
||||
But operations also have **domain-level failures** that are not covered:
|
||||
|
||||
- `/fs/readFile` can fail because the file doesn't exist, the path is
|
||||
invalid, or the caller lacks OS-level read permission. These are
|
||||
operation-specific failures distinct from the protocol-level
|
||||
`INVALID_INPUT` (schema mismatch) or `FORBIDDEN` (scope mismatch).
|
||||
- `/vastai/createMachine` can fail because the account has insufficient
|
||||
credits, the machine type is unavailable in the requested region, or the
|
||||
upstream API rate-limited the request.
|
||||
- `/agent/chat` can fail because the LLM provider returned an error, the
|
||||
context window overflowed, or the model refused the request.
|
||||
|
||||
Today, these failures collapse into `INTERNAL` with a `message` string.
|
||||
A client calling `/fs/readFile` has no way to know from the schema that it
|
||||
might return `FILE_NOT_FOUND` vs `PERMISSION_DENIED` vs `INVALID_PATH`. The
|
||||
caller has to parse `message` strings — the exact anti-pattern that typed
|
||||
RPC is meant to avoid. This is a **type safety gap**: inputs and outputs are
|
||||
typed, but errors are untyped strings.
|
||||
|
||||
### Why this matters for adapters
|
||||
|
||||
OpenAPI specs naturally include error information — response status codes
|
||||
with schemas (e.g., `404: { schema: NotFoundError }`, `422: { schema:
|
||||
ValidationError }`). MCP tool definitions carry error descriptions. The
|
||||
`from_openapi` adapter (ADR-017 L113–124) imports operations and mirrors
|
||||
"the remote operation's name, namespace, type, schemas, and access control"
|
||||
— but with no error schema field, error responses from the OpenAPI source
|
||||
are dropped on import. `to_openapi` has nowhere to project error information
|
||||
to. The same gap applies to `from_mcp`/`to_mcp`.
|
||||
|
||||
An OpenAPI operation that declares:
|
||||
|
||||
```yaml
|
||||
responses:
|
||||
'200': { schema: MachineList }
|
||||
'401': { schema: AuthError }
|
||||
'429': { schema: RateLimitError }
|
||||
```
|
||||
|
||||
cannot be faithfully represented in alknet's `OperationSpec` today. The
|
||||
adapter would import the `200` output schema and drop the error schemas —
|
||||
a lossy import that silently discards the operation's failure contract.
|
||||
|
||||
### Prior art
|
||||
|
||||
The TypeScript reference (`/workspace/@alkdev/operations/src/types.ts`
|
||||
L38–47, L94, L112) defines `ErrorDefinitionSchema` and an optional
|
||||
`errorSchemas?: ErrorDefinition[]` on `OperationSpec`:
|
||||
|
||||
```typescript
|
||||
export const ErrorDefinitionSchema = Type.Object({
|
||||
code: Type.String({ description: "Error Code e.g., INVALID_INPUT, NOT_FOUND, UNAUTHORIZED" }),
|
||||
description: Type.String(),
|
||||
schema: Type.Unknown(),
|
||||
httpStatus: Type.Optional(Type.Number()),
|
||||
});
|
||||
```
|
||||
|
||||
The `mapError()` function (`error.ts` L25–51) matches thrown errors against
|
||||
the declared error schemas by code prefix — if a handler throws an error
|
||||
whose message starts with a declared code, `mapError` rewrites it to a
|
||||
typed `CallError` with that code. This is a proven pattern: operations
|
||||
declare their error contract, the dispatch machinery maps runtime failures
|
||||
to the declared codes, and clients get typed errors instead of string
|
||||
parsing.
|
||||
|
||||
The translator agent omitted `errorSchemas` from the Rust spec, likely
|
||||
because it's `Optional` in the TS schema (so dropping it doesn't break the
|
||||
happy path) and because error schemas are semantically different from
|
||||
input/output schemas (an operation returns one output but could return any
|
||||
of several errors). That's a reasonable judgment call for a first
|
||||
translation pass, but it leaves a real gap for adapters and clients.
|
||||
|
||||
### The general principle
|
||||
|
||||
This is the same principle as the Safe Exit protocol in the SDD process
|
||||
(docs/sdd_process.md L19, L423): **make failure a typed, declared thing
|
||||
rather than an untyped exception that crashes into whoever's listening.**
|
||||
An operation that declares "I can fail with `FILE_NOT_FOUND`" is the same
|
||||
shape as an agent that declares "I can fail with `TASK_AMBIGUOUS`" — both
|
||||
turn an unknown unknown into a known known that the caller can handle
|
||||
deliberately.
|
||||
|
||||
Complex systems survive not because every component is reliable, but
|
||||
because failure is expected and typed. Cells have apoptosis (a declared
|
||||
failure mode that protects the organism). Operations have error schemas (a
|
||||
declared failure mode that lets the caller handle it). The alternative —
|
||||
components that fail with untyped strings — is how you get brittle clients
|
||||
that string-match error messages and break when the message wording
|
||||
changes.
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. `OperationSpec` gains an optional `error_schemas` field
|
||||
|
||||
```rust
|
||||
pub struct OperationSpec {
|
||||
pub name: String,
|
||||
pub namespace: String,
|
||||
pub op_type: OperationType,
|
||||
pub visibility: Visibility,
|
||||
pub input_schema: Value,
|
||||
pub output_schema: Value,
|
||||
pub access_control: AccessControl,
|
||||
pub error_schemas: Vec<ErrorDefinition>, // NEW — empty vec = no declared errors
|
||||
}
|
||||
|
||||
pub struct ErrorDefinition {
|
||||
/// Machine-readable error code. e.g., "FILE_NOT_FOUND", "RATE_LIMITED",
|
||||
/// "INSUFFICIENT_CREDITS". Distinct from the protocol-level codes
|
||||
/// (NOT_FOUND, FORBIDDEN, etc.) — these are operation-level domain codes.
|
||||
pub code: String,
|
||||
|
||||
/// Human-readable description of when this error occurs.
|
||||
pub description: String,
|
||||
|
||||
/// JSON Schema for the error detail payload. The `call.error` event's
|
||||
/// `details` field conforms to this schema when this error code is
|
||||
/// returned. `Value` (serde_json::Value) carrying a JSON Schema, same
|
||||
/// as input_schema/output_schema.
|
||||
pub schema: Value,
|
||||
|
||||
/// HTTP status code for adapter projection. `from_openapi` maps OpenAPI
|
||||
/// response status codes to error definitions; `to_openapi` projects
|
||||
/// error definitions back to response status codes. Optional — not all
|
||||
/// error sources are HTTP-backed.
|
||||
pub http_status: Option<u16>,
|
||||
}
|
||||
```
|
||||
|
||||
`error_schemas` is a `Vec<ErrorDefinition>`, not `Option<Vec<...>>`. An
|
||||
empty vec means "this operation declares no specific domain errors" (it may
|
||||
still fail with protocol-level codes like `INTERNAL`). This avoids the
|
||||
`None` vs `Some([])` ambiguity and matches the TypeScript reference's
|
||||
optional-array convention.
|
||||
|
||||
### 2. The `call.error` payload gains an optional `details` field
|
||||
|
||||
```json
|
||||
{
|
||||
"code": "FILE_NOT_FOUND",
|
||||
"message": "file not found: /etc/nonexistent",
|
||||
"retryable": false,
|
||||
"details": { "path": "/etc/nonexistent", "errno": 2 }
|
||||
}
|
||||
```
|
||||
|
||||
- `code` — the error code. Either a protocol-level code (`NOT_FOUND`,
|
||||
`FORBIDDEN`, `INVALID_INPUT`, `INTERNAL`, `TIMEOUT`) or an
|
||||
operation-level domain code from `error_schemas` (e.g.,
|
||||
`FILE_NOT_FOUND`, `RATE_LIMITED`).
|
||||
- `message` — human-readable error message. Unstructured — for logging and
|
||||
debugging, not for programmatic handling. Clients should switch on
|
||||
`code`, not parse `message`.
|
||||
- `retryable` — whether the caller should retry. `true` for transient
|
||||
failures (`TIMEOUT`, `RATE_LIMITED`), `false` for permanent ones
|
||||
(`NOT_FOUND`, `FORBIDDEN`, `FILE_NOT_FOUND`).
|
||||
- `details` — optional. When the error code matches a declared
|
||||
`ErrorDefinition`, `details` conforms to that definition's `schema`. When
|
||||
the error is protocol-level (`NOT_FOUND`, `FORBIDDEN`, etc.), `details`
|
||||
is absent or carries protocol-specific context (e.g., the operation name
|
||||
for `NOT_FOUND`). This field is the typed error payload — it's what
|
||||
makes errors structured instead of string-matched.
|
||||
|
||||
### 3. Protocol-level vs operation-level error codes
|
||||
|
||||
The five existing codes are **protocol-level** — emitted by the dispatch
|
||||
machinery, not by handlers:
|
||||
|
||||
| Code | Emitted by | Meaning |
|
||||
|------|-----------|---------|
|
||||
| `NOT_FOUND` | Registry | Operation not registered (or Internal op called from wire) |
|
||||
| `FORBIDDEN` | Registry / ACL | Caller lacks required scopes, or unauthenticated |
|
||||
| `INVALID_INPUT` | Registry | Input doesn't match `input_schema` |
|
||||
| `INTERNAL` | Registry / Adapter | Handler panic, unhandled error, connection failure |
|
||||
| `TIMEOUT` | Adapter | Request timed out |
|
||||
|
||||
Operation-level domain codes are emitted by **handlers** — the operation's
|
||||
own logic determines what went wrong. They are declared in `error_schemas`
|
||||
and appear in the `code` field of `call.error`. Examples: `FILE_NOT_FOUND`,
|
||||
`PERMISSION_DENIED`, `RATE_LIMITED`, `INSUFFICIENT_CREDITS`,
|
||||
`CONTEXT_OVERFLOW`.
|
||||
|
||||
The two namespaces are distinct but share the `code` field. Clients
|
||||
should handle protocol-level codes uniformly (they mean the same thing
|
||||
regardless of operation) and operation-level codes per-operation (they
|
||||
mean what the operation's `error_schemas` says they mean). Unknown codes
|
||||
— whether a future protocol code or an undeclared operation code — should
|
||||
be treated as `INTERNAL` with `retryable: false` (same as the current
|
||||
guidance in call-protocol.md L143).
|
||||
|
||||
### 4. Handler error mapping
|
||||
|
||||
When a handler returns an error, the dispatch machinery maps it to a
|
||||
`call.error` event. The mapping:
|
||||
|
||||
1. If the handler returns a structured error with a `code` that matches a
|
||||
declared `ErrorDefinition.code`, the `call.error` carries that code and
|
||||
the error's detail payload (validated against the definition's `schema`).
|
||||
2. If the handler returns a structured error with a `code` that doesn't
|
||||
match any declared `ErrorDefinition`, the `call.error` carries
|
||||
`INTERNAL` with the original code in `details`. This is an undeclared
|
||||
error — the handler returned a typed error but didn't declare it.
|
||||
3. If the handler returns an unstructured error (a string, a generic
|
||||
`Error`, a panic), the `call.error` carries `INTERNAL` with
|
||||
`retryable: false`. This is the current behavior for all handler
|
||||
errors.
|
||||
|
||||
The TypeScript `mapError()` function (error.ts L25–51) implements case 2
|
||||
and 3 by matching error messages against declared codes. The Rust
|
||||
implementation can use a typed error return from the handler (`Result<Value,
|
||||
CallError>` where `CallError` carries a `code`), which is cleaner than
|
||||
message-string matching — the handler returns a typed error, the registry
|
||||
checks whether the code is declared, and the `call.error` is constructed
|
||||
accordingly.
|
||||
|
||||
### 5. `from_openapi` and `to_openapi` error fidelity
|
||||
|
||||
`from_openapi` maps OpenAPI response status codes to `ErrorDefinition`s:
|
||||
|
||||
```rust
|
||||
// OpenAPI: 404: { schema: NotFoundError }
|
||||
// → ErrorDefinition { code: "NOT_FOUND", http_status: Some(404), schema: NotFoundError }
|
||||
```
|
||||
|
||||
The adapter maps the OpenAPI error schema to alknet's JSON Schema format
|
||||
(same conversion as input/output schemas). The `http_status` field records
|
||||
the original status code so `to_openapi` can project it back.
|
||||
|
||||
`to_openapi` projects `error_schemas` back to OpenAPI response definitions:
|
||||
|
||||
```yaml
|
||||
responses:
|
||||
'200': { schema: <output_schema> }
|
||||
'404': { schema: <error_schemas[0].schema> } # where http_status = 404
|
||||
'429': { schema: <error_schemas[1].schema> } # where http_status = 429
|
||||
```
|
||||
|
||||
This makes the adapter contract from ADR-017 faithful on the error axis —
|
||||
no silent dropping of error contracts.
|
||||
|
||||
`from_mcp` and `to_mcp` follow the same pattern: MCP tool definitions carry
|
||||
error descriptions, and the adapters map them to/from `ErrorDefinition`s.
|
||||
|
||||
### 6. `services/schema` exposes error schemas
|
||||
|
||||
`services/schema` returns the full `OperationSpec` including `error_schemas`.
|
||||
A client querying `/services/schema` for `/fs/readFile` gets:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "fs/readFile",
|
||||
"namespace": "fs",
|
||||
"op_type": "query",
|
||||
"input_schema": { ... },
|
||||
"output_schema": { ... },
|
||||
"error_schemas": [
|
||||
{ "code": "FILE_NOT_FOUND", "description": "The file does not exist",
|
||||
"schema": { "type": "object", "properties": { "path": { "type": "string" } } },
|
||||
"http_status": null },
|
||||
{ "code": "PERMISSION_DENIED", "description": "OS-level read permission denied",
|
||||
"schema": { "type": "object", "properties": { "path": { "type": "string" }, "errno": { "type": "integer" } } },
|
||||
"http_status": null }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This enables client code generation: a TypeScript or Rust client generator
|
||||
reading the schema can produce a typed `Result<Output, FsReadFileError>`
|
||||
enum instead of a generic `Result<Output, string>`.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
|
||||
- Operations declare their failure modes. Clients get typed errors instead
|
||||
of string-matched messages. This is the same type-safety property that
|
||||
`input_schema` and `output_schema` provide, extended to the error axis.
|
||||
- `from_openapi` and `to_openapi` are faithful on the error axis. An
|
||||
OpenAPI operation's error contract is no longer silently dropped on
|
||||
import or absent on export. The adapter contract from ADR-017 is now
|
||||
complete.
|
||||
- Client code generation can produce typed error enums. A client calling
|
||||
`/fs/readFile` can match on `FILE_NOT_FOUND` vs `PERMISSION_DENIED`
|
||||
instead of parsing `message` strings.
|
||||
- The protocol-level vs operation-level distinction is explicit. Protocol
|
||||
codes (`NOT_FOUND`, `FORBIDDEN`, etc.) mean the same thing regardless of
|
||||
operation. Operation codes (`FILE_NOT_FOUND`, `RATE_LIMITED`) mean what
|
||||
the operation declares. No conflation.
|
||||
- The `details` field carries structured error context that conforms to a
|
||||
schema — the error payload is typed, not a bare string. This enables
|
||||
programmatic error handling (retry logic, user-facing error messages,
|
||||
logging) without string parsing.
|
||||
- The principle generalizes: making failure a typed, declared thing is the
|
||||
same pattern as the SDD process's Safe Exit protocol (typed agent
|
||||
failure) and the same pattern complex biological systems use (apoptosis
|
||||
as a declared cell failure mode). The more components declare their
|
||||
failure modes, the more robust the system.
|
||||
|
||||
**Negative:**
|
||||
|
||||
- `OperationSpec` gains a field. Operations that don't declare errors
|
||||
(empty `error_schemas` vec) still work — the field is additive. But
|
||||
operations that *should* declare errors and don't will produce `INTERNAL`
|
||||
with `retryable: false`, same as today. The gap is visible but not
|
||||
enforced — an operation can ship without error schemas and clients get
|
||||
untyped errors for it. This is a documentation/guidance issue, not a
|
||||
type-system issue.
|
||||
- The `call.error` payload gains a `details` field. This is a wire-format
|
||||
addition. Existing clients that only read `code` and `message` are
|
||||
unaffected (they ignore `details`). New clients can read `details` for
|
||||
structured error context. This is backward-compatible — `details` is
|
||||
optional and absent for protocol-level errors.
|
||||
- Handler error mapping adds a step to the dispatch path: the registry
|
||||
checks whether the handler's error code matches a declared
|
||||
`ErrorDefinition`. This is a `HashMap` lookup by code — negligible cost.
|
||||
- The `http_status` field on `ErrorDefinition` is HTTP-specific. Operations
|
||||
that aren't HTTP-backed (local, session, from_mcp) leave it as `None`.
|
||||
This is a pragmatic choice: `from_openapi`/`to_openapi` need it, and it's
|
||||
optional for everything else. A future non-HTTP adapter that needs a
|
||||
different error projection field would add it — but `http_status` covers
|
||||
the immediate use case.
|
||||
- The TypeScript `mapError()` uses message-string matching to map thrown
|
||||
errors to codes. The Rust implementation can do better (typed `CallError`
|
||||
return from handlers), but this means the `Handler` type's return is
|
||||
`Result<Value, CallError>` rather than `Result<Value, Box<dyn Error>>`.
|
||||
This is a cleaner API but a slight constraint on handler authors — they
|
||||
return typed errors, not generic ones. Mitigated: `CallError::internal()`
|
||||
is available for errors that don't fit a declared code.
|
||||
|
||||
## Assumptions
|
||||
|
||||
1. **Operations can enumerate their meaningful failure modes at
|
||||
registration time.** If an operation has failure modes that are only
|
||||
discoverable at runtime (e.g., a dynamic API that returns novel error
|
||||
codes), those would be `INTERNAL` with `details` carrying the upstream
|
||||
error. The assumption is that most operations have a knowable set of
|
||||
domain errors.
|
||||
|
||||
2. **Error codes are stable per operation.** Once an operation declares
|
||||
`FILE_NOT_FOUND`, clients depend on that code. Changing it (renaming to
|
||||
`NOT_FOUND_FILE`) is a breaking change for clients that match on it.
|
||||
This is the same stability property as `input_schema` and
|
||||
`output_schema` — the operation's interface is its contract. Adding new
|
||||
error codes is additive (clients that don't know the new code treat it
|
||||
as `INTERNAL`); removing or renaming codes is breaking.
|
||||
|
||||
3. **Protocol-level codes are distinct from operation-level codes.** If an
|
||||
operation declares a code that collides with a protocol code (e.g., an
|
||||
operation declares `NOT_FOUND` as a domain error), the protocol code
|
||||
takes precedence in the dispatch machinery (the registry's `NOT_FOUND`
|
||||
for "operation not registered" is emitted before the handler runs). The
|
||||
assumption is that operations use domain-specific codes (`FILE_NOT_FOUND`)
|
||||
rather than reusing protocol codes (`NOT_FOUND`). This is a naming
|
||||
convention, not a type-system enforcement.
|
||||
|
||||
4. **`details` is optional and backward-compatible.** Existing clients that
|
||||
ignore `details` continue to work. New clients read `details` for
|
||||
structured context. The wire format addition is additive.
|
||||
|
||||
## References
|
||||
|
||||
- ADR-017: Call protocol client and adapter contract (adapter fidelity —
|
||||
this ADR makes `from_openapi`/`to_openapi` faithful on the error axis)
|
||||
- ADR-014: Secret material flow (the `details` field must not carry secret
|
||||
material — same constraint as `metadata`)
|
||||
- ADR-015: Privilege model (the `FORBIDDEN` protocol code covers ACL
|
||||
denial; operation-level `PERMISSION_DENIED` is a distinct domain error
|
||||
for OS-level permission issues)
|
||||
- docs/reviews/001-pre-implementation-architecture-sanity-check.md
|
||||
(finding C5, which this ADR resolves)
|
||||
- docs/sdd_process.md L19, L423 (Safe Exit protocol — the general principle
|
||||
of making failure typed and declared)
|
||||
- TypeScript reference: `/workspace/@alkdev/operations/src/types.ts`
|
||||
L38–47 (`ErrorDefinitionSchema`), L94, L112 (`errorSchemas` on
|
||||
`OperationSpec`), `error.ts` L25–51 (`mapError`)
|
||||
@@ -300,4 +300,13 @@ These questions are acknowledged but not active. They will be promoted to open w
|
||||
- **Door type**: One-way (security model), two-way (bundle shape)
|
||||
- **Priority**: high
|
||||
- **Resolution**: ADR-015 said handler identity was "set at registration by the assembly layer" but the registration API (`register(spec, handler)`) had no place for it — meaning every internal call would check ACL against `None`, reproducing the escalation gap ADR-015 was written to close. ADR-022 resolves this with a registration bundle (`HandlerRegistration`) carrying `provenance`, `composition_authority` (replacing `handler_identity: Identity` — it's a declared authority bundle, not a peer identity), `scoped_env`, and `capabilities`. The dispatch path (`build_root_context` and `OperationEnv::invoke()`) reads from the bundle. Provenance determines which ops can compose: only `Local` and `Session` get composition authority; leaves (`FromOpenAPI`, `FromMCP`, `FromCall`) get `None` — they don't compose, so they don't need it. Capabilities are per-request on `OperationContext`, populated from the bundle (resolving the closure-capture vs context ambiguity). The kernel/user analogy: user's authority checked once at the External gate; handler's composition authority used for all composition inside; scoped env bounds reachability. No intersection — the user's authority does not limit internal calls. See ADR-022.
|
||||
- **Cross-references**: ADR-014, ADR-015, ADR-022, docs/reviews/001-pre-implementation-architecture-sanity-check.md (C1–C4), [operation-registry.md](crates/call/operation-registry.md), [call-protocol.md](crates/call/call-protocol.md)
|
||||
- **Cross-references**: ADR-014, ADR-015, ADR-022, docs/reviews/001-pre-implementation-architecture-sanity-check.md (C1–C4), [operation-registry.md](crates/call/operation-registry.md), [call-protocol.md](crates/call/call-protocol.md)
|
||||
|
||||
### OQ-24: Operation Error Schemas
|
||||
|
||||
- **Origin**: [operation-registry.md](crates/call/operation-registry.md), [call-protocol.md](crates/call/call-protocol.md), ADR-017
|
||||
- **Status**: resolved
|
||||
- **Door type**: One-way (wire format), two-way (mapping mechanism)
|
||||
- **Priority**: high
|
||||
- **Resolution**: `OperationSpec` gains `error_schemas: Vec<ErrorDefinition>` where each `ErrorDefinition` carries a `code`, `description`, `schema` (JSON Schema for the error detail payload), and optional `http_status` (for adapter projection). The `call.error` payload gains an optional `details` field carrying the typed error payload. Protocol-level codes (`NOT_FOUND`, `FORBIDDEN`, `INVALID_INPUT`, `INTERNAL`, `TIMEOUT`) are distinct from operation-level domain codes (`FILE_NOT_FOUND`, `RATE_LIMITED`, etc.) — protocol codes are emitted by the dispatch machinery, operation codes by handlers. `from_openapi`/`to_openapi` map OpenAPI response status codes to/from `ErrorDefinition`s, making the adapter contract from ADR-017 faithful on the error axis. `services/schema` exposes `error_schemas` for client code generation. See ADR-023.
|
||||
- **Cross-references**: ADR-017, ADR-023, docs/reviews/001-pre-implementation-architecture-sanity-check.md (C5), [operation-registry.md](crates/call/operation-registry.md), [call-protocol.md](crates/call/call-protocol.md)
|
||||
@@ -213,6 +213,7 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
| [020](decisions/020-hd-derivation-for-encryption-keys.md) | HD Derivation for Encryption Keys | SLIP-0010 derivation from seed, not PBKDF2; salt field unused in v2 |
|
||||
| [021](decisions/021-key-rotation-via-version-indexed-paths.md) | Key Rotation via Version-Indexed Paths | Version-indexed derivation paths; `rotate` re-encrypts between versions |
|
||||
| [022](decisions/022-handler-registration-provenance-and-composition-authority.md) | Handler Registration, Provenance, and Composition Authority | Registration bundle carries provenance, composition authority, scoped env, capabilities; dispatch path reads from bundle |
|
||||
| [023](decisions/023-operation-error-schemas.md) | Operation Error Schemas | Operations declare domain errors; `call.error` carries typed `details`; adapter fidelity for `from_openapi`/`to_openapi` |
|
||||
|
||||
## Open Questions
|
||||
|
||||
|
||||
Reference in New Issue
Block a user