ADR-023 adds error_schemas to OperationSpec so operations can declare their domain-level failure modes (FILE_NOT_FOUND, RATE_LIMITED, etc.) distinct from protocol-level codes (NOT_FOUND, FORBIDDEN, etc.). The call.error payload gains an optional 'details' field carrying the typed error payload conforming to the declared schema. from_openapi/to_openapi map OpenAPI response status codes to/from ErrorDefinitions, making the adapter contract from ADR-017 faithful on the error axis. Also fixes W2 (KeyVersionMismatch stale comment in encryption.md — ADR-021 implements rotation without this variant) and W4 (derive_encryption_key_for_version missing from service.md method list). Spec updates: operation-registry.md (OperationSpec, ErrorDefinition, Handler error mapping, services/schema), call-protocol.md (call.error payload, CallError, ResponseEnvelope), README.md, overview.md, open-questions.md (OQ-24), call/README.md, encryption.md, service.md.
18 KiB
ADR-023: Operation Error Schemas
Status
Proposed
Context
The OperationSpec in alknet-call has input_schema and output_schema but
no error_schemas. The call.error payload (call-protocol.md L128–134)
carries a code and message, where code is one of five infrastructure
codes: NOT_FOUND, FORBIDDEN, INVALID_INPUT, INTERNAL, TIMEOUT.
These five codes cover protocol-level failures — the call protocol itself can always fail to find an operation, deny access, reject bad input, time out, or hit an internal error. They are emitted by the dispatch machinery (the registry, the adapter), not by operation handlers.
But operations also have domain-level failures that are not covered:
/fs/readFilecan fail because the file doesn't exist, the path is invalid, or the caller lacks OS-level read permission. These are operation-specific failures distinct from the protocol-levelINVALID_INPUT(schema mismatch) orFORBIDDEN(scope mismatch)./vastai/createMachinecan fail because the account has insufficient credits, the machine type is unavailable in the requested region, or the upstream API rate-limited the request./agent/chatcan fail because the LLM provider returned an error, the context window overflowed, or the model refused the request.
Today, these failures collapse into INTERNAL with a message string.
A client calling /fs/readFile has no way to know from the schema that it
might return FILE_NOT_FOUND vs PERMISSION_DENIED vs INVALID_PATH. The
caller has to parse message strings — the exact anti-pattern that typed
RPC is meant to avoid. This is a type safety gap: inputs and outputs are
typed, but errors are untyped strings.
Why this matters for adapters
OpenAPI specs naturally include error information — response status codes
with schemas (e.g., 404: { schema: NotFoundError }, 422: { schema: ValidationError }). MCP tool definitions carry error descriptions. The
from_openapi adapter (ADR-017 L113–124) imports operations and mirrors
"the remote operation's name, namespace, type, schemas, and access control"
— but with no error schema field, error responses from the OpenAPI source
are dropped on import. to_openapi has nowhere to project error information
to. The same gap applies to from_mcp/to_mcp.
An OpenAPI operation that declares:
responses:
'200': { schema: MachineList }
'401': { schema: AuthError }
'429': { schema: RateLimitError }
cannot be faithfully represented in alknet's OperationSpec today. The
adapter would import the 200 output schema and drop the error schemas —
a lossy import that silently discards the operation's failure contract.
Prior art
The TypeScript reference (/workspace/@alkdev/operations/src/types.ts
L38–47, L94, L112) defines ErrorDefinitionSchema and an optional
errorSchemas?: ErrorDefinition[] on OperationSpec:
export const ErrorDefinitionSchema = Type.Object({
code: Type.String({ description: "Error Code e.g., INVALID_INPUT, NOT_FOUND, UNAUTHORIZED" }),
description: Type.String(),
schema: Type.Unknown(),
httpStatus: Type.Optional(Type.Number()),
});
The mapError() function (error.ts L25–51) matches thrown errors against
the declared error schemas by code prefix — if a handler throws an error
whose message starts with a declared code, mapError rewrites it to a
typed CallError with that code. This is a proven pattern: operations
declare their error contract, the dispatch machinery maps runtime failures
to the declared codes, and clients get typed errors instead of string
parsing.
The translator agent omitted errorSchemas from the Rust spec, likely
because it's Optional in the TS schema (so dropping it doesn't break the
happy path) and because error schemas are semantically different from
input/output schemas (an operation returns one output but could return any
of several errors). That's a reasonable judgment call for a first
translation pass, but it leaves a real gap for adapters and clients.
The general principle
This is the same principle as the Safe Exit protocol in the SDD process
(docs/sdd_process.md L19, L423): make failure a typed, declared thing
rather than an untyped exception that crashes into whoever's listening.
An operation that declares "I can fail with FILE_NOT_FOUND" is the same
shape as an agent that declares "I can fail with TASK_AMBIGUOUS" — both
turn an unknown unknown into a known known that the caller can handle
deliberately.
Complex systems survive not because every component is reliable, but because failure is expected and typed. Cells have apoptosis (a declared failure mode that protects the organism). Operations have error schemas (a declared failure mode that lets the caller handle it). The alternative — components that fail with untyped strings — is how you get brittle clients that string-match error messages and break when the message wording changes.
Decision
1. OperationSpec gains an optional error_schemas field
pub struct OperationSpec {
pub name: String,
pub namespace: String,
pub op_type: OperationType,
pub visibility: Visibility,
pub input_schema: Value,
pub output_schema: Value,
pub access_control: AccessControl,
pub error_schemas: Vec<ErrorDefinition>, // NEW — empty vec = no declared errors
}
pub struct ErrorDefinition {
/// Machine-readable error code. e.g., "FILE_NOT_FOUND", "RATE_LIMITED",
/// "INSUFFICIENT_CREDITS". Distinct from the protocol-level codes
/// (NOT_FOUND, FORBIDDEN, etc.) — these are operation-level domain codes.
pub code: String,
/// Human-readable description of when this error occurs.
pub description: String,
/// JSON Schema for the error detail payload. The `call.error` event's
/// `details` field conforms to this schema when this error code is
/// returned. `Value` (serde_json::Value) carrying a JSON Schema, same
/// as input_schema/output_schema.
pub schema: Value,
/// HTTP status code for adapter projection. `from_openapi` maps OpenAPI
/// response status codes to error definitions; `to_openapi` projects
/// error definitions back to response status codes. Optional — not all
/// error sources are HTTP-backed.
pub http_status: Option<u16>,
}
error_schemas is a Vec<ErrorDefinition>, not Option<Vec<...>>. An
empty vec means "this operation declares no specific domain errors" (it may
still fail with protocol-level codes like INTERNAL). This avoids the
None vs Some([]) ambiguity and matches the TypeScript reference's
optional-array convention.
2. The call.error payload gains an optional details field
{
"code": "FILE_NOT_FOUND",
"message": "file not found: /etc/nonexistent",
"retryable": false,
"details": { "path": "/etc/nonexistent", "errno": 2 }
}
code— the error code. Either a protocol-level code (NOT_FOUND,FORBIDDEN,INVALID_INPUT,INTERNAL,TIMEOUT) or an operation-level domain code fromerror_schemas(e.g.,FILE_NOT_FOUND,RATE_LIMITED).message— human-readable error message. Unstructured — for logging and debugging, not for programmatic handling. Clients should switch oncode, not parsemessage.retryable— whether the caller should retry.truefor transient failures (TIMEOUT,RATE_LIMITED),falsefor permanent ones (NOT_FOUND,FORBIDDEN,FILE_NOT_FOUND).details— optional. When the error code matches a declaredErrorDefinition,detailsconforms to that definition'sschema. When the error is protocol-level (NOT_FOUND,FORBIDDEN, etc.),detailsis absent or carries protocol-specific context (e.g., the operation name forNOT_FOUND). This field is the typed error payload — it's what makes errors structured instead of string-matched.
3. Protocol-level vs operation-level error codes
The five existing codes are protocol-level — emitted by the dispatch machinery, not by handlers:
| Code | Emitted by | Meaning |
|---|---|---|
NOT_FOUND |
Registry | Operation not registered (or Internal op called from wire) |
FORBIDDEN |
Registry / ACL | Caller lacks required scopes, or unauthenticated |
INVALID_INPUT |
Registry | Input doesn't match input_schema |
INTERNAL |
Registry / Adapter | Handler panic, unhandled error, connection failure |
TIMEOUT |
Adapter | Request timed out |
Operation-level domain codes are emitted by handlers — the operation's
own logic determines what went wrong. They are declared in error_schemas
and appear in the code field of call.error. Examples: FILE_NOT_FOUND,
PERMISSION_DENIED, RATE_LIMITED, INSUFFICIENT_CREDITS,
CONTEXT_OVERFLOW.
The two namespaces are distinct but share the code field. Clients
should handle protocol-level codes uniformly (they mean the same thing
regardless of operation) and operation-level codes per-operation (they
mean what the operation's error_schemas says they mean). Unknown codes
— whether a future protocol code or an undeclared operation code — should
be treated as INTERNAL with retryable: false (same as the current
guidance in call-protocol.md L143).
4. Handler error mapping
When a handler returns an error, the dispatch machinery maps it to a
call.error event. The mapping:
- If the handler returns a structured error with a
codethat matches a declaredErrorDefinition.code, thecall.errorcarries that code and the error's detail payload (validated against the definition'sschema). - If the handler returns a structured error with a
codethat doesn't match any declaredErrorDefinition, thecall.errorcarriesINTERNALwith the original code indetails. This is an undeclared error — the handler returned a typed error but didn't declare it. - If the handler returns an unstructured error (a string, a generic
Error, a panic), thecall.errorcarriesINTERNALwithretryable: false. This is the current behavior for all handler errors.
The TypeScript mapError() function (error.ts L25–51) implements case 2
and 3 by matching error messages against declared codes. The Rust
implementation can use a typed error return from the handler (Result<Value, CallError> where CallError carries a code), which is cleaner than
message-string matching — the handler returns a typed error, the registry
checks whether the code is declared, and the call.error is constructed
accordingly.
5. from_openapi and to_openapi error fidelity
from_openapi maps OpenAPI response status codes to ErrorDefinitions:
// OpenAPI: 404: { schema: NotFoundError }
// → ErrorDefinition { code: "NOT_FOUND", http_status: Some(404), schema: NotFoundError }
The adapter maps the OpenAPI error schema to alknet's JSON Schema format
(same conversion as input/output schemas). The http_status field records
the original status code so to_openapi can project it back.
to_openapi projects error_schemas back to OpenAPI response definitions:
responses:
'200': { schema: <output_schema> }
'404': { schema: <error_schemas[0].schema> } # where http_status = 404
'429': { schema: <error_schemas[1].schema> } # where http_status = 429
This makes the adapter contract from ADR-017 faithful on the error axis — no silent dropping of error contracts.
from_mcp and to_mcp follow the same pattern: MCP tool definitions carry
error descriptions, and the adapters map them to/from ErrorDefinitions.
6. services/schema exposes error schemas
services/schema returns the full OperationSpec including error_schemas.
A client querying /services/schema for /fs/readFile gets:
{
"name": "fs/readFile",
"namespace": "fs",
"op_type": "query",
"input_schema": { ... },
"output_schema": { ... },
"error_schemas": [
{ "code": "FILE_NOT_FOUND", "description": "The file does not exist",
"schema": { "type": "object", "properties": { "path": { "type": "string" } } },
"http_status": null },
{ "code": "PERMISSION_DENIED", "description": "OS-level read permission denied",
"schema": { "type": "object", "properties": { "path": { "type": "string" }, "errno": { "type": "integer" } } },
"http_status": null }
]
}
This enables client code generation: a TypeScript or Rust client generator
reading the schema can produce a typed Result<Output, FsReadFileError>
enum instead of a generic Result<Output, string>.
Consequences
Positive:
- Operations declare their failure modes. Clients get typed errors instead
of string-matched messages. This is the same type-safety property that
input_schemaandoutput_schemaprovide, extended to the error axis. from_openapiandto_openapiare faithful on the error axis. An OpenAPI operation's error contract is no longer silently dropped on import or absent on export. The adapter contract from ADR-017 is now complete.- Client code generation can produce typed error enums. A client calling
/fs/readFilecan match onFILE_NOT_FOUNDvsPERMISSION_DENIEDinstead of parsingmessagestrings. - The protocol-level vs operation-level distinction is explicit. Protocol
codes (
NOT_FOUND,FORBIDDEN, etc.) mean the same thing regardless of operation. Operation codes (FILE_NOT_FOUND,RATE_LIMITED) mean what the operation declares. No conflation. - The
detailsfield carries structured error context that conforms to a schema — the error payload is typed, not a bare string. This enables programmatic error handling (retry logic, user-facing error messages, logging) without string parsing. - The principle generalizes: making failure a typed, declared thing is the same pattern as the SDD process's Safe Exit protocol (typed agent failure) and the same pattern complex biological systems use (apoptosis as a declared cell failure mode). The more components declare their failure modes, the more robust the system.
Negative:
OperationSpecgains a field. Operations that don't declare errors (emptyerror_schemasvec) still work — the field is additive. But operations that should declare errors and don't will produceINTERNALwithretryable: false, same as today. The gap is visible but not enforced — an operation can ship without error schemas and clients get untyped errors for it. This is a documentation/guidance issue, not a type-system issue.- The
call.errorpayload gains adetailsfield. This is a wire-format addition. Existing clients that only readcodeandmessageare unaffected (they ignoredetails). New clients can readdetailsfor structured error context. This is backward-compatible —detailsis optional and absent for protocol-level errors. - Handler error mapping adds a step to the dispatch path: the registry
checks whether the handler's error code matches a declared
ErrorDefinition. This is aHashMaplookup by code — negligible cost. - The
http_statusfield onErrorDefinitionis HTTP-specific. Operations that aren't HTTP-backed (local, session, from_mcp) leave it asNone. This is a pragmatic choice:from_openapi/to_openapineed it, and it's optional for everything else. A future non-HTTP adapter that needs a different error projection field would add it — buthttp_statuscovers the immediate use case. - The TypeScript
mapError()uses message-string matching to map thrown errors to codes. The Rust implementation can do better (typedCallErrorreturn from handlers), but this means theHandlertype's return isResult<Value, CallError>rather thanResult<Value, Box<dyn Error>>. This is a cleaner API but a slight constraint on handler authors — they return typed errors, not generic ones. Mitigated:CallError::internal()is available for errors that don't fit a declared code.
Assumptions
-
Operations can enumerate their meaningful failure modes at registration time. If an operation has failure modes that are only discoverable at runtime (e.g., a dynamic API that returns novel error codes), those would be
INTERNALwithdetailscarrying the upstream error. The assumption is that most operations have a knowable set of domain errors. -
Error codes are stable per operation. Once an operation declares
FILE_NOT_FOUND, clients depend on that code. Changing it (renaming toNOT_FOUND_FILE) is a breaking change for clients that match on it. This is the same stability property asinput_schemaandoutput_schema— the operation's interface is its contract. Adding new error codes is additive (clients that don't know the new code treat it asINTERNAL); removing or renaming codes is breaking. -
Protocol-level codes are distinct from operation-level codes. If an operation declares a code that collides with a protocol code (e.g., an operation declares
NOT_FOUNDas a domain error), the protocol code takes precedence in the dispatch machinery (the registry'sNOT_FOUNDfor "operation not registered" is emitted before the handler runs). The assumption is that operations use domain-specific codes (FILE_NOT_FOUND) rather than reusing protocol codes (NOT_FOUND). This is a naming convention, not a type-system enforcement. -
detailsis optional and backward-compatible. Existing clients that ignoredetailscontinue to work. New clients readdetailsfor structured context. The wire format addition is additive.
References
- ADR-017: Call protocol client and adapter contract (adapter fidelity —
this ADR makes
from_openapi/to_openapifaithful on the error axis) - ADR-014: Secret material flow (the
detailsfield must not carry secret material — same constraint asmetadata) - ADR-015: Privilege model (the
FORBIDDENprotocol code covers ACL denial; operation-levelPERMISSION_DENIEDis a distinct domain error for OS-level permission issues) - docs/reviews/001-pre-implementation-architecture-sanity-check.md (finding C5, which this ADR resolves)
- docs/sdd_process.md L19, L423 (Safe Exit protocol — the general principle of making failure typed and declared)
- TypeScript reference:
/workspace/@alkdev/operations/src/types.tsL38–47 (ErrorDefinitionSchema), L94, L112 (errorSchemasonOperationSpec),error.tsL25–51 (mapError)