Files

glm-5.1 27ebbd491e add open questions tracker: compile all unresolved questions across architecture docs into one cross-referenced view organized by theme and priority

2026-05-20 05:27:19 +00:00

29 KiB

Raw Blame History

status, last_updated

status	last_updated
draft	2026-05-20

Open Questions Tracker

Cross-cutting compilation of all unresolved questions across the flowgraph architecture documents, organized by theme. Questions that appear in multiple documents are unified here with cross-references.

How to Use This Document

Each question has an ID (e.g., OQ-01), status, origin (which doc(s)), and priority assessment
Cross-references link related questions that may conflict or answer each other
When a question is resolved, update its status to resolved and add a resolution note
Once all questions in a theme are resolved, the theme section can be removed

Theme 1: Edge Semantics and Type Compatibility

OQ-01: Should `fromSpecs()` add ALL edges or only compatible ones?

Origin: operation-graph.md Q1
Status: open
Priority: high — affects storage size, API surface, and diagnostic value
Options:
- (a) Add both compatible and incompatible edges (current design). Pro: diagnostic information visible. Con: graph is larger.
- (b) Only add compatible edges, with a potentialEdges() query computing incompatible connections on demand. Pro: smaller graph. Con: loses diagnostic information.
Notes: This decision affects buildTypeEdges() in analysis.md and OperationEdgeAttrs in schema.md. The compatible: false attribute on edges only makes sense if option (a) is chosen.
Cross-references: OQ-04

OQ-02: How granular should type compatibility results be?

Origin: operation-graph.md Q4, analysis.md Q1
Status: open
Priority: high — directly shapes the typeCompat() return type and OperationEdgeAttrs
Question (merged): How deep should typeCompat check? Should it be fully recursive? And should the result be { compatible, detail? } or { compatible, mismatches: TypeMismatch[] }?
Current design: The schema already defines TypeMismatch with { path, expected, actual } and OperationEdgeAttrs has an optional mismatches field. The analysis doc describes deep recursive structural comparison. But there's a tension: full recursive checking is more thorough but may produce false negatives for schemas with dynamic structures.
Notes: The schema doc already has mismatches?: TypeMismatch[] in OperationEdgeAttrs. The analysis doc already defines TypeCompatResult with mismatches. This suggests the design has already converged toward structured mismatch reporting. What remains is confirming: (a) recursive depth limits, (b) handling of Type.Unknown() and complex types (unions, intersections), (c) whether the detail string field is still needed alongside mismatches.
Cross-references: OQ-01 (incompatible edges need mismatch detail)

OQ-03: Should subscription operations be treated differently in type compatibility?

Origin: operation-graph.md Q3
Status: open
Priority: medium — affects operation graph edge semantics for streaming operations
Question: A subscription produces a stream, not a single output. Its outputSchema describes a single stream element, but the data flow semantics are different. Should type compat check for subscriptions account for this?
Notes: This has downstream implications for call-graph population (subscriptions produce multiple call.responded events) and template authoring (a subscription feeding into a mutation has different semantics than a query feeding into a mutation). May want to defer to v2 but should at least document the current behavior (subscriptions are treated the same as queries/mutations).

OQ-04: Edge type consistency — should `edgeType` be required on ALL edges?

Origin: schema.md Q1
Status: open
Priority: medium — affects serialization format and edge handling across all graph types
Options:
- (a) edgeType required on all edges. Pro: consistent, self-describing. Con: operation graph edges are always typed, making the field redundant there.
- (b) Separate edge attribute types per graph mode (current implicit design — CallEdgeAttrs is a union, OperationEdgeAttrs doesn't include edge type). Con: graphology edges must carry attributes from a single schema.
- (c) Union type on edge attributes, letting the consumer tag the edge. Pro: flexible. Con: runtime discrimination burden.
Notes: The current schema already stores edgeType alongside the edge-specific attributes in graphology (see schema.md's "Edge type storage" section), which is effectively option (a) at the storage level. The question is really about the TypeScript type API: should OperationEdgeAttrs include edgeType: "typed" or should that be a separate concern?
Cross-references: OQ-01 (if incompatible edges exist, they need tagging)

Theme 2: Structural Container Transparency

OQ-05: Should `Sequential` and `Parallel` be transparent in the graph?

Origin: workflow-templates.md Q1, host-configs.md Q1
Status: open
Priority: high — fundamental to how the DAG is structured and how the reactive engine computes preconditions
Question (merged): Currently, structural containers (Sequential, Parallel, Conditional) produce edges but no nodes. The reactive engine then has to reconstruct structural context to compute preconditions. Should they create "virtual" nodes instead?
Options:
- (a) Transparent (current design): No nodes for containers. Edges carry the structure. Pro: smaller DAG, cleaner topology. Con: precondition computation needs structural context (parentStack, siblingMap).
- (b) Virtual nodes: Containers create nodes with signal<NodeStatus>. Pro: every node has a status and preconditions, simpler reactive engine. Con: more nodes, containers with no call protocol equivalent, slightly more complex graph queries.
Notes: The host-configs doc identifies this as a "known gap": Sequential, Parallel, Conditional are transparent in the DAG but create complexity for the reactive engine's "previous sibling" precondition logic. The reactive-execution doc's WorkflowReactiveRoot.initializeSignals() assumes it operates on the flattened DAG (all nodes are operations), which aligns with option (a). The question is whether the reactive engine's context maps (parentMap, siblingMap) are sufficient or if virtual nodes would simplify things.
Cross-references: OQ-14 (partial re-rendering)

Theme 3: Call Protocol Integration

OQ-06: How does template instantiation interact with the call protocol?

Origin: workflow-templates.md Q4, host-configs.md Q3
Status: open
Priority: high — this is a fundamental integration point between flowgraph and the call protocol
Question (merged): When a template is instantiated as a call graph, each <Operation> becomes a call. But the call protocol's call.requested events include parentRequestId — who is the parent? Is it the template instance? The hub coordinator? And how does the ReactiveHostConfig bridge to registry.execute() or PendingRequestMap.call()?
Notes: The consumer-integration doc shows the coordinator calling registry.execute() inside an effect(), but doesn't specify the parentRequestId semantics. This is a consumer-side decision, but flowgraph needs to document: (a) whether the template has its own requestId, (b) how the reactive engine signals the coordinator to start a call, (c) whether ReactiveHostConfig has a callback prop for this.
Cross-references: OQ-07, OQ-08

OQ-07: Should the reactive engine own the call graph?

Origin: host-configs.md Q4
Status: open
Priority: high — affects the separation between flowgraph and the call protocol
Question: Currently the call graph (from call-graph.md) and the reactive engine (from reactive-execution.md) are separate concepts. But at runtime, every <Operation> in a template becomes a call graph node. Should the reactive engine populate the call graph as a side effect?
Options:
- (a) Separate: Call graph is populated by call protocol events. Reactive engine uses signals only. Coordinator bridges them.
- (b) Unified: Reactive engine creates call graph nodes when nodes transition to running, updates them on completion. Call graph is derived from reactive state.
Notes: Option (a) matches ADR-003 (flowgraph doesn't do storage/persistence) and the current design where the call graph is populated by updateFromEvent(). Option (b) would couple the reactive engine to the call protocol. The current design's separation is cleaner but requires the coordinator to maintain both reactive state and call graph state.

OQ-08: Should `depends_on` edges be auto-populated from workflow templates?

Origin: call-graph.md Q2
Status: open
Priority: medium — affects how the call graph and template system relate
Question: When a call graph is instantiated from a workflow template, the template's sequential/parallel structure implies data dependencies. Should the template instantiation automatically create depends_on edges in the call graph?
Notes: Currently depends_on edges must be added explicitly. Auto-population would couple the call graph to the template system. The alternative is for the coordinator to add depends_on edges when it instantiates a template.
Cross-references: OQ-06, workflow-templates Q3 (explicit depends_on in templates)

Theme 4: Failure and Retry Semantics

OQ-09: How are retries handled at the signal level?

Origin: reactive-execution.md Q2
Status: open
Priority: high — affects the core status state machine
Question: If an operation fails and should be retried, the status would need to go running → failed → ready → running. But the current state machine marks failed as terminal with no exit transitions. How should this work?
Options:
- (a) A retried status that allows re-entering ready. Con: adds another state to NodeStatus.
- (b) A separate retryCount attribute. A node can reset its status from failed to ready if retryCount < maxRetries. Con: breaks the terminal-state invariant.
- (c) Retry creates a new node (new requestId). The old node stays failed. Con: increases graph size but preserves state machine integrity.
Notes: Option (c) aligns with the call protocol, where each retry is a new call with a new requestId. This is likely the right answer but needs confirmation.
Cross-references: OQ-10

OQ-10: What happens to running nodes when a predecessor fails?

Origin: reactive-execution.md Q6
Status: open
Priority: high — affects failure propagation correctness
Question: The current spec transitions idle and waiting nodes to aborted when blockedByFailure becomes true. But what about a node that's already running? Should it be cancelled?
Options:
- (a) Running nodes are NOT affected. A predecessor's failure blocks dependents that haven't started, but running nodes continue. The coordinator can cancel them via prm.abort() if desired.
- (b) Running nodes automatically transition to aborted. This requires the effect() to check for running nodes.
Notes: Option (a) is consistent with "failure follows dependency edges, not structural scope" — a running node has already passed its preconditions, so it should be allowed to complete. The coordinator can choose to abort it. Option (b) would be more aggressive. The reactive-execution doc's constraint says "abort is immediate in signals, delayed in protocol," suggesting option (a) is intended.
Cross-references: OQ-09 (retries need to know if a running node can be restarted)

Theme 5: Preconditions and Scheduling

OQ-11: Should preconditions support OR logic?

Origin: reactive-execution.md Q1
Status: open
Priority: medium — affects the precondition computation model
Question: Currently all predecessors must complete (AND logic). An anyOf predicate would allow "start this node as soon as any predecessor completes."
Notes: OR preconditions would require either: (a) an edge attribute indicating allOf vs anyOf, (b) a node-level configuration, or (c) a separate anyOfPredecessors computed per node. This is a semantic change that affects both the DAG structure and the reactive engine. Might be a v2 feature.
Cross-references: OQ-12

OQ-12: How does `maxConcurrency` interact with preconditions?

Origin: reactive-execution.md Q4
Status: open
Priority: medium — a Parallel group with maxConcurrency: 3 should only start 3 nodes at a time
Notes: maxConcurrency is a scheduling concern, not a structural one. The DAG doesn't encode it. Options: (a) a semaphore signal in the reactive layer, (b) coordinator-enforced throttling, (c) a maxConcurrency prop on Parallel that the reactive engine respects. The <Parallel> component already has maxConcurrency as an optional prop in its definition (workflow-templates.md).
Cross-references: OQ-11, workflow-templates Parallel component

OQ-13: Should `blockedByFailure` be a separate `computed` or derived from `preconditions`?

Origin: reactive-execution.md Q5
Status: open
Priority: low — implementation detail, can be decided during implementation
Question: Currently there are two separate computed values — preconditions (all predecessors completed/skipped) and blockedByFailure (any predecessor failed/aborted). An alternative is a single computed<NodeReadiness> returning "ready" | "blocked" | "failed".
Notes: Two separate computed values are more composable (you can check preconditions independently of failure status) but require two effects per node. A single computed is simpler (one effect) but less composably queryable. This is largely an implementation choice that doesn't affect the public API. Can be deferred to implementation.

Theme 6: Graph Construction and API Surface

OQ-14: Should the call graph support unknown `operationId`?

Origin: call-graph.md Q1
Status: open (with a proposed answer)
Priority: medium — affects fromCallEvents() and updateFromEvent() behavior
Proposed answer: Yes. The call graph records what happened, not what should have happened. Nodes with unknown operationId get status: "pending" and may later transition to "failed" with an OPERATION_NOT_FOUND error code.
Notes: The doc already has a proposed answer. This just needs confirmation and the behavior documented in the fromCallEvents() spec.

OQ-15: Should the call graph support multiple graphs simultaneously?

Origin: call-graph.md Q3
Status: open
Priority: low — can be deferred to v2
Question: Currently one FlowGraph instance = one call graph. If the hub needs to track multiple concurrent workflows, it uses multiple instances. An alternative is a single graph with workflow-scoped subgraphs.
Notes: The current design (multiple instances) is simpler and matches graphology's model. Subgraphs would require a scoping mechanism. This can be deferred unless early usage shows it's needed.

OQ-16: Should `filterByStatus` use an index?

Origin: call-graph.md Q4
Status: open
Priority: low — premature optimization for small graphs
Notes: Call graphs at hub level are typically tens of nodes. O(n) filter is fast enough. An index can be added later if performance becomes an issue. Can be deferred.

OQ-17: Should `FlowGraph` expose graphology's traversal methods directly?

Origin: flowgraph-api.md Q1
Status: open
Priority: medium — affects the public API surface
Question: Currently the plan is convenience methods that delegate. But some consumers may find it inconvenient to go through .graph.forEachNode().
Options:
- (a) Convenience methods only (current plan). Direct access via .graph for power users.
- (b) Expose graphology's traversal methods directly on FlowGraph (e.g., flowGraph.forEachNode()).
- (c) Expose only the most common traversal methods and let .graph handle the rest.
Notes: This is a UX decision. Option (a) keeps the API surface small. Option (b) is more convenient but increases the delegation surface. Option (c) is a middle ground. The decision can be made during implementation based on actual consumer usage patterns.

OQ-18: Should `addOperation` auto-populate type-compat edges?

Origin: flowgraph-api.md Q2
Status: open
Priority: low — affects incremental construction behavior
Question: fromSpecs() calls buildTypeEdges() which adds all type-compatibility edges. Should addOperation() (incremental) also attempt auto-type-compat edge creation?
Notes: This is only relevant for incremental construction (rare use case). The operation graph is typically built once via fromSpecs(). If incremental construction is needed, the consumer can call buildTypeEdges() manually after adding operations. Can be deferred.

OQ-28: Should `FlowGraph` share analysis functions across instances?

Origin: flowgraph-api.md Q3
Status: open
Priority: low — optimization concern, not blocking
Question: Currently each FlowGraph instance owns its own DirectedGraph. A future optimization could pool analysis functions across instances.
Notes: Distinct from OQ-15 (multiple graphs per instance) — this is about sharing analysis logic, not about graph scoping. Can be deferred.

OQ-19: Should `parallelGroups` account for resource constraints?

Origin: analysis.md Q4
Status: open
Priority: low — feature enhancement, not a core concern
Question: Currently parallelGroups() returns the theoretical maximum parallelism. An optional maxConcurrency parameter could limit group sizes for realistic scheduling.
Notes: Can be added later as an optional parameter. Not blocking.

OQ-27: Should `validateTemplate` check runtime preconditions?

Origin: analysis.md Q2
Status: open (intentionally deferred)
Priority: low — explicitly out of scope for static analysis
Question: Currently validateTemplate only checks structural validity and type compatibility. Runtime preconditions (e.g., "operation B requires an API key that operation A doesn't have access to") are beyond the scope of static analysis and belong to the access control layer.
Notes: This is a deliberate scope boundary, not a design gap. Documented here to confirm that this is an intentional deferral, not an oversight.

Theme 7: Conditional and Template Semantics

OQ-29: Should GraphologyHostConfig produce a separate graph per edge type?

Origin: host-configs.md Q2
Status: open
Priority: medium — affects implementation of the GraphologyHostConfig
Question: Currently all edge types (sequential, conditional, typed) share the same graph. An alternative is a separate graph per edge type, enabling type-specific queries without filtering.
Notes: Related to OQ-04 (edge type consistency at the schema level) but distinct — this is about the runtime graph structure, not the type design. Multiple graphs would make type-specific queries faster (no filtering) but increase complexity and memory usage.
Cross-references: OQ-04

OQ-20: How should conditional edge conditions be represented?

Origin: schema.md Q3
Status: open
Priority: medium — affects TemplateEdgeAttrs.condition type safety
Options:
- (a) Type.Unknown() with documentation (current). Pro: maximally flexible. Con: no type safety.
- (b) Type.Union([Type.String(), Type.Function(...)]) for expression strings and function references. Pro: documents both forms. Con: functions don't serialize.
- (c) A dedicated ConditionSchema that flowgraph defines. Pro: type safe, consistent. Con: may be overly prescriptive.
Notes: The workflow-templates doc already specifies Conditional.test as ((results: Record<string, CallResult>) => boolean) | string, and the host-configs doc notes that function props need runtime resolution. Option (b) seems like the pragmatic choice that matches the existing design, but the schema representation is what needs deciding.
Known Gap (from host-configs.md): "Conditional Test Evaluation" — the Conditional.test function needs access to the WorkflowContext/ReactiveContext at runtime to evaluate against predecessor results. This is a concrete sub-problem of OQ-06 (how the reactive host config bridges to execution).
Cross-references: OQ-05 (conditional branch behavior in reactive engine), OQ-06 (runtime resolution of function props)

OQ-21: Should templates support explicit `depends_on` edges?

Origin: workflow-templates.md Q3
Status: open
Priority: medium — affects template composition expressiveness
Question: Currently dependencies are inferred from structure (sequential implies dependency). An explicit <DependsOn target="operation-name" /> component would make data dependencies visible in the template without relying on sequential ordering.
Notes: This would add expressiveness but also complexity. Implicit dependency from structure is simpler and covers the most common cases. Explicit depends_on would be needed when a node depends on a non-adjacent predecessor in a way that can't be expressed by a Sequential group. Can be deferred to v2.
Cross-references: OQ-08 (call graph depends_on edges)

Theme 8: Identity and Serialization

OQ-22: Should `CallNodeAttrs.identity` be a structured type or `Type.Record`?

Origin: schema.md Q2
Status: open
Priority: medium — affects the @alkdev/operations peer dependency
Options:
- (a) Import Identity from @alkdev/operations (peer dep). Pro: matches call protocol. Con: creates a direct type dependency.
- (b) Duplicate the type in flowgraph. Pro: no dependency. Con: divergence risk.
- (c) Use Type.Record(Type.String(), Type.Array(Type.String())) for the resources field. Pro: flexible. Con: weaker typing.
Notes: Since @alkdev/operations is already a peer dependency for type imports, option (a) seems reasonable. The concern is version alignment, but semver ranges handle this. This could also be a Type.Unknown() with documentation, letting the consumer validate.

OQ-23: Multiple graphs per `FlowGraph` instance?

Origin: call-graph.md Q3 (same as OQ-15)
Status: open (duplicate of OQ-15 — see above)

OQ-24: Async analysis functions?

Origin: analysis.md Q3
Status: open
Priority: low — premature for current scale
Question: Should analysis functions be async for large graphs? Current graphs are small (50-200 nodes), synchronous is fine.
Notes: Can be deferred. If large graphs become common, async analysis can be added with an optional async variant.

Theme 9: Reactive Execution Mechanics

OQ-25: Should the reactive graph support partial re-rendering?

Origin: reactive-execution.md Q3
Status: open (blocked on ujsx reconciler)
Priority: low — blocked on ujsx reconciler implementation
Question: If a template changes mid-execution, the ujsx reconciler could diff and apply changes. Currently only mount rendering is supported.
Known Gap (from host-configs.md): "ujsx Reconciler Not Yet Available" — the current HostConfig is mount-only: no incremental template updates, no prepareUpdate/commitUpdate flow. This gap is broader than just re-rendering.
Notes: This is entirely dependent on the ujsx reconciler, which is not yet implemented. The host-configs doc notes "currently mount-only." When the reconciler is available, flowgraph gets re-rendering "for free." This question should be revisited after the reconciler is implemented.
Cross-references: OQ-05 (structural container handling during re-render), host-configs.md "Known Gaps"

Theme 10: Version and Scale Concerns

OQ-26: How to handle version conflicts?

Origin: operation-graph.md Q2
Status: open
Priority: low — can be deferred to a versioning use case
Question: If two versions of the same operation exist in the registry, should they be separate nodes (task.classify@1.0.0 vs task.classify@2.0.0) or should the latest version win?
Notes: The current design uses namespace.name (no version) as the node key, meaning only one version per operation. This is intentional simplicity. Version conflicts are a niche concern that can be addressed when a concrete use case arises.

Summary Table

ID	Question	Origin	Priority	Status
OQ-01	All edges or only compatible edges?	operation-graph	high	open
OQ-02	Type compatibility depth and granularity	operation-graph, analysis	high	open
OQ-03	Subscription operations in type compat	operation-graph	medium	open
OQ-04	`edgeType` on all edges?	schema	medium	open
OQ-05	Structural container transparency	workflow-templates, host-configs	high	open
OQ-06	Template ↔ call protocol interaction	workflow-templates, host-configs	high	open
OQ-07	Should reactive engine own call graph?	host-configs	high	open
OQ-08	Auto-populate `depends_on` from templates?	call-graph	medium	open
OQ-09	Retries at signal level	reactive-execution	high	open
OQ-10	Running nodes when predecessor fails	reactive-execution	high	open
OQ-11	OR logic for preconditions	reactive-execution	medium	open
OQ-12	`maxConcurrency` interaction with preconditions	reactive-execution	medium	open
OQ-13	`blockedByFailure` vs single computed	reactive-execution	low	open
OQ-14	Unknown `operationId` in call graph	call-graph	medium	open (proposed)
OQ-15	Multiple graphs per instance	call-graph	low	open
OQ-16	`filterByStatus` index	call-graph	low	open
OQ-17	Expose graphology traversal directly?	flowgraph-api	medium	open
OQ-18	Auto-populate type edges on `addOperation`?	flowgraph-api	low	open
OQ-19	`parallelGroups` with resource constraints	analysis	low	open
OQ-20	Conditional edge condition representation	schema	medium	open
OQ-21	Explicit `depends_on` in templates	workflow-templates	medium	open
OQ-22	`CallNodeAttrs.identity` type	schema	medium	open
OQ-24	Async analysis functions	analysis	low	open
OQ-25	Partial re-rendering	reactive-execution	low	open (blocked)
OQ-26	Operation version conflicts	operation-graph	low	open
OQ-27	Runtime preconditions in validateTemplate?	analysis	low	open (deferred)
OQ-28	Share analysis functions across instances?	flowgraph-api	low	open
OQ-29	Separate graph per edge type?	host-configs	medium	open

Priority Assessment

High priority (should resolve before implementation):

OQ-01: All edges or only compatible — shapes the entire operation graph API
OQ-02: Type compatibility depth — shapes typeCompat() return type
OQ-05: Structural container transparency — fundamental to DAG and reactive engine
OQ-06: Template ↔ call protocol — fundamental integration point
OQ-07: Reactive engine owns call graph? — affects architecture boundaries
OQ-09: Retries — shapes the state machine
OQ-10: Running node failure handling — shapes failure propagation

Medium priority (should resolve before v1 release):

OQ-03, OQ-04, OQ-08, OQ-11, OQ-12, OQ-14, OQ-17, OQ-20, OQ-21, OQ-22, OQ-29

Low priority (can defer or decide during implementation):

OQ-13, OQ-15, OQ-16, OQ-18, OQ-19, OQ-24, OQ-25, OQ-26, OQ-27, OQ-28

Cross-Cutting Themes

These groups of questions interact with each other and should be resolved together:

Edge semantics group (OQ-01, OQ-02, OQ-04): All affect the operation graph's edge structure and the type compatibility API.
Call protocol integration group (OQ-06, OQ-07, OQ-08): All about how flowgraph connects to the live call protocol.
Failure semantics group (OQ-09, OQ-10): Both about how failure and retry propagate through the reactive engine. Resolving one may resolve or constrain the other.
Scheduling group (OQ-11, OQ-12): Both about how preconditions interact with scheduling constraints.
Template expressiveness group (OQ-05, OQ-20, OQ-21): All about what the template system can express and how it renders.
Graph structure group (OQ-04, OQ-29): Both about how edge types are represented in the graph — OQ-04 at the schema/type level, OQ-29 at the runtime graph structure level. Resolution of one constrains the other.
Known gaps from host-configs.md — not all "known gaps" are "open questions" (the reconciler gap is a dependency, not a design question), but they should be tracked here for completeness.

29 KiB Raw Blame History