docs(research): add iroh suite deep-dive references for iroh, irpc, iroh-blobs, iroh-gossip, iroh-live, and iroh-docs
This commit is contained in:
250
docs/research/references/iroh/iroh-blobs/04-storage.md
Normal file
250
docs/research/references/iroh/iroh-blobs/04-storage.md
Normal file
@@ -0,0 +1,250 @@
|
||||
# iroh-blobs: Storage Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
iroh-blobs provides three store implementations sharing a common `Store` API surface:
|
||||
|
||||
| Store | Location | Mutable | Use Case |
|
||||
|-------|----------|---------|----------|
|
||||
| `MemStore` | In-memory | ✅ | Small data, testing, WASM |
|
||||
| `FsStore` | Filesystem + redb | ✅ | Production, large data |
|
||||
| `ReadonlyMemStore` | In-memory | ❌ | Static data serving |
|
||||
|
||||
All stores implement the same RPC-based command protocol (`Command` enum), allowing both local in-process and remote RPC access through the same `Store` type.
|
||||
|
||||
## Store API Surface
|
||||
|
||||
The `Store` type (from `api::Store`) is the primary interface. It's accessed via typed sub-APIs:
|
||||
|
||||
```rust
|
||||
let store: Store = /* ... */;
|
||||
|
||||
// Blob operations
|
||||
store.blobs() // → Blobs API (add, export, read, delete, observe, etc.)
|
||||
store.tags() // → Tags API (create, list, set, delete, rename)
|
||||
|
||||
// Direct operations
|
||||
store.add_bytes(data) // → AddProgress
|
||||
store.add_slice(data) // → TempTag (convenience)
|
||||
store.get_bytes(hash) // → Result<Bytes>
|
||||
store.has(hash) // → bool
|
||||
store.shutdown() // Clean shutdown
|
||||
store.wait_idle() // Wait for all tasks to complete
|
||||
store.sync_db() // Sync database to disk (FsStore)
|
||||
```
|
||||
|
||||
## Blobs API
|
||||
|
||||
```rust
|
||||
let blobs = store.blobs();
|
||||
|
||||
// Import
|
||||
blobs.add_slice(data) // → AddProgress (raw format)
|
||||
blobs.add_bytes(data) // → AddProgress (raw format)
|
||||
blobs.add_bytes_with_opts(AddBytesOptions{..}) // → AddProgress (with format)
|
||||
blobs.import_byte_stream(format) // → streaming import
|
||||
|
||||
// Export
|
||||
blobs.reader(hash) // → BlobReader (AsyncRead + AsyncSeek)
|
||||
blobs.export(hash, path) // → export to filesystem
|
||||
blobs.export_bao(hash, ranges) // → ExportBao (BLAKE3 verified stream)
|
||||
blobs.export_ranges(hash, ranges) // → ExportRanges (raw data ranges)
|
||||
|
||||
// Observe (subscribe to chunk availability)
|
||||
blobs.observe(hash) // → ObserveAt (bitfield stream)
|
||||
|
||||
// Status
|
||||
blobs.status(hash) // → BlobStatus (NotFound/Partial/Complete)
|
||||
|
||||
// Import BAO-encoded data
|
||||
blobs.import_bao_bytes(hash, ranges, data) // → import verified BAO stream
|
||||
blobs.import_bao_reader(hash, ranges, reader) // → import from async reader
|
||||
|
||||
// Batch operations (scoped temp tags)
|
||||
blobs.batch() // → Batch (auto-cleanup scope)
|
||||
|
||||
// Delete
|
||||
blobs.delete(hashes) // → force delete (use GC normally)
|
||||
```
|
||||
|
||||
## Tags API
|
||||
|
||||
```rust
|
||||
let tags = store.tags();
|
||||
|
||||
tags.set(name, value) // Set a persistent tag
|
||||
tags.create(value) // Auto-generate a tag name, return Tag
|
||||
tags.get(name) // → Option<TagInfo>
|
||||
tags.list() // → Stream<TagInfo>
|
||||
tags.list_hash_seq() // → Stream<TagInfo> (only HashSeq format)
|
||||
tags.delete(name) // Delete a tag
|
||||
tags.delete_range(range) // Delete tags in range
|
||||
tags.delete_prefix(prefix) // Delete tags with prefix
|
||||
tags.rename(from, to) // Atomically rename a tag
|
||||
tags.temp_tag(value) // → TempTag (ephemeral protection)
|
||||
```
|
||||
|
||||
## MemStore Architecture
|
||||
|
||||
The in-memory store uses a simple actor pattern:
|
||||
|
||||
```
|
||||
MemStore (ApiClient)
|
||||
│
|
||||
└── Actor (tokio task)
|
||||
├── State
|
||||
│ ├── data: HashMap<Hash, BaoFileHandle> // All blob data
|
||||
│ ├── tags: BTreeMap<Tag, HashAndFormat> // Persistent tags
|
||||
│ └── empty_hash: BaoFileHandle // Special entry for empty blob
|
||||
├── tasks: JoinSet<TaskResult> // Spawned import/export tasks
|
||||
├── temp_tags: TempTags // Ephemeral protection
|
||||
├── protected: HashSet<Hash> // GC-protected hashes
|
||||
└── idle_waiters: Vec<oneshot::Sender<()>> // Wait-idle notifications
|
||||
```
|
||||
|
||||
### BaoFileHandle / BaoFileStorage
|
||||
|
||||
```rust
|
||||
pub enum BaoFileStorage {
|
||||
Partial(PartialMemStorage), // Still downloading
|
||||
Complete(CompleteStorage), // Fully available
|
||||
}
|
||||
|
||||
pub struct PartialMemStorage {
|
||||
data: SparseMemFile, // Sparse byte array for data
|
||||
outboard: SparseMemFile, // Sparse byte array for BLAKE3 hash tree
|
||||
size: SizeInfo, // Known/estimated size
|
||||
bitfield: Bitfield, // Which chunks are verified
|
||||
}
|
||||
|
||||
pub struct CompleteStorage {
|
||||
data: Bytes, // Complete data
|
||||
outboard: Bytes, // Complete outboard (hash tree)
|
||||
}
|
||||
```
|
||||
|
||||
The `watch::Sender<BaoFileStorage>` pattern allows subscribers to observe state changes (for the `observe` API).
|
||||
|
||||
### Data Flow (Import)
|
||||
|
||||
1. `add_bytes(data)` → compute outboard via `PreOrderMemOutboard::create()` → transition `Partial → Complete`
|
||||
2. `import_bao(hash, size, stream)` → receive `BaoContentItem` stream → write to `PartialMemStorage` → update bitfield → transition to `Complete` when all chunks present
|
||||
|
||||
### Data Flow (Export)
|
||||
|
||||
1. `export_bao(hash, ranges)` → look up `BaoFileHandle` → `traverse_ranges_validated(data, outboard, &ranges, tx)` — streams validated BAO data
|
||||
|
||||
## FsStore Architecture (Hybrid Store)
|
||||
|
||||
The filesystem store uses a **hybrid approach** that stores small data inline in redb and large data as files on disk.
|
||||
|
||||
### Design Rationale (from DESIGN.md)
|
||||
|
||||
- **Databases** are good for small blobs (low per-entry overhead, fast random access)
|
||||
- **Filesystems** are good for large blobs (OS-level caching, direct file access)
|
||||
- **Neither alone** works well for both cases
|
||||
|
||||
### Layout
|
||||
|
||||
```
|
||||
<data_dir>/
|
||||
├── db/ # redb database
|
||||
│ ├── metadata table # Hash → EntryState
|
||||
│ ├── inline_data table # Hash → Bytes (for small blobs)
|
||||
│ ├── inline_outboard table # Hash → Bytes (for small outboards)
|
||||
│ └── tags table # Tag → HashAndFormat
|
||||
├── data/<hash>.data # Large blob data files
|
||||
├── data/<hash>.outboard # Large outboard files
|
||||
├── data/<hash>.sizes # Size tracking for partial files
|
||||
└── data/<hash>.bitfield # Validated chunk tracking for partial files
|
||||
```
|
||||
|
||||
### EntryState
|
||||
|
||||
```rust
|
||||
// Simplified from src/store/fs/entry_state.rs
|
||||
pub enum EntryState {
|
||||
Complete(CompleteEntryState),
|
||||
Partial(PartialEntryState),
|
||||
}
|
||||
|
||||
pub struct CompleteEntryState {
|
||||
pub data: DataLocation, // Inline, Owned (canonical path), or External (user path)
|
||||
pub outboard: OutboardLocation, // Inline, Owned, or NotNeeded
|
||||
pub size: u64,
|
||||
}
|
||||
|
||||
pub enum DataLocation {
|
||||
Inline, // Stored in redb inline_data table
|
||||
Owned, // File at canonical path <hash>.data
|
||||
External(Vec<PathBuf>), // User-owned file paths
|
||||
}
|
||||
|
||||
pub enum OutboardLocation {
|
||||
Inline, // Stored in redb inline_outboard table
|
||||
Owned, // File at canonical path <hash>.outboard
|
||||
NotNeeded, // Data ≤ 16 KiB, no outboard needed
|
||||
}
|
||||
|
||||
pub struct PartialEntryState {
|
||||
// Either we know the verified size, or we don't yet
|
||||
pub verified_size: Option<NonZeroU64>,
|
||||
}
|
||||
```
|
||||
|
||||
### Thresholds
|
||||
|
||||
- **Data inline threshold**: 16 KiB (default) — blobs smaller than this are stored entirely in redb
|
||||
- **Outboard inline threshold**: 16 KiB (default) — outboards smaller than this are stored in redb
|
||||
- Data ≤ 16 KiB has no outboard (not needed for verification of a single chunk group)
|
||||
|
||||
### Blob Lifecycle
|
||||
|
||||
**Adding a local file (known data, unknown hash)**:
|
||||
1. Compute the full BLAKE3 hash and outboard
|
||||
2. Atomically move the file into the store under the hash name
|
||||
3. Apply inlining rules: small files → redb, large files → filesystem
|
||||
|
||||
**Syncing from remote (known hash, unknown data)**:
|
||||
1. Start with no data — keep state in memory (not in database)
|
||||
2. As chunks arrive, write incrementally to partial files
|
||||
3. Once size is known to exceed the inline threshold, create database entry + filesystem files
|
||||
4. On completion, transition to `Complete` state and apply inlining rules
|
||||
|
||||
**Deletion**:
|
||||
- Tags protect content from GC
|
||||
- `TempTag` provides ephemeral (process-lifetime) protection
|
||||
- HashSeq tags protect the root blob AND all referenced child blobs
|
||||
- GC is mark-and-sweep: mark all reachable content via tags → sweep (delete) everything else
|
||||
- Explicit `force` deletion bypasses protection (emergency use only)
|
||||
|
||||
### FsStore Actor Architecture
|
||||
|
||||
```
|
||||
FsStore (ApiClient)
|
||||
│
|
||||
└── MainActor (tokio task)
|
||||
├── TaskContext { config, db_actor_sender }
|
||||
├── EntityMap: HashMap<Hash, ActiveEntityState> // Currently active entities
|
||||
├── JoinSet<TaskResult> // Running tasks
|
||||
├── TempTags // Ephemeral protection
|
||||
├── ProtectedSet // GC protection
|
||||
└── idle_waiters
|
||||
```
|
||||
|
||||
The FsStore uses an **entity manager** pattern where each hash gets a `BaoFileHandle` (like MemStore) when active, and entries are cleaned up when tasks complete.
|
||||
|
||||
## Garbage Collection
|
||||
|
||||
```rust
|
||||
pub struct GcConfig {
|
||||
pub interval: Duration,
|
||||
pub add_protected: Option<ProtectCb>, // Optional callback to add more protected hashes
|
||||
}
|
||||
```
|
||||
|
||||
GC is a two-phase process:
|
||||
1. **Mark**: Walk all tags (persistent + temp), collect reachable hashes. For HashSeq format, traverse the hash sequence to find all child hashes.
|
||||
2. **Sweep**: Delete all blobs not in the reachable set, in batches of 100.
|
||||
|
||||
GC runs automatically at a configurable interval via `run_gc(store, config)`, or manually via `gc_run_once(store, live)`.
|
||||
Reference in New Issue
Block a user