docs(research): add nats-async and nats-server deep-dive references
This commit is contained in:
@@ -0,0 +1,277 @@
|
||||
# Connection and Reconnection
|
||||
|
||||
This document covers how connections are established, TLS handling, the server pool, and the reconnection mechanism.
|
||||
|
||||
## Connector
|
||||
|
||||
**Location**: `connector.rs`
|
||||
|
||||
The `Connector` manages the server pool and handles connection establishment and reconnection.
|
||||
|
||||
```rust
|
||||
pub(crate) struct Connector {
|
||||
servers: Vec<Server>, // Server pool with per-server metadata
|
||||
options: ConnectorOptions, // Connection configuration
|
||||
connect_stats: Arc<Statistics>, // Shared statistics
|
||||
attempts: usize, // Global reconnection attempt counter
|
||||
events_tx: mpsc::Sender<Event>, // Event channel
|
||||
state_tx: watch::Sender<State>, // Connection state watcher
|
||||
max_payload: Arc<AtomicUsize>, // Server's max payload
|
||||
last_info: ServerInfo, // Last known server info
|
||||
}
|
||||
```
|
||||
|
||||
### Server Pool
|
||||
|
||||
Each server in the pool carries metadata:
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct Server {
|
||||
pub addr: ServerAddr,
|
||||
pub failed_attempts: usize, // Consecutive failed attempts
|
||||
pub did_connect: bool, // Ever successfully connected?
|
||||
pub is_discovered: bool, // Discovered via INFO, not user-configured
|
||||
pub last_error: Option<String>, // Last connection error
|
||||
}
|
||||
```
|
||||
|
||||
### ConnectorOptions
|
||||
|
||||
```rust
|
||||
pub(crate) struct ConnectorOptions {
|
||||
pub tls_required: bool,
|
||||
pub certificates: Vec<PathBuf>,
|
||||
pub client_cert: Option<PathBuf>,
|
||||
pub client_key: Option<PathBuf>,
|
||||
pub tls_client_config: Option<rustls::ClientConfig>,
|
||||
pub tls_first: bool,
|
||||
pub auth: Auth,
|
||||
pub no_echo: bool,
|
||||
pub connection_timeout: Duration, // Default: 5 seconds
|
||||
pub name: Option<String>,
|
||||
pub ignore_discovered_servers: bool,
|
||||
pub retain_servers_order: bool,
|
||||
pub read_buffer_capacity: u16, // Default: 65535
|
||||
pub reconnect_delay_callback: Arc<dyn Fn(usize) -> Duration>,
|
||||
pub auth_callback: Option<CallbackArg1<Vec<u8>, Result<Auth, AuthError>>>,
|
||||
pub max_reconnects: Option<usize>,
|
||||
pub local_address: Option<SocketAddr>,
|
||||
pub reconnect_to_server_callback: Option<ReconnectToServerCallback>,
|
||||
}
|
||||
```
|
||||
|
||||
## Connection Establishment Flow
|
||||
|
||||
```
|
||||
Connector::try_connect_to_server(addr)
|
||||
│
|
||||
├── 1. DNS resolution
|
||||
│ server_addr.socket_addrs()
|
||||
│
|
||||
├── 2. For each resolved address:
|
||||
│ │
|
||||
│ ├── 2a. Connect with timeout
|
||||
│ │ tokio::time::timeout(connection_timeout, try_connect_to(socket_addr, ...))
|
||||
│ │
|
||||
│ └── 2b. try_connect_to():
|
||||
│ │
|
||||
│ ├── Select transport:
|
||||
│ │ ├── "ws" → WebSocket (tokio_websockets)
|
||||
│ │ ├── "wss" → WebSocket over TLS
|
||||
│ │ └── default → TCP (TcpStream)
|
||||
│ │
|
||||
│ ├── Optional: bind to local_address
|
||||
│ ├── Set TCP_NODELAY
|
||||
│ ├── Create Connection with read_buffer_capacity
|
||||
│ │
|
||||
│ ├── If tls_first: upgrade to TLS before INFO
|
||||
│ │
|
||||
│ ├── Read INFO from server
|
||||
│ │
|
||||
│ ├── If TLS required (by option, server, or URL scheme):
|
||||
│ │ upgrade to TLS (rustls)
|
||||
│ │
|
||||
│ ├── Discover servers from INFO.connect_urls
|
||||
│ │ (unless ignore_discovered_servers)
|
||||
│ │
|
||||
│ ├── Build ConnectInfo with auth:
|
||||
│ │ ├── username/password (from Auth or URL)
|
||||
│ │ ├── token (from Auth)
|
||||
│ │ ├── nkey + signed nonce (feature: nkeys)
|
||||
│ │ ├── JWT + signature callback (feature: nkeys)
|
||||
│ │ └── auth_callback (custom async callback)
|
||||
│ │
|
||||
│ ├── Send CONNECT + PING
|
||||
│ │
|
||||
│ └── Wait for response:
|
||||
│ ├── -ERR (authorization violation) → error
|
||||
│ ├── PONG or +OK → success
|
||||
│ └── EOF → error
|
||||
│
|
||||
└── 3. On success:
|
||||
├── Reset attempt counter
|
||||
├── Increment connects statistic
|
||||
├── Emit Event::Connected
|
||||
├── Update State::Connected
|
||||
├── Store max_payload
|
||||
├── Update per-server metadata (did_connect, failed_attempts)
|
||||
└── Return (ServerInfo, Connection)
|
||||
```
|
||||
|
||||
## TLS Handling
|
||||
|
||||
The client supports three TLS modes:
|
||||
|
||||
### 1. Standard TLS (INFO → TLS)
|
||||
Default behavior. The client receives the `INFO` message in plaintext, then upgrades to TLS if:
|
||||
- `tls_required` option is set
|
||||
- Server's `INFO.tls_required` is true
|
||||
- URL scheme is `tls://`
|
||||
|
||||
### 2. TLS First (TLS → INFO)
|
||||
When `ConnectOptions::tls_first()` is enabled, the client establishes TLS before reading INFO. This requires the server to have `handshake_first` enabled. Useful for environments where plaintext INFO is not acceptable.
|
||||
|
||||
### 3. WebSocket TLS
|
||||
For `wss://` URLs, TLS is handled by the WebSocket library (`tokio-websockets`) directly, not by the client's TLS layer.
|
||||
|
||||
### TLS Configuration
|
||||
The client uses `rustls` via `tokio-rustls`. Configuration steps:
|
||||
1. Load root certificates from system store (`rustls-native-certs`)
|
||||
2. Optionally add custom root certificates from PEM files
|
||||
3. Optionally configure client certificate and key for mTLS
|
||||
4. Optionally pass a custom `rustls::ClientConfig`
|
||||
|
||||
Crypto backend is selectable via feature flags:
|
||||
- `ring` (default)
|
||||
- `aws-lc-rs`
|
||||
- `fips` (requires aws-lc-rs)
|
||||
|
||||
## Reconnection
|
||||
|
||||
### Reconnection Trigger
|
||||
|
||||
Reconnection is triggered when:
|
||||
1. I/O error during read or write (`ExitReason::Disconnected`)
|
||||
2. Too many pending PINGs (no PONG received)
|
||||
3. User calls `Client::force_reconnect()` (`ExitReason::ReconnectRequested`)
|
||||
|
||||
### Reconnection Flow
|
||||
|
||||
```
|
||||
ConnectionHandler::handle_disconnect()
|
||||
│
|
||||
├── Reset pending_pings to 0
|
||||
├── Emit Event::Disconnected
|
||||
├── Update State::Disconnected
|
||||
│
|
||||
└── handle_reconnect()
|
||||
│
|
||||
└── Connector::connect()
|
||||
│
|
||||
└── Loop: try_connect()
|
||||
│
|
||||
├── If reconnect_to_server_callback is set:
|
||||
│ │ Call callback with (server_pool, server_info)
|
||||
│ │ If returns Some(ReconnectToServer):
|
||||
│ │ Validate server is in pool
|
||||
│ │ Use callback's delay or default backoff
|
||||
│ │ Try connecting to selected server
|
||||
│ └── If None or invalid: fall through to default
|
||||
│
|
||||
├── Default selection:
|
||||
│ ├── Shuffle servers (unless retain_servers_order)
|
||||
│ ├── Sort by failed_attempts (ascending)
|
||||
│ └── Try each server in order
|
||||
│
|
||||
├── For each server:
|
||||
│ ├── Increment attempts counter
|
||||
│ ├── Check max_reconnects limit
|
||||
│ ├── Apply reconnect delay (exponential backoff)
|
||||
│ └── try_connect_to_server(addr)
|
||||
│
|
||||
├── On success:
|
||||
│ ├── Reset attempts to 0
|
||||
│ ├── Re-subscribe all active subscriptions
|
||||
│ │ (filter out closed subscription channels)
|
||||
│ ├── Re-subscribe multiplexer wildcard
|
||||
│ └── Return (ServerInfo, Connection)
|
||||
│
|
||||
└── On failure:
|
||||
├── Update per-server metadata (failed_attempts, last_error)
|
||||
├── Auth errors → propagate immediately
|
||||
└── Other errors → continue to next server
|
||||
```
|
||||
|
||||
### Exponential Backoff
|
||||
|
||||
Default reconnect delay function:
|
||||
|
||||
```rust
|
||||
fn reconnect_delay_callback_default(attempts: usize) -> Duration {
|
||||
if attempts <= 1 {
|
||||
Duration::from_millis(0)
|
||||
} else {
|
||||
let exp: u32 = (attempts - 1).try_into().unwrap_or(u32::MAX);
|
||||
let max = Duration::from_secs(4);
|
||||
cmp::min(Duration::from_millis(2_u64.saturating_pow(exp)), max)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Attempt | Delay |
|
||||
|---------|-------|
|
||||
| 1 | 0ms |
|
||||
| 2 | 0ms |
|
||||
| 3 | 2ms |
|
||||
| 4 | 4ms |
|
||||
| 5 | 8ms |
|
||||
| ... | ... |
|
||||
| 13 | 4096ms |
|
||||
| 14+ | 4000ms (capped) |
|
||||
|
||||
Custom delay functions can be provided via `ConnectOptions::reconnect_delay_callback()`.
|
||||
|
||||
### Server Pool Updates
|
||||
|
||||
The server pool is dynamic:
|
||||
|
||||
1. **Initial pool**: from `connect()` / `ConnectOptions::connect()` URL(s)
|
||||
2. **Discovered servers**: added from `INFO.connect_urls` on each connection (unless `ignore_discovered_servers` is set)
|
||||
3. **Runtime updates**: via `Client::set_server_pool()` — replaces the entire pool while preserving per-server state for servers that appear in both old and new pools
|
||||
4. **Order**: servers are shuffled by default (random selection), unless `retain_servers_order` is set
|
||||
|
||||
### Max Reconnects
|
||||
|
||||
The `max_reconnects` option limits total reconnection attempts:
|
||||
- `None` or `0` → unlimited (default)
|
||||
- `Some(n)` → give up after `n` total attempts
|
||||
- Counter is reset on successful connection and when `set_server_pool()` is called
|
||||
|
||||
## ConnectOptions Defaults
|
||||
|
||||
| Option | Default |
|
||||
|--------|---------|
|
||||
| `connection_timeout` | 5 seconds |
|
||||
| `ping_interval` | 60 seconds |
|
||||
| `sender_capacity` | 2048 |
|
||||
| `subscription_capacity` | 65536 |
|
||||
| `inbox_prefix` | `"_INBOX"` |
|
||||
| `request_timeout` | 10 seconds |
|
||||
| `retry_on_initial_connect` | false |
|
||||
| `ignore_discovered_servers` | false |
|
||||
| `retain_servers_order` | false |
|
||||
| `read_buffer_capacity` | 65535 |
|
||||
| `skip_subject_validation` | false |
|
||||
| `no_echo` | false |
|
||||
| `tls_required` | false |
|
||||
| `tls_first` | false |
|
||||
| `max_reconnects` | None (unlimited) |
|
||||
|
||||
## Background Connection
|
||||
|
||||
When `ConnectOptions::retry_on_initial_connect()` is enabled, the `connect()` function returns a `Client` immediately, before the connection is established. The connection is established in a background Tokio task. This means:
|
||||
- `client.server_info()` returns `ServerInfo::default()` until connected
|
||||
- `client.connection_state()` returns `State::Pending`
|
||||
- Operations like `publish()` will queue in the command channel
|
||||
- The `Client` becomes usable once the background task connects
|
||||
Reference in New Issue
Block a user