docs(research): add polyglot SQL transpiler deep dive for multi-DB storage evaluation
This commit is contained in:
137
docs/research/references/polyglot/01_overview.md
Normal file
137
docs/research/references/polyglot/01_overview.md
Normal file
@@ -0,0 +1,137 @@
|
|||||||
|
# Polyglot: Research Overview
|
||||||
|
|
||||||
|
**Library**: `polyglot-sql` (Rust crate) / `@polyglot-sql/sdk` (TypeScript/WASM) / `polyglot-sql` (Python)
|
||||||
|
**Repository**: <https://github.com/tobilg/polyglot>
|
||||||
|
**Current Version**: 0.4.4 (as of 2026-06-03)
|
||||||
|
**License**: MIT (+ sqlglot MIT for test fixtures)
|
||||||
|
**Author**: Tobias G. (tobilg)
|
||||||
|
**Inspiration**: Python [sqlglot](https://github.com/tobymao/sqlglot) by Toby Mao
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. What Is Polyglot?
|
||||||
|
|
||||||
|
Polyglot is a **SQL transpiler** — it parses SQL from one database dialect into an AST, and generates SQL for a different dialect. It is **not** a database driver, ORM, query executor, or connection pool. Its core purpose is **dialect-agnostic SQL manipulation**: parse, transform, validate, format, and transpile SQL across 32+ database dialects.
|
||||||
|
|
||||||
|
### Key Capabilities
|
||||||
|
|
||||||
|
| Capability | Description |
|
||||||
|
|---|---|
|
||||||
|
| **Parse** | Convert SQL string → typed AST with 200+ expression node types |
|
||||||
|
| **Generate** | Convert AST → SQL string for any supported dialect |
|
||||||
|
| **Transpile** | Convert SQL from dialect A → dialect B in one call |
|
||||||
|
| **Format** | Pretty-print SQL with configurable guard rails |
|
||||||
|
| **Build** | Construct SQL programmatically via a fluent builder API |
|
||||||
|
| **Validate** | Syntax + semantic validation with error positions |
|
||||||
|
| **Lineage** | Trace column lineage through queries; generate OpenLineage payloads |
|
||||||
|
| **Diff** | AST-aware diff between two SQL expressions |
|
||||||
|
| **Traverse** | DFS/BFS iterators, predicate queries, and transforms on the AST |
|
||||||
|
|
||||||
|
### Supported Dialects (32)
|
||||||
|
|
||||||
|
Athena, BigQuery, ClickHouse, CockroachDB, Databricks, Doris, Dremio, Drill, Druid, DuckDB, Dune, Exasol, Fabric, Hive, Materialize, MySQL, Oracle, PostgreSQL, Presto, Redshift, RisingWave, SingleStore, Snowflake, Solr, Spark, SQLite, StarRocks, Tableau, Teradata, TiDB, Trino, TSQL
|
||||||
|
|
||||||
|
Plus a `Generic` dialect for standard SQL.
|
||||||
|
|
||||||
|
### Language Bindings
|
||||||
|
|
||||||
|
| Binding | Package | Delivery |
|
||||||
|
|---|---|---|
|
||||||
|
| **Rust** | `polyglot-sql` on crates.io | Native Rust crate |
|
||||||
|
| **TypeScript/WASM** | `@polyglot-sql/sdk` on npm | WASM module + JS wrapper |
|
||||||
|
| **Python** | `polyglot-sql` on PyPI | PyO3 native extension |
|
||||||
|
| **Go** | `github.com/tobilg/polyglot/packages/go` | PureGo wrapper over C FFI |
|
||||||
|
| **C FFI** | Built from `polyglot-sql-ffi` | `.so` / `.dylib` / `.dll` + `.a` / `.lib` + header |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Core Philosophy & Design Principles
|
||||||
|
|
||||||
|
1. **Pipeline architecture**: SQL → Tokenize → Parse → AST → Transform → Generate → SQL string. Each stage is independently configurable per dialect.
|
||||||
|
|
||||||
|
2. **Ported from Python sqlglot**: The Rust implementation is a faithful port of the Python `sqlglot` library, maintaining compatibility with its test fixtures (10,220+ fixture cases at 100% pass rate). The architecture, expression types, transformation rules, and dialect behaviors mirror the Python original.
|
||||||
|
|
||||||
|
3. **No runtime database connection**: Polyglot never connects to a database. It operates purely on SQL strings and ASTs. This makes it safe for sandboxed environments (WASM, serverless) and suitable for build-time / CI-time SQL analysis.
|
||||||
|
|
||||||
|
4. **Feature-gated compilation**: Each dialect is behind a Cargo feature flag (`dialect-postgresql`, `dialect-mysql`, etc.), so users compiling for constrained targets (WASM) can include only what they need. The `default` feature set includes everything.
|
||||||
|
|
||||||
|
5. **Stack safety**: The `stacker` feature (default-on for native builds) grows the stack on deeply nested inputs, preventing stack overflow from pathological SQL. WASM builds opt out since `stacker` doesn't work there.
|
||||||
|
|
||||||
|
6. **Guard rails**: Format/guard options limit input size (16 MiB default), token count (1M), AST node count (1M), and set-operation chain depth (256) to prevent resource exhaustion.
|
||||||
|
|
||||||
|
7. **Performance-first**: Built in Rust for speed. Benchmarks show 8–19× speedup over the Python `sqlglot` for transpilation, with generation at ~86× faster. The WASM build enables near-native performance in browsers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. How It Differs from Database Abstraction Layers
|
||||||
|
|
||||||
|
**Critical distinction**: Polyglot is a **SQL dialect transpiler**, not a database abstraction layer. It does not:
|
||||||
|
|
||||||
|
- Connect to databases
|
||||||
|
- Execute queries
|
||||||
|
- Manage connection pools
|
||||||
|
- Handle migrations (no `CREATE TABLE` schema evolution management)
|
||||||
|
- Map Rust types to database types
|
||||||
|
- Provide an ORM-like interface
|
||||||
|
- Handle async I/O
|
||||||
|
|
||||||
|
Instead, it focuses purely on **SQL text manipulation**: parsing, analyzing, transforming, and generating SQL strings. This makes it complementary to (not competing with) libraries like Diesel, SQLx, or SeaORM.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Performance Characteristics
|
||||||
|
|
||||||
|
From the project's benchmark suite (polyglot-sql v0.1.2 vs sqlglot v28.10.1):
|
||||||
|
|
||||||
|
| Operation | Speedup Range |
|
||||||
|
|---|---|
|
||||||
|
| Parse (SQL → AST) | 10–13× faster |
|
||||||
|
| Generate (AST → SQL) | 77–101× faster |
|
||||||
|
| Roundtrip (parse → generate → re-parse) | 13–15× faster |
|
||||||
|
| Transpile (full cross-dialect) | 1.6× (simple) to 19× (complex BigQuery→Snowflake) |
|
||||||
|
| Geometric mean | **8.70×** |
|
||||||
|
|
||||||
|
Parse benchmarks (v0.4.x, native Rust):
|
||||||
|
|
||||||
|
| Query | Mean |
|
||||||
|
|---|---|
|
||||||
|
| short (SELECT a, b, c) | 51.28 μs |
|
||||||
|
| medium (5 cols, JOIN, GROUP BY) | 259.61 μs |
|
||||||
|
| complex (3 CTEs, subquery) | 268.59 μs – 1.03 ms |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Project Maturity Indicators
|
||||||
|
|
||||||
|
| Indicator | Status |
|
||||||
|
|---|---|
|
||||||
|
| **Version** | 0.4.4 (pre-1.0, active development) |
|
||||||
|
| **Test coverage** | 18,745 test cases at 100% pass rate |
|
||||||
|
| **crates.io downloads** | ~4,738 total (as of mid-2026) |
|
||||||
|
| **Dependent crates** | 2 (via entdb) |
|
||||||
|
| **Release cadence** | Frequent patch releases (0.4.2, 0.4.3, 0.4.4 in quick succession) |
|
||||||
|
| **Source code size** | ~241K lines of Rust in core crate |
|
||||||
|
| **Fuzzing** | Supported via `cargo +nightly fuzz` |
|
||||||
|
| **CI** | Full test suite + FFI + Python + WASM |
|
||||||
|
| **Documentation** | Rust API docs (docs.rs), TypeScript docs, Python docs, playground |
|
||||||
|
| **Breaking changes** | Possible before 1.0; semver suggests API instability |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. License
|
||||||
|
|
||||||
|
- **MIT License** for the Polyglot code itself
|
||||||
|
- **sqlglot MIT License** for the test fixtures derived from the Python project
|
||||||
|
- Both are permissive, suitable for commercial use
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- <https://github.com/tobilg/polyglot> — Main repository
|
||||||
|
- <https://crates.io/crates/polyglot-sql> — Rust crate on crates.io
|
||||||
|
- <https://www.npmjs.com/package/@polyglot-sql/sdk> — TypeScript SDK on npm
|
||||||
|
- <https://pypi.org/project/polyglot-sql/> — Python bindings on PyPI
|
||||||
|
- <https://docs.rs/polyglot-sql/latest/polyglot_sql/> — Rust API documentation
|
||||||
|
- <https://polyglot-playground.gh.tobilg.com/> — Interactive playground
|
||||||
|
- <https://github.com/tobymao/sqlglot> — Original Python inspiration
|
||||||
720
docs/research/references/polyglot/02_architecture.md
Normal file
720
docs/research/references/polyglot/02_architecture.md
Normal file
@@ -0,0 +1,720 @@
|
|||||||
|
# Polyglot: Architecture Deep Dive
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Workspace Structure
|
||||||
|
|
||||||
|
The repository is organized as a Cargo workspace with 5 crates and supporting packages:
|
||||||
|
|
||||||
|
```
|
||||||
|
polyglot/
|
||||||
|
├── crates/
|
||||||
|
│ ├── polyglot-sql/ # Core Rust library (~241K LOC)
|
||||||
|
│ │ └── src/
|
||||||
|
│ │ ├── lib.rs # Public API, top-level functions
|
||||||
|
│ │ ├── tokens.rs # Tokenizer (lexer)
|
||||||
|
│ │ ├── parser.rs # Recursive-descent parser (~62K LOC)
|
||||||
|
│ │ ├── expressions.rs # AST node types (~15K LOC)
|
||||||
|
│ │ ├── generator.rs # SQL code generator (~39K LOC)
|
||||||
|
│ │ ├── dialects/ # 33 dialect implementations
|
||||||
|
│ │ │ ├── mod.rs # Dialect trait, Dialect struct, CustomDialectBuilder
|
||||||
|
│ │ │ ├── generic.rs # Base/standard SQL dialect
|
||||||
|
│ │ │ ├── postgres.rs # PostgreSQL (~1.9K LOC)
|
||||||
|
│ │ │ ├── mysql.rs # MySQL
|
||||||
|
│ │ │ ├── sqlite.rs # SQLite
|
||||||
|
│ │ │ ├── bigquery.rs # BigQuery
|
||||||
|
│ │ │ ├── ... (32 total)
|
||||||
|
│ │ ├── builder.rs # Fluent query builder API
|
||||||
|
│ │ ├── transforms.rs # Cross-dialect transform functions
|
||||||
|
│ │ ├── validation.rs # Syntax + semantic validation
|
||||||
|
│ │ ├── schema.rs # Schema representation
|
||||||
|
│ │ ├── scope.rs # Scope analysis
|
||||||
|
│ │ ├── resolver.rs # Column resolution
|
||||||
|
│ │ ├── lineage.rs # Column lineage tracking
|
||||||
|
│ │ ├── openlineage.rs # OpenLineage payload generation
|
||||||
|
│ │ ├── diff.rs # AST diff (ChangeDistiller algorithm)
|
||||||
|
│ │ ├── planner.rs # Logical query plan
|
||||||
|
│ │ ├── optimizer/ # Query optimizer modules
|
||||||
|
│ │ │ ├── annotate_types.rs # Type annotation
|
||||||
|
│ │ │ ├── qualify_columns.rs # Column qualification
|
||||||
|
│ │ │ ├── qualify_tables.rs # Table qualification
|
||||||
|
│ │ │ ├── pushdown_predicates.rs
|
||||||
|
│ │ │ ├── pushdown_projections.rs
|
||||||
|
│ │ │ ├── eliminate_joins.rs
|
||||||
|
│ │ │ ├── eliminate_ctes.rs
|
||||||
|
│ │ │ ├── simplify.rs
|
||||||
|
│ │ │ └── ...
|
||||||
|
│ │ ├── traversal.rs # DFS/BFS visitors, AST predicates
|
||||||
|
│ │ ├── ast_transforms.rs # AST manipulation utilities
|
||||||
|
│ │ ├── error.rs # Error types
|
||||||
|
│ │ └── time.rs # Time format conversion
|
||||||
|
│ ├── polyglot-sql-function-catalogs/ # Optional dialect function catalogs
|
||||||
|
│ ├── polyglot-sql-wasm/ # WASM bindings (wasm-pack)
|
||||||
|
│ ├── polyglot-sql-ffi/ # C FFI bindings (cbindgen)
|
||||||
|
│ └── polyglot-sql-python/ # Python bindings (PyO3 + maturin)
|
||||||
|
├── packages/
|
||||||
|
│ ├── sdk/ # TypeScript SDK (@polyglot-sql/sdk)
|
||||||
|
│ ├── go/ # Go SDK (PureGo wrapper over FFI)
|
||||||
|
│ ├── documentation/ # TypeScript API docs site
|
||||||
|
│ ├── playground/ # Browser playground (React 19, Vite)
|
||||||
|
│ └── python-docs/ # Python API docs
|
||||||
|
├── examples/
|
||||||
|
│ ├── rust/ # Rust usage example
|
||||||
|
│ ├── typescript/ # TypeScript SDK example
|
||||||
|
│ └── c/ # C FFI usage example
|
||||||
|
└── tools/
|
||||||
|
├── sqlglot-compare/ # Fixture extraction & comparison
|
||||||
|
└── bench-compare/ # Performance benchmarks
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Data Flow Pipeline
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ SQL String (source dialect) │
|
||||||
|
└──────────────────────────┬──────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Tokenizer (tokens.rs) │
|
||||||
|
│ • Dialect-specific lexing rules (quotes, comments, keywords) │
|
||||||
|
│ • Configurable via TokenizerConfig per dialect │
|
||||||
|
│ • Produces Vec<Token> with type, text, and Span (line/col/offset) │
|
||||||
|
└──────────────────────────┬──────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Parser (parser.rs, ~62K LOC) │
|
||||||
|
│ • Recursive-descent with precedence climbing │
|
||||||
|
│ • Dialect-aware parsing (custom keywords, syntax rules) │
|
||||||
|
│ • Produces Expression AST tree │
|
||||||
|
│ • Stack safety via `stacker` feature (default-on) │
|
||||||
|
└──────────────────────────┬──────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Expression AST (expressions.rs) │
|
||||||
|
│ • Single tagged enum with 150+ variants │
|
||||||
|
│ • Each variant has its own struct (Select, Insert, Function, etc.) │
|
||||||
|
│ • Box<Variant> keeps enum size to 2 words (tag + pointer) │
|
||||||
|
│ • Serializable via serde (derive Serialize/Deserialize) │
|
||||||
|
│ • Optional TypeScript type generation via `ts-rs` feature flag │
|
||||||
|
└──────────────────────────┬──────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌────┴────┐
|
||||||
|
│ │
|
||||||
|
┌─────────┘ └──────────┐
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
┌────────────────────────┐ ┌────────────────────────────────────┐
|
||||||
|
│ Transform Pipeline │ │ Semantic / Analysis Modules │
|
||||||
|
│ (transpile path) │ │ • validation.rs → syntax checks │
|
||||||
|
│ │ │ • schema.rs → column/type lookup │
|
||||||
|
│ 1. preprocess() │ │ • scope.rs → scope analysis │
|
||||||
|
│ (whole-tree rewrites│ │ • resolver.rs → column resolution │
|
||||||
|
│ like eliminate_ │ │ • lineage.rs → column lineage │
|
||||||
|
│ qualify) │ │ • openlineage.rs → OL payloads │
|
||||||
|
│ │ │ • optimizer/ → query optimization │
|
||||||
|
│ 2. transform_expr() │ │ • diff.rs → AST diff │
|
||||||
|
│ (per-node rewrites │ │ • planner.rs → logical plan DAG │
|
||||||
|
│ per dialect) │ │ • traversal.rs → DFS/BFS visitors │
|
||||||
|
│ │ │
|
||||||
|
│ 3. Generator │ │
|
||||||
|
│ (AST → SQL string) │ │
|
||||||
|
└───────────┬────────────┘ └────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ SQL String (target dialect) │
|
||||||
|
└──────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Core Abstractions
|
||||||
|
|
||||||
|
### 3.1 Expression AST
|
||||||
|
|
||||||
|
The central type is `Expression`, a large tagged enum with one variant per SQL construct:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub enum Expression {
|
||||||
|
// Literals
|
||||||
|
Literal(Box<Literal>),
|
||||||
|
Boolean(BooleanLiteral),
|
||||||
|
Null(Null),
|
||||||
|
|
||||||
|
// Identifiers
|
||||||
|
Identifier(Identifier),
|
||||||
|
Column(Box<Column>),
|
||||||
|
Table(Box<TableRef>),
|
||||||
|
Star(Star),
|
||||||
|
|
||||||
|
// Queries
|
||||||
|
Select(Box<Select>),
|
||||||
|
Union(Box<Union>),
|
||||||
|
Intersect(Box<Intersect>),
|
||||||
|
Except(Box<Except>),
|
||||||
|
Subquery(Box<Subquery>),
|
||||||
|
|
||||||
|
// DML
|
||||||
|
Insert(Box<Insert>),
|
||||||
|
Update(Box<Update>),
|
||||||
|
Delete(Box<Delete>),
|
||||||
|
Copy(Box<CopyStmt>),
|
||||||
|
|
||||||
|
// Binary/Unary operators
|
||||||
|
And(Box<BinaryOp>),
|
||||||
|
Or(Box<BinaryOp>),
|
||||||
|
Add(Box<BinaryOp>),
|
||||||
|
Eq(Box<BinaryOp>),
|
||||||
|
// ... 30+ operator variants
|
||||||
|
|
||||||
|
// Functions
|
||||||
|
Function(Box<Function>),
|
||||||
|
AggregateFunction(Box<AggregateFunction>),
|
||||||
|
WindowFunction(Box<WindowFunction>),
|
||||||
|
|
||||||
|
// Clauses
|
||||||
|
From(Box<From>),
|
||||||
|
Join(Box<Join>),
|
||||||
|
Where(Box<Where>),
|
||||||
|
OrderBy(Box<OrderBy>),
|
||||||
|
// ...
|
||||||
|
|
||||||
|
// ~150 total variants
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Key design choices:
|
||||||
|
- **Boxed variants**: Most variants wrap their payload in `Box` to keep `size_of::<Expression>()` at 2 words (16 bytes on 64-bit).
|
||||||
|
- **Serde support**: `#[derive(Serialize, Deserialize)]` for JSON serialization across FFI/WASM boundaries.
|
||||||
|
- **TypeScript types**: Optional `ts-rs` feature generates TypeScript interfaces.
|
||||||
|
- **Convenience methods**: `Expression::column()`, `Expression::number()`, `Expression::sql()`, `Expression::sql_for()`.
|
||||||
|
|
||||||
|
### 3.2 DialectType Enum
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub enum DialectType {
|
||||||
|
Generic, PostgreSQL, MySQL, BigQuery, Snowflake, DuckDB, SQLite,
|
||||||
|
Hive, Spark, Trino, Presto, Redshift, TSQL, Oracle, ClickHouse,
|
||||||
|
Databricks, Athena, Teradata, Doris, StarRocks, Materialize,
|
||||||
|
RisingWave, SingleStore, CockroachDB, TiDB, Druid, Solr, Tableau,
|
||||||
|
Dune, Fabric, Drill, Dremio, Exasol, DataFusion,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- Implements `FromStr` with aliases (e.g., `"mssql"` → `TSQL`, `"cockroach"` → `CockroachDB`)
|
||||||
|
- Each variant maps to a feature-gated dialect module
|
||||||
|
- Custom dialects can be registered at runtime via `CustomDialectBuilder`
|
||||||
|
|
||||||
|
### 3.3 DialectImpl Trait
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub trait DialectImpl {
|
||||||
|
fn dialect_type(&self) -> DialectType;
|
||||||
|
fn tokenizer_config(&self) -> TokenizerConfig { /* default */ }
|
||||||
|
fn generator_config(&self) -> GeneratorConfig { /* default */ }
|
||||||
|
fn generator_config_for_expr(&self, _expr: &Expression) -> GeneratorConfig { /* default */ }
|
||||||
|
fn transform_expr(&self, expr: Expression) -> Result<Expression> { Ok(expr) }
|
||||||
|
fn preprocess(&self, expr: Expression) -> Result<Expression> { Ok(expr) }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Each dialect implements this trait to provide:
|
||||||
|
1. **Tokenizer config**: Identifier quoting characters, string delimiters, keyword overrides, comment styles, hex number support
|
||||||
|
2. **Generator config**: 30+ flags controlling SQL output (identifier quote style, function casing, `LIMIT` vs `TOP` vs `FETCH FIRST`, etc.)
|
||||||
|
3. **Per-node transform**: Dialect-specific expression rewrites (e.g., PostgreSQL transforms `IFNULL` → `COALESCE`, SQLite transforms `TRY_CAST` → `CAST`)
|
||||||
|
4. **Whole-tree preprocess**: Structural rewrites that need full-tree context (e.g., eliminating `QUALIFY` for dialects that don't support it)
|
||||||
|
|
||||||
|
### 3.4 Dialect Struct (High-Level API)
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub struct Dialect {
|
||||||
|
dialect_type: DialectType,
|
||||||
|
tokenizer: Tokenizer,
|
||||||
|
generator_config: Arc<GeneratorConfig>,
|
||||||
|
transformer: Box<dyn Fn(Expression) -> Result<Expression> + Send + Sync>,
|
||||||
|
generator_config_for_expr: Option<Box<dyn Fn(&Expression) -> GeneratorConfig + Send + Sync>>,
|
||||||
|
custom_preprocess: Option<Box<dyn Fn(Expression) -> Result<Expression> + Send + Sync>>,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The `Dialect` struct bundles all dialect-specific state and provides the primary API:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// Parse SQL
|
||||||
|
let ast = dialect.parse("SELECT 1")?;
|
||||||
|
|
||||||
|
// Generate SQL from AST
|
||||||
|
let sql = dialect.generate(&ast[0])?;
|
||||||
|
|
||||||
|
// Transpile between dialects
|
||||||
|
let results = dialect.transpile("SELECT IFNULL(a,b) FROM t", DialectType::PostgreSQL)?;
|
||||||
|
|
||||||
|
// Tokenize
|
||||||
|
let tokens = dialect.tokenize("SELECT 1")?;
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.5 CustomDialectBuilder
|
||||||
|
|
||||||
|
For runtime-extensible dialect support:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::dialects::{CustomDialectBuilder, Dialect, DialectType};
|
||||||
|
use polyglot_sql::generator::NormalizeFunctions;
|
||||||
|
|
||||||
|
// Register a custom dialect inheriting from PostgreSQL
|
||||||
|
CustomDialectBuilder::new("my_postgres")
|
||||||
|
.based_on(DialectType::PostgreSQL)
|
||||||
|
.generator_config_modifier(|gc| {
|
||||||
|
gc.normalize_functions = NormalizeFunctions::Lower;
|
||||||
|
})
|
||||||
|
.register()?;
|
||||||
|
|
||||||
|
let d = Dialect::get_by_name("my_postgres").unwrap();
|
||||||
|
// Use like any built-in dialect
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Dialect Implementation Details
|
||||||
|
|
||||||
|
### 4.1 PostgreSQL (`postgres.rs`, ~1,879 LOC)
|
||||||
|
|
||||||
|
**Tokenizer:**
|
||||||
|
- `$$` string literals (dollar-quoting)
|
||||||
|
- Double-quote identifier quoting
|
||||||
|
- Nested block comments
|
||||||
|
- `EXEC` treated as generic command
|
||||||
|
|
||||||
|
**Generator config highlights:**
|
||||||
|
- `identifier_quote: '"'` (double quotes)
|
||||||
|
- `single_string_interval: true` (`INTERVAL '1 day'`)
|
||||||
|
- `parameter_token: "$"` (`$1`, `$2` placeholders)
|
||||||
|
- `supports_select_into: true`
|
||||||
|
- `supports_window_exclude: true`
|
||||||
|
- `can_implement_array_any: true`
|
||||||
|
|
||||||
|
**Transform examples:**
|
||||||
|
- `IFNULL(a, b)` → `COALESCE(a, b)`
|
||||||
|
- `RAND()` → `RANDOM()`
|
||||||
|
- `DATEDIFF(day, a, b)` → `CAST(b - a AS INT)` (date subtraction)
|
||||||
|
- `JSON_EXTRACT(a, '$.x')` → `a #> '{x}'` (arrow syntax)
|
||||||
|
- `JSON_EXTRACT_SCALAR(a, '$.x')` → `a #>> '{x}'`
|
||||||
|
- `DATE_ADD` / `DATE_SUB` → `+` / `-` interval arithmetic
|
||||||
|
- Type mappings: `TINYINT` → `SMALLINT`, `FLOAT` → `REAL`, `DOUBLE` → `DOUBLE PRECISION`
|
||||||
|
- `ILIKE` preserved (native PostgreSQL)
|
||||||
|
- `RegexpLike` → `~` operator, `RegexpILike` → `~*` operator
|
||||||
|
|
||||||
|
### 4.2 SQLite (`sqlite.rs`, ~750 LOC)
|
||||||
|
|
||||||
|
**Tokenizer:**
|
||||||
|
- Supports `"`, `[`, `` ` `` as identifier quote characters
|
||||||
|
- No nested comments
|
||||||
|
- Hex number literals (`0xCC`)
|
||||||
|
|
||||||
|
**Generator config:**
|
||||||
|
- `identifier_quote: '"'` (double quotes)
|
||||||
|
- `supports_table_alias_columns: false`
|
||||||
|
- `json_key_value_pair_sep: ","` (comma-style `JSON_OBJECT`)
|
||||||
|
|
||||||
|
**Transform examples:**
|
||||||
|
- `NVL(a, b)` → `IFNULL(a, b)`
|
||||||
|
- `TRY_CAST(x AS t)` → `CAST(x AS t)` (no try-cast)
|
||||||
|
- `RANDOM()` → function
|
||||||
|
- `ILIKE` → `LOWER(left) LIKE LOWER(right)` (no native ILIKE)
|
||||||
|
- `CountIf(cond)` → `SUM(IIF(cond, 1, 0))`
|
||||||
|
- `CEIL(x)` → function form
|
||||||
|
- `DATE_TRUNC(unit, col)` → various strftime patterns
|
||||||
|
- `DATE_DIFF` → `juliandiff` patterns
|
||||||
|
|
||||||
|
### 4.3 MySQL (`mysql.rs`)
|
||||||
|
|
||||||
|
**Tokenizer:** Backtick identifiers, `#` comments
|
||||||
|
**Generator:** Backtick quoting, `LIMIT` syntax, `CONCAT()` instead of `||`
|
||||||
|
**Transforms:** `COALESCE(a,b)` ← `IFNULL(a,b)`, `||` → `CONCAT()` (string concat), etc.
|
||||||
|
|
||||||
|
### 4.4 BigQuery (`bigquery.rs`)
|
||||||
|
|
||||||
|
**Tokenizer:** Backtick identifiers, `QUALIFY` keyword
|
||||||
|
**Generator:** Backtick quoting, `STRUCT` types, `QUALIFY` clause, `DATE_DIFF` syntax
|
||||||
|
**Transforms:** Complex date/timestamp function mappings, `UNNEST` handling, `APPROX_COUNT_DISTINCT` → `APPROX_COUNT_DISTINCT`
|
||||||
|
|
||||||
|
### 4.5 How Transpilation Works
|
||||||
|
|
||||||
|
The full transpilation pipeline:
|
||||||
|
|
||||||
|
```
|
||||||
|
Input SQL (source dialect)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Source Dialect Tokenizer
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Parser (dialect-aware)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Expression AST
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Source Dialect::preprocess() ← whole-tree rewrites
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Source Dialect::transform_expr() ← per-node rewrites (recursive, bottom-up)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Normalized AST
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Target Dialect Generator
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Output SQL (target dialect)
|
||||||
|
```
|
||||||
|
|
||||||
|
The transform pipeline uses an explicit task stack (not recursive calls) for the hot paths to avoid stack overflow. The `stacker` crate provides additional stack-growth protection.
|
||||||
|
|
||||||
|
Key cross-dialect transforms include:
|
||||||
|
- Function renaming: `IFNULL` ↔ `COALESCE` ↔ `NVL`, `DATEDIFF` ↔ date arithmetic, `STRING_AGG` ↔ `GROUP_CONCAT`
|
||||||
|
- Type mapping: `TINYINT` ↔ `SMALLINT`, `FLOAT` ↔ `REAL`, `JSON` ↔ `JSONB`
|
||||||
|
- Syntax conversion: `LIMIT` ↔ `TOP` ↔ `FETCH FIRST`, `||` (concat) ↔ `CONCAT()`, `SELECT INTO` ↔ `CREATE TABLE AS`
|
||||||
|
- Boolean handling: `BOOL_AND`/`BOOL_OR` ↔ `MIN`/`MAX`-over-`CASE`
|
||||||
|
- JSON operators: `JSON_EXTRACT` ↔ `#>`/`#>>` ↔ `->`/`->>` (PostgreSQL arrow syntax)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Fluent Builder API
|
||||||
|
|
||||||
|
The builder module (`builder.rs`, ~3.3K LOC) provides a type-safe, ergonomic way to construct SQL expressions without string interpolation:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::builder::*;
|
||||||
|
|
||||||
|
// SELECT id, name FROM users WHERE age > 18 ORDER BY name LIMIT 10
|
||||||
|
let expr = select(["id", "name"])
|
||||||
|
.from("users")
|
||||||
|
.where_(col("age").gt(lit(18)))
|
||||||
|
.order_by(["name"])
|
||||||
|
.limit(10)
|
||||||
|
.build();
|
||||||
|
|
||||||
|
// INSERT
|
||||||
|
let ins = insert_into("users")
|
||||||
|
.columns(["id", "name"])
|
||||||
|
.values([lit(1), lit("Alice")])
|
||||||
|
.build();
|
||||||
|
|
||||||
|
// CASE expression
|
||||||
|
let expr = case()
|
||||||
|
.when(col("x").gt(lit(0)), lit("positive"))
|
||||||
|
.else_(lit("non-positive"))
|
||||||
|
.build();
|
||||||
|
|
||||||
|
// Set operations
|
||||||
|
let expr = union_all(
|
||||||
|
select(["id"]).from("a"),
|
||||||
|
select(["id"]).from("b"),
|
||||||
|
).order_by(["id"]).limit(5).build();
|
||||||
|
```
|
||||||
|
|
||||||
|
Expression helpers:
|
||||||
|
- `col("users.id")` — column reference (splits on last `.`)
|
||||||
|
- `lit(42)`, `lit("hello")`, `lit(3.14)`, `lit(true)` — literals
|
||||||
|
- `func("COALESCE", [col("a"), col("b")])` — function calls
|
||||||
|
- Operator chain: `col("age").gte(lit(18)).and(col("status").eq(lit("active")))`
|
||||||
|
|
||||||
|
The builder generates an `Expression` AST that can then be serialized to any dialect via `generate()`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Validation and Schema-Aware Analysis
|
||||||
|
|
||||||
|
### 6.1 Syntax Validation
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::{validate, DialectType};
|
||||||
|
|
||||||
|
let result = validate("SELECT * FORM users", DialectType::Generic);
|
||||||
|
// result.valid == false
|
||||||
|
// result.errors contain line/column/message/error codes
|
||||||
|
```
|
||||||
|
|
||||||
|
Error codes:
|
||||||
|
- `E001` — Syntax error
|
||||||
|
- `E002` — Tokenization error
|
||||||
|
- `E003` — Parse error
|
||||||
|
- `E004` — Invalid expression (not a valid statement)
|
||||||
|
- `E005` — Trailing comma in strict mode
|
||||||
|
|
||||||
|
### 6.2 Schema-Aware Validation
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::{
|
||||||
|
validate_with_schema, DialectType, SchemaColumn, SchemaTable,
|
||||||
|
SchemaValidationOptions, ValidationSchema,
|
||||||
|
};
|
||||||
|
|
||||||
|
let schema = ValidationSchema {
|
||||||
|
strict: Some(true),
|
||||||
|
tables: vec![
|
||||||
|
SchemaTable {
|
||||||
|
name: "users".into(),
|
||||||
|
columns: vec![
|
||||||
|
SchemaColumn { name: "id".into(), data_type: "integer".into(), nullable: Some(false), primary_key: true, unique: false, references: None },
|
||||||
|
SchemaColumn { name: "email".into(), data_type: "varchar".into(), nullable: Some(false), primary_key: false, unique: true, references: None },
|
||||||
|
],
|
||||||
|
// ...
|
||||||
|
},
|
||||||
|
],
|
||||||
|
};
|
||||||
|
|
||||||
|
let opts = SchemaValidationOptions { check_types: true, check_references: true, strict: None, semantic: true };
|
||||||
|
let result = validate_with_schema("SELECT id FROM users WHERE email = 1", DialectType::Generic, &schema, &opts);
|
||||||
|
// result.valid == false (type mismatch: email is varchar, compared to integer)
|
||||||
|
```
|
||||||
|
|
||||||
|
Schema-aware error codes:
|
||||||
|
- `E200`/`E201` — Unknown table/column
|
||||||
|
- `E210`–`E217`, `W210`–`W216` — Type checks
|
||||||
|
- `E220`, `E221`, `W220`, `W221`, `W222` — Reference/FK checks
|
||||||
|
|
||||||
|
### 6.3 Function Catalogs
|
||||||
|
|
||||||
|
Optional feature-gated function catalogs (currently ClickHouse and DuckDB) provide known function signatures for semantic type checking:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
polyglot-sql = { version = "0.4", features = ["function-catalog-clickhouse"] }
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Column Lineage & OpenLineage
|
||||||
|
|
||||||
|
### 7.1 Column Lineage
|
||||||
|
|
||||||
|
Trace how columns flow through a query:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::{parse, DialectType};
|
||||||
|
use polyglot_sql::lineage::get_column_lineage;
|
||||||
|
|
||||||
|
let ast = parse("SELECT a + b AS total FROM t", DialectType::Generic).unwrap();
|
||||||
|
let lineage = get_column_lineage(&ast[0], /* schema */ None, DialectType::Generic);
|
||||||
|
// lineage tells you that "total" depends on columns "a" and "b" from table "t"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.2 OpenLineage Payload Generation
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::openlineage::{generate_run_event, OpenLineageOptions, OpenLineageDatasetId};
|
||||||
|
|
||||||
|
let opts = OpenLineageOptions {
|
||||||
|
dialect: DialectType::PostgreSQL,
|
||||||
|
producer: "my-app".into(),
|
||||||
|
dataset_namespace: Some("mydb".into()),
|
||||||
|
// ...
|
||||||
|
};
|
||||||
|
let event = generate_run_event("SELECT * FROM users", &opts)?;
|
||||||
|
// event is a JSON-serializable OpenLineage RunEvent with columnLineage facets
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Error Handling
|
||||||
|
|
||||||
|
### 8.1 Error Types
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub enum Error {
|
||||||
|
Tokenize { message: String, line: usize, column: usize, start: usize, end: usize },
|
||||||
|
Parse { message: String, line: usize, column: usize, start: usize, end: usize },
|
||||||
|
Generate(String),
|
||||||
|
Unsupported { feature: String, dialect: String },
|
||||||
|
Syntax { message: String, line: usize, column: usize, start: usize, end: usize },
|
||||||
|
Internal(String),
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
All position-bearing errors include:
|
||||||
|
- `line` — 1-based line number
|
||||||
|
- `column` — 1-based column number
|
||||||
|
- `start` / `end` — byte offsets (0-based, end exclusive)
|
||||||
|
|
||||||
|
```rust
|
||||||
|
let err = Error::parse("Unexpected token", 3, 15, 42, 44);
|
||||||
|
assert_eq!(err.line(), Some(3));
|
||||||
|
assert_eq!(err.column(), Some(15));
|
||||||
|
assert_eq!(err.start(), Some(42));
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.2 Validation Errors
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub struct ValidationError {
|
||||||
|
pub message: String,
|
||||||
|
pub line: Option<usize>,
|
||||||
|
pub column: Option<usize>,
|
||||||
|
pub severity: ValidationSeverity, // Error or Warning
|
||||||
|
pub code: String, // e.g., "E001", "E200"
|
||||||
|
pub start: Option<usize>,
|
||||||
|
pub end: Option<usize>,
|
||||||
|
}
|
||||||
|
|
||||||
|
pub struct ValidationResult {
|
||||||
|
pub valid: bool,
|
||||||
|
pub errors: Vec<ValidationError>,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.3 Guard Rail Errors
|
||||||
|
|
||||||
|
Format operations have configurable guard limits that return structured errors:
|
||||||
|
|
||||||
|
- `E_GUARD_INPUT_TOO_LARGE` — input exceeds `max_input_bytes`
|
||||||
|
- `E_GUARD_TOKEN_BUDGET_EXCEEDED` — token count exceeds `max_tokens`
|
||||||
|
- `E_GUARD_AST_BUDGET_EXCEEDED` — AST node count exceeds `max_ast_nodes`
|
||||||
|
- `E_GUARD_SET_OP_CHAIN_EXCEEDED` — UNION/INTERSECT/EXCEPT chain exceeds `max_set_op_chain`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. AST Traversal & Analysis
|
||||||
|
|
||||||
|
### 9.1 Traversal
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::{parse, DialectType};
|
||||||
|
use polyglot_sql::traversal::*;
|
||||||
|
|
||||||
|
let ast = parse("SELECT a, b FROM t WHERE x > 1", DialectType::Generic).unwrap();
|
||||||
|
let columns = get_columns(&ast[0]); // ["a", "b", "x"]
|
||||||
|
let tables = get_tables(&ast[0]); // ["t"]
|
||||||
|
```
|
||||||
|
|
||||||
|
Available predicates (70+):
|
||||||
|
- `is_select`, `is_insert`, `is_update`, `is_delete`, `is_ddl`
|
||||||
|
- `is_join`, `is_where`, `is_group_by`, `is_order_by`, `is_limit`
|
||||||
|
- `is_function`, `is_aggregate`, `is_subquery`, `is_cte`
|
||||||
|
- `is_comparison`, `is_logical`, `is_arithmetic`
|
||||||
|
- `contains_subquery`, `contains_aggregate`, `contains_window_function`
|
||||||
|
|
||||||
|
Iterators: `DfsIter`, `BfsIter` for depth-first and breadth-first traversal.
|
||||||
|
|
||||||
|
### 9.2 AST Transforms
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::ast_transforms::*;
|
||||||
|
|
||||||
|
// Rename tables
|
||||||
|
let renamed = rename_tables(expr, &[("old_name", "new_name")]);
|
||||||
|
|
||||||
|
// Add WHERE condition
|
||||||
|
let filtered = add_where(expr, col("active").eq(lit(true)));
|
||||||
|
|
||||||
|
// Remove LIMIT/OFFSET
|
||||||
|
let unlimited = remove_limit_offset(expr);
|
||||||
|
```
|
||||||
|
|
||||||
|
### 9.3 AST Diff
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::diff::{diff, diff_with_config, DiffConfig};
|
||||||
|
|
||||||
|
let edits = diff(&source_expr, &target_expr, true);
|
||||||
|
for edit in &edits {
|
||||||
|
if edit.is_change() {
|
||||||
|
println!("{:?}", edit);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Uses the ChangeDistiller algorithm with Dice coefficient matching for structural comparison.
|
||||||
|
|
||||||
|
### 9.4 Logical Planner
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use polyglot_sql::planner::Plan;
|
||||||
|
|
||||||
|
let plan = Plan::from_expression(&expr);
|
||||||
|
// plan.root is a Step DAG
|
||||||
|
// plan.leaves() returns leaf steps
|
||||||
|
// plan.dag() returns the dependency graph
|
||||||
|
```
|
||||||
|
|
||||||
|
Step kinds: Scan, Filter, Project, Aggregate, Join, Sort, Limit, etc.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Optimizer Modules
|
||||||
|
|
||||||
|
The optimizer is available behind the `semantic` feature flag:
|
||||||
|
|
||||||
|
| Module | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `qualify_columns.rs` | Resolve unqualified column references to table.column |
|
||||||
|
| `qualify_tables.rs` | Expand table names with schema/catalog |
|
||||||
|
| `annotate_types.rs` | Infer and annotate expression types |
|
||||||
|
| `pushdown_predicates.rs` | Push WHERE conditions into JOINs |
|
||||||
|
| `pushdown_projections.rs` | Reduce columns to only what's needed |
|
||||||
|
| `eliminate_joins.rs` | Remove unnecessary JOINs |
|
||||||
|
| `eliminate_ctes.rs` | Inline single-use CTEs |
|
||||||
|
| `simplify.rs` | Simplify boolean expressions, constant folding |
|
||||||
|
| `normalize.rs` | Expression normalization |
|
||||||
|
| `canonicalize.rs` | Query canonicalization |
|
||||||
|
| `subquery.rs` | Subquery analysis |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Async Support
|
||||||
|
|
||||||
|
**Polyglot does not use async I/O** — it is a pure computational library. All operations are synchronous and CPU-bound:
|
||||||
|
|
||||||
|
- `parse()` — synchronous
|
||||||
|
- `generate()` — synchronous
|
||||||
|
- `transpile()` — synchronous
|
||||||
|
- `validate()` — synchronous
|
||||||
|
- `format()` — synchronous
|
||||||
|
|
||||||
|
This is by design: Polyglot operates on SQL strings in memory, with no network or filesystem I/O. For use in async contexts (Tokio, async-std), callers should use `tokio::task::spawn_blocking()` or similar to offload CPU-heavy parsing/transpilation to a blocking thread pool.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Feature Flags
|
||||||
|
|
||||||
|
| Flag | Description | Default |
|
||||||
|
|---|---|---|
|
||||||
|
| `all-dialects` | Enable all 32 dialect parsers | ✅ |
|
||||||
|
| `generate` | SQL generation from AST | ✅ |
|
||||||
|
| `transpile` | Cross-dialect transpilation (implies `generate`) | ✅ |
|
||||||
|
| `builder` | Fluent query builder API (implies `generate`) | ✅ |
|
||||||
|
| `ast-tools` | AST inspection & transform utilities | ✅ |
|
||||||
|
| `semantic` | Schema, resolver, lineage, optimizer, validation | ✅ |
|
||||||
|
| `openlineage` | OpenLineage payload generation (implies `semantic`) | ✅ |
|
||||||
|
| `diff` | AST diff support (implies `generate`) | ✅ |
|
||||||
|
| `planner` | Logical planning helpers | ✅ |
|
||||||
|
| `time` | Time-format conversion helpers | ✅ |
|
||||||
|
| `stacker` | Stack-growth protection for native builds | ✅ |
|
||||||
|
| `bindings` | TypeScript type generation via `ts-rs` | ❌ |
|
||||||
|
| `dialect-postgresql` | PostgreSQL dialect only | — |
|
||||||
|
| `dialect-mysql` | MySQL dialect only | — |
|
||||||
|
| ... (one per dialect) | Individual dialect selector | — |
|
||||||
|
| `function-catalog-clickhouse` | ClickHouse function catalog | ❌ |
|
||||||
|
| `function-catalog-duckdb` | DuckDB function catalog | ❌ |
|
||||||
|
| `function-catalog-all-dialects` | All function catalogs | ❌ |
|
||||||
|
|
||||||
|
Minimal WASM build (for constrained targets):
|
||||||
|
```toml
|
||||||
|
polyglot-sql = { version = "0.4", default-features = false, features = ["generate", "transpile", "dialect-postgresql", "dialect-mysql"] }
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Source code examined: `/workspace/polyglot/crates/polyglot-sql/src/` (~241K LOC)
|
||||||
|
- Architecture documentation: `/workspace/polyglot/docs/sqlglot-architecture.md`
|
||||||
|
- Benchmark results: `/workspace/polyglot/docs/benchmark.md`
|
||||||
|
- README: `/workspace/polyglot/README.md`, `/workspace/polyglot/crates/polyglot-sql/README.md`
|
||||||
|
- CHANGELOG: `/workspace/polyglot/CHANGELOG.md`
|
||||||
294
docs/research/references/polyglot/03_analysis.md
Normal file
294
docs/research/references/polyglot/03_analysis.md
Normal file
@@ -0,0 +1,294 @@
|
|||||||
|
# Polyglot: Suitability Analysis & Comparisons
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. What Polyglot Is NOT
|
||||||
|
|
||||||
|
Before evaluating suitability, it's essential to understand what Polyglot **does not** do:
|
||||||
|
|
||||||
|
| NOT a... | Because |
|
||||||
|
|---|---|
|
||||||
|
| **Database driver** | No connection management, no query execution, no result set handling |
|
||||||
|
| **ORM** | No object-relational mapping, no model definitions, no active record pattern |
|
||||||
|
| **Migration tool** | No `CREATE TABLE` evolution management, no up/down migrations framework |
|
||||||
|
| **Type mapper** | No Rust type → SQL type mapping, no `FromRow` derives |
|
||||||
|
| **Connection pool** | No async I/O, no TCP connections, no TLS |
|
||||||
|
| **Query executor** | Never connects to a database; operates purely on SQL text |
|
||||||
|
|
||||||
|
**Polyglot is a SQL dialect transpiler.** It converts SQL strings between database dialects. Period.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Suitability Assessment for Multi-Database Storage Layer
|
||||||
|
|
||||||
|
### 2.1 What Polyglot CAN Do for a Multi-DB Project
|
||||||
|
|
||||||
|
| Use Case | Polyglot Support | Maturity |
|
||||||
|
|---|---|---|
|
||||||
|
| **SQL dialect translation** | ✅ Core purpose; 32 dialects with 100% test pass rate | Mature |
|
||||||
|
| **SQL pretty-printing** | ✅ Built-in format with guard rails | Mature |
|
||||||
|
| **SQL syntax validation** | ✅ Line/column error positions, error codes | Mature |
|
||||||
|
| **Schema-aware validation** | ✅ Table/column/type checking with `ValidationSchema` | Moderate |
|
||||||
|
| **Column lineage tracing** | ✅ `get_column_lineage()` for data lineage | Moderate |
|
||||||
|
| **OpenLineage payloads** | ✅ `RunEvent` and `DatasetFacet` generation | Early but functional |
|
||||||
|
| **Query builder** | ✅ Fluent API for SELECT/INSERT/UPDATE/DELETE | Usable but not as rich as query-builder-first libraries |
|
||||||
|
| **AST diff** | ✅ ChangeDistiller-based structural diff | Functional |
|
||||||
|
| **Logical planning** | ✅ Basic DAG plan extraction | Early stage |
|
||||||
|
| **Query optimization** | ✅ Column qualification, predicate pushdown, join elimination | Moderate |
|
||||||
|
| **Custom dialect registration** | ✅ `CustomDialectBuilder` for runtime extension | Functional |
|
||||||
|
|
||||||
|
### 2.2 What Polyglot CANNOT Do for a Multi-DB Project
|
||||||
|
|
||||||
|
| Need | Polyglot Support | Alternative |
|
||||||
|
|---|---|---|
|
||||||
|
| **Execute queries** | ❌ No | Use sqlx, diesel, or sea-orm |
|
||||||
|
| **Connection pooling** | ❌ No | Use deadpool, bb8, or sqlx built-in |
|
||||||
|
| **Async I/O** | ❌ Synchronous only | Wrap in `spawn_blocking()` |
|
||||||
|
| **Type-safe query building** | ⚠️ Partial (builder API returns strings) | Use diesel or sea-orm for compile-time checks |
|
||||||
|
| **Schema migration management** | ❌ No | Use diesel migrations, sqlx migrations, or refinery |
|
||||||
|
| **Row mapping / deserialization** | ❌ No | Use sqlx `FromRow`, diesel `Queryable` |
|
||||||
|
| **Runtime type mapping** | ⚠️ Limited (DataType enum, no Rust type bridge) | Build your own layer |
|
||||||
|
| **Database-specific DDL generation** | ⚠️ Parses/generates DDL but no migration framework | Use as a building block |
|
||||||
|
| **Transaction management** | ❌ No | Use sqlx or diesel |
|
||||||
|
|
||||||
|
### 2.3 Integration Pattern: Polyglot as a SQL Dialect Layer
|
||||||
|
|
||||||
|
The most natural integration pattern for a multi-database storage layer:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────┐
|
||||||
|
│ Application Logic │
|
||||||
|
├──────────────────────────────────────────────┤
|
||||||
|
│ Query Builder / ORM Layer │
|
||||||
|
│ (diesel / sea-orm / custom) │
|
||||||
|
├──────────────────────┬───────────────────────┤
|
||||||
|
│ │ │
|
||||||
|
│ Polyglot Layer │ Direct SQL │
|
||||||
|
│ (transpile, │ (no translation │
|
||||||
|
│ validate, │ needed) │
|
||||||
|
│ format) │ │
|
||||||
|
├──────────────────────┴───────────────────────┤
|
||||||
|
│ Database Driver Layer │
|
||||||
|
│ (sqlx / diesel / tungstenite) │
|
||||||
|
├──────────────────────────────────────────────┤
|
||||||
|
│ PostgreSQL │ MySQL │ SQLite │
|
||||||
|
└──────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
In this pattern, Polyglot sits **above** the database drivers, translating SQL from a canonical dialect to the target database's dialect before execution. It does **not** replace the drivers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Comparison with Other Rust SQL Libraries
|
||||||
|
|
||||||
|
### 3.1 Feature Comparison Matrix
|
||||||
|
|
||||||
|
| Feature | **Polyglot** | **Diesel** | **SQLx** | **SeaORM** | **sqlparser-rs** |
|
||||||
|
|---|---|---|---|---|---|
|
||||||
|
| **Primary Purpose** | SQL transpilation | ORM / query builder | Async DB driver | Async ORM | SQL parsing |
|
||||||
|
| **SQL Parsing** | ✅ Full AST (200+ node types) | ✅ DSL-based | ❌ No | ❌ No | ✅ Full AST |
|
||||||
|
| **SQL Generation** | ✅ Multi-dialect | ✅ Via DSL | ❌ No | ❌ No | ⚠️ Limited |
|
||||||
|
| **Cross-dialect Transpilation** | ✅ 32 dialects | ❌ No | ❌ No | ❌ No | ❌ No |
|
||||||
|
| **Query Builder** | ⚠️ Fluent, string-based | ✅ Type-safe DSL | ❌ No | ✅ Type-safe | ❌ No |
|
||||||
|
| **Async I/O** | ❌ No (sync only) | ❌ Diesel 1.x is sync | ✅ Native async | ✅ Native async | ❌ No |
|
||||||
|
| **Type-safe Queries** | ❌ No (runtime) | ✅ Compile-time | ❌ No | ✅ Compile-time | ❌ No |
|
||||||
|
| **Connection Pool** | ❌ No | ❌ No (Diesel 2.x via r2d2) | ✅ Built-in | ✅ Built-in | ❌ No |
|
||||||
|
| **Migration Support** | ❌ No | ✅ Built-in | ❌ No | ✅ Built-in | ❌ No |
|
||||||
|
| **Database Execution** | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
|
||||||
|
| **Schema Validation** | ✅ Via ValidationSchema | ✅ Compile-time | ❌ No | ⚠️ Limited | ❌ No |
|
||||||
|
| **Column Lineage** | ✅ Built-in | ❌ No | ❌ No | ❌ No | ❌ No |
|
||||||
|
| **AST Diff** | ✅ Built-in | ❌ No | ❌ No | ❌ No | ❌ No |
|
||||||
|
| **Dialects Supported** | 32 | 4 (PG, MySQL, SQLite, MSSQL) | N/A | N/A | 1 (ANSI SQL) |
|
||||||
|
| **License** | MIT | MIT/Apache-2.0 | MIT/Apache-2.0 | MIT | MIT/Apache-2.0 |
|
||||||
|
| **Maturity** | v0.4.4 (pre-1.0) | v2.2 (stable) | v0.8 (stable) | v1.1 (stable) | v0.49 (mature) |
|
||||||
|
|
||||||
|
### 3.2 Polyglot vs Diesel
|
||||||
|
|
||||||
|
| Aspect | Polyglot | Diesel |
|
||||||
|
|---|---|---|
|
||||||
|
| **Philosophy** | Parse any SQL → AST → generate any dialect | Type-safe DSL → SQL for specific databases |
|
||||||
|
| **Type Safety** | Runtime (string-based) | Compile-time (macro-based) |
|
||||||
|
| **Query Building** | `select(["col"]).from("t").where_(...)` → `Expression` AST | `schema::table::dsl::col.filter(...)` → SQL |
|
||||||
|
| **Dialect Breadth** | 32 dialects | 4 (PostgreSQL, MySQL, SQLite, MSSQL) |
|
||||||
|
| **Database Execution** | None (SQL text only) | Full CRUD with connection management |
|
||||||
|
| **Migrations** | None | Built-in migration framework |
|
||||||
|
| **When to use** | You need cross-dialect SQL translation, validation, lineage | You need type-safe queries with database execution |
|
||||||
|
|
||||||
|
**Verdict**: Polyglot and Diesel are **complementary**, not competing. Use Diesel for type-safe database interaction; use Polyglot when you need to translate SQL between dialects or analyze SQL without executing it.
|
||||||
|
|
||||||
|
### 3.3 Polyglot vs SQLx
|
||||||
|
|
||||||
|
| Aspect | Polyglot | SQLx |
|
||||||
|
|---|---|---|
|
||||||
|
| **Philosophy** | SQL manipulation without execution | Async database driver with compile-time query checking |
|
||||||
|
| **Async** | Synchronous only | Fully async |
|
||||||
|
| **Query Checking** | Runtime validation against schema | Compile-time `query!()` macro |
|
||||||
|
| **Database Support** | 32 dialects (parsing) | PostgreSQL, MySQL, SQLite (execution) |
|
||||||
|
| **When to use** | SQL transformation/analysis | Database interaction with async Rust |
|
||||||
|
|
||||||
|
**Verdict**: SQLx is for executing queries against databases. Polyglot is for transforming SQL text. They solve entirely different problems.
|
||||||
|
|
||||||
|
### 3.4 Polyglot vs SeaORM
|
||||||
|
|
||||||
|
| Aspect | Polyglot | SeaORM |
|
||||||
|
|---|---|---|
|
||||||
|
| **Philosophy** | SQL transpilation | Async ORM built on SQLx |
|
||||||
|
| **Async** | No | Yes |
|
||||||
|
| **Model Definition** | None | Entity models via macros |
|
||||||
|
| **Relationships** | None | Has-one, has-many, many-to-many |
|
||||||
|
| **When to use** | SQL dialect conversion | Database CRUD with relationships |
|
||||||
|
|
||||||
|
**Verdict**: Same as SQLx — complementary, not competing.
|
||||||
|
|
||||||
|
### 3.5 Polyglot vs sqlparser-rs
|
||||||
|
|
||||||
|
| Aspect | Polyglot | sqlparser-rs |
|
||||||
|
|---|---|---|
|
||||||
|
| **Parsing** | ✅ Full (200+ node types) | ✅ Full (ANSI SQL + some dialects) |
|
||||||
|
| **Generation** | ✅ Multi-dialect generation | ⚠️ Limited round-trip |
|
||||||
|
| **Transpilation** | ✅ Cross-dialect transforms | ❌ No |
|
||||||
|
| **Dialects** | 32 | Primarily ANSI SQL |
|
||||||
|
| **Validation** | ✅ With error positions | ❌ Parse errors only |
|
||||||
|
| **Builder** | ✅ Fluent API | ❌ No |
|
||||||
|
| **Lineage** | ✅ Built-in | ❌ No |
|
||||||
|
| **Diff** | ✅ Built-in | ❌ No |
|
||||||
|
| **Maturity** | v0.4.4 | v0.49 (more established) |
|
||||||
|
|
||||||
|
**Verdict**: sqlparser-rs is a mature parser for ANSI SQL. Polyglot offers significantly more: transpilation, 32 dialects, validation, lineage, diff, and a builder API. If you need dialect translation, Polyglot is the clear choice. If you only need ANSI SQL parsing and don't need generation/transpilation, sqlparser-rs may suffice with less overhead.
|
||||||
|
|
||||||
|
### 3.6 Polyglot vs Python sqlglot
|
||||||
|
|
||||||
|
| Aspect | Polyglot (Rust) | sqlglot (Python) |
|
||||||
|
|---|---|---|
|
||||||
|
| **Performance** | 8–19× faster (transpile), ~86× faster (generate) | Baseline |
|
||||||
|
| **Language** | Rust | Python |
|
||||||
|
| **Feature Parity** | ~95% of sqlglot's transpilation | Full feature set |
|
||||||
|
| **Optimizer** | Column qualification, predicate pushdown (moderate) | Full optimizer (column pruning, join elimination, etc.) |
|
||||||
|
| **Execution** | ❌ No | ⚠️ Limited (can execute against some engines) |
|
||||||
|
| **Test Compatibility** | 10,220+ sqlglot fixture cases at 100% | Original test suite |
|
||||||
|
| **Deployment** | Native binary / WASM / Python / Go | Python package |
|
||||||
|
|
||||||
|
**Verdict**: Polyglot is the performance-oriented port of sqlglot. It covers the core transpilation use case at near-full feature parity. The Python sqlglot has a more mature optimizer and some execution capabilities, but Polyglot is catching up rapidly (0.4.x adds lineage, OpenLineage, schema validation, and more).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Limitations and Gotchas
|
||||||
|
|
||||||
|
### 4.1 Current Limitations
|
||||||
|
|
||||||
|
| Limitation | Impact | Mitigation |
|
||||||
|
|---|---|---|
|
||||||
|
| **Pre-1.0 API** | Breaking changes possible between minor versions | Pin exact version in Cargo.toml |
|
||||||
|
| **No query execution** | Cannot run SQL against databases | Use alongside sqlx/diesel |
|
||||||
|
| **No async** | Blocking in async contexts | Wrap in `spawn_blocking()` |
|
||||||
|
| **No migration framework** | Cannot manage schema evolution | Use diesel migrations or refinery |
|
||||||
|
| **No Rust type mapping** | `DataType` enum doesn't map to Rust types | Build your own type bridge |
|
||||||
|
| **Builder returns Expression** | Builder doesn't produce type-safe queries | Accept runtime nature; pair with runtime validation |
|
||||||
|
| **Optimizer is early** | Limited optimization passes vs Python sqlglot | Most useful passes exist (qualify_columns, pushdown_predicates) |
|
||||||
|
| **WASM lacks `stacker`** | Deeply nested SQL may overflow stack in browser | Set format guard limits; consider web workers |
|
||||||
|
| **Custom dialects are global** | `CustomDialectBuilder` uses a global `RwLock` registry | Fine for most apps; not ideal for per-request isolation |
|
||||||
|
| **No prepared statement support** | Cannot generate `?` placeholders for parameterized queries | Build queries as strings; use sqlx for parameterization |
|
||||||
|
|
||||||
|
### 4.2 Gotchas
|
||||||
|
|
||||||
|
1. **`Dialect::get()` creates a new instance each call**: The `Dialect` struct bundles tokenizer + generator config + transformer. For hot loops, cache the `Dialect` instance rather than calling `Dialect::get()` repeatedly. (The overhead is minimal but non-zero.)
|
||||||
|
|
||||||
|
2. **Transpilation is not always invertible**: Some dialects have features that don't exist in others (e.g., BigQuery's `QUALIFY`, PostgreSQL's `ILIKE`, TSQL's `TOP`). Transpiling `A → B` and then `B → A` may lose information.
|
||||||
|
|
||||||
|
3. **Function transformation depth**: The transform pipeline processes per-node bottom-up. Some transformations require multi-pass processing (handled by `preprocess()`), but edge cases may require manual intervention.
|
||||||
|
|
||||||
|
4. **AST is not a stable serialization format**: The `Expression` enum and its inner structs may change between versions. If you serialize ASTs to JSON, expect breaking changes across minor versions.
|
||||||
|
|
||||||
|
5. **Feature flags are cumulative**: `transpile` implies `generate`, `openlineage` implies `semantic`, etc. For minimal builds, use `default-features = false` and select only what you need.
|
||||||
|
|
||||||
|
6. **Global custom dialect registry**: Custom dialects registered via `CustomDialectBuilder::register()` are stored in a global `RwLock<HashMap>`. This means they persist for the lifetime of the process and are visible across threads. Call `unregister_custom_dialect()` to remove them.
|
||||||
|
|
||||||
|
7. **Parser is permissive**: The parser accepts many SQL constructs that some databases reject. Validation (via `validate()` or `validate_with_schema()`) can catch some issues, but it's not a substitute for database-level error checking.
|
||||||
|
|
||||||
|
8. **No `?` placeholder generation**: Polyglot doesn't generate parameterized query placeholders. For prepared statements, you'll need to handle parameter binding yourself with your database driver.
|
||||||
|
|
||||||
|
9. **Schema validation requires manual schema definition**: The `ValidationSchema` struct must be populated manually — there's no automatic schema introspection from a live database.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Production-Readiness Assessment
|
||||||
|
|
||||||
|
### 5.1 Strengths
|
||||||
|
|
||||||
|
| Area | Rating | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| **Transpilation accuracy** | ⭐⭐⭐⭐⭐ | 10,220+ fixture cases at 100% pass rate |
|
||||||
|
| **Performance** | ⭐⭐⭐⭐⭐ | 8–19× faster than Python sqlglot |
|
||||||
|
| **Dialect coverage** | ⭐⭐⭐⭐⭐ | 32 dialects covering all major databases |
|
||||||
|
| **API ergonomics** | ⭐⭐⭐⭐ | Clean public API; builder is pleasant |
|
||||||
|
| **Error reporting** | ⭐⭐⭐⭐ | Line/column/byte-offset positions |
|
||||||
|
| **WASM support** | ⭐⭐⭐⭐ | Full feature set in browser |
|
||||||
|
| **Multi-language bindings** | ⭐⭐⭐⭐⭐ | Rust, TypeScript, Python, Go, C FFI |
|
||||||
|
| **Documentation** | ⭐⭐⭐ | Rust API docs exist; could use more guides |
|
||||||
|
| **Test coverage** | ⭐⭐⭐⭐⭐ | 18,745 test cases |
|
||||||
|
| **Fuzzing** | ⭐⭐⭐⭐ | Supported via `cargo fuzz` |
|
||||||
|
|
||||||
|
### 5.2 Risks
|
||||||
|
|
||||||
|
| Risk | Severity | Mitigation |
|
||||||
|
|---|---|---|
|
||||||
|
| **Pre-1.0 breaking changes** | Medium | Pin version; monitor CHANGELOG |
|
||||||
|
| **Single maintainer** | Medium | Code is well-structured; community could fork |
|
||||||
|
| **Limited optimizer** | Low | Core passes exist; Python sqlglot is reference |
|
||||||
|
| **No query execution** | Low (by design) | Combine with sqlx/diesel |
|
||||||
|
| **WASM stack limits** | Low | Set guard rails; use web workers |
|
||||||
|
|
||||||
|
### 5.3 Overall Assessment
|
||||||
|
|
||||||
|
**Polyglot is production-viable for SQL transpilation and analysis tasks**, with caveats:
|
||||||
|
|
||||||
|
- ✅ **Use for**: SQL dialect translation, SQL linting/validation, column lineage, pretty-printing, AST analysis, cross-database query migration
|
||||||
|
- ⚠️ **Use with caution for**: Query building (no type safety), optimization (partial coverage)
|
||||||
|
- ❌ **Don't use for**: Database execution, connection management, migrations, type-safe queries
|
||||||
|
|
||||||
|
For a multi-database storage layer, the recommended pattern is:
|
||||||
|
|
||||||
|
```
|
||||||
|
Application → Polyglot (transpile SQL to target dialect) → sqlx/diesel (execute)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Recommendation
|
||||||
|
|
||||||
|
### When to Adopt Polyglot
|
||||||
|
|
||||||
|
1. **You need to support multiple database backends with different SQL dialects** and want to write queries once in a canonical dialect, then transpile to the target at runtime.
|
||||||
|
2. **You need SQL validation or analysis** (lineage, schema checking) without executing queries.
|
||||||
|
3. **You need SQL pretty-printing or formatting** with configurable guard rails.
|
||||||
|
4. **You need column lineage tracking** for data governance or OpenLineage integration.
|
||||||
|
5. **You need to parse and analyze SQL** in a Rust/WASM/Python/Go context without connecting to a database.
|
||||||
|
|
||||||
|
### When NOT to Adopt Polyglot
|
||||||
|
|
||||||
|
1. **You need type-safe query building** — use Diesel or SeaORM instead.
|
||||||
|
2. **You need async database execution** — use SQLx or SeaORM instead.
|
||||||
|
3. **You need schema migrations** — use Diesel migrations, sqlx migrations, or Refinery instead.
|
||||||
|
4. **You only need PostgreSQL** (or a single dialect) — a simpler parser may suffice.
|
||||||
|
5. **You need Rust type → SQL type mapping** — Polyglot doesn't provide this.
|
||||||
|
|
||||||
|
### Suggested Adoption Strategy
|
||||||
|
|
||||||
|
For a multi-database storage layer:
|
||||||
|
|
||||||
|
1. **Use Polyglot for SQL transpilation**: Write queries in a canonical dialect (e.g., PostgreSQL-compatible), transpile to the target dialect at runtime.
|
||||||
|
2. **Use SQLx for database execution**: Handle connections, pooling, and async I/O.
|
||||||
|
3. **Use Polyglot for validation**: Validate user-provided SQL before execution.
|
||||||
|
4. **Use Polyglot for lineage**: Trace column flow for data governance.
|
||||||
|
5. **Build a thin integration layer** that combines Polyglot's transpilation with SQLx's execution.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- <https://github.com/tobilg/polyglot> — Main repository
|
||||||
|
- <https://crates.io/crates/polyglot-sql> — Rust crate (v0.4.4)
|
||||||
|
- <https://docs.rs/polyglot-sql/latest/polyglot_sql/> — Rust API docs
|
||||||
|
- <https://github.com/tobymao/sqlglot> — Python inspiration
|
||||||
|
- <https://lib.rs/crates/polyglot-sql> — Package metadata
|
||||||
|
- Local source: `/workspace/polyglot/`
|
||||||
Reference in New Issue
Block a user