diff --git a/docs/research/references/polyglot/01_overview.md b/docs/research/references/polyglot/01_overview.md new file mode 100644 index 0000000..831fa9d --- /dev/null +++ b/docs/research/references/polyglot/01_overview.md @@ -0,0 +1,137 @@ +# Polyglot: Research Overview + +**Library**: `polyglot-sql` (Rust crate) / `@polyglot-sql/sdk` (TypeScript/WASM) / `polyglot-sql` (Python) +**Repository**: +**Current Version**: 0.4.4 (as of 2026-06-03) +**License**: MIT (+ sqlglot MIT for test fixtures) +**Author**: Tobias G. (tobilg) +**Inspiration**: Python [sqlglot](https://github.com/tobymao/sqlglot) by Toby Mao + +--- + +## 1. What Is Polyglot? + +Polyglot is a **SQL transpiler** — it parses SQL from one database dialect into an AST, and generates SQL for a different dialect. It is **not** a database driver, ORM, query executor, or connection pool. Its core purpose is **dialect-agnostic SQL manipulation**: parse, transform, validate, format, and transpile SQL across 32+ database dialects. + +### Key Capabilities + +| Capability | Description | +|---|---| +| **Parse** | Convert SQL string → typed AST with 200+ expression node types | +| **Generate** | Convert AST → SQL string for any supported dialect | +| **Transpile** | Convert SQL from dialect A → dialect B in one call | +| **Format** | Pretty-print SQL with configurable guard rails | +| **Build** | Construct SQL programmatically via a fluent builder API | +| **Validate** | Syntax + semantic validation with error positions | +| **Lineage** | Trace column lineage through queries; generate OpenLineage payloads | +| **Diff** | AST-aware diff between two SQL expressions | +| **Traverse** | DFS/BFS iterators, predicate queries, and transforms on the AST | + +### Supported Dialects (32) + +Athena, BigQuery, ClickHouse, CockroachDB, Databricks, Doris, Dremio, Drill, Druid, DuckDB, Dune, Exasol, Fabric, Hive, Materialize, MySQL, Oracle, PostgreSQL, Presto, Redshift, RisingWave, SingleStore, Snowflake, Solr, Spark, SQLite, StarRocks, Tableau, Teradata, TiDB, Trino, TSQL + +Plus a `Generic` dialect for standard SQL. + +### Language Bindings + +| Binding | Package | Delivery | +|---|---|---| +| **Rust** | `polyglot-sql` on crates.io | Native Rust crate | +| **TypeScript/WASM** | `@polyglot-sql/sdk` on npm | WASM module + JS wrapper | +| **Python** | `polyglot-sql` on PyPI | PyO3 native extension | +| **Go** | `github.com/tobilg/polyglot/packages/go` | PureGo wrapper over C FFI | +| **C FFI** | Built from `polyglot-sql-ffi` | `.so` / `.dylib` / `.dll` + `.a` / `.lib` + header | + +--- + +## 2. Core Philosophy & Design Principles + +1. **Pipeline architecture**: SQL → Tokenize → Parse → AST → Transform → Generate → SQL string. Each stage is independently configurable per dialect. + +2. **Ported from Python sqlglot**: The Rust implementation is a faithful port of the Python `sqlglot` library, maintaining compatibility with its test fixtures (10,220+ fixture cases at 100% pass rate). The architecture, expression types, transformation rules, and dialect behaviors mirror the Python original. + +3. **No runtime database connection**: Polyglot never connects to a database. It operates purely on SQL strings and ASTs. This makes it safe for sandboxed environments (WASM, serverless) and suitable for build-time / CI-time SQL analysis. + +4. **Feature-gated compilation**: Each dialect is behind a Cargo feature flag (`dialect-postgresql`, `dialect-mysql`, etc.), so users compiling for constrained targets (WASM) can include only what they need. The `default` feature set includes everything. + +5. **Stack safety**: The `stacker` feature (default-on for native builds) grows the stack on deeply nested inputs, preventing stack overflow from pathological SQL. WASM builds opt out since `stacker` doesn't work there. + +6. **Guard rails**: Format/guard options limit input size (16 MiB default), token count (1M), AST node count (1M), and set-operation chain depth (256) to prevent resource exhaustion. + +7. **Performance-first**: Built in Rust for speed. Benchmarks show 8–19× speedup over the Python `sqlglot` for transpilation, with generation at ~86× faster. The WASM build enables near-native performance in browsers. + +--- + +## 3. How It Differs from Database Abstraction Layers + +**Critical distinction**: Polyglot is a **SQL dialect transpiler**, not a database abstraction layer. It does not: + +- Connect to databases +- Execute queries +- Manage connection pools +- Handle migrations (no `CREATE TABLE` schema evolution management) +- Map Rust types to database types +- Provide an ORM-like interface +- Handle async I/O + +Instead, it focuses purely on **SQL text manipulation**: parsing, analyzing, transforming, and generating SQL strings. This makes it complementary to (not competing with) libraries like Diesel, SQLx, or SeaORM. + +--- + +## 4. Performance Characteristics + +From the project's benchmark suite (polyglot-sql v0.1.2 vs sqlglot v28.10.1): + +| Operation | Speedup Range | +|---|---| +| Parse (SQL → AST) | 10–13× faster | +| Generate (AST → SQL) | 77–101× faster | +| Roundtrip (parse → generate → re-parse) | 13–15× faster | +| Transpile (full cross-dialect) | 1.6× (simple) to 19× (complex BigQuery→Snowflake) | +| Geometric mean | **8.70×** | + +Parse benchmarks (v0.4.x, native Rust): + +| Query | Mean | +|---|---| +| short (SELECT a, b, c) | 51.28 μs | +| medium (5 cols, JOIN, GROUP BY) | 259.61 μs | +| complex (3 CTEs, subquery) | 268.59 μs – 1.03 ms | + +--- + +## 5. Project Maturity Indicators + +| Indicator | Status | +|---|---| +| **Version** | 0.4.4 (pre-1.0, active development) | +| **Test coverage** | 18,745 test cases at 100% pass rate | +| **crates.io downloads** | ~4,738 total (as of mid-2026) | +| **Dependent crates** | 2 (via entdb) | +| **Release cadence** | Frequent patch releases (0.4.2, 0.4.3, 0.4.4 in quick succession) | +| **Source code size** | ~241K lines of Rust in core crate | +| **Fuzzing** | Supported via `cargo +nightly fuzz` | +| **CI** | Full test suite + FFI + Python + WASM | +| **Documentation** | Rust API docs (docs.rs), TypeScript docs, Python docs, playground | +| **Breaking changes** | Possible before 1.0; semver suggests API instability | + +--- + +## 6. License + +- **MIT License** for the Polyglot code itself +- **sqlglot MIT License** for the test fixtures derived from the Python project +- Both are permissive, suitable for commercial use + +--- + +## References + +- — Main repository +- — Rust crate on crates.io +- — TypeScript SDK on npm +- — Python bindings on PyPI +- — Rust API documentation +- — Interactive playground +- — Original Python inspiration \ No newline at end of file diff --git a/docs/research/references/polyglot/02_architecture.md b/docs/research/references/polyglot/02_architecture.md new file mode 100644 index 0000000..281f6cc --- /dev/null +++ b/docs/research/references/polyglot/02_architecture.md @@ -0,0 +1,720 @@ +# Polyglot: Architecture Deep Dive + +--- + +## 1. Workspace Structure + +The repository is organized as a Cargo workspace with 5 crates and supporting packages: + +``` +polyglot/ +├── crates/ +│ ├── polyglot-sql/ # Core Rust library (~241K LOC) +│ │ └── src/ +│ │ ├── lib.rs # Public API, top-level functions +│ │ ├── tokens.rs # Tokenizer (lexer) +│ │ ├── parser.rs # Recursive-descent parser (~62K LOC) +│ │ ├── expressions.rs # AST node types (~15K LOC) +│ │ ├── generator.rs # SQL code generator (~39K LOC) +│ │ ├── dialects/ # 33 dialect implementations +│ │ │ ├── mod.rs # Dialect trait, Dialect struct, CustomDialectBuilder +│ │ │ ├── generic.rs # Base/standard SQL dialect +│ │ │ ├── postgres.rs # PostgreSQL (~1.9K LOC) +│ │ │ ├── mysql.rs # MySQL +│ │ │ ├── sqlite.rs # SQLite +│ │ │ ├── bigquery.rs # BigQuery +│ │ │ ├── ... (32 total) +│ │ ├── builder.rs # Fluent query builder API +│ │ ├── transforms.rs # Cross-dialect transform functions +│ │ ├── validation.rs # Syntax + semantic validation +│ │ ├── schema.rs # Schema representation +│ │ ├── scope.rs # Scope analysis +│ │ ├── resolver.rs # Column resolution +│ │ ├── lineage.rs # Column lineage tracking +│ │ ├── openlineage.rs # OpenLineage payload generation +│ │ ├── diff.rs # AST diff (ChangeDistiller algorithm) +│ │ ├── planner.rs # Logical query plan +│ │ ├── optimizer/ # Query optimizer modules +│ │ │ ├── annotate_types.rs # Type annotation +│ │ │ ├── qualify_columns.rs # Column qualification +│ │ │ ├── qualify_tables.rs # Table qualification +│ │ │ ├── pushdown_predicates.rs +│ │ │ ├── pushdown_projections.rs +│ │ │ ├── eliminate_joins.rs +│ │ │ ├── eliminate_ctes.rs +│ │ │ ├── simplify.rs +│ │ │ └── ... +│ │ ├── traversal.rs # DFS/BFS visitors, AST predicates +│ │ ├── ast_transforms.rs # AST manipulation utilities +│ │ ├── error.rs # Error types +│ │ └── time.rs # Time format conversion +│ ├── polyglot-sql-function-catalogs/ # Optional dialect function catalogs +│ ├── polyglot-sql-wasm/ # WASM bindings (wasm-pack) +│ ├── polyglot-sql-ffi/ # C FFI bindings (cbindgen) +│ └── polyglot-sql-python/ # Python bindings (PyO3 + maturin) +├── packages/ +│ ├── sdk/ # TypeScript SDK (@polyglot-sql/sdk) +│ ├── go/ # Go SDK (PureGo wrapper over FFI) +│ ├── documentation/ # TypeScript API docs site +│ ├── playground/ # Browser playground (React 19, Vite) +│ └── python-docs/ # Python API docs +├── examples/ +│ ├── rust/ # Rust usage example +│ ├── typescript/ # TypeScript SDK example +│ └── c/ # C FFI usage example +└── tools/ + ├── sqlglot-compare/ # Fixture extraction & comparison + └── bench-compare/ # Performance benchmarks +``` + +--- + +## 2. Data Flow Pipeline + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ SQL String (source dialect) │ +└──────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────┐ +│ Tokenizer (tokens.rs) │ +│ • Dialect-specific lexing rules (quotes, comments, keywords) │ +│ • Configurable via TokenizerConfig per dialect │ +│ • Produces Vec with type, text, and Span (line/col/offset) │ +└──────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────┐ +│ Parser (parser.rs, ~62K LOC) │ +│ • Recursive-descent with precedence climbing │ +│ • Dialect-aware parsing (custom keywords, syntax rules) │ +│ • Produces Expression AST tree │ +│ • Stack safety via `stacker` feature (default-on) │ +└──────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────┐ +│ Expression AST (expressions.rs) │ +│ • Single tagged enum with 150+ variants │ +│ • Each variant has its own struct (Select, Insert, Function, etc.) │ +│ • Box keeps enum size to 2 words (tag + pointer) │ +│ • Serializable via serde (derive Serialize/Deserialize) │ +│ • Optional TypeScript type generation via `ts-rs` feature flag │ +└──────────────────────────┬──────────────────────────────────────────┘ + │ + ┌────┴────┐ + │ │ + ┌─────────┘ └──────────┐ + │ │ + ▼ ▼ +┌────────────────────────┐ ┌────────────────────────────────────┐ +│ Transform Pipeline │ │ Semantic / Analysis Modules │ +│ (transpile path) │ │ • validation.rs → syntax checks │ +│ │ │ • schema.rs → column/type lookup │ +│ 1. preprocess() │ │ • scope.rs → scope analysis │ +│ (whole-tree rewrites│ │ • resolver.rs → column resolution │ +│ like eliminate_ │ │ • lineage.rs → column lineage │ +│ qualify) │ │ • openlineage.rs → OL payloads │ +│ │ │ • optimizer/ → query optimization │ +│ 2. transform_expr() │ │ • diff.rs → AST diff │ +│ (per-node rewrites │ │ • planner.rs → logical plan DAG │ +│ per dialect) │ │ • traversal.rs → DFS/BFS visitors │ +│ │ │ +│ 3. Generator │ │ +│ (AST → SQL string) │ │ +└───────────┬────────────┘ └────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────┐ +│ SQL String (target dialect) │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 3. Core Abstractions + +### 3.1 Expression AST + +The central type is `Expression`, a large tagged enum with one variant per SQL construct: + +```rust +pub enum Expression { + // Literals + Literal(Box), + Boolean(BooleanLiteral), + Null(Null), + + // Identifiers + Identifier(Identifier), + Column(Box), + Table(Box), + Star(Star), + + // Queries + Select(Box