docs(research): add polyglot SQL transpiler deep dive for multi-DB storage evaluation

This commit is contained in:
2026-06-10 10:04:30 +00:00
parent bdb0b604e9
commit 6e71d1f306
3 changed files with 1151 additions and 0 deletions

View File

@@ -0,0 +1,294 @@
# Polyglot: Suitability Analysis & Comparisons
---
## 1. What Polyglot Is NOT
Before evaluating suitability, it's essential to understand what Polyglot **does not** do:
| NOT a... | Because |
|---|---|
| **Database driver** | No connection management, no query execution, no result set handling |
| **ORM** | No object-relational mapping, no model definitions, no active record pattern |
| **Migration tool** | No `CREATE TABLE` evolution management, no up/down migrations framework |
| **Type mapper** | No Rust type → SQL type mapping, no `FromRow` derives |
| **Connection pool** | No async I/O, no TCP connections, no TLS |
| **Query executor** | Never connects to a database; operates purely on SQL text |
**Polyglot is a SQL dialect transpiler.** It converts SQL strings between database dialects. Period.
---
## 2. Suitability Assessment for Multi-Database Storage Layer
### 2.1 What Polyglot CAN Do for a Multi-DB Project
| Use Case | Polyglot Support | Maturity |
|---|---|---|
| **SQL dialect translation** | ✅ Core purpose; 32 dialects with 100% test pass rate | Mature |
| **SQL pretty-printing** | ✅ Built-in format with guard rails | Mature |
| **SQL syntax validation** | ✅ Line/column error positions, error codes | Mature |
| **Schema-aware validation** | ✅ Table/column/type checking with `ValidationSchema` | Moderate |
| **Column lineage tracing** | ✅ `get_column_lineage()` for data lineage | Moderate |
| **OpenLineage payloads** | ✅ `RunEvent` and `DatasetFacet` generation | Early but functional |
| **Query builder** | ✅ Fluent API for SELECT/INSERT/UPDATE/DELETE | Usable but not as rich as query-builder-first libraries |
| **AST diff** | ✅ ChangeDistiller-based structural diff | Functional |
| **Logical planning** | ✅ Basic DAG plan extraction | Early stage |
| **Query optimization** | ✅ Column qualification, predicate pushdown, join elimination | Moderate |
| **Custom dialect registration** | ✅ `CustomDialectBuilder` for runtime extension | Functional |
### 2.2 What Polyglot CANNOT Do for a Multi-DB Project
| Need | Polyglot Support | Alternative |
|---|---|---|
| **Execute queries** | ❌ No | Use sqlx, diesel, or sea-orm |
| **Connection pooling** | ❌ No | Use deadpool, bb8, or sqlx built-in |
| **Async I/O** | ❌ Synchronous only | Wrap in `spawn_blocking()` |
| **Type-safe query building** | ⚠️ Partial (builder API returns strings) | Use diesel or sea-orm for compile-time checks |
| **Schema migration management** | ❌ No | Use diesel migrations, sqlx migrations, or refinery |
| **Row mapping / deserialization** | ❌ No | Use sqlx `FromRow`, diesel `Queryable` |
| **Runtime type mapping** | ⚠️ Limited (DataType enum, no Rust type bridge) | Build your own layer |
| **Database-specific DDL generation** | ⚠️ Parses/generates DDL but no migration framework | Use as a building block |
| **Transaction management** | ❌ No | Use sqlx or diesel |
### 2.3 Integration Pattern: Polyglot as a SQL Dialect Layer
The most natural integration pattern for a multi-database storage layer:
```
┌──────────────────────────────────────────────┐
│ Application Logic │
├──────────────────────────────────────────────┤
│ Query Builder / ORM Layer │
│ (diesel / sea-orm / custom) │
├──────────────────────┬───────────────────────┤
│ │ │
│ Polyglot Layer │ Direct SQL │
│ (transpile, │ (no translation │
│ validate, │ needed) │
│ format) │ │
├──────────────────────┴───────────────────────┤
│ Database Driver Layer │
│ (sqlx / diesel / tungstenite) │
├──────────────────────────────────────────────┤
│ PostgreSQL │ MySQL │ SQLite │
└──────────────────────────────────────────────┘
```
In this pattern, Polyglot sits **above** the database drivers, translating SQL from a canonical dialect to the target database's dialect before execution. It does **not** replace the drivers.
---
## 3. Comparison with Other Rust SQL Libraries
### 3.1 Feature Comparison Matrix
| Feature | **Polyglot** | **Diesel** | **SQLx** | **SeaORM** | **sqlparser-rs** |
|---|---|---|---|---|---|
| **Primary Purpose** | SQL transpilation | ORM / query builder | Async DB driver | Async ORM | SQL parsing |
| **SQL Parsing** | ✅ Full AST (200+ node types) | ✅ DSL-based | ❌ No | ❌ No | ✅ Full AST |
| **SQL Generation** | ✅ Multi-dialect | ✅ Via DSL | ❌ No | ❌ No | ⚠️ Limited |
| **Cross-dialect Transpilation** | ✅ 32 dialects | ❌ No | ❌ No | ❌ No | ❌ No |
| **Query Builder** | ⚠️ Fluent, string-based | ✅ Type-safe DSL | ❌ No | ✅ Type-safe | ❌ No |
| **Async I/O** | ❌ No (sync only) | ❌ Diesel 1.x is sync | ✅ Native async | ✅ Native async | ❌ No |
| **Type-safe Queries** | ❌ No (runtime) | ✅ Compile-time | ❌ No | ✅ Compile-time | ❌ No |
| **Connection Pool** | ❌ No | ❌ No (Diesel 2.x via r2d2) | ✅ Built-in | ✅ Built-in | ❌ No |
| **Migration Support** | ❌ No | ✅ Built-in | ❌ No | ✅ Built-in | ❌ No |
| **Database Execution** | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
| **Schema Validation** | ✅ Via ValidationSchema | ✅ Compile-time | ❌ No | ⚠️ Limited | ❌ No |
| **Column Lineage** | ✅ Built-in | ❌ No | ❌ No | ❌ No | ❌ No |
| **AST Diff** | ✅ Built-in | ❌ No | ❌ No | ❌ No | ❌ No |
| **Dialects Supported** | 32 | 4 (PG, MySQL, SQLite, MSSQL) | N/A | N/A | 1 (ANSI SQL) |
| **License** | MIT | MIT/Apache-2.0 | MIT/Apache-2.0 | MIT | MIT/Apache-2.0 |
| **Maturity** | v0.4.4 (pre-1.0) | v2.2 (stable) | v0.8 (stable) | v1.1 (stable) | v0.49 (mature) |
### 3.2 Polyglot vs Diesel
| Aspect | Polyglot | Diesel |
|---|---|---|
| **Philosophy** | Parse any SQL → AST → generate any dialect | Type-safe DSL → SQL for specific databases |
| **Type Safety** | Runtime (string-based) | Compile-time (macro-based) |
| **Query Building** | `select(["col"]).from("t").where_(...)``Expression` AST | `schema::table::dsl::col.filter(...)` → SQL |
| **Dialect Breadth** | 32 dialects | 4 (PostgreSQL, MySQL, SQLite, MSSQL) |
| **Database Execution** | None (SQL text only) | Full CRUD with connection management |
| **Migrations** | None | Built-in migration framework |
| **When to use** | You need cross-dialect SQL translation, validation, lineage | You need type-safe queries with database execution |
**Verdict**: Polyglot and Diesel are **complementary**, not competing. Use Diesel for type-safe database interaction; use Polyglot when you need to translate SQL between dialects or analyze SQL without executing it.
### 3.3 Polyglot vs SQLx
| Aspect | Polyglot | SQLx |
|---|---|---|
| **Philosophy** | SQL manipulation without execution | Async database driver with compile-time query checking |
| **Async** | Synchronous only | Fully async |
| **Query Checking** | Runtime validation against schema | Compile-time `query!()` macro |
| **Database Support** | 32 dialects (parsing) | PostgreSQL, MySQL, SQLite (execution) |
| **When to use** | SQL transformation/analysis | Database interaction with async Rust |
**Verdict**: SQLx is for executing queries against databases. Polyglot is for transforming SQL text. They solve entirely different problems.
### 3.4 Polyglot vs SeaORM
| Aspect | Polyglot | SeaORM |
|---|---|---|
| **Philosophy** | SQL transpilation | Async ORM built on SQLx |
| **Async** | No | Yes |
| **Model Definition** | None | Entity models via macros |
| **Relationships** | None | Has-one, has-many, many-to-many |
| **When to use** | SQL dialect conversion | Database CRUD with relationships |
**Verdict**: Same as SQLx — complementary, not competing.
### 3.5 Polyglot vs sqlparser-rs
| Aspect | Polyglot | sqlparser-rs |
|---|---|---|
| **Parsing** | ✅ Full (200+ node types) | ✅ Full (ANSI SQL + some dialects) |
| **Generation** | ✅ Multi-dialect generation | ⚠️ Limited round-trip |
| **Transpilation** | ✅ Cross-dialect transforms | ❌ No |
| **Dialects** | 32 | Primarily ANSI SQL |
| **Validation** | ✅ With error positions | ❌ Parse errors only |
| **Builder** | ✅ Fluent API | ❌ No |
| **Lineage** | ✅ Built-in | ❌ No |
| **Diff** | ✅ Built-in | ❌ No |
| **Maturity** | v0.4.4 | v0.49 (more established) |
**Verdict**: sqlparser-rs is a mature parser for ANSI SQL. Polyglot offers significantly more: transpilation, 32 dialects, validation, lineage, diff, and a builder API. If you need dialect translation, Polyglot is the clear choice. If you only need ANSI SQL parsing and don't need generation/transpilation, sqlparser-rs may suffice with less overhead.
### 3.6 Polyglot vs Python sqlglot
| Aspect | Polyglot (Rust) | sqlglot (Python) |
|---|---|---|
| **Performance** | 819× faster (transpile), ~86× faster (generate) | Baseline |
| **Language** | Rust | Python |
| **Feature Parity** | ~95% of sqlglot's transpilation | Full feature set |
| **Optimizer** | Column qualification, predicate pushdown (moderate) | Full optimizer (column pruning, join elimination, etc.) |
| **Execution** | ❌ No | ⚠️ Limited (can execute against some engines) |
| **Test Compatibility** | 10,220+ sqlglot fixture cases at 100% | Original test suite |
| **Deployment** | Native binary / WASM / Python / Go | Python package |
**Verdict**: Polyglot is the performance-oriented port of sqlglot. It covers the core transpilation use case at near-full feature parity. The Python sqlglot has a more mature optimizer and some execution capabilities, but Polyglot is catching up rapidly (0.4.x adds lineage, OpenLineage, schema validation, and more).
---
## 4. Limitations and Gotchas
### 4.1 Current Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| **Pre-1.0 API** | Breaking changes possible between minor versions | Pin exact version in Cargo.toml |
| **No query execution** | Cannot run SQL against databases | Use alongside sqlx/diesel |
| **No async** | Blocking in async contexts | Wrap in `spawn_blocking()` |
| **No migration framework** | Cannot manage schema evolution | Use diesel migrations or refinery |
| **No Rust type mapping** | `DataType` enum doesn't map to Rust types | Build your own type bridge |
| **Builder returns Expression** | Builder doesn't produce type-safe queries | Accept runtime nature; pair with runtime validation |
| **Optimizer is early** | Limited optimization passes vs Python sqlglot | Most useful passes exist (qualify_columns, pushdown_predicates) |
| **WASM lacks `stacker`** | Deeply nested SQL may overflow stack in browser | Set format guard limits; consider web workers |
| **Custom dialects are global** | `CustomDialectBuilder` uses a global `RwLock` registry | Fine for most apps; not ideal for per-request isolation |
| **No prepared statement support** | Cannot generate `?` placeholders for parameterized queries | Build queries as strings; use sqlx for parameterization |
### 4.2 Gotchas
1. **`Dialect::get()` creates a new instance each call**: The `Dialect` struct bundles tokenizer + generator config + transformer. For hot loops, cache the `Dialect` instance rather than calling `Dialect::get()` repeatedly. (The overhead is minimal but non-zero.)
2. **Transpilation is not always invertible**: Some dialects have features that don't exist in others (e.g., BigQuery's `QUALIFY`, PostgreSQL's `ILIKE`, TSQL's `TOP`). Transpiling `A → B` and then `B → A` may lose information.
3. **Function transformation depth**: The transform pipeline processes per-node bottom-up. Some transformations require multi-pass processing (handled by `preprocess()`), but edge cases may require manual intervention.
4. **AST is not a stable serialization format**: The `Expression` enum and its inner structs may change between versions. If you serialize ASTs to JSON, expect breaking changes across minor versions.
5. **Feature flags are cumulative**: `transpile` implies `generate`, `openlineage` implies `semantic`, etc. For minimal builds, use `default-features = false` and select only what you need.
6. **Global custom dialect registry**: Custom dialects registered via `CustomDialectBuilder::register()` are stored in a global `RwLock<HashMap>`. This means they persist for the lifetime of the process and are visible across threads. Call `unregister_custom_dialect()` to remove them.
7. **Parser is permissive**: The parser accepts many SQL constructs that some databases reject. Validation (via `validate()` or `validate_with_schema()`) can catch some issues, but it's not a substitute for database-level error checking.
8. **No `?` placeholder generation**: Polyglot doesn't generate parameterized query placeholders. For prepared statements, you'll need to handle parameter binding yourself with your database driver.
9. **Schema validation requires manual schema definition**: The `ValidationSchema` struct must be populated manually — there's no automatic schema introspection from a live database.
---
## 5. Production-Readiness Assessment
### 5.1 Strengths
| Area | Rating | Notes |
|---|---|---|
| **Transpilation accuracy** | ⭐⭐⭐⭐⭐ | 10,220+ fixture cases at 100% pass rate |
| **Performance** | ⭐⭐⭐⭐⭐ | 819× faster than Python sqlglot |
| **Dialect coverage** | ⭐⭐⭐⭐⭐ | 32 dialects covering all major databases |
| **API ergonomics** | ⭐⭐⭐⭐ | Clean public API; builder is pleasant |
| **Error reporting** | ⭐⭐⭐⭐ | Line/column/byte-offset positions |
| **WASM support** | ⭐⭐⭐⭐ | Full feature set in browser |
| **Multi-language bindings** | ⭐⭐⭐⭐⭐ | Rust, TypeScript, Python, Go, C FFI |
| **Documentation** | ⭐⭐⭐ | Rust API docs exist; could use more guides |
| **Test coverage** | ⭐⭐⭐⭐⭐ | 18,745 test cases |
| **Fuzzing** | ⭐⭐⭐⭐ | Supported via `cargo fuzz` |
### 5.2 Risks
| Risk | Severity | Mitigation |
|---|---|---|
| **Pre-1.0 breaking changes** | Medium | Pin version; monitor CHANGELOG |
| **Single maintainer** | Medium | Code is well-structured; community could fork |
| **Limited optimizer** | Low | Core passes exist; Python sqlglot is reference |
| **No query execution** | Low (by design) | Combine with sqlx/diesel |
| **WASM stack limits** | Low | Set guard rails; use web workers |
### 5.3 Overall Assessment
**Polyglot is production-viable for SQL transpilation and analysis tasks**, with caveats:
-**Use for**: SQL dialect translation, SQL linting/validation, column lineage, pretty-printing, AST analysis, cross-database query migration
- ⚠️ **Use with caution for**: Query building (no type safety), optimization (partial coverage)
-**Don't use for**: Database execution, connection management, migrations, type-safe queries
For a multi-database storage layer, the recommended pattern is:
```
Application → Polyglot (transpile SQL to target dialect) → sqlx/diesel (execute)
```
---
## 6. Recommendation
### When to Adopt Polyglot
1. **You need to support multiple database backends with different SQL dialects** and want to write queries once in a canonical dialect, then transpile to the target at runtime.
2. **You need SQL validation or analysis** (lineage, schema checking) without executing queries.
3. **You need SQL pretty-printing or formatting** with configurable guard rails.
4. **You need column lineage tracking** for data governance or OpenLineage integration.
5. **You need to parse and analyze SQL** in a Rust/WASM/Python/Go context without connecting to a database.
### When NOT to Adopt Polyglot
1. **You need type-safe query building** — use Diesel or SeaORM instead.
2. **You need async database execution** — use SQLx or SeaORM instead.
3. **You need schema migrations** — use Diesel migrations, sqlx migrations, or Refinery instead.
4. **You only need PostgreSQL** (or a single dialect) — a simpler parser may suffice.
5. **You need Rust type → SQL type mapping** — Polyglot doesn't provide this.
### Suggested Adoption Strategy
For a multi-database storage layer:
1. **Use Polyglot for SQL transpilation**: Write queries in a canonical dialect (e.g., PostgreSQL-compatible), transpile to the target dialect at runtime.
2. **Use SQLx for database execution**: Handle connections, pooling, and async I/O.
3. **Use Polyglot for validation**: Validate user-provided SQL before execution.
4. **Use Polyglot for lineage**: Trace column flow for data governance.
5. **Build a thin integration layer** that combines Polyglot's transpilation with SQLx's execution.
---
## References
- <https://github.com/tobilg/polyglot> — Main repository
- <https://crates.io/crates/polyglot-sql> — Rust crate (v0.4.4)
- <https://docs.rs/polyglot-sql/latest/polyglot_sql/> — Rust API docs
- <https://github.com/tobymao/sqlglot> — Python inspiration
- <https://lib.rs/crates/polyglot-sql> — Package metadata
- Local source: `/workspace/polyglot/`