Files
alknet/docs/research/references/polyglot/03_analysis.md

294 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Polyglot: Suitability Analysis & Comparisons
---
## 1. What Polyglot Is NOT
Before evaluating suitability, it's essential to understand what Polyglot **does not** do:
| NOT a... | Because |
|---|---|
| **Database driver** | No connection management, no query execution, no result set handling |
| **ORM** | No object-relational mapping, no model definitions, no active record pattern |
| **Migration tool** | No `CREATE TABLE` evolution management, no up/down migrations framework |
| **Type mapper** | No Rust type → SQL type mapping, no `FromRow` derives |
| **Connection pool** | No async I/O, no TCP connections, no TLS |
| **Query executor** | Never connects to a database; operates purely on SQL text |
**Polyglot is a SQL dialect transpiler.** It converts SQL strings between database dialects. Period.
---
## 2. Suitability Assessment for Multi-Database Storage Layer
### 2.1 What Polyglot CAN Do for a Multi-DB Project
| Use Case | Polyglot Support | Maturity |
|---|---|---|
| **SQL dialect translation** | ✅ Core purpose; 32 dialects with 100% test pass rate | Mature |
| **SQL pretty-printing** | ✅ Built-in format with guard rails | Mature |
| **SQL syntax validation** | ✅ Line/column error positions, error codes | Mature |
| **Schema-aware validation** | ✅ Table/column/type checking with `ValidationSchema` | Moderate |
| **Column lineage tracing** | ✅ `get_column_lineage()` for data lineage | Moderate |
| **OpenLineage payloads** | ✅ `RunEvent` and `DatasetFacet` generation | Early but functional |
| **Query builder** | ✅ Fluent API for SELECT/INSERT/UPDATE/DELETE | Usable but not as rich as query-builder-first libraries |
| **AST diff** | ✅ ChangeDistiller-based structural diff | Functional |
| **Logical planning** | ✅ Basic DAG plan extraction | Early stage |
| **Query optimization** | ✅ Column qualification, predicate pushdown, join elimination | Moderate |
| **Custom dialect registration** | ✅ `CustomDialectBuilder` for runtime extension | Functional |
### 2.2 What Polyglot CANNOT Do for a Multi-DB Project
| Need | Polyglot Support | Alternative |
|---|---|---|
| **Execute queries** | ❌ No | Use sqlx, diesel, or sea-orm |
| **Connection pooling** | ❌ No | Use deadpool, bb8, or sqlx built-in |
| **Async I/O** | ❌ Synchronous only | Wrap in `spawn_blocking()` |
| **Type-safe query building** | ⚠️ Partial (builder API returns strings) | Use diesel or sea-orm for compile-time checks |
| **Schema migration management** | ❌ No | Use diesel migrations, sqlx migrations, or refinery |
| **Row mapping / deserialization** | ❌ No | Use sqlx `FromRow`, diesel `Queryable` |
| **Runtime type mapping** | ⚠️ Limited (DataType enum, no Rust type bridge) | Build your own layer |
| **Database-specific DDL generation** | ⚠️ Parses/generates DDL but no migration framework | Use as a building block |
| **Transaction management** | ❌ No | Use sqlx or diesel |
### 2.3 Integration Pattern: Polyglot as a SQL Dialect Layer
The most natural integration pattern for a multi-database storage layer:
```
┌──────────────────────────────────────────────┐
│ Application Logic │
├──────────────────────────────────────────────┤
│ Query Builder / ORM Layer │
│ (diesel / sea-orm / custom) │
├──────────────────────┬───────────────────────┤
│ │ │
│ Polyglot Layer │ Direct SQL │
│ (transpile, │ (no translation │
│ validate, │ needed) │
│ format) │ │
├──────────────────────┴───────────────────────┤
│ Database Driver Layer │
│ (sqlx / diesel / tungstenite) │
├──────────────────────────────────────────────┤
│ PostgreSQL │ MySQL │ SQLite │
└──────────────────────────────────────────────┘
```
In this pattern, Polyglot sits **above** the database drivers, translating SQL from a canonical dialect to the target database's dialect before execution. It does **not** replace the drivers.
---
## 3. Comparison with Other Rust SQL Libraries
### 3.1 Feature Comparison Matrix
| Feature | **Polyglot** | **Diesel** | **SQLx** | **SeaORM** | **sqlparser-rs** |
|---|---|---|---|---|---|
| **Primary Purpose** | SQL transpilation | ORM / query builder | Async DB driver | Async ORM | SQL parsing |
| **SQL Parsing** | ✅ Full AST (200+ node types) | ✅ DSL-based | ❌ No | ❌ No | ✅ Full AST |
| **SQL Generation** | ✅ Multi-dialect | ✅ Via DSL | ❌ No | ❌ No | ⚠️ Limited |
| **Cross-dialect Transpilation** | ✅ 32 dialects | ❌ No | ❌ No | ❌ No | ❌ No |
| **Query Builder** | ⚠️ Fluent, string-based | ✅ Type-safe DSL | ❌ No | ✅ Type-safe | ❌ No |
| **Async I/O** | ❌ No (sync only) | ❌ Diesel 1.x is sync | ✅ Native async | ✅ Native async | ❌ No |
| **Type-safe Queries** | ❌ No (runtime) | ✅ Compile-time | ❌ No | ✅ Compile-time | ❌ No |
| **Connection Pool** | ❌ No | ❌ No (Diesel 2.x via r2d2) | ✅ Built-in | ✅ Built-in | ❌ No |
| **Migration Support** | ❌ No | ✅ Built-in | ❌ No | ✅ Built-in | ❌ No |
| **Database Execution** | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
| **Schema Validation** | ✅ Via ValidationSchema | ✅ Compile-time | ❌ No | ⚠️ Limited | ❌ No |
| **Column Lineage** | ✅ Built-in | ❌ No | ❌ No | ❌ No | ❌ No |
| **AST Diff** | ✅ Built-in | ❌ No | ❌ No | ❌ No | ❌ No |
| **Dialects Supported** | 32 | 4 (PG, MySQL, SQLite, MSSQL) | N/A | N/A | 1 (ANSI SQL) |
| **License** | MIT | MIT/Apache-2.0 | MIT/Apache-2.0 | MIT | MIT/Apache-2.0 |
| **Maturity** | v0.4.4 (pre-1.0) | v2.2 (stable) | v0.8 (stable) | v1.1 (stable) | v0.49 (mature) |
### 3.2 Polyglot vs Diesel
| Aspect | Polyglot | Diesel |
|---|---|---|
| **Philosophy** | Parse any SQL → AST → generate any dialect | Type-safe DSL → SQL for specific databases |
| **Type Safety** | Runtime (string-based) | Compile-time (macro-based) |
| **Query Building** | `select(["col"]).from("t").where_(...)``Expression` AST | `schema::table::dsl::col.filter(...)` → SQL |
| **Dialect Breadth** | 32 dialects | 4 (PostgreSQL, MySQL, SQLite, MSSQL) |
| **Database Execution** | None (SQL text only) | Full CRUD with connection management |
| **Migrations** | None | Built-in migration framework |
| **When to use** | You need cross-dialect SQL translation, validation, lineage | You need type-safe queries with database execution |
**Verdict**: Polyglot and Diesel are **complementary**, not competing. Use Diesel for type-safe database interaction; use Polyglot when you need to translate SQL between dialects or analyze SQL without executing it.
### 3.3 Polyglot vs SQLx
| Aspect | Polyglot | SQLx |
|---|---|---|
| **Philosophy** | SQL manipulation without execution | Async database driver with compile-time query checking |
| **Async** | Synchronous only | Fully async |
| **Query Checking** | Runtime validation against schema | Compile-time `query!()` macro |
| **Database Support** | 32 dialects (parsing) | PostgreSQL, MySQL, SQLite (execution) |
| **When to use** | SQL transformation/analysis | Database interaction with async Rust |
**Verdict**: SQLx is for executing queries against databases. Polyglot is for transforming SQL text. They solve entirely different problems.
### 3.4 Polyglot vs SeaORM
| Aspect | Polyglot | SeaORM |
|---|---|---|
| **Philosophy** | SQL transpilation | Async ORM built on SQLx |
| **Async** | No | Yes |
| **Model Definition** | None | Entity models via macros |
| **Relationships** | None | Has-one, has-many, many-to-many |
| **When to use** | SQL dialect conversion | Database CRUD with relationships |
**Verdict**: Same as SQLx — complementary, not competing.
### 3.5 Polyglot vs sqlparser-rs
| Aspect | Polyglot | sqlparser-rs |
|---|---|---|
| **Parsing** | ✅ Full (200+ node types) | ✅ Full (ANSI SQL + some dialects) |
| **Generation** | ✅ Multi-dialect generation | ⚠️ Limited round-trip |
| **Transpilation** | ✅ Cross-dialect transforms | ❌ No |
| **Dialects** | 32 | Primarily ANSI SQL |
| **Validation** | ✅ With error positions | ❌ Parse errors only |
| **Builder** | ✅ Fluent API | ❌ No |
| **Lineage** | ✅ Built-in | ❌ No |
| **Diff** | ✅ Built-in | ❌ No |
| **Maturity** | v0.4.4 | v0.49 (more established) |
**Verdict**: sqlparser-rs is a mature parser for ANSI SQL. Polyglot offers significantly more: transpilation, 32 dialects, validation, lineage, diff, and a builder API. If you need dialect translation, Polyglot is the clear choice. If you only need ANSI SQL parsing and don't need generation/transpilation, sqlparser-rs may suffice with less overhead.
### 3.6 Polyglot vs Python sqlglot
| Aspect | Polyglot (Rust) | sqlglot (Python) |
|---|---|---|
| **Performance** | 819× faster (transpile), ~86× faster (generate) | Baseline |
| **Language** | Rust | Python |
| **Feature Parity** | ~95% of sqlglot's transpilation | Full feature set |
| **Optimizer** | Column qualification, predicate pushdown (moderate) | Full optimizer (column pruning, join elimination, etc.) |
| **Execution** | ❌ No | ⚠️ Limited (can execute against some engines) |
| **Test Compatibility** | 10,220+ sqlglot fixture cases at 100% | Original test suite |
| **Deployment** | Native binary / WASM / Python / Go | Python package |
**Verdict**: Polyglot is the performance-oriented port of sqlglot. It covers the core transpilation use case at near-full feature parity. The Python sqlglot has a more mature optimizer and some execution capabilities, but Polyglot is catching up rapidly (0.4.x adds lineage, OpenLineage, schema validation, and more).
---
## 4. Limitations and Gotchas
### 4.1 Current Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| **Pre-1.0 API** | Breaking changes possible between minor versions | Pin exact version in Cargo.toml |
| **No query execution** | Cannot run SQL against databases | Use alongside sqlx/diesel |
| **No async** | Blocking in async contexts | Wrap in `spawn_blocking()` |
| **No migration framework** | Cannot manage schema evolution | Use diesel migrations or refinery |
| **No Rust type mapping** | `DataType` enum doesn't map to Rust types | Build your own type bridge |
| **Builder returns Expression** | Builder doesn't produce type-safe queries | Accept runtime nature; pair with runtime validation |
| **Optimizer is early** | Limited optimization passes vs Python sqlglot | Most useful passes exist (qualify_columns, pushdown_predicates) |
| **WASM lacks `stacker`** | Deeply nested SQL may overflow stack in browser | Set format guard limits; consider web workers |
| **Custom dialects are global** | `CustomDialectBuilder` uses a global `RwLock` registry | Fine for most apps; not ideal for per-request isolation |
| **No prepared statement support** | Cannot generate `?` placeholders for parameterized queries | Build queries as strings; use sqlx for parameterization |
### 4.2 Gotchas
1. **`Dialect::get()` creates a new instance each call**: The `Dialect` struct bundles tokenizer + generator config + transformer. For hot loops, cache the `Dialect` instance rather than calling `Dialect::get()` repeatedly. (The overhead is minimal but non-zero.)
2. **Transpilation is not always invertible**: Some dialects have features that don't exist in others (e.g., BigQuery's `QUALIFY`, PostgreSQL's `ILIKE`, TSQL's `TOP`). Transpiling `A → B` and then `B → A` may lose information.
3. **Function transformation depth**: The transform pipeline processes per-node bottom-up. Some transformations require multi-pass processing (handled by `preprocess()`), but edge cases may require manual intervention.
4. **AST is not a stable serialization format**: The `Expression` enum and its inner structs may change between versions. If you serialize ASTs to JSON, expect breaking changes across minor versions.
5. **Feature flags are cumulative**: `transpile` implies `generate`, `openlineage` implies `semantic`, etc. For minimal builds, use `default-features = false` and select only what you need.
6. **Global custom dialect registry**: Custom dialects registered via `CustomDialectBuilder::register()` are stored in a global `RwLock<HashMap>`. This means they persist for the lifetime of the process and are visible across threads. Call `unregister_custom_dialect()` to remove them.
7. **Parser is permissive**: The parser accepts many SQL constructs that some databases reject. Validation (via `validate()` or `validate_with_schema()`) can catch some issues, but it's not a substitute for database-level error checking.
8. **No `?` placeholder generation**: Polyglot doesn't generate parameterized query placeholders. For prepared statements, you'll need to handle parameter binding yourself with your database driver.
9. **Schema validation requires manual schema definition**: The `ValidationSchema` struct must be populated manually — there's no automatic schema introspection from a live database.
---
## 5. Production-Readiness Assessment
### 5.1 Strengths
| Area | Rating | Notes |
|---|---|---|
| **Transpilation accuracy** | ⭐⭐⭐⭐⭐ | 10,220+ fixture cases at 100% pass rate |
| **Performance** | ⭐⭐⭐⭐⭐ | 819× faster than Python sqlglot |
| **Dialect coverage** | ⭐⭐⭐⭐⭐ | 32 dialects covering all major databases |
| **API ergonomics** | ⭐⭐⭐⭐ | Clean public API; builder is pleasant |
| **Error reporting** | ⭐⭐⭐⭐ | Line/column/byte-offset positions |
| **WASM support** | ⭐⭐⭐⭐ | Full feature set in browser |
| **Multi-language bindings** | ⭐⭐⭐⭐⭐ | Rust, TypeScript, Python, Go, C FFI |
| **Documentation** | ⭐⭐⭐ | Rust API docs exist; could use more guides |
| **Test coverage** | ⭐⭐⭐⭐⭐ | 18,745 test cases |
| **Fuzzing** | ⭐⭐⭐⭐ | Supported via `cargo fuzz` |
### 5.2 Risks
| Risk | Severity | Mitigation |
|---|---|---|
| **Pre-1.0 breaking changes** | Medium | Pin version; monitor CHANGELOG |
| **Single maintainer** | Medium | Code is well-structured; community could fork |
| **Limited optimizer** | Low | Core passes exist; Python sqlglot is reference |
| **No query execution** | Low (by design) | Combine with sqlx/diesel |
| **WASM stack limits** | Low | Set guard rails; use web workers |
### 5.3 Overall Assessment
**Polyglot is production-viable for SQL transpilation and analysis tasks**, with caveats:
-**Use for**: SQL dialect translation, SQL linting/validation, column lineage, pretty-printing, AST analysis, cross-database query migration
- ⚠️ **Use with caution for**: Query building (no type safety), optimization (partial coverage)
-**Don't use for**: Database execution, connection management, migrations, type-safe queries
For a multi-database storage layer, the recommended pattern is:
```
Application → Polyglot (transpile SQL to target dialect) → sqlx/diesel (execute)
```
---
## 6. Recommendation
### When to Adopt Polyglot
1. **You need to support multiple database backends with different SQL dialects** and want to write queries once in a canonical dialect, then transpile to the target at runtime.
2. **You need SQL validation or analysis** (lineage, schema checking) without executing queries.
3. **You need SQL pretty-printing or formatting** with configurable guard rails.
4. **You need column lineage tracking** for data governance or OpenLineage integration.
5. **You need to parse and analyze SQL** in a Rust/WASM/Python/Go context without connecting to a database.
### When NOT to Adopt Polyglot
1. **You need type-safe query building** — use Diesel or SeaORM instead.
2. **You need async database execution** — use SQLx or SeaORM instead.
3. **You need schema migrations** — use Diesel migrations, sqlx migrations, or Refinery instead.
4. **You only need PostgreSQL** (or a single dialect) — a simpler parser may suffice.
5. **You need Rust type → SQL type mapping** — Polyglot doesn't provide this.
### Suggested Adoption Strategy
For a multi-database storage layer:
1. **Use Polyglot for SQL transpilation**: Write queries in a canonical dialect (e.g., PostgreSQL-compatible), transpile to the target dialect at runtime.
2. **Use SQLx for database execution**: Handle connections, pooling, and async I/O.
3. **Use Polyglot for validation**: Validate user-provided SQL before execution.
4. **Use Polyglot for lineage**: Trace column flow for data governance.
5. **Build a thin integration layer** that combines Polyglot's transpilation with SQLx's execution.
---
## References
- <https://github.com/tobilg/polyglot> — Main repository
- <https://crates.io/crates/polyglot-sql> — Rust crate (v0.4.4)
- <https://docs.rs/polyglot-sql/latest/polyglot_sql/> — Rust API docs
- <https://github.com/tobymao/sqlglot> — Python inspiration
- <https://lib.rs/crates/polyglot-sql> — Package metadata
- Local source: `/workspace/polyglot/`