docs(research): add polyglot SQL transpiler deep dive for multi-DB storage evaluation
This commit is contained in:
720
docs/research/references/polyglot/02_architecture.md
Normal file
720
docs/research/references/polyglot/02_architecture.md
Normal file
@@ -0,0 +1,720 @@
|
||||
# Polyglot: Architecture Deep Dive
|
||||
|
||||
---
|
||||
|
||||
## 1. Workspace Structure
|
||||
|
||||
The repository is organized as a Cargo workspace with 5 crates and supporting packages:
|
||||
|
||||
```
|
||||
polyglot/
|
||||
├── crates/
|
||||
│ ├── polyglot-sql/ # Core Rust library (~241K LOC)
|
||||
│ │ └── src/
|
||||
│ │ ├── lib.rs # Public API, top-level functions
|
||||
│ │ ├── tokens.rs # Tokenizer (lexer)
|
||||
│ │ ├── parser.rs # Recursive-descent parser (~62K LOC)
|
||||
│ │ ├── expressions.rs # AST node types (~15K LOC)
|
||||
│ │ ├── generator.rs # SQL code generator (~39K LOC)
|
||||
│ │ ├── dialects/ # 33 dialect implementations
|
||||
│ │ │ ├── mod.rs # Dialect trait, Dialect struct, CustomDialectBuilder
|
||||
│ │ │ ├── generic.rs # Base/standard SQL dialect
|
||||
│ │ │ ├── postgres.rs # PostgreSQL (~1.9K LOC)
|
||||
│ │ │ ├── mysql.rs # MySQL
|
||||
│ │ │ ├── sqlite.rs # SQLite
|
||||
│ │ │ ├── bigquery.rs # BigQuery
|
||||
│ │ │ ├── ... (32 total)
|
||||
│ │ ├── builder.rs # Fluent query builder API
|
||||
│ │ ├── transforms.rs # Cross-dialect transform functions
|
||||
│ │ ├── validation.rs # Syntax + semantic validation
|
||||
│ │ ├── schema.rs # Schema representation
|
||||
│ │ ├── scope.rs # Scope analysis
|
||||
│ │ ├── resolver.rs # Column resolution
|
||||
│ │ ├── lineage.rs # Column lineage tracking
|
||||
│ │ ├── openlineage.rs # OpenLineage payload generation
|
||||
│ │ ├── diff.rs # AST diff (ChangeDistiller algorithm)
|
||||
│ │ ├── planner.rs # Logical query plan
|
||||
│ │ ├── optimizer/ # Query optimizer modules
|
||||
│ │ │ ├── annotate_types.rs # Type annotation
|
||||
│ │ │ ├── qualify_columns.rs # Column qualification
|
||||
│ │ │ ├── qualify_tables.rs # Table qualification
|
||||
│ │ │ ├── pushdown_predicates.rs
|
||||
│ │ │ ├── pushdown_projections.rs
|
||||
│ │ │ ├── eliminate_joins.rs
|
||||
│ │ │ ├── eliminate_ctes.rs
|
||||
│ │ │ ├── simplify.rs
|
||||
│ │ │ └── ...
|
||||
│ │ ├── traversal.rs # DFS/BFS visitors, AST predicates
|
||||
│ │ ├── ast_transforms.rs # AST manipulation utilities
|
||||
│ │ ├── error.rs # Error types
|
||||
│ │ └── time.rs # Time format conversion
|
||||
│ ├── polyglot-sql-function-catalogs/ # Optional dialect function catalogs
|
||||
│ ├── polyglot-sql-wasm/ # WASM bindings (wasm-pack)
|
||||
│ ├── polyglot-sql-ffi/ # C FFI bindings (cbindgen)
|
||||
│ └── polyglot-sql-python/ # Python bindings (PyO3 + maturin)
|
||||
├── packages/
|
||||
│ ├── sdk/ # TypeScript SDK (@polyglot-sql/sdk)
|
||||
│ ├── go/ # Go SDK (PureGo wrapper over FFI)
|
||||
│ ├── documentation/ # TypeScript API docs site
|
||||
│ ├── playground/ # Browser playground (React 19, Vite)
|
||||
│ └── python-docs/ # Python API docs
|
||||
├── examples/
|
||||
│ ├── rust/ # Rust usage example
|
||||
│ ├── typescript/ # TypeScript SDK example
|
||||
│ └── c/ # C FFI usage example
|
||||
└── tools/
|
||||
├── sqlglot-compare/ # Fixture extraction & comparison
|
||||
└── bench-compare/ # Performance benchmarks
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Data Flow Pipeline
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ SQL String (source dialect) │
|
||||
└──────────────────────────┬──────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Tokenizer (tokens.rs) │
|
||||
│ • Dialect-specific lexing rules (quotes, comments, keywords) │
|
||||
│ • Configurable via TokenizerConfig per dialect │
|
||||
│ • Produces Vec<Token> with type, text, and Span (line/col/offset) │
|
||||
└──────────────────────────┬──────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Parser (parser.rs, ~62K LOC) │
|
||||
│ • Recursive-descent with precedence climbing │
|
||||
│ • Dialect-aware parsing (custom keywords, syntax rules) │
|
||||
│ • Produces Expression AST tree │
|
||||
│ • Stack safety via `stacker` feature (default-on) │
|
||||
└──────────────────────────┬──────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Expression AST (expressions.rs) │
|
||||
│ • Single tagged enum with 150+ variants │
|
||||
│ • Each variant has its own struct (Select, Insert, Function, etc.) │
|
||||
│ • Box<Variant> keeps enum size to 2 words (tag + pointer) │
|
||||
│ • Serializable via serde (derive Serialize/Deserialize) │
|
||||
│ • Optional TypeScript type generation via `ts-rs` feature flag │
|
||||
└──────────────────────────┬──────────────────────────────────────────┘
|
||||
│
|
||||
┌────┴────┐
|
||||
│ │
|
||||
┌─────────┘ └──────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌────────────────────────┐ ┌────────────────────────────────────┐
|
||||
│ Transform Pipeline │ │ Semantic / Analysis Modules │
|
||||
│ (transpile path) │ │ • validation.rs → syntax checks │
|
||||
│ │ │ • schema.rs → column/type lookup │
|
||||
│ 1. preprocess() │ │ • scope.rs → scope analysis │
|
||||
│ (whole-tree rewrites│ │ • resolver.rs → column resolution │
|
||||
│ like eliminate_ │ │ • lineage.rs → column lineage │
|
||||
│ qualify) │ │ • openlineage.rs → OL payloads │
|
||||
│ │ │ • optimizer/ → query optimization │
|
||||
│ 2. transform_expr() │ │ • diff.rs → AST diff │
|
||||
│ (per-node rewrites │ │ • planner.rs → logical plan DAG │
|
||||
│ per dialect) │ │ • traversal.rs → DFS/BFS visitors │
|
||||
│ │ │
|
||||
│ 3. Generator │ │
|
||||
│ (AST → SQL string) │ │
|
||||
└───────────┬────────────┘ └────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ SQL String (target dialect) │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Core Abstractions
|
||||
|
||||
### 3.1 Expression AST
|
||||
|
||||
The central type is `Expression`, a large tagged enum with one variant per SQL construct:
|
||||
|
||||
```rust
|
||||
pub enum Expression {
|
||||
// Literals
|
||||
Literal(Box<Literal>),
|
||||
Boolean(BooleanLiteral),
|
||||
Null(Null),
|
||||
|
||||
// Identifiers
|
||||
Identifier(Identifier),
|
||||
Column(Box<Column>),
|
||||
Table(Box<TableRef>),
|
||||
Star(Star),
|
||||
|
||||
// Queries
|
||||
Select(Box<Select>),
|
||||
Union(Box<Union>),
|
||||
Intersect(Box<Intersect>),
|
||||
Except(Box<Except>),
|
||||
Subquery(Box<Subquery>),
|
||||
|
||||
// DML
|
||||
Insert(Box<Insert>),
|
||||
Update(Box<Update>),
|
||||
Delete(Box<Delete>),
|
||||
Copy(Box<CopyStmt>),
|
||||
|
||||
// Binary/Unary operators
|
||||
And(Box<BinaryOp>),
|
||||
Or(Box<BinaryOp>),
|
||||
Add(Box<BinaryOp>),
|
||||
Eq(Box<BinaryOp>),
|
||||
// ... 30+ operator variants
|
||||
|
||||
// Functions
|
||||
Function(Box<Function>),
|
||||
AggregateFunction(Box<AggregateFunction>),
|
||||
WindowFunction(Box<WindowFunction>),
|
||||
|
||||
// Clauses
|
||||
From(Box<From>),
|
||||
Join(Box<Join>),
|
||||
Where(Box<Where>),
|
||||
OrderBy(Box<OrderBy>),
|
||||
// ...
|
||||
|
||||
// ~150 total variants
|
||||
}
|
||||
```
|
||||
|
||||
Key design choices:
|
||||
- **Boxed variants**: Most variants wrap their payload in `Box` to keep `size_of::<Expression>()` at 2 words (16 bytes on 64-bit).
|
||||
- **Serde support**: `#[derive(Serialize, Deserialize)]` for JSON serialization across FFI/WASM boundaries.
|
||||
- **TypeScript types**: Optional `ts-rs` feature generates TypeScript interfaces.
|
||||
- **Convenience methods**: `Expression::column()`, `Expression::number()`, `Expression::sql()`, `Expression::sql_for()`.
|
||||
|
||||
### 3.2 DialectType Enum
|
||||
|
||||
```rust
|
||||
pub enum DialectType {
|
||||
Generic, PostgreSQL, MySQL, BigQuery, Snowflake, DuckDB, SQLite,
|
||||
Hive, Spark, Trino, Presto, Redshift, TSQL, Oracle, ClickHouse,
|
||||
Databricks, Athena, Teradata, Doris, StarRocks, Materialize,
|
||||
RisingWave, SingleStore, CockroachDB, TiDB, Druid, Solr, Tableau,
|
||||
Dune, Fabric, Drill, Dremio, Exasol, DataFusion,
|
||||
}
|
||||
```
|
||||
|
||||
- Implements `FromStr` with aliases (e.g., `"mssql"` → `TSQL`, `"cockroach"` → `CockroachDB`)
|
||||
- Each variant maps to a feature-gated dialect module
|
||||
- Custom dialects can be registered at runtime via `CustomDialectBuilder`
|
||||
|
||||
### 3.3 DialectImpl Trait
|
||||
|
||||
```rust
|
||||
pub trait DialectImpl {
|
||||
fn dialect_type(&self) -> DialectType;
|
||||
fn tokenizer_config(&self) -> TokenizerConfig { /* default */ }
|
||||
fn generator_config(&self) -> GeneratorConfig { /* default */ }
|
||||
fn generator_config_for_expr(&self, _expr: &Expression) -> GeneratorConfig { /* default */ }
|
||||
fn transform_expr(&self, expr: Expression) -> Result<Expression> { Ok(expr) }
|
||||
fn preprocess(&self, expr: Expression) -> Result<Expression> { Ok(expr) }
|
||||
}
|
||||
```
|
||||
|
||||
Each dialect implements this trait to provide:
|
||||
1. **Tokenizer config**: Identifier quoting characters, string delimiters, keyword overrides, comment styles, hex number support
|
||||
2. **Generator config**: 30+ flags controlling SQL output (identifier quote style, function casing, `LIMIT` vs `TOP` vs `FETCH FIRST`, etc.)
|
||||
3. **Per-node transform**: Dialect-specific expression rewrites (e.g., PostgreSQL transforms `IFNULL` → `COALESCE`, SQLite transforms `TRY_CAST` → `CAST`)
|
||||
4. **Whole-tree preprocess**: Structural rewrites that need full-tree context (e.g., eliminating `QUALIFY` for dialects that don't support it)
|
||||
|
||||
### 3.4 Dialect Struct (High-Level API)
|
||||
|
||||
```rust
|
||||
pub struct Dialect {
|
||||
dialect_type: DialectType,
|
||||
tokenizer: Tokenizer,
|
||||
generator_config: Arc<GeneratorConfig>,
|
||||
transformer: Box<dyn Fn(Expression) -> Result<Expression> + Send + Sync>,
|
||||
generator_config_for_expr: Option<Box<dyn Fn(&Expression) -> GeneratorConfig + Send + Sync>>,
|
||||
custom_preprocess: Option<Box<dyn Fn(Expression) -> Result<Expression> + Send + Sync>>,
|
||||
}
|
||||
```
|
||||
|
||||
The `Dialect` struct bundles all dialect-specific state and provides the primary API:
|
||||
|
||||
```rust
|
||||
// Parse SQL
|
||||
let ast = dialect.parse("SELECT 1")?;
|
||||
|
||||
// Generate SQL from AST
|
||||
let sql = dialect.generate(&ast[0])?;
|
||||
|
||||
// Transpile between dialects
|
||||
let results = dialect.transpile("SELECT IFNULL(a,b) FROM t", DialectType::PostgreSQL)?;
|
||||
|
||||
// Tokenize
|
||||
let tokens = dialect.tokenize("SELECT 1")?;
|
||||
```
|
||||
|
||||
### 3.5 CustomDialectBuilder
|
||||
|
||||
For runtime-extensible dialect support:
|
||||
|
||||
```rust
|
||||
use polyglot_sql::dialects::{CustomDialectBuilder, Dialect, DialectType};
|
||||
use polyglot_sql::generator::NormalizeFunctions;
|
||||
|
||||
// Register a custom dialect inheriting from PostgreSQL
|
||||
CustomDialectBuilder::new("my_postgres")
|
||||
.based_on(DialectType::PostgreSQL)
|
||||
.generator_config_modifier(|gc| {
|
||||
gc.normalize_functions = NormalizeFunctions::Lower;
|
||||
})
|
||||
.register()?;
|
||||
|
||||
let d = Dialect::get_by_name("my_postgres").unwrap();
|
||||
// Use like any built-in dialect
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Dialect Implementation Details
|
||||
|
||||
### 4.1 PostgreSQL (`postgres.rs`, ~1,879 LOC)
|
||||
|
||||
**Tokenizer:**
|
||||
- `$$` string literals (dollar-quoting)
|
||||
- Double-quote identifier quoting
|
||||
- Nested block comments
|
||||
- `EXEC` treated as generic command
|
||||
|
||||
**Generator config highlights:**
|
||||
- `identifier_quote: '"'` (double quotes)
|
||||
- `single_string_interval: true` (`INTERVAL '1 day'`)
|
||||
- `parameter_token: "$"` (`$1`, `$2` placeholders)
|
||||
- `supports_select_into: true`
|
||||
- `supports_window_exclude: true`
|
||||
- `can_implement_array_any: true`
|
||||
|
||||
**Transform examples:**
|
||||
- `IFNULL(a, b)` → `COALESCE(a, b)`
|
||||
- `RAND()` → `RANDOM()`
|
||||
- `DATEDIFF(day, a, b)` → `CAST(b - a AS INT)` (date subtraction)
|
||||
- `JSON_EXTRACT(a, '$.x')` → `a #> '{x}'` (arrow syntax)
|
||||
- `JSON_EXTRACT_SCALAR(a, '$.x')` → `a #>> '{x}'`
|
||||
- `DATE_ADD` / `DATE_SUB` → `+` / `-` interval arithmetic
|
||||
- Type mappings: `TINYINT` → `SMALLINT`, `FLOAT` → `REAL`, `DOUBLE` → `DOUBLE PRECISION`
|
||||
- `ILIKE` preserved (native PostgreSQL)
|
||||
- `RegexpLike` → `~` operator, `RegexpILike` → `~*` operator
|
||||
|
||||
### 4.2 SQLite (`sqlite.rs`, ~750 LOC)
|
||||
|
||||
**Tokenizer:**
|
||||
- Supports `"`, `[`, `` ` `` as identifier quote characters
|
||||
- No nested comments
|
||||
- Hex number literals (`0xCC`)
|
||||
|
||||
**Generator config:**
|
||||
- `identifier_quote: '"'` (double quotes)
|
||||
- `supports_table_alias_columns: false`
|
||||
- `json_key_value_pair_sep: ","` (comma-style `JSON_OBJECT`)
|
||||
|
||||
**Transform examples:**
|
||||
- `NVL(a, b)` → `IFNULL(a, b)`
|
||||
- `TRY_CAST(x AS t)` → `CAST(x AS t)` (no try-cast)
|
||||
- `RANDOM()` → function
|
||||
- `ILIKE` → `LOWER(left) LIKE LOWER(right)` (no native ILIKE)
|
||||
- `CountIf(cond)` → `SUM(IIF(cond, 1, 0))`
|
||||
- `CEIL(x)` → function form
|
||||
- `DATE_TRUNC(unit, col)` → various strftime patterns
|
||||
- `DATE_DIFF` → `juliandiff` patterns
|
||||
|
||||
### 4.3 MySQL (`mysql.rs`)
|
||||
|
||||
**Tokenizer:** Backtick identifiers, `#` comments
|
||||
**Generator:** Backtick quoting, `LIMIT` syntax, `CONCAT()` instead of `||`
|
||||
**Transforms:** `COALESCE(a,b)` ← `IFNULL(a,b)`, `||` → `CONCAT()` (string concat), etc.
|
||||
|
||||
### 4.4 BigQuery (`bigquery.rs`)
|
||||
|
||||
**Tokenizer:** Backtick identifiers, `QUALIFY` keyword
|
||||
**Generator:** Backtick quoting, `STRUCT` types, `QUALIFY` clause, `DATE_DIFF` syntax
|
||||
**Transforms:** Complex date/timestamp function mappings, `UNNEST` handling, `APPROX_COUNT_DISTINCT` → `APPROX_COUNT_DISTINCT`
|
||||
|
||||
### 4.5 How Transpilation Works
|
||||
|
||||
The full transpilation pipeline:
|
||||
|
||||
```
|
||||
Input SQL (source dialect)
|
||||
│
|
||||
▼
|
||||
Source Dialect Tokenizer
|
||||
│
|
||||
▼
|
||||
Parser (dialect-aware)
|
||||
│
|
||||
▼
|
||||
Expression AST
|
||||
│
|
||||
▼
|
||||
Source Dialect::preprocess() ← whole-tree rewrites
|
||||
│
|
||||
▼
|
||||
Source Dialect::transform_expr() ← per-node rewrites (recursive, bottom-up)
|
||||
│
|
||||
▼
|
||||
Normalized AST
|
||||
│
|
||||
▼
|
||||
Target Dialect Generator
|
||||
│
|
||||
▼
|
||||
Output SQL (target dialect)
|
||||
```
|
||||
|
||||
The transform pipeline uses an explicit task stack (not recursive calls) for the hot paths to avoid stack overflow. The `stacker` crate provides additional stack-growth protection.
|
||||
|
||||
Key cross-dialect transforms include:
|
||||
- Function renaming: `IFNULL` ↔ `COALESCE` ↔ `NVL`, `DATEDIFF` ↔ date arithmetic, `STRING_AGG` ↔ `GROUP_CONCAT`
|
||||
- Type mapping: `TINYINT` ↔ `SMALLINT`, `FLOAT` ↔ `REAL`, `JSON` ↔ `JSONB`
|
||||
- Syntax conversion: `LIMIT` ↔ `TOP` ↔ `FETCH FIRST`, `||` (concat) ↔ `CONCAT()`, `SELECT INTO` ↔ `CREATE TABLE AS`
|
||||
- Boolean handling: `BOOL_AND`/`BOOL_OR` ↔ `MIN`/`MAX`-over-`CASE`
|
||||
- JSON operators: `JSON_EXTRACT` ↔ `#>`/`#>>` ↔ `->`/`->>` (PostgreSQL arrow syntax)
|
||||
|
||||
---
|
||||
|
||||
## 5. Fluent Builder API
|
||||
|
||||
The builder module (`builder.rs`, ~3.3K LOC) provides a type-safe, ergonomic way to construct SQL expressions without string interpolation:
|
||||
|
||||
```rust
|
||||
use polyglot_sql::builder::*;
|
||||
|
||||
// SELECT id, name FROM users WHERE age > 18 ORDER BY name LIMIT 10
|
||||
let expr = select(["id", "name"])
|
||||
.from("users")
|
||||
.where_(col("age").gt(lit(18)))
|
||||
.order_by(["name"])
|
||||
.limit(10)
|
||||
.build();
|
||||
|
||||
// INSERT
|
||||
let ins = insert_into("users")
|
||||
.columns(["id", "name"])
|
||||
.values([lit(1), lit("Alice")])
|
||||
.build();
|
||||
|
||||
// CASE expression
|
||||
let expr = case()
|
||||
.when(col("x").gt(lit(0)), lit("positive"))
|
||||
.else_(lit("non-positive"))
|
||||
.build();
|
||||
|
||||
// Set operations
|
||||
let expr = union_all(
|
||||
select(["id"]).from("a"),
|
||||
select(["id"]).from("b"),
|
||||
).order_by(["id"]).limit(5).build();
|
||||
```
|
||||
|
||||
Expression helpers:
|
||||
- `col("users.id")` — column reference (splits on last `.`)
|
||||
- `lit(42)`, `lit("hello")`, `lit(3.14)`, `lit(true)` — literals
|
||||
- `func("COALESCE", [col("a"), col("b")])` — function calls
|
||||
- Operator chain: `col("age").gte(lit(18)).and(col("status").eq(lit("active")))`
|
||||
|
||||
The builder generates an `Expression` AST that can then be serialized to any dialect via `generate()`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Validation and Schema-Aware Analysis
|
||||
|
||||
### 6.1 Syntax Validation
|
||||
|
||||
```rust
|
||||
use polyglot_sql::{validate, DialectType};
|
||||
|
||||
let result = validate("SELECT * FORM users", DialectType::Generic);
|
||||
// result.valid == false
|
||||
// result.errors contain line/column/message/error codes
|
||||
```
|
||||
|
||||
Error codes:
|
||||
- `E001` — Syntax error
|
||||
- `E002` — Tokenization error
|
||||
- `E003` — Parse error
|
||||
- `E004` — Invalid expression (not a valid statement)
|
||||
- `E005` — Trailing comma in strict mode
|
||||
|
||||
### 6.2 Schema-Aware Validation
|
||||
|
||||
```rust
|
||||
use polyglot_sql::{
|
||||
validate_with_schema, DialectType, SchemaColumn, SchemaTable,
|
||||
SchemaValidationOptions, ValidationSchema,
|
||||
};
|
||||
|
||||
let schema = ValidationSchema {
|
||||
strict: Some(true),
|
||||
tables: vec![
|
||||
SchemaTable {
|
||||
name: "users".into(),
|
||||
columns: vec![
|
||||
SchemaColumn { name: "id".into(), data_type: "integer".into(), nullable: Some(false), primary_key: true, unique: false, references: None },
|
||||
SchemaColumn { name: "email".into(), data_type: "varchar".into(), nullable: Some(false), primary_key: false, unique: true, references: None },
|
||||
],
|
||||
// ...
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
let opts = SchemaValidationOptions { check_types: true, check_references: true, strict: None, semantic: true };
|
||||
let result = validate_with_schema("SELECT id FROM users WHERE email = 1", DialectType::Generic, &schema, &opts);
|
||||
// result.valid == false (type mismatch: email is varchar, compared to integer)
|
||||
```
|
||||
|
||||
Schema-aware error codes:
|
||||
- `E200`/`E201` — Unknown table/column
|
||||
- `E210`–`E217`, `W210`–`W216` — Type checks
|
||||
- `E220`, `E221`, `W220`, `W221`, `W222` — Reference/FK checks
|
||||
|
||||
### 6.3 Function Catalogs
|
||||
|
||||
Optional feature-gated function catalogs (currently ClickHouse and DuckDB) provide known function signatures for semantic type checking:
|
||||
|
||||
```toml
|
||||
polyglot-sql = { version = "0.4", features = ["function-catalog-clickhouse"] }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Column Lineage & OpenLineage
|
||||
|
||||
### 7.1 Column Lineage
|
||||
|
||||
Trace how columns flow through a query:
|
||||
|
||||
```rust
|
||||
use polyglot_sql::{parse, DialectType};
|
||||
use polyglot_sql::lineage::get_column_lineage;
|
||||
|
||||
let ast = parse("SELECT a + b AS total FROM t", DialectType::Generic).unwrap();
|
||||
let lineage = get_column_lineage(&ast[0], /* schema */ None, DialectType::Generic);
|
||||
// lineage tells you that "total" depends on columns "a" and "b" from table "t"
|
||||
```
|
||||
|
||||
### 7.2 OpenLineage Payload Generation
|
||||
|
||||
```rust
|
||||
use polyglot_sql::openlineage::{generate_run_event, OpenLineageOptions, OpenLineageDatasetId};
|
||||
|
||||
let opts = OpenLineageOptions {
|
||||
dialect: DialectType::PostgreSQL,
|
||||
producer: "my-app".into(),
|
||||
dataset_namespace: Some("mydb".into()),
|
||||
// ...
|
||||
};
|
||||
let event = generate_run_event("SELECT * FROM users", &opts)?;
|
||||
// event is a JSON-serializable OpenLineage RunEvent with columnLineage facets
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Error Handling
|
||||
|
||||
### 8.1 Error Types
|
||||
|
||||
```rust
|
||||
pub enum Error {
|
||||
Tokenize { message: String, line: usize, column: usize, start: usize, end: usize },
|
||||
Parse { message: String, line: usize, column: usize, start: usize, end: usize },
|
||||
Generate(String),
|
||||
Unsupported { feature: String, dialect: String },
|
||||
Syntax { message: String, line: usize, column: usize, start: usize, end: usize },
|
||||
Internal(String),
|
||||
}
|
||||
```
|
||||
|
||||
All position-bearing errors include:
|
||||
- `line` — 1-based line number
|
||||
- `column` — 1-based column number
|
||||
- `start` / `end` — byte offsets (0-based, end exclusive)
|
||||
|
||||
```rust
|
||||
let err = Error::parse("Unexpected token", 3, 15, 42, 44);
|
||||
assert_eq!(err.line(), Some(3));
|
||||
assert_eq!(err.column(), Some(15));
|
||||
assert_eq!(err.start(), Some(42));
|
||||
```
|
||||
|
||||
### 8.2 Validation Errors
|
||||
|
||||
```rust
|
||||
pub struct ValidationError {
|
||||
pub message: String,
|
||||
pub line: Option<usize>,
|
||||
pub column: Option<usize>,
|
||||
pub severity: ValidationSeverity, // Error or Warning
|
||||
pub code: String, // e.g., "E001", "E200"
|
||||
pub start: Option<usize>,
|
||||
pub end: Option<usize>,
|
||||
}
|
||||
|
||||
pub struct ValidationResult {
|
||||
pub valid: bool,
|
||||
pub errors: Vec<ValidationError>,
|
||||
}
|
||||
```
|
||||
|
||||
### 8.3 Guard Rail Errors
|
||||
|
||||
Format operations have configurable guard limits that return structured errors:
|
||||
|
||||
- `E_GUARD_INPUT_TOO_LARGE` — input exceeds `max_input_bytes`
|
||||
- `E_GUARD_TOKEN_BUDGET_EXCEEDED` — token count exceeds `max_tokens`
|
||||
- `E_GUARD_AST_BUDGET_EXCEEDED` — AST node count exceeds `max_ast_nodes`
|
||||
- `E_GUARD_SET_OP_CHAIN_EXCEEDED` — UNION/INTERSECT/EXCEPT chain exceeds `max_set_op_chain`
|
||||
|
||||
---
|
||||
|
||||
## 9. AST Traversal & Analysis
|
||||
|
||||
### 9.1 Traversal
|
||||
|
||||
```rust
|
||||
use polyglot_sql::{parse, DialectType};
|
||||
use polyglot_sql::traversal::*;
|
||||
|
||||
let ast = parse("SELECT a, b FROM t WHERE x > 1", DialectType::Generic).unwrap();
|
||||
let columns = get_columns(&ast[0]); // ["a", "b", "x"]
|
||||
let tables = get_tables(&ast[0]); // ["t"]
|
||||
```
|
||||
|
||||
Available predicates (70+):
|
||||
- `is_select`, `is_insert`, `is_update`, `is_delete`, `is_ddl`
|
||||
- `is_join`, `is_where`, `is_group_by`, `is_order_by`, `is_limit`
|
||||
- `is_function`, `is_aggregate`, `is_subquery`, `is_cte`
|
||||
- `is_comparison`, `is_logical`, `is_arithmetic`
|
||||
- `contains_subquery`, `contains_aggregate`, `contains_window_function`
|
||||
|
||||
Iterators: `DfsIter`, `BfsIter` for depth-first and breadth-first traversal.
|
||||
|
||||
### 9.2 AST Transforms
|
||||
|
||||
```rust
|
||||
use polyglot_sql::ast_transforms::*;
|
||||
|
||||
// Rename tables
|
||||
let renamed = rename_tables(expr, &[("old_name", "new_name")]);
|
||||
|
||||
// Add WHERE condition
|
||||
let filtered = add_where(expr, col("active").eq(lit(true)));
|
||||
|
||||
// Remove LIMIT/OFFSET
|
||||
let unlimited = remove_limit_offset(expr);
|
||||
```
|
||||
|
||||
### 9.3 AST Diff
|
||||
|
||||
```rust
|
||||
use polyglot_sql::diff::{diff, diff_with_config, DiffConfig};
|
||||
|
||||
let edits = diff(&source_expr, &target_expr, true);
|
||||
for edit in &edits {
|
||||
if edit.is_change() {
|
||||
println!("{:?}", edit);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Uses the ChangeDistiller algorithm with Dice coefficient matching for structural comparison.
|
||||
|
||||
### 9.4 Logical Planner
|
||||
|
||||
```rust
|
||||
use polyglot_sql::planner::Plan;
|
||||
|
||||
let plan = Plan::from_expression(&expr);
|
||||
// plan.root is a Step DAG
|
||||
// plan.leaves() returns leaf steps
|
||||
// plan.dag() returns the dependency graph
|
||||
```
|
||||
|
||||
Step kinds: Scan, Filter, Project, Aggregate, Join, Sort, Limit, etc.
|
||||
|
||||
---
|
||||
|
||||
## 10. Optimizer Modules
|
||||
|
||||
The optimizer is available behind the `semantic` feature flag:
|
||||
|
||||
| Module | Purpose |
|
||||
|---|---|
|
||||
| `qualify_columns.rs` | Resolve unqualified column references to table.column |
|
||||
| `qualify_tables.rs` | Expand table names with schema/catalog |
|
||||
| `annotate_types.rs` | Infer and annotate expression types |
|
||||
| `pushdown_predicates.rs` | Push WHERE conditions into JOINs |
|
||||
| `pushdown_projections.rs` | Reduce columns to only what's needed |
|
||||
| `eliminate_joins.rs` | Remove unnecessary JOINs |
|
||||
| `eliminate_ctes.rs` | Inline single-use CTEs |
|
||||
| `simplify.rs` | Simplify boolean expressions, constant folding |
|
||||
| `normalize.rs` | Expression normalization |
|
||||
| `canonicalize.rs` | Query canonicalization |
|
||||
| `subquery.rs` | Subquery analysis |
|
||||
|
||||
---
|
||||
|
||||
## 11. Async Support
|
||||
|
||||
**Polyglot does not use async I/O** — it is a pure computational library. All operations are synchronous and CPU-bound:
|
||||
|
||||
- `parse()` — synchronous
|
||||
- `generate()` — synchronous
|
||||
- `transpile()` — synchronous
|
||||
- `validate()` — synchronous
|
||||
- `format()` — synchronous
|
||||
|
||||
This is by design: Polyglot operates on SQL strings in memory, with no network or filesystem I/O. For use in async contexts (Tokio, async-std), callers should use `tokio::task::spawn_blocking()` or similar to offload CPU-heavy parsing/transpilation to a blocking thread pool.
|
||||
|
||||
---
|
||||
|
||||
## 12. Feature Flags
|
||||
|
||||
| Flag | Description | Default |
|
||||
|---|---|---|
|
||||
| `all-dialects` | Enable all 32 dialect parsers | ✅ |
|
||||
| `generate` | SQL generation from AST | ✅ |
|
||||
| `transpile` | Cross-dialect transpilation (implies `generate`) | ✅ |
|
||||
| `builder` | Fluent query builder API (implies `generate`) | ✅ |
|
||||
| `ast-tools` | AST inspection & transform utilities | ✅ |
|
||||
| `semantic` | Schema, resolver, lineage, optimizer, validation | ✅ |
|
||||
| `openlineage` | OpenLineage payload generation (implies `semantic`) | ✅ |
|
||||
| `diff` | AST diff support (implies `generate`) | ✅ |
|
||||
| `planner` | Logical planning helpers | ✅ |
|
||||
| `time` | Time-format conversion helpers | ✅ |
|
||||
| `stacker` | Stack-growth protection for native builds | ✅ |
|
||||
| `bindings` | TypeScript type generation via `ts-rs` | ❌ |
|
||||
| `dialect-postgresql` | PostgreSQL dialect only | — |
|
||||
| `dialect-mysql` | MySQL dialect only | — |
|
||||
| ... (one per dialect) | Individual dialect selector | — |
|
||||
| `function-catalog-clickhouse` | ClickHouse function catalog | ❌ |
|
||||
| `function-catalog-duckdb` | DuckDB function catalog | ❌ |
|
||||
| `function-catalog-all-dialects` | All function catalogs | ❌ |
|
||||
|
||||
Minimal WASM build (for constrained targets):
|
||||
```toml
|
||||
polyglot-sql = { version = "0.4", default-features = false, features = ["generate", "transpile", "dialect-postgresql", "dialect-mysql"] }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Source code examined: `/workspace/polyglot/crates/polyglot-sql/src/` (~241K LOC)
|
||||
- Architecture documentation: `/workspace/polyglot/docs/sqlglot-architecture.md`
|
||||
- Benchmark results: `/workspace/polyglot/docs/benchmark.md`
|
||||
- README: `/workspace/polyglot/README.md`, `/workspace/polyglot/crates/polyglot-sql/README.md`
|
||||
- CHANGELOG: `/workspace/polyglot/CHANGELOG.md`
|
||||
Reference in New Issue
Block a user