Python Engines current Onelake Catalog integration

Same 22 TPC-H queries. Same Delta Lake data on OneLake (SF10). Same single Fabric node. Five Python SQL engines: DuckDB (Delta Classic), DuckDB (Iceberg REST), LakeSail, Polars, DataFusion , you can download the notebook here

unfortunately both daft and chdb did not support reading from Onelake abfss

DuckDB iceberg read support is not new, but it is very slow, but the next version 1.5 made a massive improvements and now it is slightly faster than Delta

They all run the same SQL now

All five engines executed the exact same SQL. No dialect tweaks, no rewrites. The one exception: Polars failed on Query 11 with

`SQLSyntaxError: subquery comparisons with '>' are not supported`

Everything else just worked,SQL compatibility across Python engines is basically solved in 2026. The differentiators are elsewhere.

Freshness vs. performance is a trade-off you should be making

import duckdb
conn = duckdb.connect()
conn.sql(f""" install delta_classic FROM community ;
attach 'abfss://{ws}@onelake.dfs.fabric.microsoft.com/{lh}.Lakehouse/Tables/{schema}'
AS db (TYPE delta_classic, PIN_SNAPSHOT); USE db
""")

`MAX_TABLE_STALENESS ‘5 minutes’` means the engine caches the catalog metadata and skips the round-trip for 5 minutes.

DuckDB’s Delta Classic does the same with `PIN_SNAPSHOT`.

import duckdb
conn = duckdb.connect()
conn.sql(f""" install delta_classic FROM community ;
attach 'abfss://{ws}@onelake.dfs.fabric.microsoft.com/{lh}.Lakehouse/Tables/{schema}'
AS db (TYPE delta_classic, PIN_SNAPSHOT); USE db
""")

Your dashboard doesn’t need sub-second freshness. Your reporting query doesn’t care about the last 30 seconds of ingestion. Declaring a staleness budget upfront – predictable, explicit – is not a compromise. It’s the right default for analytics.

Object store calls are the real bottleneck

Every engine reads from OneLake over ABFSS. Every Parquet file is a network call. It doesn’t matter how fast your engine scans columnar data in memory if it makes hundreds of HTTP calls to list files and read metadata before it starts.

DuckDB Delta Classic (PIN_SNAPSHOT): caches the Delta log and file list at attach time. Subsequent queries skip the metadata round-trips.

DuckDB Iceberg (MAX_TABLE_STALENESS): caches the Iceberg snapshot from the catalog API. Within the staleness window, no catalog calls.

LakeSail: has native OneLake catalog integration (SAIL_CATALOG__LIST). You point it at the lakehouse, it discovers tables and schemas through the catalog. Metadata resolution is handled by the catalog layer, not by scanning storage paths, but it has no concept of cache, every query will call Onelake Catalog API

Polars, DataFusion: resolve the Delta log on every query. Every query pays the metadata tax.

An engine that caches metadata will beat a “faster” engine that doesn’t. Every time, especially at scale.

How about writes?

You can write to OneLake today using Python deltalake or pyiceberg – that works fine. But native SQL writes (CREATE TABLE AS INSERT INTO ) through the engine catalog integration itself? That’s still the gap, lakesail can write delta just fine but using a path.

LakeSail and DuckDB Iceberg both depend on OneLake’s catalog adding write support. The read path works through the catalog API, but there’s no write path yet. When it lands, both engines get writes for free.

DuckDB Delta Classic has a different bottleneck: DuckDB’s Delta extension itself. Write support exists but is experimental and not usable for production workloads yet.

The bottom line

Raw execution speed will converge. These are all open source projects, developers read each other’s code, there’s no magical trick one has that others can’t adopt. The gap narrows with every release.

Catalog Integration and cache are the real differentiator. And I’d argue that even *reading* from OneLake is nearly solved now.

Full disclosure: I authored the DuckDB Delta Classic extension and the LakeSail OneLake integration (both with the help of AI), so take my enthusiasm for catalog integration with a grain of bias

Querying a Onelake Table with RLS and CLS Using DuckDB’s MSSQL Extension

Onelake supports Row Level Security and Column Level Security. These protections work when you use trusted engines such as Power BI or Spark running inside Microsoft Fabric. In those environments, the compute engine operates within a controlled boundary, so security rules can be enforced properly.

However, if you try to access the storage directly from a Python notebook or a local engine running on your laptop, including open source Spark, direct access is blocked. Otherwise, Row Level Security and Column Level Security would be meaningless. Security only works when the engine itself is trusted and governed.

This blog show a workaround by laveraging SQL Endpoint, These policies are enforced at the SQL endpoint, meaning any external tool that connects through it — including DuckDB — automatically respects them.

Let’s walk through a quick example.

The Data

We have a power.duid table containing 626 rows of Australian power generation facilities. Columns include DUID, Region, FuelSourceDescriptor, Participant, State, latitude, and longitude.

Configuring Row-Level Security

In the Lakehouse role readsometables, we add an RLS rule that restricts visibility to a single region:

SELECT * FROM power.duid WHERE Region='WA1'

Members of this role will only see rows where Region = 'WA1'.

Configuring Column-Level Security

On the same role, we enable CLS and grant Read visibility only to specific columns: DUID, Region, FuelSourceDescriptor, State, latitude, and longitude. The Participant column is excluded.

Querying with DuckDB’s MSSQL Extension

From any Python environment, we can connect to the SQL endpoint using DuckDB’s community MSSQL extension and Azure authentication:

import duckdb
from azure.identity import DefaultAzureCredential

conn = duckdb.connect()
token = DefaultAzureCredential().get_token("https://database.windows.net/.default").token

conn.sql(f"""
    ATTACH IF NOT EXISTS
    'Server=<your-sql-endpoint>;Database=data'
    AS data (TYPE mssql, ACCESS_TOKEN '{token}')
""")

if you are running it inside Fabric notebook, first you need to updgrade duckdb,

!pip install duckdb --upgrade
import sys
sys.exit(0)

the run this code

import duckdb
conn = duckdb.connect()
token = notebookutils.credentials.getToken('sql')
conn.sql(f"""
install mssql from community ;
ATTACH if not exists
'Server=SQL_endpoint;Database=data'
AS data (TYPE mssql, ACCESS_TOKEN '{token}')
""")



Now when we query, RLS and CLS are enforced server-side:

conn.sql("SELECT DISTINCT(Region) FROM data.power.duid").show()

Only WA1 — the RLS filter is working. And if we select all columns:

conn.sql("SELECT * FROM data.power.duid LIMIT 4").show()

you get an error, that you can not select Participant

No Participant column — CLS is doing its job, now if you remove it , everything works fine

First Look at Geometry Types in Parquet

Getting different parties in the software industry to agree on a common standard is rare. Most of the time, a dominant player sets the rules. Occasionally, however, collaboration happens organically and multiple teams align on a shared approach. Geometry types in Parquet are a good example of that.

In short: there is now a defined way to store GIS data in Parquet. Both Delta Lake and Apache Iceberg have adopted the standard ( at least the spec). The challenge is that actual implementation across engines and libraries is uneven.

  • Iceberg: no geometry support yet in Java nor Python, see spec
  • Delta:  it’s unclear if it’s supported in the  open source implementation (I need to try sedona and report back), nothing in the spec though ?
  • DuckDB: recently added support ( you need nightly build or wait for 1.4)
  • PyArrow: has included support for a few months, just use the latest release
  • Arrow rust : no support, it means, no delta python support 😦

The important point is that agreeing on a specification does not guarantee broad implementation. and even if there is a standard spec, that does not means the initial implementation will be open source, it is hard to believe we still have this situation in 2025 !!!

Let’s run it in Python Notebook

To test things out, I built a Python notebook that downloads public geospatial data, merges it with population data, writes it to Parquet, and renders a map using GeoPandas, male sure to install the latest version of duckdb, pyarrow and geopandas

!pip install -q duckdb  --pre --upgrade
!pip install -q pyarrow --upgrade
!pip install geopandas  --upgrade
import sys
sys.exit(0)

At first glance, that may not seem groundbreaking. After all, the same visualization could be done with GeoJSON. The real advantage comes from how geometry types in Parquet store bounding box coordinates. With this metadata, spatial filters can be applied directly during reads, avoiding the need to scan entire datasets.

That capability is what makes the feature truly valuable: efficient filtering and querying at scale, note that currently duckdb does not support pushing those filters, probably you need to wait to early 2026 ( it is hard to believe 2025 is nearly gone)

👉Workaround if your favorite Engine don’t support it .

 A practical workaround is to read the Parquet file with DuckDB (or any library that supports geometry types) and export the geometry column back as WKT text. This allows Fabric to handle the data, albeit without the benefits of native geometry support, For example PowerBI can read WKT just fine

duckdb.sql("select geom, ST_AsText(geom) as wkt  from '/lakehouse/default/Files/countries.parquet' ")

For PowerBI support to wkt, I have written some blogs before, some people may argue that you need a specialized tool for Spatial, Personally I think BI tools are the natural place to display maps data 🙂

Using gpt-oss 20B for Text to SQL 

TL;DR : 

As a quick first impression, I tested  Generating SQL Queries based on a YAML Based Semantic model, all the files are stored here , considering i have only 4 GB of VRAM, it is not bad at all !!!

to be clear, this is not a very rigorous benchmark, I just used the result of the last runs, differents runs will give you slightly different results, it is just to get a feeling about the Model, but it is a workload I do care about, which is the only thing that matter really.

The Experimental Setup

The experiment uses a SQL generation test based on the TPC-DS dataset (scale factor 0.1), featuring a star schema with two fact tables (store_sales, store_returns) and multiple dimension tables (date_dim, store, customer, item). The models were challenged with 20 questions ranging from simple aggregations to complex analytical queries requiring proper dimensional modeling techniques.

The Models Under Test

  1. O3-Mini (Reference Model): Cloud-based Azure model serving as the “ground truth”
  2. Qwen3-30B-A3B-2507: Local model via LM Studio
  3. GPT-OSS-20B : Local model via LM Studio

Usually I use Ollama, but moved to LM studio because of MCP tools support, for Qwen 3, strangely code did not perform well at all, and the thinking mode is simply too slow

Key Technical Constraints

  • Memory Limit: Both local models run on 4GB VRAM, my laptop has 32 GB of RAM
  • Timeout: 180 seconds per query
  • Retry Logic: Up to 1 attempt for syntax error correction
  • Validation: Results compared using value-based matching (exact, superset, subset)

The Testing Framework

The experiment employs a robust testing framework with several features:

Semantic Model Enforcement

The models were provided with a detailed semantic model that explicitly defines:

  • Proper dimensional modeling principles
  • Forbidden patterns (direct fact-to-fact joins)
  • Required CTE patterns for combining fact tables
  • Specific measure definitions and business rules

Multi-Level Result Validation

Results are categorized into five match types:

  • Exact Match: Identical results including order
  • Superset: Model returns additional valid data
  • Subset: Model returns partial but correct data
  • Mismatch: Different results
  • Error: Execution or generation failures

The Results: Not bad at all

Overall Performance Summary

Both local models achieved 75-85% accuracy, which is remarkable considering they’re running on consumer-grade hardware with just 4GB VRAM. The GPT-OSS-20B model slightly outperformed Qwen3  with 85% accuracy versus 75%. Although it is way slower

I guess we are not there yet for interactive use case, it is simply too slow for a local setup, specially for complex queries.

Tool calling

a more practical use case is tools calling, you can basically use it to interact with a DB or PowerBI using an mcp server and because it is totally local, you can go forward and read the data and do whatever you want as it is total isolated to your own computer.

The Future is bright

I don’t want to sounds negative, just 6 months ago, i could not make it to works at all, and now I have the choice between multiple vendors and it is all open source, I am very confident that those Models will get even more efficient with time.