Python Engines current Onelake Catalog integration

Same 22 TPC-H queries. Same Delta Lake data on OneLake (SF10). Same single Fabric node. Five Python SQL engines: DuckDB (Delta Classic), DuckDB (Iceberg REST), LakeSail, Polars, DataFusion , you can download the notebook here

unfortunately both daft and chdb did not support reading from Onelake abfss

DuckDB iceberg read support is not new, but it is very slow, but the next version 1.5 made a massive improvements and now it is slightly faster than Delta

They all run the same SQL now

All five engines executed the exact same SQL. No dialect tweaks, no rewrites. The one exception: Polars failed on Query 11 with

`SQLSyntaxError: subquery comparisons with '>' are not supported`

Everything else just worked,SQL compatibility across Python engines is basically solved in 2026. The differentiators are elsewhere.

Freshness vs. performance is a trade-off you should be making

import duckdb
conn = duckdb.connect()
conn.sql(f""" install delta_classic FROM community ;
attach 'abfss://{ws}@onelake.dfs.fabric.microsoft.com/{lh}.Lakehouse/Tables/{schema}'
AS db (TYPE delta_classic, PIN_SNAPSHOT); USE db
""")

`MAX_TABLE_STALENESS ‘5 minutes’` means the engine caches the catalog metadata and skips the round-trip for 5 minutes.

DuckDB’s Delta Classic does the same with `PIN_SNAPSHOT`.

import duckdb
conn = duckdb.connect()
conn.sql(f""" install delta_classic FROM community ;
attach 'abfss://{ws}@onelake.dfs.fabric.microsoft.com/{lh}.Lakehouse/Tables/{schema}'
AS db (TYPE delta_classic, PIN_SNAPSHOT); USE db
""")

Your dashboard doesn’t need sub-second freshness. Your reporting query doesn’t care about the last 30 seconds of ingestion. Declaring a staleness budget upfront – predictable, explicit – is not a compromise. It’s the right default for analytics.

Object store calls are the real bottleneck

Every engine reads from OneLake over ABFSS. Every Parquet file is a network call. It doesn’t matter how fast your engine scans columnar data in memory if it makes hundreds of HTTP calls to list files and read metadata before it starts.

DuckDB Delta Classic (PIN_SNAPSHOT): caches the Delta log and file list at attach time. Subsequent queries skip the metadata round-trips.

DuckDB Iceberg (MAX_TABLE_STALENESS): caches the Iceberg snapshot from the catalog API. Within the staleness window, no catalog calls.

LakeSail: has native OneLake catalog integration (SAIL_CATALOG__LIST). You point it at the lakehouse, it discovers tables and schemas through the catalog. Metadata resolution is handled by the catalog layer, not by scanning storage paths, but it has no concept of cache, every query will call Onelake Catalog API

Polars, DataFusion: resolve the Delta log on every query. Every query pays the metadata tax.

An engine that caches metadata will beat a “faster” engine that doesn’t. Every time, especially at scale.

How about writes?

You can write to OneLake today using Python deltalake or pyiceberg – that works fine. But native SQL writes (CREATE TABLE AS INSERT INTO ) through the engine catalog integration itself? That’s still the gap, lakesail can write delta just fine but using a path.

LakeSail and DuckDB Iceberg both depend on OneLake’s catalog adding write support. The read path works through the catalog API, but there’s no write path yet. When it lands, both engines get writes for free.

DuckDB Delta Classic has a different bottleneck: DuckDB’s Delta extension itself. Write support exists but is experimental and not usable for production workloads yet.

The bottom line

Raw execution speed will converge. These are all open source projects, developers read each other’s code, there’s no magical trick one has that others can’t adopt. The gap narrows with every release.

Catalog Integration and cache are the real differentiator. And I’d argue that even *reading* from OneLake is nearly solved now.

Full disclosure: I authored the DuckDB Delta Classic extension and the LakeSail OneLake integration (both with the help of AI), so take my enthusiasm for catalog integration with a grain of bias

Querying a Onelake Table with RLS and CLS Using DuckDB’s MSSQL Extension

Onelake supports Row Level Security and Column Level Security. These protections work when you use trusted engines such as Power BI or Spark running inside Microsoft Fabric. In those environments, the compute engine operates within a controlled boundary, so security rules can be enforced properly.

However, if you try to access the storage directly from a Python notebook or a local engine running on your laptop, including open source Spark, direct access is blocked. Otherwise, Row Level Security and Column Level Security would be meaningless. Security only works when the engine itself is trusted and governed.

This blog show a workaround by laveraging SQL Endpoint, These policies are enforced at the SQL endpoint, meaning any external tool that connects through it — including DuckDB — automatically respects them.

Let’s walk through a quick example.

The Data

We have a power.duid table containing 626 rows of Australian power generation facilities. Columns include DUID, Region, FuelSourceDescriptor, Participant, State, latitude, and longitude.

Configuring Row-Level Security

In the Lakehouse role readsometables, we add an RLS rule that restricts visibility to a single region:

SELECT * FROM power.duid WHERE Region='WA1'

Members of this role will only see rows where Region = 'WA1'.

Configuring Column-Level Security

On the same role, we enable CLS and grant Read visibility only to specific columns: DUID, Region, FuelSourceDescriptor, State, latitude, and longitude. The Participant column is excluded.

Querying with DuckDB’s MSSQL Extension

From any Python environment, we can connect to the SQL endpoint using DuckDB’s community MSSQL extension and Azure authentication:

import duckdb
from azure.identity import DefaultAzureCredential

conn = duckdb.connect()
token = DefaultAzureCredential().get_token("https://database.windows.net/.default").token

conn.sql(f"""
    ATTACH IF NOT EXISTS
    'Server=<your-sql-endpoint>;Database=data'
    AS data (TYPE mssql, ACCESS_TOKEN '{token}')
""")

if you are running it inside Fabric notebook, first you need to updgrade duckdb,

!pip install duckdb --upgrade
import sys
sys.exit(0)

the run this code

import duckdb
conn = duckdb.connect()
token = notebookutils.credentials.getToken('sql')
conn.sql(f"""
install mssql from community ;
ATTACH if not exists
'Server=SQL_endpoint;Database=data'
AS data (TYPE mssql, ACCESS_TOKEN '{token}')
""")



Now when we query, RLS and CLS are enforced server-side:

conn.sql("SELECT DISTINCT(Region) FROM data.power.duid").show()

Only WA1 — the RLS filter is working. And if we select all columns:

conn.sql("SELECT * FROM data.power.duid LIMIT 4").show()

you get an error, that you can not select Participant

No Participant column — CLS is doing its job, now if you remove it , everything works fine

Writing to SQL Server using DuckDB

I don’t know much about SQL Server. The closest I ever got to it was having read only access to a database. I remember 10 years ago we had a use case for a database, and IT decided for some reason that we were not allowed to install SQL Server Express. Even though it was free and a Microsoft product. To this day, it is still a mystery to me, anyway, at that time I was introduced to PowerPivot and PowerQuery, and the rest was history.

Although I knew very little about SQL Server, I knew that SQL Server users are in love with the product. I worked with a smart data engineer who had a very clear world view:

I used SQL Server for years. It is rock solid. I am not interested in any new tech.

At the time, I thought he lacked imagination. Now I think I see his point.

When SQL Server was added to Fabric, I was like, oh, that’s interesting. But I don’t really do operational workloads anyway, so I kind of ignored it.

Initially I tried to make it fit my workflow, which is basically developing Python notebooks using DuckDB or Polars (depending on my mood) inside VSCode with GitHub Copilot. and deploy it later into Fabric, of course you can insert a dataframe into SQL Server, but it did not really click for me at first. To be clear, I am not saying it is not possible. It just did not feel natural in my workflow( messing with pyodbc is not fun).

btw the SQL extension inside VSCode is awesome

A week ago I was browsing the DuckDB community extensions and I came across the mssql extension. And boy !!! that was an emotional rollercoaster (The last time I had this experience was when I first used tabular editor a very long time ago).

You just attach a SQL Server database using either username and password or just a token. That’s it. The rest is managed by the extension, suddenly everything make sense to me!!!

conn = duckdb.connect()

if PLATFORM == 'fabric':
    token = DefaultAzureCredential().get_token("https://database.windows.net/.default").token

# notebookutils.credentials.getToken("sql") inside Fabric notebook
    for attempt in range(3):
        try:
            conn.sql(f"""
                ATTACH IF NOT EXISTS
                'Server={host};Database={db}'
                AS db (TYPE mssql, ACCESS_TOKEN '{token}')
            """)
            break
        except Exception as e:
            if attempt < 2:
                print(f"Attempt {attempt+1} failed, waiting 60s for serverless wake-up...")
                time.sleep(60)
            else:
                raise e
else:
    conn.sql(f"""
        ATTACH OR REPLACE
        'Server={host},{pr};Database={db};User Id={user};Password={pw};Encrypt=yes'
        AS db (TYPE mssql)
    """)

conn.sql("SET mssql_query_timeout = 6000; SET mssql_ctas_drop_on_failure = true;")
print(f"Connected to SQL Server via {PLATFORM}")

again, I know there other ways to load data which are more efficiently, but if I have a small csv that I processed using python, nothing compare to the simplicity of a dataframe, in that week; here are some things I learned, I know it is obvious for someone who used it !!! but for me, it is like I was living under a rock all these years 🙂

if you run show all tables in duckdb, you get something like this

TDS and bulk insertion

You don’t need ODBC. You can talk to SQL Server directly using TDS, which is the native protocol it understands. There is also something called BCP, which basically lets you batch load data efficiently instead of pushing rows one by one. Under the hood it streams the data in chunks, and the performance is actually quite decent. It is not some hacky workaround. It feels like you are speaking SQL Server’s own language, and that changes the whole experience.

SQL Server is not only for OLTP

Turns out people use SQL Server for analytics too, with columnar table format.

CREATE CLUSTERED COLUMNSTORE INDEX cci_{table}
ON {schema}.{table}
ORDER ({order_col});

I tested a typical analytical benchmark and more or less it performs like a modern single node data warehouse.

Accelerating Analytics for row store

Basically, there is a batch mode where the engine processes row-based tables in batches instead of strictly row by row. The engine can apply vectorized operations, better CPU cache usage, and smarter memory management even on traditional rowstore tables. It is something DuckDB added with great fanfare to accelerate PostgreSQL heap tables. I was a bit surprised that SQL Server already had it for years.

RLS/CLS for untrusted Engine

If you have a CLS or RLS Lakehouse table and you want to query it from an untrusted engine, let’s say DuckDB running on your laptop, today, you can’t for a good reason as the direct storage access is blocked, this extension solves it, as you query the SQL Endpoint itself.

Most of fancy things were already invented

Basically, many of the things’ people think are next generation technologies were already implemented decades ago. SQL control flow, temp tables, complex transactions, fine grained security, workload isolation, it was all already there.

I think the real takeaway for me; user experience is as important – if not more- than the SQL Engine itself, and when a group of very smart people like something then there is probably a very good reason for it.

Query Onelake Iceberg REST catalog using Duckdb SQL

it is a quick post on how to query Onelake Iceberg REST Catalog using pure SQL with DuckDB, and yes you need a service principal that has access to the lakehouse

CREATE or replace PERSISTENT secret onelake_identity_iceberg (
    TYPE ICEBERG,
    CLIENT_ID 'xxxxxxxxxxxxxx',
    CLIENT_SECRET 'yyyyyyyyyyyyyyyyyyy' ,
    OAUTH2_SCOPE   'https://storage.azure.com/.default' ,
    OAUTH2_SERVER_URI 'https://login.microsoftonline.com/TENANT_ID /oauth2/v2.0/token' ,
    ENDPOINT 'https://onelake.table.fabric.microsoft.com/iceberg'
);
CREATE or replace PERSISTENT secret azure_spn (
    TYPE azure,
    PROVIDER service_principal,
    TENANT_ID 'ccccccc',
    CLIENT_ID 'iiiiiiiiiiiiii',
    CLIENT_SECRET 'xbndlfrewi' ,
    ACCOUNT_NAME 'onelake'
);

it works reasonably well assuming your region is not far from your laptop, or even better , if you run it inside Fabric then there is no network shenanigans, I recorded a video showing my experience

Why read operations do not always need full consistency checks

I hope DuckDB eventually adds an option that allows turning off table state checks for purely read scenarios. The current behaviour is correct because you always need the latest state when writing in order to guarantee consistency. However, for read queries it feels unnecessary and hurts the overall user experience. PowerBI solved this problem very well with its concept of framing, and something similar in DuckDB would make a big difference, notice duckdb delta reader already support pin version.