iceberg – Small Data And self service

Power BI with DuckDB, 4 years later

Four years ago I wrote a blog about using DuckDB with Power BI in DirectQuery. It got a fair number of likes on LinkedIn 🙂 along with the one comment I didn’t want to hear: how does this work in production? (Craig, if you’re reading this, you were right.)

Back then I thought the technology was the hard part and the rest would sort itself out. It didn’t.

The ODBC driver never really worked in any non-trivial setup. Filters didn’t push down, decimal precision was buggy. It has gotten better since, but two show stoppers remained:

DuckDB is in-process, so the driver is the database. There’s no warm, long-running session. Every query starts from scratch.
I don’t think those drivers can realistically be certified (personal opinion). And Power BI Service, or any hosted BI service for that matter, is not going to host an in-process engine for free. An on-prem data gateway is not really a good option either.

In 2026 things are way better. MotherDuck (DuckDB’s SaaS) shipped a PostgreSQL endpoint. Problem solved: Power BI speaks Postgres, and it works out of the box.

Then last week DuckDB released Quack. For my own sanity I’ll just call it “DuckDB Server.” It is just an extension; a single function call and you have a server !!

My first reaction was annoyance. Four years of waiting, and they shipped a proprietary wire protocol. I was hoping for pg wire. I want my driver to work. I don’t really care about a 2x improvement if nothing interoperates.

Luckily I was partially wrong. Within two days there was an ADBC driver from gizmodata/adbc-driver-quack, and, to my surprise, a Power BI custom connector from Curt Hagenlocher (think of him as the Linus of Power Query). my understanding it is a side project, not official Microsoft.

And somehow, the whole thing worked. It was beautiful.

But lesson learned from last time: this is experimental, with no guarantee the connector will ever be certified.

The main change from the 2022 post is that instead of pointing at parquet files, I’m pointing at a catalog and getting tables back, like an actual database instead of a pile of files and duckdb got way better.

High level architecture

OneLake Iceberg Catalog — OneLake exposes data as tables. You need three things:
- Endpoint: https://onelake.table.fabric.microsoft.com/iceberg
- An Entra ID auth token
- Path to the Lakehouse/Warehouse: workspace_name/Lakehouse_name.Lakehouse

DuckDB + iceberg extension — reads the catalog and the underlying parquet over HTTPS.

Entra ID — az account get-access-token --resource https://storage.azure.com/ mints a short-lived bearer token. No service principal, no app registration. I have a script that grabs the token, and I opened duckdb-azure#170 hoping to make this much simpler.

DuckDB Endpoint — turns the engine into a TCP server on 127.0.0.1:9494, speaking DuckDB’s native wire protocol (whatever that means).

The ADBC Driver — Python client and Power BI share the same DLL, you need to manually install it from curt github page

You can download all the files here

Power BI

Let’s just share a video. Yes, 600M rows, warm run in my laptop

Python Notebook

TPC-H SF=10 (10 GB), 22 queries, run twice in the same session via client.ipynb. Numbers are seconds, copied straight from the notebook output.

	Cold	Warm
Total	~5 min 29 s	~30 s

Cold time is dominated by parquet I/O over HTTPS from OneLake. Bandwidth and seek count, not CPU. Warm runs hit DuckDB’s in-process buffer cache, Onelake endpoint is in another continent and my internet provider is horrible 🙂

Optimization on this stack should target bytes read and seeks (codec, row-group size, predicate pushdown, range prefetch), not query plans.

This is exactly why server mode make sense, as the warm cache is shared by all client (notebook, Power BI, AI Agent)

Not production ready

The Entra token has a ~1h TTL. As far as I can tell, DuckDB has no way to auto-refresh tokens.
The driver is not certified, so it can’t be used in the service, if you want it added to PowerBI, create an idea in Fabric forum and vote
DuckDB Server is new. Don’t expect SQL Server maturity yet 🙂
DuckDB’s remote file cache is RAM only. When you restart DuckDB, you lose it and have to deal with the cold-run pain again and egress fees 😦
The DuckDB Azure extension is still pretty rough in places. To be fair, they’ve openly said they don’t have the bandwidth.

Hopefully it won’t take another four years to make this production ready.

Still, seeing DuckDB as a single binary serving a 600M row table to Power BI was genuinely fun. and The Iceberg catalog is awesome !!!

The Boring Reason Iceberg Matters

TL;DR: Iceberg’s value is sociological, not technical. And if you care about lightweight, single-process engines like datafusion and duckdb, it’s probably your best shot at first-class lakehouse support with Wide interoperability.

The first real data engineering work I did was an ingestion pipeline built on pandas and Parquet with Hive-style partitioning — an environment where 512 MB of memory was a genuine architectural constraint, not a rounding error. That experience shaped how I think about data tooling: the engine matters, but so does the ability to swap it out. Engine independence is something I care about more than most people I know, which is probably why I find myself paying close attention to Iceberg. Not for the reasons most people cite, though. It’s not the spec. It’s where the engineering hours are landing.

Getting query engines and catalogs to talk to each other is genuinely hard work. Most of it is unglamorous: error envelope parsing, metadata round-tripping, commit response shapes, partition spec edge cases, auth token quirks between vendors. None of it ships a feature anyone demos. None of it makes a good blog post. It’s the maintenance work that quietly determines whether your stack actually functions.

This is the part that’s easy to miss. Standards don’t converge because the spec is good. They converge because enough people, at enough companies, decide to put sustained hours into the interop bugs — year after year, across release cycles, through personnel changes and shifting priorities.

Look at the Iceberg committer list: Netflix, Apple, Databricks, Snowflake, AWS, Dremio, Microsoft. No single employer controls what gets merged. The incentive to fix cross-vendor interoperability bugs is distributed across the committer base itself. The governance isn’t just a formality — it’s what makes it possible for engineers from genuinely different setups to find, reproduce, and fix the same bug together.

There is one specific layer worth watching: the Iceberg REST catalog specification. It has become the canonical standard for how engines and catalogs communicate. Adoption is real: Polaris, lakekeeper, Gravitino, and a growing list of vendor-managed catalogs implement it.

But adoption and interoperability are not the same thing.

In practice, vendors still interpret parts of the specification differently. Engines end up handling quirks like slightly different response shapes, undocumented authentication flows, or inconsistent error handling. The nearest analogy is ODBC — a real standard, widely implemented, and still years of painful work before the “connect to anything” promise actually held up in practice.

The Iceberg REST catalog ecosystem feels earlier in that curve. The gap between specification and implementation is exactly where a lot of the maintenance work is happening right now. And closing that gap is precisely the kind of work Iceberg’s governance model is designed to support, because the people hitting the bugs are often the same people with commit access to fix them.

This is where the stakes become concrete, especially for lightweight engines.

For cloud warehouses and large JVM-based systems, the maintenance burden is manageable. There are full-time teams paid to absorb it. For the newer generation of small, single-process engines, the situation is very different. These are compact teams building engines with a specific focus: query latency, memory efficiency, embedded analytics, local execution.

Every hour spent chasing interoperability edge cases is an hour not spent improving the engine itself.

Several of these engines already support Iceberg in some form. But broad, reliable lakehouse support depends on the ecosystem doing its part: stable specifications, faithful implementations, bugs surfaced and fixed upstream.

A well-maintained standard is not just a convenience for these projects. It’s what makes serious lakehouse support achievable without hollowing out the team building the engine.

There is also a broader cost to fragmentation that rarely gets discussed directly. Every hour the ecosystem spends maintaining incompatible metadata layers is an hour not spent making lakehouse systems actually better. That cost doesn’t show up clearly in any individual issue tracker, but it accumulates across the entire ecosystem.

That’s the real argument for Iceberg.

Not that it’s a particularly clever format. Formats are mostly boring by design.

The real advantage is that Iceberg has assembled the right kind of maintenance coalition: enough companies with genuinely different incentives, governance that distributes merge authority, and enough independent implementations that bugs surface from the edges instead of only the center.

Whether that coalition survives long term as the market consolidates is still an open question. But right now, Iceberg is the ecosystem where the boring interoperability work is most likely to get done by someone other than you.

And in infrastructure, that’s close to everything.

That’s also why this feels personal to me.

The 512 MB pipeline I started with wrote Parquet files and hoped for the best — no transactions, no snapshot isolation, just partitions and careful scheduling to avoid stepping on yourself.

What I actually wanted, and couldn’t realistically have at the time, was proper ACID semantics with snapshot isloation end to end from something small and cheap. A cloud function. A tiny process with almost no memory to spare.

Iceberg is the closest thing to a realistic path toward that today. Not because the specification is especially elegant, but because it’s where the maintenance work is happening.

And eventually, ecosystems catch up to where the maintenance happens.

Special thanks to Raki Rahman for a few conversations that genuinely reshaped how I think about this space.

Ideas are mine; writing assisted by AI.

Python Engines current Onelake Catalog integration

Same 22 TPC-H queries. Same Delta Lake data on OneLake (SF10). Same single Fabric node. Five Python SQL engines: DuckDB (Delta Classic), DuckDB (Iceberg REST), LakeSail, Polars, DataFusion , you can download the notebook here

unfortunately both daft and chdb did not support reading from Onelake abfss

DuckDB iceberg read support is not new, but it is very slow, but the next version 1.5 made a massive improvements and now it is slightly faster than Delta

They all run the same SQL now

All five engines executed the exact same SQL. No dialect tweaks, no rewrites. The one exception: Polars failed on Query 11 with

`SQLSyntaxError: subquery comparisons with '>' are not supported`

Everything else just worked,SQL compatibility across Python engines is basically solved in 2026. The differentiators are elsewhere.

Freshness vs. performance is a trade-off you should be making

			
import duckdb
conn = duckdb.connect()
conn.sql(f"""  install delta_classic FROM community  ;
attach 'abfss://{ws}@onelake.dfs.fabric.microsoft.com/{lh}.Lakehouse/Tables/{schema}'
AS db (TYPE delta_classic, PIN_SNAPSHOT); USE db
            """)

		

`MAX_TABLE_STALENESS ‘5 minutes’` means the engine caches the catalog metadata and skips the round-trip for 5 minutes.

DuckDB’s Delta Classic does the same with `PIN_SNAPSHOT`.

			
import duckdb
conn = duckdb.connect()
conn.sql(f"""  install delta_classic FROM community  ;
attach 'abfss://{ws}@onelake.dfs.fabric.microsoft.com/{lh}.Lakehouse/Tables/{schema}'
AS db (TYPE delta_classic, PIN_SNAPSHOT); USE db
            """)

		

Your dashboard doesn’t need sub-second freshness. Your reporting query doesn’t care about the last 30 seconds of ingestion. Declaring a staleness budget upfront – predictable, explicit – is not a compromise. It’s the right default for analytics.

Object store calls are the real bottleneck

Every engine reads from OneLake over ABFSS. Every Parquet file is a network call. It doesn’t matter how fast your engine scans columnar data in memory if it makes hundreds of HTTP calls to list files and read metadata before it starts.

– DuckDB Delta Classic (PIN_SNAPSHOT): caches the Delta log and file list at attach time. Subsequent queries skip the metadata round-trips.

– DuckDB Iceberg (MAX_TABLE_STALENESS): caches the Iceberg snapshot from the catalog API. Within the staleness window, no catalog calls.

– LakeSail: has native OneLake catalog integration (SAIL_CATALOG__LIST). You point it at the lakehouse, it discovers tables and schemas through the catalog. Metadata resolution is handled by the catalog layer, not by scanning storage paths, but it has no concept of cache, every query will call Onelake Catalog API

– Polars, DataFusion: resolve the Delta log on every query. Every query pays the metadata tax.

An engine that caches metadata will beat a “faster” engine that doesn’t. Every time, especially at scale.

How about writes?

You can write to OneLake today using Python deltalake or pyiceberg – that works fine. But native SQL writes (CREATE TABLE AS INSERT INTO ) through the engine catalog integration itself? That’s still the gap, lakesail can write delta just fine but using a path.

– LakeSail and DuckDB Iceberg both depend on OneLake’s catalog adding write support. The read path works through the catalog API, but there’s no write path yet. When it lands, both engines get writes for free.

– DuckDB Delta Classic has a different bottleneck: DuckDB’s Delta extension itself. Write support exists but is experimental and not usable for production workloads yet.

The bottom line

Raw execution speed will converge. These are all open source projects, developers read each other’s code, there’s no magical trick one has that others can’t adopt. The gap narrows with every release.

Catalog Integration and cache are the real differentiator. And I’d argue that even *reading* from OneLake is nearly solved now.

—

Full disclosure: I authored the DuckDB Delta Classic extension and the LakeSail OneLake integration (both with the help of AI), so take my enthusiasm for catalog integration with a grain of bias

Query Onelake Iceberg REST catalog using Duckdb SQL

it is a quick post on how to query Onelake Iceberg REST Catalog using pure SQL with DuckDB, and yes you need a service principal that has access to the lakehouse

CREATE or replace PERSISTENT secret onelake_identity_iceberg (
    TYPE ICEBERG,
    CLIENT_ID 'xxxxxxxxxxxxxx',
    CLIENT_SECRET 'yyyyyyyyyyyyyyyyyyy' ,
    OAUTH2_SCOPE   'https://storage.azure.com/.default' ,
    OAUTH2_SERVER_URI 'https://login.microsoftonline.com/TENANT_ID /oauth2/v2.0/token' ,
    ENDPOINT 'https://onelake.table.fabric.microsoft.com/iceberg'
);
CREATE or replace PERSISTENT secret azure_spn (
    TYPE azure,
    PROVIDER service_principal,
    TENANT_ID 'ccccccc',
    CLIENT_ID 'iiiiiiiiiiiiii',
    CLIENT_SECRET 'xbndlfrewi' ,
    ACCOUNT_NAME 'onelake'
);

it works reasonably well assuming your region is not far from your laptop, or even better , if you run it inside Fabric then there is no network shenanigans, I recorded a video showing my experience

Why read operations do not always need full consistency checks

I hope DuckDB eventually adds an option that allows turning off table state checks for purely read scenarios. The current behaviour is correct because you always need the latest state when writing in order to guarantee consistency. However, for read queries it feels unnecessary and hurts the overall user experience. PowerBI solved this problem very well with its concept of framing, and something similar in DuckDB would make a big difference, notice duckdb delta reader already support pin version.

	Power BI with DuckDB… on Using DuckDB with PowerBI
	Querying a Fabric La… on Writing to SQL Server using…
	Benjamin on Running DuckDB at 10 TB s…
	mim on Running DuckDB at 10 TB s…
	Benjamin on Running DuckDB at 10 TB s…

High level architecture

Power BI

Python Notebook

Not production ready

Share this:

Share this:

They all run the same SQL now

Freshness vs. performance is a trade-off you should be making

Object store calls are the real bottleneck

How about writes?

The bottom line

Share this:

Why read operations do not always need full consistency checks

Share this: