Power BI with DuckDB, 4 years later

Four years ago I wrote a blog about using DuckDB with Power BI in DirectQuery. It got a fair number of likes on LinkedIn 🙂 along with the one comment I didn’t want to hear: how does this work in production? (Craig, if you’re reading this, you were right.)

Back then I thought the technology was the hard part and the rest would sort itself out. It didn’t.

The ODBC driver never really worked in any non-trivial setup. Filters didn’t push down, decimal precision was buggy. It has gotten better since, but two show stoppers remained:

  • DuckDB is in-process, so the driver is the database. There’s no warm, long-running session. Every query starts from scratch.
  • I don’t think those drivers can realistically be certified (personal opinion). And Power BI Service, or any hosted BI service for that matter, is not going to host an in-process engine for free. An on-prem data gateway is not really a good option either.

In 2026 things are way better. MotherDuck (DuckDB’s SaaS) shipped a PostgreSQL endpoint. Problem solved: Power BI speaks Postgres, and it works out of the box.

Then last week DuckDB released Quack. For my own sanity I’ll just call it “DuckDB Server.” It is just an extension; a single function call and you have a server !!

My first reaction was annoyance. Four years of waiting, and they shipped a proprietary wire protocol. I was hoping for pg wire. I want my driver to work. I don’t really care about a 2x improvement if nothing interoperates.

Luckily I was partially wrong. Within two days there was an ADBC driver from gizmodata/adbc-driver-quack, and, to my surprise, a Power BI custom connector from Curt Hagenlocher (think of him as the Linus of Power Query). my understanding it is a side project, not official Microsoft.

And somehow, the whole thing worked. It was beautiful.

But lesson learned from last time: this is experimental, with no guarantee the connector will ever be certified.

The main change from the 2022 post is that instead of pointing at parquet files, I’m pointing at a catalog and getting tables back, like an actual database instead of a pile of files and duckdb got way better.

High level architecture

  1. OneLake Iceberg Catalog — OneLake exposes data as tables. You need three things:
    • Endpoint: https://onelake.table.fabric.microsoft.com/iceberg
    • An Entra ID auth token
    • Path to the Lakehouse/Warehouse: workspace_name/Lakehouse_name.Lakehouse

  1. DuckDB + iceberg extension — reads the catalog and the underlying parquet over HTTPS.

  1. Entra IDaz account get-access-token --resource https://storage.azure.com/ mints a short-lived bearer token. No service principal, no app registration. I have a script that grabs the token, and I opened duckdb-azure#170 hoping to make this much simpler.

  1. DuckDB Endpoint — turns the engine into a TCP server on 127.0.0.1:9494, speaking DuckDB’s native wire protocol (whatever that means).

  1. The ADBC Driver — Python client and Power BI share the same DLL, you need to manually install it from curt github page

You can download all the files here

Power BI

Let’s just share a video. Yes, 600M rows, warm run in my laptop

Python Notebook

TPC-H SF=10 (10 GB), 22 queries, run twice in the same session via client.ipynb. Numbers are seconds, copied straight from the notebook output.

ColdWarm
Total~5 min 29 s~30 s

Cold time is dominated by parquet I/O over HTTPS from OneLake. Bandwidth and seek count, not CPU. Warm runs hit DuckDB’s in-process buffer cache, Onelake endpoint is in another continent and my internet provider is horrible 🙂

Optimization on this stack should target bytes read and seeks (codec, row-group size, predicate pushdown, range prefetch), not query plans.

This is exactly why server mode make sense, as the warm cache is shared by all client (notebook, Power BI, AI Agent)

Not production ready

  • The Entra token has a ~1h TTL. As far as I can tell, DuckDB has no way to auto-refresh tokens.
  • The driver is not certified, so it can’t be used in the service, if you want it added to PowerBI, create an idea in Fabric forum and vote
  • DuckDB Server is new. Don’t expect SQL Server maturity yet 🙂
  • DuckDB’s remote file cache is RAM only. When you restart DuckDB, you lose it and have to deal with the cold-run pain again and egress fees 😦
  • The DuckDB Azure extension is still pretty rough in places. To be fair, they’ve openly said they don’t have the bandwidth.

Hopefully it won’t take another four years to make this production ready.

Still, seeing DuckDB as a single binary serving a 600M row table to Power BI was genuinely fun. and The Iceberg catalog is awesome !!!

Aggregate 100 GB using PowerQuery and DuckDB

Maybe not known enough, but you can run a python script inside PowerQuery in the desktop, and deploy it to PowerBI service using a personal gateway.

it is just a POC that showcase you can do a lot; just using a decent laptop and python.

let’s say, you have a folder with 100 GB of parquet files, and you want to aggregate some results for further analysis in PowerBI, Obviously importing 100 GB maybe not be a great idea, in this scenario, from what I can see, people recommend stuff Like PySpark or Dask , which I guess are great option, but if you want a lightweight option with a simple pip install then DuckDB is amazing !!

For testing, first write your SQL Query using only 1 parquet file, when you are happy with the results, change the path from “folder/x.parquet” to “folder/*.parquet”

In this example I have a folder of 100 GB of Parquet files.

in PowerQuery you need just to insert a python script, you don’t need to know anything about python, it is just a standard SQL Query, PowerQuery is smart enough to know that df is a dataframe, and it will show the result in the next step.

import duckdb 
con = duckdb.connect()
df =con.execute('''
select  L_RETURNFLAG,L_LINESTATUS,count(*) from
 'C:/xxxxx/parquet/lineitem/*.parquet' 
        group by 1,2
''').df()

And here is the final results.

The duration will depend mainly on your CPU and SSD speed, DuckDB consume little memory, in that example less than 2 GB, the secret sauce is parallel scan of Parquet files.

Traditionally Big Data means; Data that don’t fit into RAM, nowadays, Big Data is how much Data can fit in your SSD, and that’s a welcome change.

Loading 1 Billion New York Taxi Dataset into Datamart

Was chatting with David Eldersveld and he suggested that he wants to run a competition using the famous New York Taxi Dataset with Datamart, long story short, I did endup publishing my attempt before he had the time to start the competition, my sincere apology.

The report is using my personal PPU instance, the data is located here , personally I wanted to exclusively use tools available in PowerBI out of the box, no Synapse nor Azure stuff or BigQuery, Just pure self service tools.

Initially I didn’t really believed that PowerQuery can download such a big volume of data, my first set back , PowerQuery can not read parquet files available in public url, but as usual Chris Webb has an excellent blog explaining the reason, and gave a workaround.

I added the code here, it does work with any PowerQuery in Dataflow, Datamart and PowerBI desktop, unfortunately Excel is not supported yet, when prompted for authentication use anonymous

The only section of the code that you should pay attention to is selecting the number of files to download by default it is 2, but you can increase it

Only when I was writing this blog, I noticed the files for 2022 are using a slightly different URL, will update the code later. Code Updated.

To reduce the database size, I had to split Datetime to date and time, low cardinality is good for performance too.

Loading into Datamart

I don’t know how much it took datamart to load the data, currently Query refresh history is broken, but I think it is more than 6 hours, I maybe wrong, but Datamart take a bit of time to generate the tables with Clustered Columnstore Index

Initially I loaded only 2 then 30 files just to see how Datamart behave and finally I went for 100 files, and it did work again to my surprise.

and the Lineage View of the report

Performance

The Performance is not bad at all considering, the data was loaded as it is and it not sorted, although the parquet files are organized by month, unfortunately there are some outlier in every file see for example, so you get overlapping segments.

You can check the database size on disk by running this Query

EXEC sp_spaceused

Optimisation

Pretty much the only optimisation you can do in Datamart is to pre sort the data before loading it, but when you have 1 billion rows saved in parquet files, sorting is a very expensive operation, but there are options I think.

Create another Datamart and load it from the “raw” Datamart and define incremental refresh which will create partitions, yes partitions should improve the performance.

Hybrid table in PowerBI Dataset where only the recent data is cached in Vertipaq and the history kept in Datamart as a Direct Query Mode.

Final thoughts

The Publish to web report is here, a very big missing piece is the option to append data to an existing Datamart, this will make adding new data without a full refresh extremely trivial, I know about incremental refresh, and I am sure a hack like this may work, but we want the real deal, Dataflow people hurry up 🙂

I notice something interesting because the price of PPU is fixed, I felt I can experiment without the fear of getting a massive bill, maybe reserved pricing is not a bad thing after all.

My first reaction when I saw datamart was, that it will be a big validation for PowerQuery, and it is, as Alex said, PowerQuery everything !!!

Poor man’s snapshot using PowerBI Dataflow

It is a quick Hack and maybe useful, let’s say you want to keep a snapshot of a previous data, for example you have a dimension table and you want to see how the data has changed between multiple load.

This Scenario was famously hard to do in PowerBI, because when you load a table a into PowerBI Database, by default it will overwrite the existing data, there is an option for incremental refresh, but it assume your data change daily or at least regularly.

The canonical solution for this scenario nowadays is to attach a workspace to an Azure Storage, unfortunately, it is very problematic for a couple of of reasons

– Business users don’t usually have access to Azure Resources.

– As of today, a PowerBI admin need to opt for that Option, which apply for all Workspace Admins, it would have being easier if that option can be granted to only some Workspace Admins.

XMLA End Point

The Idea is very simple.

  • Getting already Loaded Data from a PowerBI using XMLA Endpoint ( See this blog for how to use it).
  • Load the new data using a Dataflow.
  • In the Same Dataflow you can append only the new Data, or any operation you want ( like merging new Data), it is very flexible, it can use any attribute not only Date
  • Just make sure there is a lag between Dataflow refresh and PowerBI Dataset refresh

Dataflow Should be Better

The Obvious Question ? because Dataflow keep snapshots of recent loads, why we need the extra step of attaching an Azure Storage, why not exposing this data using some extra option in the UI.

I think after all this years, it is time for Dataflow to tackle more use cases, can we have an option to append Data instead of overwrite ? or why not an easy to use upsert.

PowerBI made Data Modeling nearly trivial, Hopefully Dataflow can make Enterprise ETL available for everyone.