How far Python alone can take you on Delta

1. delta-rs is an ACID Delta writer

delta-rs implements the Delta Lake protocol natively. mergeupdate, and delete go through optimistic concurrency control on every commit. No external coordinator, no catalog service. Two writers race for the same version of the log, one wins, the other retries.

All you need is a path. No metastore to provision, no catalog endpoint, no JDBC connection, no warehouse to wake up. A folder on disk (or on ADLS / S3 / GCS) is the whole interface.

Setup: B is a Delta table being fed a series of CSV batches (batch_001.csvbatch_002.csv, …). Each merge should ingest only files B hasn’t seen yet.

A naming note: the project is delta-rs but the Python package is deltalake (pip install deltalake). On Fabric, stick with what’s preinstalled — Python notebooks already ship with deltalake and OneLake access configured.

From the notebook:

# Bootstrap target B with batch_001 already ingested
write_deltalake(Target_PATH, pa.table({...}), mode="overwrite")
vB = DeltaTable(Target_PATH).version() # v0
# Compute the rows to ingest from the target's current state
con.sql(f"ATTACH '{Target_PATH}' AS tgt (TYPE delta, VERSION {vB});")
our_rows = con.sql("""
SELECT s.id, s.value, parse_filename(s.filename) AS filename
FROM read_csv_auto('source_csv/*.csv', filename=true) s
WHERE parse_filename(s.filename) NOT IN (SELECT DISTINCT filename FROM tgt)
""").arrow()
# → 80 new rows from batch_002..005
# First merge: 80 inserts, commits cleanly
DeltaTable(Target_PATH).merge(
source=our_rows,
predicate="t.filename = s.filename",
source_alias="s", target_alias="t",
).when_not_matched_insert_all().execute()
# Same merge re-run: 0 inserts. The predicate is idempotent.
DeltaTable(Target_PATH).merge(...).when_not_matched_insert_all().execute()

Two commits, both correct. The second run does nothing because the predicate already sees the rows. The transaction model travels with the table itself: move the folder, open it from another machine, and the next writer continues from the last commit.

write_deltalake(mode="append") and write_deltalake(mode="overwrite") are blind on purpose. Blind append means N concurrent appenders all succeed and the result is the union of their rows — exactly what you want for event streams or log ingestion. Blind overwrite means the new data wins and whatever was there is gone — what you want when the writer is the authoritative source for the table. OCC only kicks in for operations that actually read the target (mergeupdatedelete), since those are the only ones where a concurrent change can invalidate what you just computed.

2. I want the full read-to-write transaction, Python API is fine

A common pattern: DuckDB or Polars reads, transforms, and hands an Arrow table to delta-rs to commit. The notebook above is exactly that shape — DuckDB computes “filenames not yet in B” and delta-rs merges the result.

Inside delta-rs, OCC still works. What it cannot see is the read on the other side of the engine boundary. delta-rs knows about the merge it is about to commit; it does not know that DuckDB read B at version vB thirty seconds ago.

Carry the snapshot across the boundary by pinning both sides to the same version:

vB = DeltaTable(Target_PATH).version()
import duckdb
con = duckdb.connect()
con.sql(f"ATTACH '{Target_PATH}' AS tgt (TYPE delta, VERSION {vB});")
our_rows = con.sql("SELECT ...").arrow()
DeltaTable(Target_PATH, version=vB).merge( # ← pinned
source=our_rows,
predicate="t.filename = s.filename",
source_alias="s", target_alias="t",
).when_not_matched_insert_all().execute()

The OCC check now compares against vB instead of HEAD. If another process touched B in the meantime — say a parallel job deleted batch_001.csv — the pinned merge raises:

Failed to commit transaction: Commit failed: a concurrent transaction deleted data this operation read.

Catch it, recompute the diff against fresh state, retry. On the Polars side, pl.read_delta(path, version=vB) accepts the same pin, so the pattern works for any reader that exposes versioned reads.

The pin is just a number. No new infrastructure, no shared coordinator, still path-based.

3. I don’t want the Python API, I want SQL only

If you would rather write SQL — say, drive the pipeline from dbt — your options on Delta today are Spark and Fabric Data Warehouse. Both have supported dbt adapters and work great in production. I have to admit, I was hoping DuckDB would fill that gap, since it is a database and SQL-level transactions are what you expect from a database. The market went the other way: investment is going into catalog-based lakehouse formats (DuckLake, Iceberg), and the DuckDB Delta writer that does exist is tied to Unity Catalog and limited to blind appends. I don’t see them investing in a file-based conflict resolver any time soon 🙂 Lakesail seems interested in this use case, but it is still too early to call.

Takeaway

I personally use delta-rs for CSV ingestion, appends, and recording results from high-concurrency performance tests — it is fast, cheap, and bullet-proof in those scenarios. The open source maintainers are very helpful and care deeply about the product, as they use it themselves in production. But it is not the right tool for every case; Data Warehouse and Spark are more appropriate for complex workloads. With time you intuitively pick the tool that makes sense for a particular job and how much compute you can spend. None of that has to be an either/or: at the end of the day it is a lakehouse, and the whole concept of a lakehouse is having the option to choose the engine. That option matters — if we say only one engine (open source or not) is blessed for writes, then there is no point in the concept of a lakehouse.


Notebook: https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/TableFormat/delta/occ.ipynb

Thanks Raki for keeping me honest:)

Thanks to Ion for explaining how version worked when doing merge: https://www.linkedin.com/in/ionkoutsouris/

Edit : how about Spark

Thanks to Frithjof for explaining Spark behaviour : The merge fixes one snapshot at transaction start (current HEAD = post-delete) and uses it for both its scan and its conflict check. Internally consistent — but bound to HEAD-at-merge-start, which Spark chose, not to the state our read saw, same behaviour when using delta_rs with a lazy dataframe : https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/TableFormat/delta/occ_spark.ipynb

Running DuckDB at 10 TB scale

This started as just a fun experiment. I was curious to see what happens when you push DuckDB really hard — like, absurdly hard. So I went straight for the biggest Python single-node compute we have in Microsoft Fabric: 64 cores and 512 GB of RAM. Because why not?


Setting Things Up

I generated data using tpchgen and registered it with delta_rs. Both are Rust-based tools, but I used their Python APIs (as it should be, of course). I created datasets at three different scales: 1 TB, 3 TB, and 10 TB.

From previous tests, I know that Ducklake works better, but I used Delta so it is readable by other Fabric Engines ( as of this writing , Ducklake does not supporting exporting Iceberg metadata, which is unfortunate)

You can grab the notebook if you want to play with it yourself .


What I Actually Wanted to Know

The goal wasn’t really about performance . I wanted to see if it would work at all. DuckDB has a reputation for being great with smallish data, but wanted to see when the data is substantially bigger than the available Memory.

And yes, it turns out DuckDB can handle quite a bit more than most people assume.


The Big Lesson: Local Disk Matters

Here’s where things got interesting.

If you ever try this yourself, don’t use a Lakehouse folder for data spilling. It’s painfully slow(as the data is first written to disk then uploaded to remote storage)

Instead, point DuckDB to the local disk that Fabric uses for AzureFuse caching. That disk is about 2 TB. or any writable folder

You can tell DuckDB to use it like this:

SET temp_directory = '/mnt/notebookfusetmp';

Once I did that, I could actually see the data spilling happening in real time which felt oddly satisfying, it works but slow , it is better to just have more RAM 🙂


Python notebook is fundamentally just a Linux VM, and you can see the storage layout using this command 

!df -hT

Here is the layout for 2 cores

Which is different when running it for 64 cores ( container vs VM, something like that), I notice the local disk increased with more cores, which make sense

The Results

Most queries went through without too much trouble. except Query 17 at 10 TB scale? That one It ran for more than an hour before my authentication token expired. So technically, it didn’t fail 🙂

DuckDB does not have a way to refresh Azure token mid query. as far as I know

Edit : according to Claude, I need at least 1-2 TB of RAM (10-20% of database size) to avoid disk thrashing


Observations: DuckDB’s Buffer Pool

Something I hadn’t noticed before is how the buffer pool behaves when you work with data way bigger than your RAM. It tends to evict data that was just read from remote storage — which feels wasteful. I can’t help but think it might be better to spill that to disk instead.

I’m now testing an alternative cache manager called duck-read-cache-fs to see if it handles that better. We’ll see, i still think it is too low level to be handled by an extension, I know MotherDuck rewrote their own buffer manager, but not sure if it is for the same reason.


Why not test other Engines

I did, actually , and the best result I got was with Lakesail at around 100 GB. Beyond that, no single-node open-source engine can really handle this scale. Polars, for instance, doesn’t support spilling to disk at all and implements fewer than 10 of the 22 standard SQL queries.

Wrapping Up

So, what did I learn? DuckDB is tougher than it looks. With proper disk spilling and some patience, it can handle multi-terabyte datasets just fine, and sometimes the right solution is just to add more RAM

personally , I never had a need for TB of data ( my sweet spot is 100 GB) and distributed system (Like Fabric DWH, Spark etc) will handle this use case way better, after all they were designed for this scale.

But it’s amazing to see how far an in-process database has come 🙂 just a couple of years ago, I was thrilled when DcukDB could handle 10 GB!

First Look at Incremental Framing in Power BI

TL;DR: Incremental framing is like CDC to RAM 🙂 It significantly improves cold-run performance of Direct Lake mode in some scenarios, there is an excellent documentation that explain everything in details

What Is Incremental Framing?

One of the most important improvements to Direct Lake mode in Power BI is incremental framing.

Power BI’s OLAP engine, VertiPaq (probably the most widely deployed OLAP engine, though many outside the Power BI world may not know it) relies heavily on dictionaries. This works well because it is a read-only database. another core trick is its ability to do calculation directly on encoded data. This makes it extremely efficient and embarrassingly fast  ( I just like this expression for some reason ).


Direct Lake Breakthrough

Direct Lake’s breakthrough is that dictionary building is fast enough to be done at runtime.

Typical workflow:

  1. A user opens a report.
  2. The report generates DAX queries.
  3. These queries trigger scans against the Delta table.
  4. VertiPaq scans only the required columns.
  5. It builds a global dictionary per column, loads the data from Parquet into memory, and executes queries.

The encoding step happens once at the start, and since BI data doesn’t usually change more that much, this model works well.


The Problem with Continuous Appends

In scenarios where data is appended frequently (e.g., every few minutes), the initial approach does not works very well. Each update requires rebuilding dictionaries and reloading all the data into RAM, effectively paying the cost of a cold run every time ( reading from remote storage will be always slower).


How Incremental Framing Fixes This

Incremental framing solves the problem by:

  • Incrementally loading new data into RAM.
  • Encoding only what’s necessary.
  • Removing obsolete Parquet data when not needed.

This substantially improves cold-run performance. Hot-run performance remains largely unchanged.


Benchmark: Australian Electricity Market

To test this feature, I used my go-to workload: the Australian electricity market, where data is appended every 5 minutes—an ideal test case.

  • Incremental framing is on by default, I turn it off using this bog
  • For benchmarking, I adapted an existing tool , Direct Lake load testing( I just changed writing the results to Delta instead of CSV), I used 8 concurrent users, the main fact Table is around 120 M records, the queries reflect a typical user session , this is a real life use case, not some theoretical benchmark.

Results

P99

P99 (the 99th percentile latency, often used to show worst-case performance):

  • Improvement of 9x–10x, again, your results may varied depending on workload, Parquet layout, and data distribution.

P90

P90 (90th percentile latency):

  • Less dramatic but still strong.
  • Improved from 500 ms → 200 ms.
  • Faster queries also reduce capacity unit usage.

Geomean

just for fun and to show how fast Vertipaq is, let’s see the geomean, alright went from 11 ms to 8 ms, general purpose OLAP engines are cool, but specialized Engines are just at another level !!!

This does not solve Bad Table layout problem

This feature improves support for Delta tables with frequent appends and deletes. However, performance still degrades if you have too many small Parquet row groups.

VertiPaq does not rewrite data layouts—it reads data as-is. To maintain good performance:

  • Compact your tables regularly.
  • In my case, I backfill data nightly. The small Parquets added during the day don’t cause major issues, but I still compact every 100 files as a precaution.

If your data is produced inside Fabric, VOrder helps manage this. For external engines (Snowflake, Databricks, Delta Lake with Python), you’ll need to actively manage table layout yourself.

Vibe Coding a Simple SQL orchestrator in a Fabric Python Notebook

Note: The blog and especially the code were written with the assistance of an LLM.

TL;DR

I built a simple Fabric Python notebook to orchestrate sequential SQL transformation tasks in OneLake using DuckDB and delta-rs. It handles task order, stops on failure, fetches SQL from external sources (like GitHub or a Onelake folder), manages Delta Lake writes, and uses Arrow recordbacth for efficient data transfer, even for large datasets. This approach helps separate SQL logic from Python code and simulates external table behavior in DuckDB. Check out the code on GitHub: https://github.com/djouallah/duckrun

pip install duckrun

Introduction

Inspired by tools like dbt and sqlmesh, I started thinking about building a simple SQL orchestrator directly within a Python notebook. I was showing a colleague a Fabric notebook doing a non-trivial transformation, and although it worked perfectly, I noticed that the SQL logic and Python code were mixed together – clear to me, but spaghetti code to anyone else. With Fabric’s release of the user data function, I saw the perfect opportunity to restructure my workflow:

  • Data ingestion using a User-Defined Function (UDF), which runs in a separate workspace.
  • Data transformation in another workspace, reading data from the ingestion workspace as read-only.
  • All transformations are done in pure SQL, there 8 tables, every table has a sql file, I used DuckDB, but feel free to use anything else that understands SQL and output arrow (datafusion, chdb, etc).
  • Built Python code to orchestrate the transformation steps.
  • PowerBI reports are in another workspace

I think this is much easier to present 🙂

I did try yato, which is a very interesting orchestrator, but it does not support parquet materialization

How It Works

The logic is pretty simple, inspired by the need for reliable steps:

  1. Your Task List: You provide the function with a list (tasks_list). Each item has table_name (same SQL filename, table_name.sql) and how to materilize the data in OneLake (‘append’ , ‘overwrite’,ignore and None)
  2. Going Down the List: The function loops through your tasks_list, taking one task at a time.
  3. Checking Progress: It keeps track of whether the last task worked out using a flag (like previous_task_successful). This flag starts optimistically as True.
  4. Do or Don’t: Before tackling the current task, it checks that flag.
  • If the flag is True, it retrieves the table_name and mode from the current task entry and passes them to another function, likely called run_sql. This function performs the actual work of running your transformation SQL and writing to OneLake.
  • If the flag is False, it knows something went wrong earlier, prints a quick “skipping” message, and importantly, uses a break statement to exit the loop immediately. No more tasks are run after a failure.
  1. Updating the Status: After run_sql finishes, run_sql_sequence checks if run_sql returned 1 (our signal for success). If it returns 1, the previous_task_successful flag stays True. If not, the flag flips to False.
  2. Wrap Up: When the loop is done (either having completed all tasks or broken early), it prints a final message letting you know if everything went smoothly or if there was a hiccup.

The run_sql function is the workhorse called by run_sql_sequence. It’s responsible for fetching your actual transformation SQL (that SELECT … FROM raw_table). A neat part here is that your SQL files don’t have to live right next to your notebook; they can be stored anywhere accessible, like a GitHub repository, and the run_sql function can fetch them. It then sends the SQL to your DuckDB connection and handles the writing part to your target OneLake table using write_deltalake for those specific modes. It also includes basic error checks built in for file reading, network stuff, and database errors, returning 1 if it succeeds and something else if it doesn’t.

You’ll notice the line con.sql(f””” CREATE or replace SECRET onelake … “””) inside run_sql; this is intentionally placed there to ensure a fresh access token for OneLake is obtained with every call, as these tokens typically have a limited validity period (around 1 hour), keeping your connection authorized throughout the sequence.

When using the overwrite mode, you might notice a line that drops DuckDB view (con.sql(f’drop VIEW if exists {table_name}’)). This is done because while DuckDB can query the latest state of the Delta Lake files, the view definition in the current session needs to be refreshed after the underlying data is completely replaced by write_deltalake in overwrite mode. Dropping and recreating the view ensures that subsequent queries against this view name correctly point to the newly overwritten data.

The reason we do this kind of hacks is, duckdb does not support external table yet, so we are just simulating the same behavior by combining duckdb and delta rs, spark obviousely has native support

Handling Materialization in Python

One design choice here is handling the materialization strategy (whether to overwrite or append data) within the Python code (run_sql function) rather than embedding that logic directly into the SQL scripts.

Why do it this way?

Consider a table like summary. You might have a nightly job that completely recalculates and overwrites the summary table, but an intraday job that just appends the latest data. If the overwrite or append command was inside the SQL script itself, you’d need two separate SQL files for the exact same transformation logic – one with CREATE OR REPLACE TABLE … AS SELECT … and another with INSERT INTO … SELECT ….

By keeping the materialization mode in the Python run_sql function and passing it to write_deltalake, you can use the same core SQL transformation script for the summary table in both your nightly and intraday pipelines. The Python code dictates how the results of that SQL query are written to the Delta Lake table in OneLake. This keeps your SQL scripts cleaner, more focused on the transformation logic itself, and allows for greater flexibility in how you materialize the results depending on the context of your pipeline run.

Efficient Data Transfer with Arrow Record batch

A key efficiency point is how data moves from DuckDB to Delta Lake. When DuckDB executes the transformation SQL, it returns the results as an Apache Arrow RecordBatch. Arrow’s columnar format is highly efficient for analytical processing. Since both DuckDB and the deltalake library understand Arrow, data transfers with minimal overhead. This “zero-copy” capability is especially powerful for handling datasets larger than your notebook’s available RAM, allowing write_deltalake to process and write data efficiently without loading everything into memory at once.

Example:

you pass Onelake location, schema and the number of files before doing any compaction

first it will load all the existing Delta table

Here’s an example showing how you might define and run different task lists for different scenarios:

sql_tasks_to_run_nightly = [
    ['price', 'append'],
    ['scada', 'append'],
    ['duid', 'ignore'],
    ['summary', 'overwrite'], # Overwrite summary nightly
    ['calendar', 'ignore'],
    ['mstdatetime', 'ignore'],
]

sql_tasks_to_intraday = [
    ['price_today', 'append'],
    ['scada_today', 'append'],
    ['duid', 'ignore'],
    ['summary', 'append'] # Append to summary intraday using the *same* SQL script
]



You can then use Python logic to decide which pipeline to run based on conditions, like the time of day:

start = time(4, 0)
end = time(5, 30)

if start <= now_brisbane <= end:
    run_sql_sequence(sql_tasks_to_run_nightly)

Here’s an example of an error I encountered during a run, it will automatically stop the remaining tasks:

Attempting to run SQL for table: price_today with mode: append
Running in mode: append for table: price_today
Error writing to delta table price_today in mode append: Parser Error: read_csv cannot take NULL list as parameter
Error updating data or creating view in append mode for price_today: Parser Error: read_csv cannot take NULL list as parameter
Failed to run SQL for table: price_today. Stopping sequence.
One or more SQL tasks failed.

here is some screenshots from actual runs

as it is a delta table, I can use SQL endpoints to get some stats

For example the table scada has nearly 300 Million rows, the raw data is around 1 billion of gz.csv

It took nearly 50 minutes to process using 2 cpu and 16 GB of RAM, notice although arrow is supposed to be zero copy, writing parquet directly from Duckdb is substantially faster !!! but anyway, the fact it works at all is a miracle 🙂

in the summary table we remove empty rows and other business logic, which reduce the total size to 119 Million rows.

here is an example report using PowerBI direct lake mode, basically reading delta directly from storage

In this run, it did detect that the the night batch table has changed

Conclusion

To be clear, I am not suggesting that I did anything novel, it is a very naive orchestrator, but the point is I could not have done it before, somehow the combination of open table table format, robust query engines and an easy to use platform to run it make it possible and for that’s progress !!!

I am very bad at remembering python libraries syntax but with those coding assistants, I can just focus on the business logic and let the machine do the coding. I think that’s good news for business users.