How far Python alone can take you on Delta

1. delta-rs is an ACID Delta writer

delta-rs implements the Delta Lake protocol natively. mergeupdate, and delete go through optimistic concurrency control on every commit. No external coordinator, no catalog service. Two writers race for the same version of the log, one wins, the other retries.

All you need is a path. No metastore to provision, no catalog endpoint, no JDBC connection, no warehouse to wake up. A folder on disk (or on ADLS / S3 / GCS) is the whole interface.

Setup: B is a Delta table being fed a series of CSV batches (batch_001.csvbatch_002.csv, …). Each merge should ingest only files B hasn’t seen yet.

A naming note: the project is delta-rs but the Python package is deltalake (pip install deltalake). On Fabric, stick with what’s preinstalled — Python notebooks already ship with deltalake and OneLake access configured.

From the notebook:

# Bootstrap target B with batch_001 already ingested
write_deltalake(Target_PATH, pa.table({...}), mode="overwrite")
vB = DeltaTable(Target_PATH).version() # v0
# Compute the rows to ingest from the target's current state
con.sql(f"ATTACH '{Target_PATH}' AS tgt (TYPE delta, VERSION {vB});")
our_rows = con.sql("""
SELECT s.id, s.value, parse_filename(s.filename) AS filename
FROM read_csv_auto('source_csv/*.csv', filename=true) s
WHERE parse_filename(s.filename) NOT IN (SELECT DISTINCT filename FROM tgt)
""").arrow()
# → 80 new rows from batch_002..005
# First merge: 80 inserts, commits cleanly
DeltaTable(Target_PATH).merge(
source=our_rows,
predicate="t.filename = s.filename",
source_alias="s", target_alias="t",
).when_not_matched_insert_all().execute()
# Same merge re-run: 0 inserts. The predicate is idempotent.
DeltaTable(Target_PATH).merge(...).when_not_matched_insert_all().execute()

Two commits, both correct. The second run does nothing because the predicate already sees the rows. The transaction model travels with the table itself: move the folder, open it from another machine, and the next writer continues from the last commit.

write_deltalake(mode="append") and write_deltalake(mode="overwrite") are blind on purpose. Blind append means N concurrent appenders all succeed and the result is the union of their rows — exactly what you want for event streams or log ingestion. Blind overwrite means the new data wins and whatever was there is gone — what you want when the writer is the authoritative source for the table. OCC only kicks in for operations that actually read the target (mergeupdatedelete), since those are the only ones where a concurrent change can invalidate what you just computed.

2. I want the full read-to-write transaction, Python API is fine

A common pattern: DuckDB or Polars reads, transforms, and hands an Arrow table to delta-rs to commit. The notebook above is exactly that shape — DuckDB computes “filenames not yet in B” and delta-rs merges the result.

Inside delta-rs, OCC still works. What it cannot see is the read on the other side of the engine boundary. delta-rs knows about the merge it is about to commit; it does not know that DuckDB read B at version vB thirty seconds ago.

Carry the snapshot across the boundary by pinning both sides to the same version:

vB = DeltaTable(Target_PATH).version()
import duckdb
con = duckdb.connect()
con.sql(f"ATTACH '{Target_PATH}' AS tgt (TYPE delta, VERSION {vB});")
our_rows = con.sql("SELECT ...").arrow()
DeltaTable(Target_PATH, version=vB).merge( # ← pinned
source=our_rows,
predicate="t.filename = s.filename",
source_alias="s", target_alias="t",
).when_not_matched_insert_all().execute()

The OCC check now compares against vB instead of HEAD. If another process touched B in the meantime — say a parallel job deleted batch_001.csv — the pinned merge raises:

Failed to commit transaction: Commit failed: a concurrent transaction deleted data this operation read.

Catch it, recompute the diff against fresh state, retry. On the Polars side, pl.read_delta(path, version=vB) accepts the same pin, so the pattern works for any reader that exposes versioned reads.

The pin is just a number. No new infrastructure, no shared coordinator, still path-based.

3. I don’t want the Python API, I want SQL only

If you would rather write SQL — say, drive the pipeline from dbt — your options on Delta today are Spark and Fabric Data Warehouse. Both have supported dbt adapters and work great in production. I have to admit, I was hoping DuckDB would fill that gap, since it is a database and SQL-level transactions are what you expect from a database. The market went the other way: investment is going into catalog-based lakehouse formats (DuckLake, Iceberg), and the DuckDB Delta writer that does exist is tied to Unity Catalog and limited to blind appends. I don’t see them investing in a file-based conflict resolver any time soon 🙂 Lakesail seems interested in this use case, but it is still too early to call.

Takeaway

I personally use delta-rs for CSV ingestion, appends, and recording results from high-concurrency performance tests — it is fast, cheap, and bullet-proof in those scenarios. The open source maintainers are very helpful and care deeply about the product, as they use it themselves in production. But it is not the right tool for every case; Data Warehouse and Spark are more appropriate for complex workloads. With time you intuitively pick the tool that makes sense for a particular job and how much compute you can spend. None of that has to be an either/or: at the end of the day it is a lakehouse, and the whole concept of a lakehouse is having the option to choose the engine. That option matters — if we say only one engine (open source or not) is blessed for writes, then there is no point in the concept of a lakehouse.


Notebook: https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/TableFormat/delta/occ.ipynb

Thanks Raki for keeping me honest:)

Thanks to Ion for explaining how version worked when doing merge: https://www.linkedin.com/in/ionkoutsouris/

Edit : how about Spark

Thanks to Frithjof for explaining Spark behaviour : The merge fixes one snapshot at transaction start (current HEAD = post-delete) and uses it for both its scan and its conflict check. Internally consistent — but bound to HEAD-at-merge-start, which Spark chose, not to the state our read saw, same behaviour when using delta_rs with a lazy dataframe : https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/TableFormat/delta/occ_spark.ipynb

Ensuring safe single-writer for DuckLake on OneLake using file lease

DuckLake supports multi-writer just fine — but only if your catalog is a real database, like Postgres (there’s some interest in SQL Server support too). But if all you have is object storage and a SQLite or DuckDB file as the catalog, you’re stuck with single-writer: object stores aren’t real filesystems, so the DB file can’t be locked. Nothing stops two processes from writing to it at the same time and corrupting it.

If single-writer is enough for you (one notebook, one pipeline, one user), you don’t need to stand up a database server. You just need accidental concurrent runs to fail fast.

The trick: take a blob lease

OneLake speaks the ADLS API, so you can take a lease on a blob — a mutex for free (it seems S3 needs DynamoDB and GCS needs a homemade lock object). Each run does:

  1. Acquire a lease on metadata.db in abfss://.
  2. Download it to local disk of the notebook.
  3. Point DuckLake at the local copy and do the work.
  4. Upload the modified file under the lease.
  5. Release the lease.

A second notebook that starts while the lease is held fails immediately on acquire_lease. It can’t even read a stale copy. and you can’t delete the file using the UI , I can see already some uses cases here:)

What about crashed runs?

ADLS leases are either 15–60 seconds fixed, or infinite. Fixed leases need a heartbeat — annoying inside a notebook. Infinite leases work until something crashes — then the file is stuck.

The fix: take an infinite lease, but stamp acquired_at = <utc iso> into the blob’s own metadata when you acquire. When the next run hits a lease conflict, read that timestamp. Older than 12 hours? Call break_lease and re-acquire. A crashed run self-heals within 12 hours. You can shorten that window, or break the lease manually with a one-line script if you can’t wait — there’s a snippet in the README.

Code is here.

The Boring Reason Iceberg Matters

TL;DR: Iceberg’s value is sociological, not technical. And if you care about lightweight, single-process engines like datafusion and duckdb, it’s probably your best shot at first-class lakehouse support with Wide interoperability.

The first real data engineering work I did was an ingestion pipeline built on pandas and Parquet with Hive-style partitioning — an environment where 512 MB of memory was a genuine architectural constraint, not a rounding error. That experience shaped how I think about data tooling: the engine matters, but so does the ability to swap it out. Engine independence is something I care about more than most people I know, which is probably why I find myself paying close attention to Iceberg. Not for the reasons most people cite, though. It’s not the spec. It’s where the engineering hours are landing.

Getting query engines and catalogs to talk to each other is genuinely hard work. Most of it is unglamorous: error envelope parsing, metadata round-tripping, commit response shapes, partition spec edge cases, auth token quirks between vendors. None of it ships a feature anyone demos. None of it makes a good blog post. It’s the maintenance work that quietly determines whether your stack actually functions.

This is the part that’s easy to miss. Standards don’t converge because the spec is good. They converge because enough people, at enough companies, decide to put sustained hours into the interop bugs — year after year, across release cycles, through personnel changes and shifting priorities.

Look at the Iceberg committer list: Netflix, Apple, Databricks, Snowflake, AWS, Dremio, Microsoft. No single employer controls what gets merged. The incentive to fix cross-vendor interoperability bugs is distributed across the committer base itself. The governance isn’t just a formality — it’s what makes it possible for engineers from genuinely different setups to find, reproduce, and fix the same bug together.

There is one specific layer worth watching: the Iceberg REST catalog specification. It has become the canonical standard for how engines and catalogs communicate. Adoption is real: Polaris, lakekeeper, Gravitino, and a growing list of vendor-managed catalogs implement it.

But adoption and interoperability are not the same thing.

In practice, vendors still interpret parts of the specification differently. Engines end up handling quirks like slightly different response shapes, undocumented authentication flows, or inconsistent error handling. The nearest analogy is ODBC — a real standard, widely implemented, and still years of painful work before the “connect to anything” promise actually held up in practice.

The Iceberg REST catalog ecosystem feels earlier in that curve. The gap between specification and implementation is exactly where a lot of the maintenance work is happening right now. And closing that gap is precisely the kind of work Iceberg’s governance model is designed to support, because the people hitting the bugs are often the same people with commit access to fix them.

This is where the stakes become concrete, especially for lightweight engines.

For cloud warehouses and large JVM-based systems, the maintenance burden is manageable. There are full-time teams paid to absorb it. For the newer generation of small, single-process engines, the situation is very different. These are compact teams building engines with a specific focus: query latency, memory efficiency, embedded analytics, local execution.

Every hour spent chasing interoperability edge cases is an hour not spent improving the engine itself.

Several of these engines already support Iceberg in some form. But broad, reliable lakehouse support depends on the ecosystem doing its part: stable specifications, faithful implementations, bugs surfaced and fixed upstream.

A well-maintained standard is not just a convenience for these projects. It’s what makes serious lakehouse support achievable without hollowing out the team building the engine.

There is also a broader cost to fragmentation that rarely gets discussed directly. Every hour the ecosystem spends maintaining incompatible metadata layers is an hour not spent making lakehouse systems actually better. That cost doesn’t show up clearly in any individual issue tracker, but it accumulates across the entire ecosystem.

That’s the real argument for Iceberg.

Not that it’s a particularly clever format. Formats are mostly boring by design.

The real advantage is that Iceberg has assembled the right kind of maintenance coalition: enough companies with genuinely different incentives, governance that distributes merge authority, and enough independent implementations that bugs surface from the edges instead of only the center.

Whether that coalition survives long term as the market consolidates is still an open question. But right now, Iceberg is the ecosystem where the boring interoperability work is most likely to get done by someone other than you.

And in infrastructure, that’s close to everything.

That’s also why this feels personal to me.

The 512 MB pipeline I started with wrote Parquet files and hoped for the best — no transactions, no snapshot isolation, just partitions and careful scheduling to avoid stepping on yourself.

What I actually wanted, and couldn’t realistically have at the time, was proper ACID semantics with snapshot isloation end to end from something small and cheap. A cloud function. A tiny process with almost no memory to spare.

Iceberg is the closest thing to a realistic path toward that today. Not because the specification is especially elegant, but because it’s where the maintenance work is happening.

And eventually, ecosystems catch up to where the maintenance happens.


Special thanks to Raki Rahman for a few conversations that genuinely reshaped how I think about this space.

Ideas are mine; writing assisted by AI.

Building a Data Pipeline Using VSCode and Claude Out of Thin Air

A complete data pipeline running on Microsoft Fabric that downloads public data, transforms it into a star schema, exports it as Delta Lake tables, and serves it through a Power BI semantic model with Direct Lake — all from a single Python notebook and using pure SQL

all the code is available in github

and Interactive DAG

The entire stack:

  • One Fabric notebook (2 cells)
  • DuckDB as the compute engine — could have been Polars or Lakesail, just a personal preference to be honest
  • dbt as the transformation framework
  • A Python script to deploy everything via Fabric REST API
  • GitHub for source control, documentation, and testing

Note: DuckDB is not officially supported by Microsoft Fabric. Every effort is made to ensure compatibility with OneLake.

Overall Architecture

Why DuckDB + Delta Export

Microsoft Fabric’s lakehouse uses Delta Lake or Apache Iceberg as its table format. Power BI’s Direct Lake mode reads the data directly from OneLake. So whatever engine you use, you need to produce Delta Lake files on OneLake.

DuckDB cannot write Delta Lake natively (it is experimental at this stage). It has its own table format via the DuckLake extension, but DuckLake writes Parquet files with a DuckDB/SQLite/PostgreSQL metadata catalog.

OneLake catalog has only Iceberg read support, so that’s not an option for now.

The solution: delta_export, a community DuckDB extension that exports DuckLake tables as Delta Lake. The pipeline works like this:

  1. dbt transforms data into DuckLake tables (Parquet + metadata)
  2. ducklake_rewrite_data_files and ducklake_merge_adjacent_files compact the Parquet files
  3. CALL delta_export() converts every DuckLake table into a proper Delta Lake table on OneLake

Without delta_export, DuckLake is not useful in this context. DuckLake manages tables internally, but Fabric has no idea what a SQLite metadata catalog is. It needs Delta transaction logs.

From dbt_project.yml:

on-run-end:
- "CALL ducklake_rewrite_data_files('ducklake')"
- "CALL ducklake_merge_adjacent_files('ducklake')"
- "CALL delta_export()"

DuckLake: How It Works and Its Limitations

DuckLake stores table metadata in a database and writes data as Parquet files to any storage backend (local, S3, Azure). The DuckDB connection looks like this:

# profiles.yml (prod target)
attach:
- path: "ducklake:sqlite:{{ env_var('METADATA_LOCAL_PATH') }}"
alias: ducklake
options:
data_path: "{{ env_var('ROOT_PATH') }}/Tables"
data_inlining_row_limit: 0

METADATA_LOCAL_PATH points to /lakehouse/default/Files/metadata.db — the Files section of the OneLake lakehouse. In a Fabric notebook, /lakehouse/default/ is a local mount of the lakehouse storage. The SQLite file lives right there on OneLake, persisting across notebook runs without any special sync logic. data_path points to the Tables section on OneLake (abfss://...). DuckDB computes in memory, DuckLake tracks what’s in each table via SQLite, and Parquet files land on OneLake.

The single-writer limitation. DuckLake when used with a file-based DB is basically a single-writer architecture. Only one process can write to a DuckLake database at a time. This means:

  • No parallel pipeline runs
  • No concurrent notebooks writing to the same tables
  • The Fabric pipeline is set to concurrency: 1 specifically because of this

For this use case, it’s fine — one notebook runs every hour, processes new files, and exits. But if you need concurrent writers, DuckLake is not the right choice.

Edit – May 2026: Updated the pipeline to enforce a single writer at any time — accidental duplicate runs exit cleanly instead of racing. Thanks to OneLake exposing primitives like leases and mutexes, DuckLake’s single-writer model becomes a much more viable option in practice.

Obviously you can use PostgreSQL as a catalog, but that makes the architecture more complex.

dbt as the Orchestrator

dbt does everything here — not just transformations. The on-run-start hook downloads data from the web, archives it to OneLake, and tracks state in a parquet log. The on-run-end hook compacts files and exports Delta.

on-run-start:
- "CALL ducklake.set_option('rewrite_delete_threshold', 0)"
- "CALL ducklake.set_option('target_file_size', '128MB')"
- "{{ download() }}"
on-run-end:
- "CALL ducklake_rewrite_data_files('ducklake')"
- "CALL ducklake_merge_adjacent_files('ducklake')"
- "CALL delta_export()"

The download() macro (371 lines) handles:

  • Fetching daily SCADA and price reports from AEMO’s website
  • Fetching intraday 5-minute dispatch data
  • Downloading generator reference data
  • Archiving everything as partitioned ZIPs on OneLake
  • Maintaining a csv_archive_log.parquet file for deduplication

The 8 dbt models then process this data:

  • stg_csv_archive_log — view over the archive log
  • dim_calendar — date dimension (one-time load)
  • dim_duid — generator unit reference (smart refresh: only rebuilds when new generators appear)
  • fct_scada, fct_price — daily historical data, incremental by file
  • fct_scada_today, fct_price_today — intraday data, incremental by file
  • fct_summary — combined fact table exposed to Power BI

Every fact model uses file-based incremental processing. Pre-hooks query the archive log, filter out already-processed files, and set DuckDB VARIABLEs with the remaining ZIP paths. The model’s SQL reads from those paths. Next run, those files are skipped.

The Semantic Model: AI-Generated from Thin Air

This is the part that surprises me the most. The model.bim file — the Power BI semantic model definition — was generated entirely by AI (Claude). No Power BI Desktop. No click-through wizards. No SSDT.

The model.bim is a JSON file in TMSL (Tabular Model Scripting Language) format. It defines:

  • 3 tables exposed to Power BI: dim_calendar, dim_duid, fct_summary
  • 5 hidden tables (raw layer, not needed for reporting)
  • 2 relationships (fact → dimension)
  • 5 DAX measures (Total MW, Total MWh, Avg Price, Generator Count, Latest Update)
  • Direct Lake partitions pointing to Delta tables on OneLake

Notice I am using pure Direct Lake mode that does not fall back to SQL:

{
"name": "PBI_ProTooling",
"value": "[\"RemoteModeling\", \"DirectLakeOnOneLakeCreatedInDesktop\"]"
}

The M expression for the data source:

let
Source = AzureStorage.DataLake("{{ONELAKE_URL}}", [HierarchicalNavigation=true])
in
Source

{{ONELAKE_URL}} is a placeholder. The deploy script substitutes it with the actual OneLake URL at deploy time.

Each table partition maps to a Delta table on OneLake:

{
"mode": "directLake",
"source": {
"type": "entity",
"entityName": "fct_summary",
"expressionSource": "DirectLake",
"schemaName": "aemo"
}
}

This maps to Tables/aemo/fct_summary/ — exactly where DuckLake + delta_export writes the Delta files.

AI generated all of this by reading the dbt schema definitions (column names, types, descriptions) and understanding the Direct Lake requirements. No manual TMSL authoring. No reverse engineering from Power BI Desktop. The entire semantic model is version-controlled, diffable, and deployable via API.

Poor Man CI/CD, No Service Principal

deploy_to_fabric.py is a single Python script that deploys everything to Fabric using the REST API. It has 6 steps:

  1. lakehouse — Create the OneLake lakehouse (with schema support)
  2. files — Upload all dbt project files to Files/dbt/
  3. notebook — Create a 2-cell notebook (install deps + run dbt)
  4. pipeline — Create a pipeline that runs the notebook
  5. schedule — Set up hourly cron schedule
  6. semantic_model — Deploy model.bim with Direct Lake config + refresh

You can run any subset: python deploy_to_fabric.py semantic_model deploys just the BIM.

Authentication uses az login — your browser opens, you sign in, done. The script reads from the production git branch (clones it into a temp directory) so what you deploy is always what’s been merged to production.

python deploy_to_fabric.py # deploy everything
python deploy_to_fabric.py semantic_model # just the semantic model
python deploy_to_fabric.py files notebook # just files + notebook

and here is the script in action

CI/CD

assuming you got pass the app registration in Azure, GitHub Actions handles CI — on every push and pull request to production:

Q&A

Why deploy to Fabric from local instead of from GitHub Actions?

CI (testing, docs, DAG) runs in GitHub Actions — no cloud credentials needed, just Azurite. But Fabric deployment requires authenticating to the Fabric REST API, which means a service principal.

This is just my personal experience working in different companies. As a business user, there is almost zero chance IT will give permission to register an app. And even if a miracle happens, you still need to convince a Fabric admin. This is not a technical limitation, it is human behaviour.

Instead, deploy_to_fabric.py uses AzureCliCredential — you run az login, your browser opens, you sign in, done. The script picks up your existing identity. You already have the Fabric permissions. No secrets to store, no service principal to manage.

The tradeoff is that deployment requires a human at a keyboard. For a single-person or small-team project, that’s fine — you deploy when you’re ready, not on every push.

Why not just use Datawarehouse/Spark/Dataflow etc? It’s built into Fabric.

All those tools in Fabric are awesome, but it is a lakehouse and the whole point of a lakehouse is to use whatever you want as long as it produces Parquet and Delta/Iceberg metadata, ideally sorted with a decent row group size from 2M to 16M.

Why DuckLake instead of Delta or Iceberg?

  • DuckDB Delta write support is still experimental.
  • OneLake Catalog supports Iceberg read only.

If we had Iceberg write, that would be my first preference.

Why is the semantic model AI-generated?

Because it is cool 🙂 and it is unbelievable that AI managed to write it out of thin air and did cool stuff like generating descriptions so Power BI AI behaves better.

What happens if the pipeline fails mid-run?

The DuckLake metadata DB lives on OneLake (Files section). If the run fails mid-way:

  • Downloaded source files are already archived on OneLake (no re-download needed)
  • DuckLake metadata reflects whatever was committed before the failure
  • Next run picks up where it left off using the archive log

The pipeline has a 1-hour timeout. If it hangs, Fabric kills it and the next hourly run starts fresh.

Can this scale?

Python notebooks scale to half a TB of RAM. If you need more, then you are reading the wrong blog 🙂

Where is TMDL?

I could not deploy using TMDL, even after feeding AI all kurt buhler articles 🙂 bim seems to be better undersood at least for now.

Why use SQLite instead of DuckDB to store the metadata DB?

The Files section of OneLake is not a real POSIX filesystem. It is not like your local disk — it basically uses FUSE. All Python engines think it is a real filesystem, but I noticed SQLite works better than DuckDB for this. It flushes data more reliably.

What is skill

In this case, a skill is simply a way to capture what was learned during the work so the AI can reuse that knowledge later.

I wrote the skill after finishing the task, then asked the AI to summarize the key learnings and steps. The idea is that next time the AI runs a similar task, it will be better informed and produce better results.

This is not specific to Claude. The same approach works with Copilot as well. The format is different, but the idea is exactly the same: capture the knowledge once so the AI can reuse it later.

Parting Thoughts

Everything you have heard about AI is pretty much true. The only wrong part was the timing. We all knew about AI’s potential, but in my experience something changed around December 2025. Suddenly AI became genuinely useful — less hallucination, and it just works well enough. Especially when you can test the outcome. And that is the key insight: data engineering is, in a sense, just software engineering. AI writes the code, AI does everything. Your job as a user is to make sure the tests are comprehensive. Contrary to what you hear from professional software engineers, you don’t need to care about the general case. If it is solid enough and it works for your use case, that is all that matters. Nothing more.

There is another aspect worth mentioning. There is a real market for business users who are not programmers. There is enormous value in using your laptop as your main dev and test environment. You open VSCode, you talk to your favorite AI agent, you run dbt run, and you see results in seconds. That feedback loop changes everything. Data platforms like Fabric become a hosting environment with security boundaries, governance, and all that.

and if you are still reading, dbt test are just awesome !!!