microsoft-fabric – Small Data And self service

duckrun

duckrun is a package I built using AI exclusively, to solve pain points I hit when using Fabric Python notebooks. I like DuckDB very much, but I was tired of manually discovering table names every time and writing long Python deltalake code just to write a Delta table, so I combined those two packages under one helper package to make my workflow smoother. Recently I discovered how awesome dbt is, so why not add a dbt adapter too. Then someone trolled me about the lack of snapshot isolation when doing read-modify-write on the same table. Luckily DuckDB now exposes the read version in delta_scan (before, you had to do the weird attach thing), so I have something that might actually be useful. The rest of the blog is written by AI. It wrote my code, so I don’t see the issue; it’ll even write my blog.

I also added a web page with some projects I built: dbt projects I ported, and some non-trivial SQL statements with proper snapshot isolation.

https://djouallah.github.io/duckrun

So yes, it’s a hack, until Iceberg matures or we get a better Delta write story in DuckDB.

At its core, duckrun is a four-part split: DuckDB executes the SQL, Arrow streams the result, delta-rs commits it to Delta, and dbt (optionally) orchestrates the whole DAG.

It’s storage-agnostic, running anywhere DuckDB and delta-rs can reach: local filesystem, S3, GCS, ADLS, OneLake. In practice I test it on the local filesystem and Microsoft Fabric OneLake. S3 and GCS use the same code path, but I barely touch them, so treat them as untested.

The gap it fills

DuckDB reads Delta well (delta_scan). It does not write it well: its Delta support is blind INSERT only (no UPDATE, DELETE, or MERGE), and its trajectory points at writing through Unity Catalog, which defeats the point of filesystem-native Delta. So if you want DuckDB’s engine but need upserts on Delta tables, there is no single tool that does both today.

Approach	Reads Delta	Writes Delta (merge/update/delete)	DuckDB SQL engine
DuckDB alone	✅	❌ (blind INSERT only)	✅
delta-rs alone	✅	✅	❌
duckrun	✅	✅	✅

duckrun’s answer is a split:

DuckDB runs all SQL and model logic, and reads Delta through delta_scan views.
delta-rs handles every write: overwrite, append, merge, delete, update.
Arrow bridges the two: a DuckDB relation is streamed to delta-rs over the C-stream interface.
Snapshot isolation ties it together: each read pins a Delta version, and each read-modify-write commits against the version it read, so a concurrent commit errors instead of silently clobbering.

That’s the whole architecture. The README calls it glue, and that’s accurate: each layer does only the one thing it’s set up to do.

Two costs come with this. Two engines split one RAM budget with no shared allocator, and the Arrow handoff isn’t zero-copy: DuckDB’s native vector format isn’t Arrow, so each batch is decoded and re-encoded as it streams across the Arrow C Data Interface (a batch-at-a-time ArrowArrayStream, so even a large write never fully materializes in memory). duckrun manages this with a cgroup-aware memory split, sampled per job so it doesn’t get OOM-killed on Fabric/k8s, where DuckDB otherwise sees the whole node. The fractions are of the effective limit: on a merge, 0.3 to DuckDB and 0.6 to delta-rs spill, leaving 0.1 slack; on a plain write, 0.85 to DuckDB. The split leans toward delta-rs because that’s where the memory goes: profiling a merge attributes ~99% of resident memory to delta-rs and only ~15 MB to DuckDB. Keeping every write behind delta-rs also means the bridge (and the memory juggling) can be deleted the day DuckDB ships a real Delta writer, without touching the read or state model.

connect: read first

connect() is read-only by default, so you can point it at a lakehouse and explore with no chance of an accidental write. Tables are discovered for you, with no manual name bookkeeping:

			
import duckrun
conn = duckrun.connect("abfss://<ws>@onelake.dfs.fabric.microsoft.com/<lh>/Tables/dbo")
conn.sql("SHOW TABLES").show()
conn.sql("select status, count(*) from orders group by status").show()
df = conn.table("orders").toPandas()          # or .toArrow() for a streaming reader
# time travel
from duckrun import DeltaTable
DeltaTable.forName(conn, "orders").history()  # newest-first: version, timestamp, operation
conn.read.format("delta").option("versionAsOf", 0).load(".../Tables/dbo/orders").show()

		

Multiple catalogs: attach

Attach more lakehouses and query across them by three-part name. This is where the data warehouse case shows up: in Fabric a Warehouse is just a write-locked Lakehouse, so you attach it read_only=True next to a writable Lakehouse and join the two:

			
conn.attach("abfss://…/warehouse.Warehouse/Tables", name="warehouse", read_only=True)
conn.attach("/data/reference", name="local")
conn.sql("""
  select *
  from warehouse.mart.facts f
  join local.dbo.lookup l on l.id = f.id
""").show()

		

Same code against a local path, s3://, gs://, or az://.

Writing: DML and merge

Opt into writes with read_only=False. Then just write SQL: plain DML routes straight to delta-rs, with no Python deltalake boilerplate:

			
conn = duckrun.connect("abfss://…/Tables/dbo", read_only=False)
conn.sql("create or replace table clean_orders as select * from orders where amount > 0")
conn.sql("insert into clean_orders select * from late_orders")
conn.sql("update clean_orders set status = 'shipped' where status = 'packed'")
conn.sql("delete from clean_orders where amount = 0")

		

MERGE works the same way; reference the literal target / source aliases:

			
conn.sql("""
  merge into clean_orders as target
  using updates       as source
    on target.id = source.id
  when matched then update set *
  when not matched then insert *
""")

		

…or, if you’d rather build it, the DeltaTable API mirrors Delta’s:

			
from duckrun import DeltaTable
src = conn.sql("select * from updates")
DeltaTable.forName(conn, "clean_orders").merge(src, "target.id = source.id") \
    .whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

conn.sql accepts CREATE [OR REPLACE] TABLE AS, INSERT, UPDATE, DELETE, ALTER … ADD COLUMN, MERGE, and DROP (a soft tombstone: delta-rs has no drop, so data files persist until you purge them). CREATE TEMP TABLE and CREATE VIEW stay native DuckDB. Things that can’t be honored cleanly, like multi-statement strings or UPDATE … FROM, are rejected rather than silently mishandled.

The surface is small. It mirrors the Delta/DeltaTable API so notebook code reads familiarly, but there is no fluent transform builder and no second SQL engine. Transforms are SQL, run by DuckDB.

Snapshot isolation

This started as someone trolling the project over read-modify-write on the same table, and it turned into the guarantee I rely on most. Honestly, I come from a Dataflow Gen1 background (basically a single writer), so all this concurrency stuff never made much sense to me. In a Lakehouse, where anyone can write to a table, even accidentally, it suddenly becomes a real factor. (Thanks Raki for the harsh feedback, lol.) A lakehouse has no transaction manager and no single-writer guarantee: two pipelines, a double-fired job, or a notebook racing a scheduled run can all commit to the same table. The dangerous shape is read → compute → write: if someone commits in between, a naïve write at HEAD silently overwrites them, a lost update with no error. duckrun’s job is to turn that into a CommitFailedError.

What’s fenced, and what isn’t. delete, update, and merge are read-modify-writes (they read the current rows, compute a change, then commit), so duckrun pins each to the version it read and delta-rs’s OCC validates the commit over (read, HEAD]. A conflicting concurrent commit makes them fail. Plain append and overwrite are not fenced, by design: they match Spark’s SaveMode. An append rebases onto HEAD (appends don’t conflict), and an overwrite is last-writer-wins. So the fence is automatic exactly where a lost update is possible, and absent where it isn’t.

How the OCC works. There’s no lock manager anywhere. A Delta commit is the atomic creation of the next log entry, _delta_log/…{N+1}.json, so if a racing writer already wrote it, your put-if-absent loses and delta-rs raises CommitFailedError. Optimistic, filesystem-native, no coordinator. The gap OCC alone leaves: it only checks the commit instant, not the version you read.

Pinning the version you read, the same whether you write SQL or a DataFrame. A DeltaTable handle captures the version at forName() (call it vB), and every merge() / delete() / update() through it commits against vB, so OCC validates the whole (vB, HEAD] window, not just the instant of the commit. conn.sql("delete …" / "update …" / "merge …") funnels into the exact same engine path, pinned the same way: there is no second code path for SQL. And DuckDB exposing the read version in delta_scan(…, version => vB) (the reason for the duckdb >= 1.5.4 floor) lets the read sit on vB too, so a read-modify-write split across statements still lands on one snapshot. Spark/Delta fences only the commit instant; this fences the version you actually read.

append_if_unchanged / overwrite_if_unchanged are the fenced siblings of plain append/overwrite. I had to coin the terms, because Delta/Spark has no built-in fenced append (you’d hand-write a MERGE for it). They’re a version compare-and-swap: load the table at the version you read and pass max_commit_retries=0 so delta-rs won’t rebase. If anything committed since, the target version is already taken and the commit fails. For the watermark/idempotent-append case this is cheaper than a merge, with no target scan and no key join. (safeappend is the deprecated alias.)

The dbt adapter

A thin wrapper over dbt-duckdb that adds Delta-backed table and incremental materializations; everything else (views, seeds, sources, tests, plugins) is inherited. Point a profile at a lakehouse and dbt run:

			
my_project:
  outputs:
    dev:
      type: duckrun
      root_path: "abfss://<ws>@onelake.dfs.fabric.microsoft.com/<lh>/Tables"

		

dbt with no catalog. Normally dbt leans on a metastore to know what exists and to resolve {{ this }}, ref(), and is_incremental(); duckrun has none, and state still survives across separate dbt build processes. At run start the adapter discovers Delta tables on disk (glob for local/az/s3/gs; the OneLake DFS REST API for abfss, since DuckDB can’t glob it reliably) and registers each as a delta_scan view named to match dbt’s database.schema.identifier. That view is what makes those references resolve against real Delta tables. Each materialization pre-registers its own {{ this }} view before running, then recreates the view after the delta-rs write, since that write lands a new Delta version and the old view would otherwise point at stale files. The namespace is rebuilt from storage on every run instead of read from a catalog.

Incremental strategies:

Strategy	Behavior
`merge` (default with `unique_key`)	upsert
`insert`	insert new keys only
`append` (default without `unique_key`)	blind append
`append_if_unchanged` / `safeappend`	append, commit only if version unchanged: cheap, no target scan, errors on conflict
`microbatch`	delete+insert per `event_time` window

Compaction and 7-day vacuum run automatically (every run for overwrites; past a file-count threshold for incrementals).

Two limits worth knowing up front. First, writes are single-threaded within a run (in-process delta-rs isn’t thread-safe; cross-process concurrency is fully supported). Second, constraints are enforced at the write boundary, not stored in the table: a contract with not_null columns is checked by a guard query before the write, so a null fails with NOT NULL constraint failed and the prior Delta version is left untouched. Two caveats there: delta-rs can’t persist column constraints into Delta metadata, and timestampNtz columns can’t be written yet.

On versions, duckrun is deliberately conservative because the underlying libraries move fast and break: duckdb >= 1.5.4 (first stable with delta_scan(version => N)) and deltalake == 1.5.0, the first release with MERGE max_spill_size. On Microsoft Fabric, pip install --upgrade and restart the kernel, since the bundled DuckDB is older than the floor.

Testing: the only way to trust AI code

If the AI writes the code, what makes it trustworthy? Not that it compiles, and not that the unit tests are green: an AI will happily write a test that passes for the wrong reason. The only signal I actually trust is integration testing: run a real dbt project end to end and check the tables it lands on real storage.

So that’s where most of the test weight sits. duckrun runs a few hundred tests across 26 files, but the ones that matter most are the 8 integration projects that build for real, most against live Microsoft Fabric OneLake (abfss://), not a mock. To keep them honest, with real models rather than toys I wrote to flatter the adapter, I ported existing dbt projects from the web: other people’s models kept as close to original as I could, with attribution:

sde_dbt_tutorial, a port of josephmachado/simple_dbt_project: raw tables → bronze typing → a Delta-backed SCD2 customer snapshot → a merge-incremental clickstream fact → an orders_obt gold mart.
coffee, ported from JosueBogran/coffeeshopdatageneratorv2: CSV ingest over https, a deduped SCD2 product dim, a region-partitioned fact, a revenue mart.
aemo, my own dbt_fabric_python_delta, built against live OneLake. The full run is published as browsable dbt docs: fct_scada is a 360M-row Delta table you can inspect yourself, not a screenshot.
snapshot_pin — a concurrent-writer test that asserts the guarantee above end to end: one writer reads a version, a second commits underneath it, and the first writer’s stale commit is rejected on real storage rather than silently overwriting.
plus a TPCH merge/append/overwrite spill benchmark, a connection-API demo on live NYC TLC taxi data, and a multi-catalog lakehouse + warehouse + local join.

The rendered catalogs for these (real Delta stats, row counts, last-modified) are on the project page.

On top of the projects, the adapter runs the official dbt adapter test suite (dbt-tests-adapter, the same conformance suite every dbt adapter is measured against) at 126/135 passing (93%), regenerated on every push to main. The documented failures are deliberate choices: no persistent views in open Delta, and rejecting merge configs that would silently diverge.

The evidence that matters is tables on real storage, hit the same way a user would, not a passing test count and not “the AI said it works.”

What does it mean to build a package you don’t understand?

I should be honest: I don’t understand the code in detail. But that was always true. duckrun is glue over DuckDB and delta-rs (written in C++ and Rust), and I don’t have the slightest idea how those work internally either. Almost nobody who builds on a library understands its guts. So what does “writing a package” actually mean?

For me, two things. Expressing the problem, knowing the pain well enough to say exactly what should happen, and making the design decisions that follow: delta-rs for every write, delta_scan views for reads, snapshot isolation as the contract. The AI writes the code; I own the problem and the shape of the solution.

The third thing is what makes it real: tests, and a lot of them. duckrun runs an extensive unit and integration suite, but I only actually trust it when I see tables land in OneLake. Code that passes locally and code that materializes correctly on real storage are not the same claim.

One trick I’ve learned: use a second agent to verify the first one’s work, not the agent that wrote it. The catch is that AI still cheats to make a test pass: it’ll weaken or game the check even when it plainly knows that isn’t the right thing. I hope that improves. Until it does: don’t trust anything it produces. Verify it against reality.

Power BI with DuckDB, 4 years later

Four years ago I wrote a blog about using DuckDB with Power BI in DirectQuery. It got a fair number of likes on LinkedIn 🙂 along with the one comment I didn’t want to hear: how does this work in production? (Craig, if you’re reading this, you were right.)

Back then I thought the technology was the hard part and the rest would sort itself out. It didn’t.

The ODBC driver never really worked in any non-trivial setup. Filters didn’t push down, decimal precision was buggy. It has gotten better since, but two show stoppers remained:

DuckDB is in-process, so the driver is the database. There’s no warm, long-running session. Every query starts from scratch.
I don’t think those drivers can realistically be certified (personal opinion). And Power BI Service, or any hosted BI service for that matter, is not going to host an in-process engine for free. An on-prem data gateway is not really a good option either.

In 2026 things are way better. MotherDuck (DuckDB’s SaaS) shipped a PostgreSQL endpoint. Problem solved: Power BI speaks Postgres, and it works out of the box.

Then last week DuckDB released Quack. For my own sanity I’ll just call it “DuckDB Server.” It is just an extension; a single function call and you have a server !!

My first reaction was annoyance. Four years of waiting, and they shipped a proprietary wire protocol. I was hoping for pg wire. I want my driver to work. I don’t really care about a 2x improvement if nothing interoperates.

Luckily I was partially wrong. Within two days there was an ADBC driver from gizmodata/adbc-driver-quack, and, to my surprise, a Power BI custom connector from Curt Hagenlocher (think of him as the Linus of Power Query). my understanding it is a side project, not official Microsoft.

And somehow, the whole thing worked. It was beautiful.

But lesson learned from last time: this is experimental, with no guarantee the connector will ever be certified.

The main change from the 2022 post is that instead of pointing at parquet files, I’m pointing at a catalog and getting tables back, like an actual database instead of a pile of files and duckdb got way better.

High level architecture

OneLake Iceberg Catalog — OneLake exposes data as tables. You need three things:
- Endpoint: https://onelake.table.fabric.microsoft.com/iceberg
- An Entra ID auth token
- Path to the Lakehouse/Warehouse: workspace_name/Lakehouse_name.Lakehouse

DuckDB + iceberg extension — reads the catalog and the underlying parquet over HTTPS.

Entra ID — az account get-access-token --resource https://storage.azure.com/ mints a short-lived bearer token. No service principal, no app registration. I have a script that grabs the token, and I opened duckdb-azure#170 hoping to make this much simpler.

DuckDB Endpoint — turns the engine into a TCP server on 127.0.0.1:9494, speaking DuckDB’s native wire protocol (whatever that means).

The ADBC Driver — Python client and Power BI share the same DLL, you need to manually install it from curt github page

You can download all the files here

Power BI

Let’s just share a video. Yes, 600M rows, warm run in my laptop

Python Notebook

TPC-H SF=10 (10 GB), 22 queries, run twice in the same session via client.ipynb. Numbers are seconds, copied straight from the notebook output.

	Cold	Warm
Total	~5 min 29 s	~30 s

Cold time is dominated by parquet I/O over HTTPS from OneLake. Bandwidth and seek count, not CPU. Warm runs hit DuckDB’s in-process buffer cache, Onelake endpoint is in another continent and my internet provider is horrible 🙂

Optimization on this stack should target bytes read and seeks (codec, row-group size, predicate pushdown, range prefetch), not query plans.

This is exactly why server mode make sense, as the warm cache is shared by all client (notebook, Power BI, AI Agent)

Not production ready

The Entra token has a ~1h TTL. As far as I can tell, DuckDB has no way to auto-refresh tokens.
The driver is not certified, so it can’t be used in the service, if you want it added to PowerBI, create an idea in Fabric forum and vote
DuckDB Server is new. Don’t expect SQL Server maturity yet 🙂
DuckDB’s remote file cache is RAM only. When you restart DuckDB, you lose it and have to deal with the cold-run pain again and egress fees 😦
The DuckDB Azure extension is still pretty rough in places. To be fair, they’ve openly said they don’t have the bandwidth.

Hopefully it won’t take another four years to make this production ready.

Still, seeing DuckDB as a single binary serving a 600M row table to Power BI was genuinely fun. and The Iceberg catalog is awesome !!!

Deploying to Microsoft Fabric with the Fabric CLI: First Impression

Microsoft Fabric now has a proper CLI deploy, and it works. I built a fully automated CI/CD pipeline that deploys a Python notebook, Lakehouse, Semantic Model, and Data Pipeline to Fabric using nothing but the fab CLI and GitHub Actions. Here’s what I learned along the way , what works great, what to watch out for, and where a few small additions could make the experience even better.

The full source code is available on GitHub: djouallah/dbt_fabric_python_notebook.

The Blog and the code was written by AI, to be clear, Fabric had always excellent API. and I perosnally used adhoc pythion script to deploy, but this time, it feels more natural

maybe the main take away when working with Agent and writing python code, logs everything including API response specially at the begining, AI is very good at autocorrecting !!!

The Goal

Push to main or production, and everything deploys automatically:

A Lakehouse gets created (with schemas enabled)
A Python Notebook gets deployed and attached to the Lakehouse (dbt need local path)
The notebook’s supporting files get copied to OneLake
The notebook runs — transforming data and creating Delta tables
A Direct Lake Semantic Model gets deployed (pointing at those Delta tables)
A Data Pipeline gets deployed and scheduled on a cron

No portal clicks. No manual steps. Just git push.

Project Structure

			
├── deploy.py                    # Orchestrates the entire deploy
├── deploy_config.yml            # Per-environment config (workspace IDs, schedules)
├── fabric_items/
│   ├── data.Lakehouse/          # Lakehouse definition
│   ├── run.Notebook/            # Python notebook (.ipynb)
│   ├── aemo_electricity.SemanticModel/  # Direct Lake model
│   └── run_pipeline.DataPipeline/       # Scheduled pipeline
├── dbt/                         # Data transformation project
└── .github/workflows/
    ├── ci.yml                   # Tests on every push
    └── deploy.yml               # Deploys to Fabric

		

Each Fabric item lives in a folder named {displayName}.{ItemType} under fabric_items/. The deploy script discovers them dynamically — no hardcoded item names.

What Works Well

The fab deploy command is brand new — v1.5.0, March 12, 2026. For a tool that just shipped, two things stood out.

Native `.ipynb` Support for Notebooks

Fabric’s default Git format for notebooks is notebook-content.py — a custom FabricGitSource format that flattens your notebook into a single .py file with metadata comments. It’s fine for Git diffs, but you lose the cell structure, can’t preview outputs, and can’t use standard Jupyter tooling to edit it.

As of Fabric CLI v1.4.0 (February 2026), you can now deploy notebooks as standard .ipynb files. Before v1.4.0, the CLI only supported the .py format.

With .ipynb support, what you see in VS Code or Jupyter is exactly what gets deployed:

			
fabric_items/
  run.Notebook/
    .platform
    notebook-content.ipynb    # standard Jupyter format, deployed as-is

You can edit notebooks locally with proper cell boundaries, use Jupyter tooling, and the deploy just works. Notebooks are finally first-class citizens in the deployment story.

`model.bim` Is Beautifully Simple

Fabric supports two formats for Semantic Models: TMDL (a folder of .tmdl files, one per table — the default) and TMSL (a single model.bim JSON file). TMDL is better for Git diffs on large models. But for my use case, model.bim is perfect.

One file. Everything in it — tables, columns, measures, relationships, and the Direct Lake connection. The entire environment-specific configuration boils down to a single OneLake URL:

https://onelake.dfs.fabric.microsoft.com/{workspace_id}/{lakehouse_id}

Two GUIDs. That’s it. Swapping environments is a two-line string replacement:

			
bim_path.write_text(
    bim_text.replace(source_ws_id, WS_ID)
            .replace(source_lh_id, target_lh_id)
)

Compare this to the pipeline, where you’re hunting through deeply nested JSON paths with fab set. The BIM format is refreshingly straightforward.

The deploy works perfectly with just Python string replacement — three lines of code and a git checkout to restore.

TMSL (`model.bim`) vs TMDL: Which Format for CI/CD?

Fabric supports two formats for Semantic Models, and this choice matters more than it might seem.

TMDL is the default. It splits your model into a folder of .tmdl files — one per table, plus separate files for relationships, the model definition, and the database config:

			
definition/
├── tables/
│   ├── dim_calendar.tmdl
│   ├── dim_duid.tmdl
│   └── fct_summary.tmdl
├── relationships.tmdl
├── model.tmdl
└── database.tmdl

		

TMSL is a single model.bim JSON file with everything in it.

For CI/CD pipelines, TMSL wins hands down. Here’s why:

One file to manage. Your deploy script reads one file, replaces two GUIDs, deploys, and runs git checkout to restore. With TMDL, you’d need to find which .tmdl file contains the OneLake URL and handle multiple files.
Two .replace() calls. The entire environment swap is two string replacements on one file. With TMDL, the connection expression lives in model.tmdl, but table definitions reference it indirectly — more files to reason about during deployment.
Easier to grep and debug. When something goes wrong with your Direct Lake connection, you open one file, search for the OneLake URL, and see everything. No jumping between files.

When TMDL makes more sense:

Large models with dozens of tables where multiple people edit measures and columns — per-file Git diffs are cleaner and merge conflicts are smaller
Teams using Tabular Editor who need reviewable PRs on individual table changes
Models that change frequently at the table level

But if your semantic model is authored once and deployed across environments — which is the typical CI/CD pattern — you’re not reviewing table-level diffs. You’re swapping two GUIDs and pushing. TMSL keeps it simple.

I chose model.bim and haven’t looked back.

Things to Know Before You Start

Lesson 1: Deploy Order Matters — A Lot

This was my biggest source of failed deployments. Fabric items have implicit dependencies, and deploying them out of order causes cryptic failures.

The correct sequence:

Lakehouse → Notebook → (run notebook) → Semantic Model → Data Pipeline

Why this specific order:

The Notebook needs a Lakehouse to attach to. If the Lakehouse doesn’t exist yet, the attachment step fails.
The Semantic Model uses Direct Lake mode, which validates that the Delta tables it references actually exist. If you deploy the model before running the notebook that creates those tables, validation fails.
The Data Pipeline references the Notebook by ID. You need the Notebook deployed first to get its target workspace ID.

I ended up with a strict 7-phase deploy script:

			
# 1. Create/verify Lakehouse (with schemas enabled)
# 2a. Deploy Lakehouse
# 2b. Deploy Notebook
# 2c. Attach Lakehouse to Notebook via fab set
# 3. Copy supporting files to OneLake
# 4. Run the Notebook (blocks until complete)
# 5. Deploy Semantic Model (Delta tables now exist)
# 6. Refresh Semantic Model via Power BI API
# 7. Deploy + schedule Data Pipeline

		

Lesson 2: `fab job run` Does Nothing for Notebooks Without `-i '{}'`

This one cost me hours of debugging. Running a notebook via the CLI:

			
# Does NOTHING — silently succeeds but notebook never executes
fab job run prod.Workspace/run.Notebook
# Actually runs the notebook
fab job run prod.Workspace/run.Notebook -i '{}'

Notebooks require the -i '{}' flag (empty JSON input). Without it, the command returns success but the notebook never fires. There’s no error, no warning — it just silently does nothing.

Lesson 3: `parameter.yml` Token Replacement Is Surprisingly Limited

Fabric CLI has a parameter.yml mechanism for replacing GUIDs across environments. The idea is great — use tokens like $workspace.id and $items.Lakehouse.data.$id that get resolved at deploy time.

In practice, the rules are strict and poorly documented:

Tokens only resolve if the entire value starts with `$`

			
# WRONG — token is embedded in a URL, never resolves
replace_value:
  _ALL_: "https://onelake.dfs.fabric.microsoft.com/$workspace.id/$items.Lakehouse.data.$id/"
# CORRECT — each token must be its own replacement entry
- find_value: "e446a5e7-..."
  replace_value:
    _ALL_: "$workspace.id"

		

The `$items` token format is strict

			
$items.Lakehouse.data.$id    # correct: $items.{type}.{name}.$attribute
$items.data.$id              # WRONG: "Invalid $items variable syntax"

`is_regex` must be a string, not a boolean

			
is_regex: "true"   # correct
is_regex: true     # WRONG — Fabric CLI rejects with "not of type string"

My solution: skip `parameter.yml` entirely

I found it simpler and more transparent to do GUID replacement directly in Python:

			
# Read the source file, find dev GUIDs, replace with target GUIDs
bim_text = bim_path.read_text()
bim_path.write_text(
    bim_text.replace(source_ws_id, WS_ID)
            .replace(source_lh_id, target_lh_id)
)
# Deploy with the modified file
fab_deploy(["SemanticModel"])
# Restore original for clean git state
subprocess.run(["git", "checkout", str(bim_path)])

		

The pattern: modify → deploy → git restore. No token resolution needed.

Lesson 4: `item_types_in_scope` Must Be Plural

The deploy config YAML key is item_types_in_scope (plural). Use the singular item_type_in_scope and Fabric CLI silently ignores it — deploying everything in your repository directory instead of just the types you specified.

			
# CORRECT
item_types_in_scope:
  - Notebook
  - Lakehouse
# WRONG — silently deploys ALL item types
item_type_in_scope:
  - Notebook

		

This is the kind of bug that only shows up in production when your Semantic Model gets deployed before your Delta tables exist.

Lesson 5: New Lakehouses Need a Provisioning Wait

Creating a Lakehouse returns immediately, but the underlying infrastructure isn’t ready yet:

			
result = subprocess.run(["fab", "create", LAKEHOUSE, "-P", "enableSchemas=true"])
if result.returncode == 0:
    # Brand new lakehouse — need to wait for provisioning
    print("Waiting 60s for provisioning...")
    time.sleep(60)

		

On first deploy to a new workspace, this 60-second wait is essential. Without it, subsequent operations (deploying items, copying files) fail with opaque errors.

Lesson 6: Attaching a Lakehouse to a Notebook Requires `fab set`

Deploying a notebook doesn’t automatically connect it to a Lakehouse. You need a separate fab set call:

			
lakehouse_payload = json.dumps({
    "known_lakehouses": [{"id": target_lh_id}],
    "default_lakehouse": target_lh_id,
    "default_lakehouse_name": "data",
    "default_lakehouse_workspace_id": WS_ID,
})
fab(["set", NOTEBOOK, "-q",
     "definition.parts[0].payload.metadata.dependencies.lakehouse",
     "-i", lakehouse_payload, "-f"])

		

The JSON path is deeply nested and not well documented. I had to inspect the API responses to find the correct path: definition.parts[0].payload.metadata.dependencies.lakehouse.

Lesson 7: Semantic Model Refresh Uses the Power BI API, Not the Fabric API

After deploying a Direct Lake semantic model, you need to trigger a refresh. But this isn’t a Fabric API call — it’s a Power BI API call:

			
# Note the -A powerbi flag — this targets the Power BI API endpoint
fab api -A powerbi -X post "groups/{workspace_id}/datasets/{model_id}/refreshes"

Without the -A powerbi flag, you’ll get 404s because the Fabric API doesn’t have a refresh endpoint for semantic models.

Lesson 8: Pipeline References Are Hardcoded GUIDs

A Data Pipeline that runs a notebook stores the notebook’s ID and workspace ID as hardcoded GUIDs in its definition:

			
{
  "typeProperties": {
    "notebookId": "da888b35-a17c-49ac-a8cf-1a5ffae91e20",
    "workspaceId": "e446a5e7-6666-42ad-a331-0bfef3187fbf"
  }
}

		

These are your dev GUIDs. After deploying to a different workspace, you need to update them:

			
target_nb_id = get_target_item_id("Notebook", "run")
fab(["set", PIPELINE, "-q",
     "definition.parts[0].payload.properties.activities[0].typeProperties.notebookId",
     "-i", target_nb_id, "-f"])
fab(["set", PIPELINE, "-q",
     "definition.parts[0].payload.properties.activities[0].typeProperties.workspaceId",
     "-i", WS_ID, "-f"])

		

Again, the JSON paths are deeply nested. The fab set command is your best friend for post-deploy configuration.

Lesson 9: GitHub Actions Authentication via OIDC

No stored secrets for the Fabric service principal. GitHub’s OIDC provider exchanges a federated token directly:

			
- name: Login to Fabric CLI
  run: |
    FED_TOKEN=$(curl -sH "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
      "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=api://AzureADTokenExchange" | jq -r '.value')
    fab auth login -t ${{ secrets.AZURE_TENANT_ID }} \
                   -u ${{ secrets.AZURE_CLIENT_ID }} \
                   --federated-token "$FED_TOKEN"

		

This means no client secrets to rotate — just configure the Azure AD app registration to trust your GitHub repo’s OIDC issuer. It works well, but you still need to set up an Azure AD app registration, configure federated credentials, and grant it Fabric permissions. It would be nice if Fabric supported direct service-to-service authentication — something like a Fabric API key or a native GitHub integration — without needing Azure as the intermediary.

Lesson 10: Use Variable Libraries for Runtime Config

Instead of baking config values into your notebook or using parameter.yml, Fabric has Variable Libraries:

			
# In your notebook at runtime:
import notebookutils
vl = notebookutils.variableLibrary.getLibrary("deploy_config")
download_limit = vl.download_limit

The deploy script creates/updates the variable library via the API:

			
fab(["api", "-X", "post", f"workspaces/{WS_ID}/variableLibraries",
     "-i", json.dumps({"displayName": "deploy_config", "definition": vl_definition})])

This gives you environment-specific configuration without redeploying the notebook. Change a variable, next pipeline run picks it up.

Lesson 11: Use `abfss://` Paths for OneLake — It Makes Your Notebook Portable

When reading or writing to OneLake, use the abfss:// protocol with workspace and lakehouse IDs:

			
workspace_id = notebookutils.runtime.context.get('currentWorkspaceId')
lakehouse_id = notebookutils.lakehouse.get('data').get('id')
root_path = f"abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}"

This makes your notebook fully portable — the same code runs everywhere:

Local dev: swap to a local path or Azurite connection
Deployed to staging: notebookutils resolves to the staging workspace/lakehouse IDs
Deployed to production: same code, different IDs at runtime

The alternative — hardcoding workspace names or using /lakehouse/default/ mount paths — ties your notebook to a specific workspace. With abfss://, the notebook doesn’t care where it’s running. The IDs come from the runtime context, and the deploy script handles attaching the right Lakehouse. Zero code changes between environments.

Lesson 12: Copying Files to OneLake Is Parallel but Slow

The notebook needs supporting files (SQL models, configs) available in OneLake. The fab cp command handles this, but it’s one file at a time. I parallelized with 8 workers:

			
from concurrent.futures import ThreadPoolExecutor
def copy_file(f):
    rel = f.relative_to(root)
    fab(["cp", rel.as_posix(), f"{LAKEHOUSE}/Files/{rel.parent.as_posix()}/", "-f"])
with ThreadPoolExecutor(max_workers=8) as executor:
    executor.map(copy_file, files)

		

Before copying files, you need to create the directory structure with fab mkdir. OneLake doesn’t auto-create parent directories.

Lesson 13: Schedule Idempotently

Don’t recreate the pipeline schedule every deploy — check first:

			
result = subprocess.run(["fab", "job", "run-list", PIPELINE, "--schedule"],
                        capture_output=True, text=True)
if "True" not in result.stdout:
    fab(["job", "run-sch", PIPELINE,
         "--type", "cron",
         "--interval", cfg["schedule_interval"],
         "--start", cfg["schedule_start"],
         "--end", cfg["schedule_end"],
         "--enable"])

		

This prevents duplicate schedules stacking up across deploys.

The Big Picture

Here’s the overall architecture in one diagram:

			
GitHub Push
    │
    ▼
GitHub Actions (OIDC → fab auth login)
    │
    ▼
deploy.py
    ├── fab create    → Lakehouse (with schemas)
    ├── fab deploy    → Notebook
    ├── fab set       → Attach Lakehouse to Notebook
    ├── fab cp        → Copy data files to OneLake (8 parallel workers)
    ├── fab job run   → Execute Notebook (creates Delta tables)
    ├── fab deploy    → Semantic Model (with GUID replacement + git restore)
    ├── fab api       → Refresh Semantic Model (Power BI API)
    ├── fab deploy    → Data Pipeline
    ├── fab set       → Update Pipeline notebook/workspace refs
    └── fab job run-sch → Schedule Pipeline (if not already scheduled)

		

Everything is driven by a single deploy_config.yml that maps branch names to workspace IDs:

			
defaults:
  schedule_interval: "30"
  schedule_start: "2025-01-01T00:00:00"
  schedule_end: "2030-12-31T23:59:59"
main:
  ws_id: "e446a5e7-..."
  schedule_interval: "720"    # 12 hours (staging)
production:
  ws_id: "be079b0f-..."
  download_limit: "60"        # full data

		

Push to main → deploy to staging workspace. Push to production → deploy to production workspace.

Lesson 14: Don’t Deploy the Lakehouse Item — Let the Data Define the Schema

I had a data.Lakehouse/ folder in fabric_items/ with a .platform file and a lakehouse.metadata.json that just set defaultSchema: dbo. I was running fab deploy for it. Then I realized: I was already creating the Lakehouse with fab create before the deploy step:

fab create "prod.Workspace/data.Lakehouse" -P enableSchemas=true

The fab create handles everything. The fab deploy of the Lakehouse item was redundant.

But there’s a deeper point here: the Lakehouse schema should be driven by your data, not by CI/CD. Your notebook creates the tables, your data transformation defines the schemas. The Lakehouse is just the container — it doesn’t need a deployment definition. Trying to manage Lakehouse schema through fab deploy is fighting the natural flow. Create the container, let the data populate it.

I deleted the entire data.Lakehouse/ folder from my repo. One less item to deploy, one less thing to break.

What I’d Tell My Past Self

Read every fab CLI error message carefully. Many failures are silent (wrong key name, missing -i flag). Add verbose logging.
Deploy in phases, not all at once. Item dependencies are real and the error messages when you get the order wrong are unhelpful.
Skip parameter.yml for anything non-trivial. Direct GUID replacement in Python with git restore is simpler and fully transparent.
fab set is the power tool. Most post-deploy configuration — attaching lakehouses, updating pipeline references — goes through deeply nested JSON paths in fab set.
Test in a separate workspace mapped to a non-production branch. The deploy_config.yml pattern of mapping branches to workspaces makes this trivial.
The Power BI API and Fabric API are different surfaces. Some operations (like semantic model refresh) only exist on the Power BI side. Use fab api -A powerbi.
Don’t deploy what you don’t need to. If fab create handles it, drop the item definition. Let your data drive the schema.

The Fabric CLI is new — fab deploy landed in v1.5.0 just this month — and it already handles a full end-to-end deployment pipeline. The foundation is solid. Everything you need is already there — it just takes knowing where to look. Hopefully this saves you some of that discovery time.

Acknowledgements

Special thanks to Kevin Chant — Data Platform MVP and Lead BI & Analytics Architect — whose blog has been an invaluable resource on Fabric CI/CD and DevOps practices for the data platform. If you’re working with Fabric deployments, his posts are well worth following.

First Look at Incremental Framing in Power BI

TL;DR: Incremental framing is like CDC to RAM 🙂 It significantly improves cold-run performance of Direct Lake mode in some scenarios, there is an excellent documentation that explain everything in details

What Is Incremental Framing?

One of the most important improvements to Direct Lake mode in Power BI is incremental framing.

Power BI’s OLAP engine, VertiPaq (probably the most widely deployed OLAP engine, though many outside the Power BI world may not know it) relies heavily on dictionaries. This works well because it is a read-only database. another core trick is its ability to do calculation directly on encoded data. This makes it extremely efficient and embarrassingly fast ( I just like this expression for some reason ).

Direct Lake Breakthrough

Direct Lake’s breakthrough is that dictionary building is fast enough to be done at runtime.

Typical workflow:

A user opens a report.
The report generates DAX queries.
These queries trigger scans against the Delta table.
VertiPaq scans only the required columns.
It builds a global dictionary per column, loads the data from Parquet into memory, and executes queries.

The encoding step happens once at the start, and since BI data doesn’t usually change more that much, this model works well.

The Problem with Continuous Appends

In scenarios where data is appended frequently (e.g., every few minutes), the initial approach does not works very well. Each update requires rebuilding dictionaries and reloading all the data into RAM, effectively paying the cost of a cold run every time ( reading from remote storage will be always slower).

How Incremental Framing Fixes This

Incremental framing solves the problem by:

Incrementally loading new data into RAM.
Encoding only what’s necessary.
Removing obsolete Parquet data when not needed.

This substantially improves cold-run performance. Hot-run performance remains largely unchanged.

Benchmark: Australian Electricity Market

To test this feature, I used my go-to workload: the Australian electricity market, where data is appended every 5 minutes—an ideal test case.

Incremental framing is on by default, I turn it off using this bog
For benchmarking, I adapted an existing tool , Direct Lake load testing( I just changed writing the results to Delta instead of CSV), I used 8 concurrent users, the main fact Table is around 120 M records, the queries reflect a typical user session , this is a real life use case, not some theoretical benchmark.

Results

P99

P99 (the 99th percentile latency, often used to show worst-case performance):

Improvement of 9x–10x, again, your results may varied depending on workload, Parquet layout, and data distribution.

P90

P90 (90th percentile latency):

Less dramatic but still strong.
Improved from 500 ms → 200 ms.
Faster queries also reduce capacity unit usage.

Geomean

just for fun and to show how fast Vertipaq is, let’s see the geomean, alright went from 11 ms to 8 ms, general purpose OLAP engines are cool, but specialized Engines are just at another level !!!

This does not solve Bad Table layout problem

This feature improves support for Delta tables with frequent appends and deletes. However, performance still degrades if you have too many small Parquet row groups.

VertiPaq does not rewrite data layouts—it reads data as-is. To maintain good performance:

Compact your tables regularly.
In my case, I backfill data nightly. The small Parquets added during the day don’t cause major issues, but I still compact every 100 files as a precaution.

If your data is produced inside Fabric, VOrder helps manage this. For external engines (Snowflake, Databricks, Delta Lake with Python), you’ll need to actively manage table layout yourself.

	Power BI with DuckDB… on Using DuckDB with PowerBI
	Querying a Fabric La… on Writing to SQL Server using…
	Benjamin on Running DuckDB at 10 TB s…
	mim on Running DuckDB at 10 TB s…
	Benjamin on Running DuckDB at 10 TB s…

The gap it fills

connect: read first

Multiple catalogs: attach

Writing: DML and merge

Snapshot isolation

The dbt adapter

Testing: the only way to trust AI code

What does it mean to build a package you don’t understand?

Share this:

High level architecture

Power BI

Python Notebook

Not production ready

Share this:

The Goal

Project Structure

What Works Well

Native .ipynb Support for Notebooks

model.bim Is Beautifully Simple

TMSL (model.bim) vs TMDL: Which Format for CI/CD?

Things to Know Before You Start

Lesson 1: Deploy Order Matters — A Lot

Lesson 2: fab job run Does Nothing for Notebooks Without -i '{}'

Lesson 3: parameter.yml Token Replacement Is Surprisingly Limited

Tokens only resolve if the entire value starts with $

The $items token format is strict

is_regex must be a string, not a boolean

My solution: skip parameter.yml entirely

Lesson 4: item_types_in_scope Must Be Plural

Lesson 5: New Lakehouses Need a Provisioning Wait

Lesson 6: Attaching a Lakehouse to a Notebook Requires fab set

Lesson 7: Semantic Model Refresh Uses the Power BI API, Not the Fabric API

Lesson 8: Pipeline References Are Hardcoded GUIDs

Lesson 9: GitHub Actions Authentication via OIDC

Lesson 10: Use Variable Libraries for Runtime Config

Lesson 11: Use abfss:// Paths for OneLake — It Makes Your Notebook Portable

Lesson 12: Copying Files to OneLake Is Parallel but Slow

Lesson 13: Schedule Idempotently

The Big Picture

Lesson 14: Don’t Deploy the Lakehouse Item — Let the Data Define the Schema

What I’d Tell My Past Self

Acknowledgements

Share this:

What Is Incremental Framing?

Direct Lake Breakthrough

The Problem with Continuous Appends

How Incremental Framing Fixes This

Benchmark: Australian Electricity Market

Results

P99

P90

Geomean

This does not solve Bad Table layout problem

Share this:

Native `.ipynb` Support for Notebooks

`model.bim` Is Beautifully Simple

TMSL (`model.bim`) vs TMDL: Which Format for CI/CD?

Lesson 2: `fab job run` Does Nothing for Notebooks Without `-i '{}'`

Lesson 3: `parameter.yml` Token Replacement Is Surprisingly Limited

Tokens only resolve if the entire value starts with `$`

The `$items` token format is strict

`is_regex` must be a string, not a boolean

My solution: skip `parameter.yml` entirely

Lesson 4: `item_types_in_scope` Must Be Plural

Lesson 6: Attaching a Lakehouse to a Notebook Requires `fab set`

Lesson 11: Use `abfss://` Paths for OneLake — It Makes Your Notebook Portable