Deploying to Microsoft Fabric with the Fabric CLI: First Impression

Microsoft Fabric now has a proper CLI deploy, and it works. I built a fully automated CI/CD pipeline that deploys a Python notebook, Lakehouse, Semantic Model, and Data Pipeline to Fabric using nothing but the fab CLI and GitHub Actions. Here’s what I learned along the way , what works great, what to watch out for, and where a few small additions could make the experience even better.

The full source code is available on GitHub: djouallah/dbt_fabric_python_notebook.

The Blog and the code was written by AI, to be clear, Fabric had always excellent API. and I perosnally used adhoc pythion script to deploy, but this time, it feels more natural

maybe the main take away when working with Agent and writing python code, logs everything including API response specially at the begining, AI is very good at autocorrecting !!!

The Goal

Push to main or production, and everything deploys automatically:

  1. Lakehouse gets created (with schemas enabled)
  2. Python Notebook gets deployed and attached to the Lakehouse (dbt need local path)
  3. The notebook’s supporting files get copied to OneLake
  4. The notebook runs — transforming data and creating Delta tables
  5. Direct Lake Semantic Model gets deployed (pointing at those Delta tables)
  6. Data Pipeline gets deployed and scheduled on a cron

No portal clicks. No manual steps. Just git push.


Project Structure

├── deploy.py # Orchestrates the entire deploy
├── deploy_config.yml # Per-environment config (workspace IDs, schedules)
├── fabric_items/
│ ├── data.Lakehouse/ # Lakehouse definition
│ ├── run.Notebook/ # Python notebook (.ipynb)
│ ├── aemo_electricity.SemanticModel/ # Direct Lake model
│ └── run_pipeline.DataPipeline/ # Scheduled pipeline
├── dbt/ # Data transformation project
└── .github/workflows/
├── ci.yml # Tests on every push
└── deploy.yml # Deploys to Fabric

Each Fabric item lives in a folder named {displayName}.{ItemType} under fabric_items/. The deploy script discovers them dynamically — no hardcoded item names.


What Works Well

The fab deploy command is brand new — v1.5.0, March 12, 2026. For a tool that just shipped, two things stood out.

Native .ipynb Support for Notebooks

Fabric’s default Git format for notebooks is notebook-content.py — a custom FabricGitSource format that flattens your notebook into a single .py file with metadata comments. It’s fine for Git diffs, but you lose the cell structure, can’t preview outputs, and can’t use standard Jupyter tooling to edit it.

As of Fabric CLI v1.4.0 (February 2026), you can now deploy notebooks as standard .ipynb files. Before v1.4.0, the CLI only supported the .py format.

With .ipynb support, what you see in VS Code or Jupyter is exactly what gets deployed:

fabric_items/
run.Notebook/
.platform
notebook-content.ipynb # standard Jupyter format, deployed as-is

You can edit notebooks locally with proper cell boundaries, use Jupyter tooling, and the deploy just works. Notebooks are finally first-class citizens in the deployment story.

model.bim Is Beautifully Simple

Fabric supports two formats for Semantic Models: TMDL (a folder of .tmdl files, one per table — the default) and TMSL (a single model.bim JSON file). TMDL is better for Git diffs on large models. But for my use case, model.bim is perfect.

One file. Everything in it — tables, columns, measures, relationships, and the Direct Lake connection. The entire environment-specific configuration boils down to a single OneLake URL:

https://onelake.dfs.fabric.microsoft.com/{workspace_id}/{lakehouse_id}

Two GUIDs. That’s it. Swapping environments is a two-line string replacement:

bim_path.write_text(
bim_text.replace(source_ws_id, WS_ID)
.replace(source_lh_id, target_lh_id)
)

Compare this to the pipeline, where you’re hunting through deeply nested JSON paths with fab set. The BIM format is refreshingly straightforward.

The deploy works perfectly with just Python string replacement — three lines of code and a git checkout to restore.


TMSL (model.bim) vs TMDL: Which Format for CI/CD?

Fabric supports two formats for Semantic Models, and this choice matters more than it might seem.

TMDL is the default. It splits your model into a folder of .tmdl files — one per table, plus separate files for relationships, the model definition, and the database config:

definition/
├── tables/
│ ├── dim_calendar.tmdl
│ ├── dim_duid.tmdl
│ └── fct_summary.tmdl
├── relationships.tmdl
├── model.tmdl
└── database.tmdl

TMSL is a single model.bim JSON file with everything in it.

For CI/CD pipelines, TMSL wins hands down. Here’s why:

  1. One file to manage. Your deploy script reads one file, replaces two GUIDs, deploys, and runs git checkout to restore. With TMDL, you’d need to find which .tmdl file contains the OneLake URL and handle multiple files.
  2. Two .replace() calls. The entire environment swap is two string replacements on one file. With TMDL, the connection expression lives in model.tmdl, but table definitions reference it indirectly — more files to reason about during deployment.
  3. Easier to grep and debug. When something goes wrong with your Direct Lake connection, you open one file, search for the OneLake URL, and see everything. No jumping between files.

When TMDL makes more sense:

  • Large models with dozens of tables where multiple people edit measures and columns — per-file Git diffs are cleaner and merge conflicts are smaller
  • Teams using Tabular Editor who need reviewable PRs on individual table changes
  • Models that change frequently at the table level

But if your semantic model is authored once and deployed across environments — which is the typical CI/CD pattern — you’re not reviewing table-level diffs. You’re swapping two GUIDs and pushing. TMSL keeps it simple.

I chose model.bim and haven’t looked back.


Things to Know Before You Start


Lesson 1: Deploy Order Matters — A Lot

This was my biggest source of failed deployments. Fabric items have implicit dependencies, and deploying them out of order causes cryptic failures.

The correct sequence:

Lakehouse → Notebook → (run notebook) → Semantic Model → Data Pipeline

Why this specific order:

  • The Notebook needs a Lakehouse to attach to. If the Lakehouse doesn’t exist yet, the attachment step fails.
  • The Semantic Model uses Direct Lake mode, which validates that the Delta tables it references actually exist. If you deploy the model before running the notebook that creates those tables, validation fails.
  • The Data Pipeline references the Notebook by ID. You need the Notebook deployed first to get its target workspace ID.

I ended up with a strict 7-phase deploy script:

# 1. Create/verify Lakehouse (with schemas enabled)
# 2a. Deploy Lakehouse
# 2b. Deploy Notebook
# 2c. Attach Lakehouse to Notebook via fab set
# 3. Copy supporting files to OneLake
# 4. Run the Notebook (blocks until complete)
# 5. Deploy Semantic Model (Delta tables now exist)
# 6. Refresh Semantic Model via Power BI API
# 7. Deploy + schedule Data Pipeline

Lesson 2: fab job run Does Nothing for Notebooks Without -i '{}'

This one cost me hours of debugging. Running a notebook via the CLI:

# Does NOTHING — silently succeeds but notebook never executes
fab job run prod.Workspace/run.Notebook
# Actually runs the notebook
fab job run prod.Workspace/run.Notebook -i '{}'

Notebooks require the -i '{}' flag (empty JSON input). Without it, the command returns success but the notebook never fires. There’s no error, no warning — it just silently does nothing.


Lesson 3: parameter.yml Token Replacement Is Surprisingly Limited

Fabric CLI has a parameter.yml mechanism for replacing GUIDs across environments. The idea is great — use tokens like $workspace.id and $items.Lakehouse.data.$id that get resolved at deploy time.

In practice, the rules are strict and poorly documented:

Tokens only resolve if the entire value starts with $

# WRONG — token is embedded in a URL, never resolves
replace_value:
_ALL_: "https://onelake.dfs.fabric.microsoft.com/$workspace.id/$items.Lakehouse.data.$id/"
# CORRECT — each token must be its own replacement entry
- find_value: "e446a5e7-..."
replace_value:
_ALL_: "$workspace.id"

The $items token format is strict

$items.Lakehouse.data.$id # correct: $items.{type}.{name}.$attribute
$items.data.$id # WRONG: "Invalid $items variable syntax"

is_regex must be a string, not a boolean

is_regex: "true" # correct
is_regex: true # WRONG — Fabric CLI rejects with "not of type string"

My solution: skip parameter.yml entirely

I found it simpler and more transparent to do GUID replacement directly in Python:

# Read the source file, find dev GUIDs, replace with target GUIDs
bim_text = bim_path.read_text()
bim_path.write_text(
bim_text.replace(source_ws_id, WS_ID)
.replace(source_lh_id, target_lh_id)
)
# Deploy with the modified file
fab_deploy(["SemanticModel"])
# Restore original for clean git state
subprocess.run(["git", "checkout", str(bim_path)])

The pattern: modify → deploy → git restore. No token resolution needed.


Lesson 4: item_types_in_scope Must Be Plural

The deploy config YAML key is item_types_in_scope (plural). Use the singular item_type_in_scope and Fabric CLI silently ignores it — deploying everything in your repository directory instead of just the types you specified.

# CORRECT
item_types_in_scope:
- Notebook
- Lakehouse
# WRONG — silently deploys ALL item types
item_type_in_scope:
- Notebook

This is the kind of bug that only shows up in production when your Semantic Model gets deployed before your Delta tables exist.


Lesson 5: New Lakehouses Need a Provisioning Wait

Creating a Lakehouse returns immediately, but the underlying infrastructure isn’t ready yet:

result = subprocess.run(["fab", "create", LAKEHOUSE, "-P", "enableSchemas=true"])
if result.returncode == 0:
# Brand new lakehouse — need to wait for provisioning
print("Waiting 60s for provisioning...")
time.sleep(60)

On first deploy to a new workspace, this 60-second wait is essential. Without it, subsequent operations (deploying items, copying files) fail with opaque errors.


Lesson 6: Attaching a Lakehouse to a Notebook Requires fab set

Deploying a notebook doesn’t automatically connect it to a Lakehouse. You need a separate fab set call:

lakehouse_payload = json.dumps({
"known_lakehouses": [{"id": target_lh_id}],
"default_lakehouse": target_lh_id,
"default_lakehouse_name": "data",
"default_lakehouse_workspace_id": WS_ID,
})
fab(["set", NOTEBOOK, "-q",
"definition.parts[0].payload.metadata.dependencies.lakehouse",
"-i", lakehouse_payload, "-f"])

The JSON path is deeply nested and not well documented. I had to inspect the API responses to find the correct path: definition.parts[0].payload.metadata.dependencies.lakehouse.


Lesson 7: Semantic Model Refresh Uses the Power BI API, Not the Fabric API

After deploying a Direct Lake semantic model, you need to trigger a refresh. But this isn’t a Fabric API call — it’s a Power BI API call:

# Note the -A powerbi flag — this targets the Power BI API endpoint
fab api -A powerbi -X post "groups/{workspace_id}/datasets/{model_id}/refreshes"

Without the -A powerbi flag, you’ll get 404s because the Fabric API doesn’t have a refresh endpoint for semantic models.


Lesson 8: Pipeline References Are Hardcoded GUIDs

A Data Pipeline that runs a notebook stores the notebook’s ID and workspace ID as hardcoded GUIDs in its definition:

{
"typeProperties": {
"notebookId": "da888b35-a17c-49ac-a8cf-1a5ffae91e20",
"workspaceId": "e446a5e7-6666-42ad-a331-0bfef3187fbf"
}
}

These are your dev GUIDs. After deploying to a different workspace, you need to update them:

target_nb_id = get_target_item_id("Notebook", "run")
fab(["set", PIPELINE, "-q",
"definition.parts[0].payload.properties.activities[0].typeProperties.notebookId",
"-i", target_nb_id, "-f"])
fab(["set", PIPELINE, "-q",
"definition.parts[0].payload.properties.activities[0].typeProperties.workspaceId",
"-i", WS_ID, "-f"])

Again, the JSON paths are deeply nested. The fab set command is your best friend for post-deploy configuration.


Lesson 9: GitHub Actions Authentication via OIDC

No stored secrets for the Fabric service principal. GitHub’s OIDC provider exchanges a federated token directly:

- name: Login to Fabric CLI
run: |
FED_TOKEN=$(curl -sH "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
"$ACTIONS_ID_TOKEN_REQUEST_URL&audience=api://AzureADTokenExchange" | jq -r '.value')
fab auth login -t ${{ secrets.AZURE_TENANT_ID }} \
-u ${{ secrets.AZURE_CLIENT_ID }} \
--federated-token "$FED_TOKEN"

This means no client secrets to rotate — just configure the Azure AD app registration to trust your GitHub repo’s OIDC issuer. It works well, but you still need to set up an Azure AD app registration, configure federated credentials, and grant it Fabric permissions. It would be nice if Fabric supported direct service-to-service authentication — something like a Fabric API key or a native GitHub integration — without needing Azure as the intermediary.


Lesson 10: Use Variable Libraries for Runtime Config

Instead of baking config values into your notebook or using parameter.yml, Fabric has Variable Libraries:

# In your notebook at runtime:
import notebookutils
vl = notebookutils.variableLibrary.getLibrary("deploy_config")
download_limit = vl.download_limit

The deploy script creates/updates the variable library via the API:

fab(["api", "-X", "post", f"workspaces/{WS_ID}/variableLibraries",
"-i", json.dumps({"displayName": "deploy_config", "definition": vl_definition})])

This gives you environment-specific configuration without redeploying the notebook. Change a variable, next pipeline run picks it up.


Lesson 11: Use abfss:// Paths for OneLake — It Makes Your Notebook Portable

When reading or writing to OneLake, use the abfss:// protocol with workspace and lakehouse IDs:

workspace_id = notebookutils.runtime.context.get('currentWorkspaceId')
lakehouse_id = notebookutils.lakehouse.get('data').get('id')
root_path = f"abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}"

This makes your notebook fully portable — the same code runs everywhere:

  • Local dev: swap to a local path or Azurite connection
  • Deployed to stagingnotebookutils resolves to the staging workspace/lakehouse IDs
  • Deployed to production: same code, different IDs at runtime

The alternative — hardcoding workspace names or using /lakehouse/default/ mount paths — ties your notebook to a specific workspace. With abfss://, the notebook doesn’t care where it’s running. The IDs come from the runtime context, and the deploy script handles attaching the right Lakehouse. Zero code changes between environments.


Lesson 12: Copying Files to OneLake Is Parallel but Slow

The notebook needs supporting files (SQL models, configs) available in OneLake. The fab cp command handles this, but it’s one file at a time. I parallelized with 8 workers:

from concurrent.futures import ThreadPoolExecutor
def copy_file(f):
rel = f.relative_to(root)
fab(["cp", rel.as_posix(), f"{LAKEHOUSE}/Files/{rel.parent.as_posix()}/", "-f"])
with ThreadPoolExecutor(max_workers=8) as executor:
executor.map(copy_file, files)

Before copying files, you need to create the directory structure with fab mkdir. OneLake doesn’t auto-create parent directories.


Lesson 13: Schedule Idempotently

Don’t recreate the pipeline schedule every deploy — check first:

result = subprocess.run(["fab", "job", "run-list", PIPELINE, "--schedule"],
capture_output=True, text=True)
if "True" not in result.stdout:
fab(["job", "run-sch", PIPELINE,
"--type", "cron",
"--interval", cfg["schedule_interval"],
"--start", cfg["schedule_start"],
"--end", cfg["schedule_end"],
"--enable"])

This prevents duplicate schedules stacking up across deploys.


The Big Picture

Here’s the overall architecture in one diagram:

GitHub Push
GitHub Actions (OIDC → fab auth login)
deploy.py
├── fab create → Lakehouse (with schemas)
├── fab deploy → Notebook
├── fab set → Attach Lakehouse to Notebook
├── fab cp → Copy data files to OneLake (8 parallel workers)
├── fab job run → Execute Notebook (creates Delta tables)
├── fab deploy → Semantic Model (with GUID replacement + git restore)
├── fab api → Refresh Semantic Model (Power BI API)
├── fab deploy → Data Pipeline
├── fab set → Update Pipeline notebook/workspace refs
└── fab job run-sch → Schedule Pipeline (if not already scheduled)

Everything is driven by a single deploy_config.yml that maps branch names to workspace IDs:

defaults:
schedule_interval: "30"
schedule_start: "2025-01-01T00:00:00"
schedule_end: "2030-12-31T23:59:59"
main:
ws_id: "e446a5e7-..."
schedule_interval: "720" # 12 hours (staging)
production:
ws_id: "be079b0f-..."
download_limit: "60" # full data

Push to main → deploy to staging workspace. Push to production → deploy to production workspace.


Lesson 14: Don’t Deploy the Lakehouse Item — Let the Data Define the Schema

I had a data.Lakehouse/ folder in fabric_items/ with a .platform file and a lakehouse.metadata.json that just set defaultSchema: dbo. I was running fab deploy for it. Then I realized: I was already creating the Lakehouse with fab create before the deploy step:

fab create "prod.Workspace/data.Lakehouse" -P enableSchemas=true

The fab create handles everything. The fab deploy of the Lakehouse item was redundant.

But there’s a deeper point here: the Lakehouse schema should be driven by your data, not by CI/CD. Your notebook creates the tables, your data transformation defines the schemas. The Lakehouse is just the container — it doesn’t need a deployment definition. Trying to manage Lakehouse schema through fab deploy is fighting the natural flow. Create the container, let the data populate it.

I deleted the entire data.Lakehouse/ folder from my repo. One less item to deploy, one less thing to break.

What I’d Tell My Past Self

  1. Read every fab CLI error message carefully. Many failures are silent (wrong key name, missing -i flag). Add verbose logging.
  2. Deploy in phases, not all at once. Item dependencies are real and the error messages when you get the order wrong are unhelpful.
  3. Skip parameter.yml for anything non-trivial. Direct GUID replacement in Python with git restore is simpler and fully transparent.
  4. fab set is the power tool. Most post-deploy configuration — attaching lakehouses, updating pipeline references — goes through deeply nested JSON paths in fab set.
  5. Test in a separate workspace mapped to a non-production branch. The deploy_config.yml pattern of mapping branches to workspaces makes this trivial.
  6. The Power BI API and Fabric API are different surfaces. Some operations (like semantic model refresh) only exist on the Power BI side. Use fab api -A powerbi.
  7. Don’t deploy what you don’t need to. If fab create handles it, drop the item definition. Let your data drive the schema.

The Fabric CLI is new — fab deploy landed in v1.5.0 just this month — and it already handles a full end-to-end deployment pipeline. The foundation is solid. Everything you need is already there — it just takes knowing where to look. Hopefully this saves you some of that discovery time.


Acknowledgements

Special thanks to Kevin Chant — Data Platform MVP and Lead BI & Analytics Architect — whose blog has been an invaluable resource on Fabric CI/CD and DevOps practices for the data platform. If you’re working with Fabric deployments, his posts are well worth following.

First Look at Incremental Framing in Power BI

TL;DR: Incremental framing is like CDC to RAM 🙂 It significantly improves cold-run performance of Direct Lake mode in some scenarios, there is an excellent documentation that explain everything in details

What Is Incremental Framing?

One of the most important improvements to Direct Lake mode in Power BI is incremental framing.

Power BI’s OLAP engine, VertiPaq (probably the most widely deployed OLAP engine, though many outside the Power BI world may not know it) relies heavily on dictionaries. This works well because it is a read-only database. another core trick is its ability to do calculation directly on encoded data. This makes it extremely efficient and embarrassingly fast  ( I just like this expression for some reason ).


Direct Lake Breakthrough

Direct Lake’s breakthrough is that dictionary building is fast enough to be done at runtime.

Typical workflow:

  1. A user opens a report.
  2. The report generates DAX queries.
  3. These queries trigger scans against the Delta table.
  4. VertiPaq scans only the required columns.
  5. It builds a global dictionary per column, loads the data from Parquet into memory, and executes queries.

The encoding step happens once at the start, and since BI data doesn’t usually change more that much, this model works well.


The Problem with Continuous Appends

In scenarios where data is appended frequently (e.g., every few minutes), the initial approach does not works very well. Each update requires rebuilding dictionaries and reloading all the data into RAM, effectively paying the cost of a cold run every time ( reading from remote storage will be always slower).


How Incremental Framing Fixes This

Incremental framing solves the problem by:

  • Incrementally loading new data into RAM.
  • Encoding only what’s necessary.
  • Removing obsolete Parquet data when not needed.

This substantially improves cold-run performance. Hot-run performance remains largely unchanged.


Benchmark: Australian Electricity Market

To test this feature, I used my go-to workload: the Australian electricity market, where data is appended every 5 minutes—an ideal test case.

  • Incremental framing is on by default, I turn it off using this bog
  • For benchmarking, I adapted an existing tool , Direct Lake load testing( I just changed writing the results to Delta instead of CSV), I used 8 concurrent users, the main fact Table is around 120 M records, the queries reflect a typical user session , this is a real life use case, not some theoretical benchmark.

Results

P99

P99 (the 99th percentile latency, often used to show worst-case performance):

  • Improvement of 9x–10x, again, your results may varied depending on workload, Parquet layout, and data distribution.

P90

P90 (90th percentile latency):

  • Less dramatic but still strong.
  • Improved from 500 ms → 200 ms.
  • Faster queries also reduce capacity unit usage.

Geomean

just for fun and to show how fast Vertipaq is, let’s see the geomean, alright went from 11 ms to 8 ms, general purpose OLAP engines are cool, but specialized Engines are just at another level !!!

This does not solve Bad Table layout problem

This feature improves support for Delta tables with frequent appends and deletes. However, performance still degrades if you have too many small Parquet row groups.

VertiPaq does not rewrite data layouts—it reads data as-is. To maintain good performance:

  • Compact your tables regularly.
  • In my case, I backfill data nightly. The small Parquets added during the day don’t cause major issues, but I still compact every 100 files as a precaution.

If your data is produced inside Fabric, VOrder helps manage this. For external engines (Snowflake, Databricks, Delta Lake with Python), you’ll need to actively manage table layout yourself.

Some Observations on Running TPCH 1 TB on Microsoft Fabric

This is not an official Microsoft benchmark, just my personal experience.

Last week, I came across a new TPCH generator written in Rust. Luckily, someone ported it to Python, which makes generating large datasets possible even with a small amount of RAM. For example, it took 2 hours and 30 minutes to generate a 1 TB scale dataset using the smallest Fabric Python notebook (2 cores and 16 GB of RAM).

Having the data handy, I tested Fabric DWH and SQL Endpoint. I also tested DuckDB as a sanity check. To be honest, I wasn’t sure what to expect.

I shared all the notebooks used and results here

I ran the test 30 times over three days, I think I have enough data to say something useful,In this blog, I will focus only on the results for the cold and warm runs, along with some observations.

For readers unfamiliar with Fabric, DWH and SQL Endpoint refer to the same distributed SQL engine. With DWH, you ingest data that is stored as a Delta table (which can be read by any Delta reader). With SQL Endpoint, you query external Delta tables written by Spark and other writers (this is called a Lakehouse table). Both use Delta tables.

Notes:

  • All the runs are using a Python notebook
  • to send queries to DWH/SQL Endpoint, all you need is
    conn = notebookutils.data.connect_to_artifact("data")
    conn.query("select 42")
  • I did not include the cost of ingestion for the DWH
  • The cost include compute and storage transaction and assume pay as you go rate of 0.18 $/Cu(hour)
  • For extracting Capacity usage, I used this excellent blog

Cold Run

  • The first-ever run on SQL Endpoint incurs an overhead, apparently the system build statistics. This overhead happened only once across all tests.
  • Point 2 is an outlier but an interesting one 🙂
  • The number of dots displayed is less than the number of tests runs as some tests perfectly match, which is a good sign that the system is predictable !!!
  • vorder improves performance for both SQL Endpoint and DuckDB. The data was generated by Rust and rewritten using Spark; it seems to be worth the effort.
  • Costs are roughly the same for DWH and SQL Endpoint when the Delta is optimized by vorder, but DWH is still faster.
  • DuckDB, running in a Python notebook with 64 cores, is the cheapest (but the slowest). Query 17 did not run , so that result is moot. ,Still, it’s a testament to the OneLake architecture: third-party engines can perform well without any additional Microsoft integration. Lakehouse for the win.

Warm Run

  • vorder is better than vanilla Parquet.
  • DWH is faster and a bit cheaper than SQL Endpoint.
  • DuckDB behavior is a bit surprising, was expecting better performance , considering the data is already loaded into RAM.

Impact on the Parquet Writer

I added a chat showing the impact of using different writers on the read performance, I use only warm run to remove the impact of the first run ever as it does not happen in the DWH ( as the data was ingested)

  • given the same table layout, DWH and SQL Endpoint perform the same, it is expected as it is the same engine
  • surprisingly using the initial raw delta table vs spark optimize write gave more or less the same performance at least for this particular workload.

Final Thoughts

Running the test was a very enjoyable experience, and for me, that’s the most important thing. I particularly enjoyed using Python notebooks to interact with Fabric DWH. It makes a lot of sense to combine a client-server distributed system with a lightweight client that costs very little.

There are new features coming that will make the experience working with DWH even more streamlined.

Edit :

  • update the figures for Dcukdb as Query 17 runs but you need to limit the memory manually set memory_limit='500GB'
  • added a graph on the impact of the parquet layout.

An Excel User’s Perspective on Lakehouse Architecture

This is more or less the industry consensus on how a Lakehouse architecture should look in 2025.

By now, it’s become clear that Parquet is the de facto standard for storing data, and using an object store to separate storage from compute makes a lot of sense.

Another interesting development is how vendors want to package this offering. Storage vendors saw an opportunity to do more—after all, there’s no law that says the metastore belongs to the data warehouse! So you get things like S3 Table and Cloudflare R2, which I think is a good thing, especially if you’re a smaller analytics vendor. Life becomes much easier when table maintenance is done upstream, allowing you to focus solely on making the query engine faster.

Encouraging things are also happening in the table format space. I know a bit about Iceberg and Delta, but not much about the others. One very interesting development is Iceberg adopting deletion vectors from Delta in the V3 spec, while Delta will requires a catalog for read and write (at least for catalog managed table). I like to call it the “Icebergification” of Delta.

Another trend is the Delta Java writer making it easier to auto-generate Iceberg metadata. and Xtable is doing the same regardless of the delta writer, At this stage, one could argue: why do we need two table formats that are becoming virtually identical?

Data Analyst—How About Me?

These improvements mostly impact the write path, which is primarily managed by data engineers. But what about data analysts and end users?

if you have Fabric OneLake, you can use Direct Lake in OneLake mode. Marco has a great article about it. It’s a fantastic improvement compared to the initial version of Direct Lake. However, it doesn’t solve the problem if your data is hosted in an S3 table or BigQuery Iceberg table. Yes, you can create a shortcut to OneLake and read it from there, but that still depends on a data engineer setting it up.

Now imagine a world where an Excel, Tableau, or Power BI Desktop user (or any arbitrary client tool) can just point to a Lakehouse using a standard API, discover tables, read data, and build reports. Honestly, this isn’t a big ask , we already have this when connecting to databases using ODBC, and I don’t see any technical reason why we can’t have the same experience with Lakehouses.

We Already Have This API

For me, the most promising development in the Lakehouse ecosystem is the Iceberg Catalog REST API, and I genuinely hope it becomes a standard—just like ODBC is today (and hopefully ADBC in the future, but that’s another topic).

Again, speaking as a data analyst, I want my tools to support the read part of the API—just the ability to list tables and scan a table. That’s all. I have zero interest in how the data is stored or which table format is used. The catalog should be smart enough to generate metadata on the fly.

The Good News

We’re getting there—at least if you’re using a Python notebook. Here’s an example where I use the same Iceberg REST API to query a table from four different Lakehouse implementations using Daft.

def connect_catalog(cat):
  match cat:
    case 'polaris':
      catalog = load_catalog(
              'default',
              uri= polaris_endpoint,
              warehouse='dwh',
              scope = 'PRINCIPAL_ROLE:data_engineer' ,
              credential= polaris_key
            )
    case 's3':
      catalog = load_catalog(
              'default',
              **{
                "type": "rest",
                "warehouse": s3_warehouse ,
                "uri": "https://s3tables.us-east-2.amazonaws.com/iceberg",
                "rest.sigv4-enabled": "true",
                "rest.signing-name": "s3tables",
                "rest.signing-region": "us-east-2"
              }
            )
    case 'uc':
      catalog = load_catalog(
               'default',
              token = token ,
              uri = endpoint,
              warehouse = 'ne'
              )
    case 'r2':
      catalog = RestCatalog(
              name = 'default',
              token = token_r2 ,
              uri = endpoint_r2,
              warehouse = r2_warehouse
              )
  return catalog

Then, I run a standard SQL query using Daft SQL.

Final Thoughts

It took Parquet a decade to become a standard. We may or may not have a single standard table format—and maybe we don’t need one. But if we want this Lakehouse vision to become mainstream, then everyone should support the Iceberg Catalog REST API, at least for read operations.