What is Fabric DWH Bursting 

TL;DR : Trying to simulate Fabric DWH Bursting using Fabric Notebook and DuckDB. 

Setup

I am using TPCH_SF100 (600M rows) as an example, it does not matter you can use any dataset as long as you keep the same queries.

Fabric compute is rather serverless, you can’t customize it, except for Notebook where we can choose the number of CPU.

The test is very simple: run the same queries using a Fabric Notebook with different compute size, 4 Compute Unit, 8,16,32 and save the results in a OneLake.

To be very clear, Fabric SQL Engine is totally different from DuckDB, it does scale by adding more nodes, DuckDB is a single node and it scales by adding more CPU to the same node, but I think it is the same principle.

Scale is not linear.

  • When increasing CU from 4 to 8, the duration decreased by 50 %
  • From 8 CU to 16 , the Duration decreased by 24 %
  • From 16 to 32, we reach a plateau, the duration stayed the same ( maybe OneLake throughput became the bottleneck)

I think it is fair to say regardless of the DB used at certain point adding more CPU will have diminishing returns.

Let’s now show the same graph but with $ values

For Me I will pick CU = 8 as the optimal solution, for the same cost as CU= 4 and half the duration, that’s what I call a manual bursting 🙂

Bursting in Fabric DWH is an automatic algorithm  to *predict* the optimum compute unit for a particular workload.

I don’t know how Fabric DWH increases resource allocations but I suspect it just adds ( and removes)  more nodes.

What’s not clear though, is how the system decided what’s the best solution, using the previous example; one can argue that for interactive Queries CU= 16 is the best option, yes you consume way more resources but it is faster. And for batch ETL jobs the only thing that matters is the lowest cost. 

Smaller SKU can do very big thing 

One of the most interesting implication of Bursting is that smaller SKU like F2 can do very big workload, F2 gives you 2 CU, but using the previous dataset as an example, it can scale just fine to 16 CU, but there is no free lunch, the total compute  you are entitled to is still 2 CU X 24 Hours = 48 CU(h) 

The challenge 

Cost based optimization in Database is a very complex problem, adding dynamic resource allocation makes it even harder ,otherwise everyone will be doing it 🙂

For me personally, SQL Engine so far seems to behave as expected, I did not see anything particularly extreme, some queries using as low as 2 CU for smaller workload. for a TB dataset bursting went till 72 CU.

What matters is continuous improvements.

I think what matters the most is that the efficiency keep improving, today Fabric DWH TPCH_SF100 cold run finish around 2 minutes for 6 cents ( that’s pretty much the same performance as Cloud DWH X)  which is great, but my hope in less than 5 years, the same workload will finish in less than 60 second for 3 cents.

And that’s the next big thing for Fabric, just efficiency improvement and bug fixes, nothing fancy. 

Loading Delta Table to Fabric OneLake using Delta Rust

Edit : Writing to delta is now substantially easier, and you don’t need to pass credential

from deltalake.writer import write_deltalake write_deltalake("abfss://xxxxx@onelake.dfs.xx",df,mode='overwrite')

TL;DR : you can load data to Fabric managed table area using the Python Package Delta-rs, (internally it is written in Rust) it is very handy when Spark is not the appropriate Solution for example in cloud functions where resources are very limited.

Load Files to Delta Tables using abfss

In this code, we use Pyarrow Dataset to list multiple folders and Load it as a Delta Table using Delta Rust,  because we are using a URL, we can save data anywhere in any workspace assuming we have write access to, notice as of today, Pyarrow don’t support OneLake URL yet, so you need to read it from a mounted Lakehouse

import pyarrow.dataset as ds
from deltalake.writer import write_deltalake
aadToken = notebookutils.credentials.getToken('storage')
storage_options={"bearer_token": aadToken, "use_fabric_endpoint": "true"}
sf=100
rowgroup           = 2000000
nbr_rowgroup_File  = 8 * rowgroup
for tbl in ['lineitem','nation','region','customer','supplier','orders','part','partsupp'] :
     print(tbl)
     dataset = ds.dataset(f'/lakehouse/default/Files/{sf}/{tbl}',format="parquet")
     write_deltalake(f"abfss://xxxxx@onelake.dfs.fabric.microsoft.com/yy.Lakehouse/Tables/{tbl}"\
     ,dataset\
     ,mode='overwrite',overwrite_schema=True,max_rows_per_file   =nbr_rowgroup_File,min_rows_per_group=rowgroup,max_rows_per_group=rowgroup\
     ,storage_options=storage_options)

Load Files to Delta Tables using mounted Lakehouse

Alternatively, if you mount a lakehouse you can use something like this

write_deltalake(f"/lakehouse/default/Tables/name",df,engine='rust',mode="append",  storage_options={"allow_unsafe_rename":"true"})

Note : rowgroup size is just for information, that’s a whole different topic for another day 🙂

I did use it to Load 38 GB of compressed Parquet files, and it did work fine, Both SQL and Vertipaq did run Queries without problem.

There are still some Bugs

Because Fabric in-house Engines (Vertipaq, SQL etc) are tested using Delta Tables generated by first party tools (Spark, DF Gen2) you may find some incompatibility issues when using Delta Rust, Personally I did find that Vertipaq does not support RLE_Dictionnary, and some issues with SQL Engine which I can’t easily reproduce, although it works fine with Fabric Spark ( I did report the bug )

How About V-Order ?

V Order is a Microsoft proprietary tech, so it can’t be used here. From my testing, V Order produces better compressed Parquet files by altering the sort orders of columns, so unfortunately if you use Delta Rust you lose that.

Final Thoughts

One of the first questions about Fabric I got from some skeptical audience was; How Open is it ? and to be honest I had my doubts too, turns out it is just Azure Storage with a custom URL, and it is Just Delta Table under the hood.

I would like to thank Josh and Matt for adding this feature to Delta Rust.

Fabric : Engines Resource Consumption.

TL:DR;  Although all fabric Engines can use the same Storage, the usage cost of every Engine is different and contrary to common wisdom , PowerBI Engine (Vertipaq) is not the most expensive.

Getting average resource usage consumption 

In Fabric capacity metrics, go to a hidden tab “TT Operation Data”

Copy it to a new tab, and you will get something like this

The table Operationsnames is hidden in the model, so I can’t build a customized report, anyway just for the purpose of this blog, I will just export to Excel ( God bless excel).

The compute unit is simply CU(s)/Duration, and to get a feeling about the numbers, let’s convert CU which is basically a virtual unit to a dollar figures, it is just for demonstration purpose only, P1 is a committed resource, regardless of the usage, the you pay a fixed 5000 $/Month and you get 64 CU

So 1 CU= 5000 $/30 day/24 Hours/64 CU= 0.10 $/Hour

F SKU which is a Fabric capacity bought from Azure is more expensive but it has zero commitment and you can suspend it anytime.

Just to show that the resource assignment is not fixed, here is an example from our production instance ( I have no idea yet, what is High Scale Dataflow compute)

Can we configure resources assignments ?

No you can’t, Fabric assign the necessarily resources automatically, the only exception is Spark

Do all Engines have the same efficiency ?

That’s a tricky questions and to be honest, it does not matter in practise, most of the engines don’t really overlap, so it is not a useful comparison, but we have an interesting exception, Fabric Notebook and Dataflow Gen2 can be both used for an ETL job, I think it will be an interesting hot topic 🙂

Still you did not answer which Engine is cheaper ?

I don’t know, build a Proof of concept with a realistic workload and see for yourself, personally after using fabric for the last 5 Months, I can say the following

  • Notebook can be very cheap, just assign one node compute.
  • SQL Engine is very competitive 
  • Dataflow Gen2 is probably the less mature offering in Fabric and can act in some  unexpected ways.
  • Orchestration using Pipelines is virtually free 🙂

Glaredb Storage Format

I was messing around with GlareDB which is one of the new wave of OLAP DB system (With DuckDB, Datafusion and Databend) it is open source and based on Datafusion, this blog is not a review but just a small demo for an interesting design decision by GlareDB developers, Instead of building a new storage system from scratch, they just used Delta Table, basically they assembled a database using just existing components, apparently all glued together using Apache Arrow !!!

Write Data using GlareDB

I am using Fabric Notebook here, currently writing to the Lakehouse does not work ( I opened a bug report, hopefully it is an easy fix) instead I will just the local folder

Let’s create a small dataframe using Pandas 

GlareDB can query Pandas DF directly

Then I can see the list of files 

Read Data using Polars

Now the interesting part, because it is an Open Table Format, I can use another Engine to read the data, let’s try Polars for a change 🙂 

It will be nice though to have a Query that return all tables with their path as tables/20001 does not mean much 

So DB Vendors should stop innovating in Storage Format? 

I have to admit I changed my mind about this subject, I used to think Query Engines Developers  should design the best format that serve their Engine, after using Fabric for a couple of Months, open table format is just too convenient, my current thinking, the cold storage table format  make a lot of of sense when using a standard format (Delta, Iceberg, Hudi etc)  the optimization can be done downstream, for example tables statistics, In-Memory representations of the data, there are plenty of areas where DB vendor can differentiate their offering, but cold storage is really the common denominator.

One thing though I like about Delta is the relative Path. You can move around the folder and data keeps just working. In the current example, I moved the folder to my desktop and it still works. Iceberg is a bit tricky as it does not support relative paths yet.