Process 1 Billion rows of raw csv in Fabric Notebook for less than 20 Cents 

The Use case

Data source is around 2200 files with a total of 897 Million rows of weird csv files (the file has more columns than the header) , This is a real world data not some synthetic dataset, it is relatively small around 100 GB uncompressed.

The Pipeline will read those files and extract clean data from it using non trivial transformation and save it as a Delta Table.

we used the smallest Compute available in Fabric Notebook which is 4 cores with 32 GB. to be clear this is a real single node (not 1 driver and 1 executor), Although the Runtime is using Spark, All the Engines interact Directly with the Operating system, as far as I can tell, Spark has a very minimum overhead when not used Directly by the Python code.

You need to pick the Engine

Nowadays we have plenty of high quality Calculation Engines,  but two seems to gain traction (Polars and DuckDB) , at least by the number of package downloaded and the religious wars that seems to erupt occasionally in twitter 🙂

For a change I tried to use Polars, as I was accused of having a bias toward DuckDB, long story short, hit a bug with Polars , I tried Datafusion too but did managed to get a working code, there is not enough documentation on the web, after that I did test Clickhouse chdb, but find a bug, anyway the code is public, feel free to test your own Engine.

So I ended up using DuckDB, the code is published here , it is using only 60 files as it is available publicly, the whole archive is saved in my tenant (happy to share it if interested) 

The results is rather surprising (God bless Onelake throughput), I am using the excellent Python Package Delta Lake to write to Onelake

26 minutes, that’s freaking fast, using Fabric F2, the total cost will be

0.36 $/Hour X(26/60) =  15 Cents

you need to add a couple of cents for Onelake Storage Transactions.

As far as I can tell, this is maybe one of the cheapest option in the Market.

0.36 $/Hour is the rate for pay as you go, if you have a reservation then it is substantially cheaper.

because it is Delta Table Then Any Fabric Engine ( SQL, PowerBI, Spark) can read it.

What’s the catch ?

Today DuckDB can not write directly to Delta Table ( it is coming though eventually) instead it will export data to Delta Lake writer using Arrow Table, it is supposed to be zero copy but as far as I can tell, it is the biggest bottleneck and will generate out of memory errors , the solution is easy ; process the files in chunks , not all at once

#############################################
list_files=[os.path.basename(x) for x in glob.glob(Source+'*.CSV')]
files_to_upload_full_Path = [Source + i for i in list_files]
if len(files_to_upload_full_Path) >0 :
  for i in range(0, len(files_to_upload_full_Path), chunk_len):
    chunk = files_to_upload_full_Path[i:i + chunk_len]
    df=get_scada(chunk)
    write_deltalake("/lakehouse/default/Tables/scada_duckdb",df,mode="append",engine='rust',partition_by=['YEAR'],storage_options={"allow_unsafe_rename":"true"})
    del df

By experimentation, I notice 100 files works fine with 16 GB, 200 files with 32 GB etc

When exporting to Parquet, DuckDB managed the memory natively and it is faster too.

Native Lakehouse Is the future of Data Engineering

The combination of Open table format like Delta and Iceberg with ultra efficient Open Source Engine like DuckDB, Polars, Velox, datafusion all written in C++/Rust will give data engineers an extremely powerful tools to build more flexible and way cheaper data solutions.

if I have to give an advice for young Data engineers/Analysts, Learn Python/SQL.

Would like to thanks Pedro Holanda for fixing some very hard to reproduce bugs in the DuckDB csv reader.

And Ion Koutsouris for answering my silly questions about Delta lake writer.

Reading Snowflake Iceberg Table using Fabric OneLake

TL;DR : today you can use a Notebook, for Direct Lake mode you have to wait a bit 🙂

After a long Private Preview, Snowflake Iceberg support is finally available in public preview. I have a personal instance in Azure Australia East and wanted to see how easy it is to read the data using Fabric OneLake, this is just my personal observations and does not necessarily represent my employer views.

Create Iceberg Table In Snowflake

Creating an iceberg table in Snowflake is rather straightforward but I particularly like the Option Base_Location, because it is a human readable name and not an auto generated random GUID (Looking at you Lakehouse Vendor X), locating the table later in Fabric OneLake is very natural. 

Use Read Only privilege for extra safety.

The Iceberg Table is read only by third party Engines, so to be extra safe, make sure the account you use to read the data from Azure Storage is read only.

The role selected is Storage Blob Reader

Fabric Lakehouse View

I created a new lakehouse and built a shortcut to my azure storage account, and just like that, I can see the data

How to read Iceberg in Fabric (Officially) 

Officially DeltaTable is the only Table format supported in Fabric, but recently Microsoft announced they are a member of OneTable  project which is a kind of universal Table format translator, for example if you want to read Iceberg but your engine support only Delta, OneTable will generate a new Delta Table Metadata on the fly and keep syncing new changes ( The Data itself is not rewritten), I tried to install it, but it involve java which is usually something I try to avoid, Looking forward to the official service ( I don’t think there is any public ETA)

Read Iceberg from a Notebook (Unofficially)

Fabric has a notebook , literally you can read anything, Snowflake recommend to use Spark with their catalog SDK,  I am not a Spark person, I am more into Python Engines, unfortunately it was not a nice ride, Tried Pyiceberg, Hyper, Databend  they all didn’t work for different reasons, luckily DuckDB works out of the box

Snowflake Parquet files are peculiar 

From a couple of big tables I did create, Snowflake seems to generate Parquet files with a fixed size 16 MB

Every file has only one row group, I had a look at the parquet metadata and it seems distinct count stats are generated, ( a lot of engines don’t do it as it is a very expensive operation)

Final Thoughts

That was just a quick look, today you can read Iceberg using Fabric notebook, but it will be interesting to see how PowerBI Direct Lake mode will behave with such a small file size when OneTable is available( hopefully not too far in the future) , I suspect a lot of users of Both PowerBI and Snowflake will be interested to learn more about this integration.

What is Fabric DWH Bursting 

TL;DR : Trying to simulate Fabric DWH Bursting using Fabric Notebook and DuckDB. 

Setup

I am using TPCH_SF100 (600M rows) as an example, it does not matter you can use any dataset as long as you keep the same queries.

Fabric compute is rather serverless, you can’t customize it, except for Notebook where we can choose the number of CPU.

The test is very simple: run the same queries using a Fabric Notebook with different compute size, 4 Compute Unit, 8,16,32 and save the results in a OneLake.

To be very clear, Fabric SQL Engine is totally different from DuckDB, it does scale by adding more nodes, DuckDB is a single node and it scales by adding more CPU to the same node, but I think it is the same principle.

Scale is not linear.

  • When increasing CU from 4 to 8, the duration decreased by 50 %
  • From 8 CU to 16 , the Duration decreased by 24 %
  • From 16 to 32, we reach a plateau, the duration stayed the same ( maybe OneLake throughput became the bottleneck)

I think it is fair to say regardless of the DB used at certain point adding more CPU will have diminishing returns.

Let’s now show the same graph but with $ values

For Me I will pick CU = 8 as the optimal solution, for the same cost as CU= 4 and half the duration, that’s what I call a manual bursting 🙂

Bursting in Fabric DWH is an automatic algorithm  to *predict* the optimum compute unit for a particular workload.

I don’t know how Fabric DWH increases resource allocations but I suspect it just adds ( and removes)  more nodes.

What’s not clear though, is how the system decided what’s the best solution, using the previous example; one can argue that for interactive Queries CU= 16 is the best option, yes you consume way more resources but it is faster. And for batch ETL jobs the only thing that matters is the lowest cost. 

Smaller SKU can do very big thing 

One of the most interesting implication of Bursting is that smaller SKU like F2 can do very big workload, F2 gives you 2 CU, but using the previous dataset as an example, it can scale just fine to 16 CU, but there is no free lunch, the total compute  you are entitled to is still 2 CU X 24 Hours = 48 CU(h) 

The challenge 

Cost based optimization in Database is a very complex problem, adding dynamic resource allocation makes it even harder ,otherwise everyone will be doing it 🙂

For me personally, SQL Engine so far seems to behave as expected, I did not see anything particularly extreme, some queries using as low as 2 CU for smaller workload. for a TB dataset bursting went till 72 CU.

What matters is continuous improvements.

I think what matters the most is that the efficiency keep improving, today Fabric DWH TPCH_SF100 cold run finish around 2 minutes for 6 cents ( that’s pretty much the same performance as Cloud DWH X)  which is great, but my hope in less than 5 years, the same workload will finish in less than 60 second for 3 cents.

And that’s the next big thing for Fabric, just efficiency improvement and bug fixes, nothing fancy. 

Fabric Notebook Virtual filesystem

TL;DR: Fabric notebook uses a very nice functionality by mounting a remote Azure container as a “fake” local filesystem,  this works as the OS Level, for a lot of programs it is just another folder, you can learn more details here 

How it works 

When you mount a Fabric Lakehouse , you get something like this

In the file section, you can store any files, the tables section is special, if you create a Delta Table, Fabric will automatically expose it to all Fabric Engines ( SQL, Vertipaq etc), the data is stored in an azure storage container.

When you run this command in the notebook   !df -h, this is what you get

When a program send a command to delete, read, write etc, the system automatically translate those to an Azure storage API, it is not perfect, it is still not a real local filesystem, but so far it works well specially for read, I tested with 5 different SQL Engines and they all works fine ( DuckDB, Datafusion, databend,Glaredb and hyper) although the performance varied, I noticed that Hyper has some serious performance issues but I understand it may not be a priority for them to fix it 🙂

Writing DuckDB native file format

This one was  very confusing to me when I used it the first time, my understanding; files in a remote storage are immutable, you can create new files or delete but not modify an existing one, but somehow it works.

Open a remote DuckDB files in a write mode, the file is 25 GB

35 seconds is still slow, I thought it was supposed to be reading only some metadata !! I think that’s a DuckDB limitation.

Then I delete a random number from a table, it took 9 second to delete 85 million records ( this is an amazing performance)

Then I run a checkpoint and it works fine

BlobFuse2 will be a game changer 

Currently Fabric notebook runtime uses BlobFuseV1 For OneLake, which as far as I can tell does not support any cache, although in my experience the  throughput is rather good, it is still an object store and will never reach the speed of an SSD disk, but BlobFuse2 may well be the best thing that happen for Fabric Notebook, it has a native disc cache, and it works at the OS level, every program get a free cache, I hope the product team will upgrade soon.