What is the Fastest Engine to sort small Data in a Fabric Notebook?

TL;DR : using Fabric Python Notebook to Sort and Save Parquet files up to 100 GB shows that DuckDB is very competitive compared to Spark even when using Only half the resources available in a compute pool.

Introduction :

In Fabric the minimum Spark compute that you can provision is 2 nodes, 1 Driver and 1 Executor, my understanding, and I am not an expert by any means is : Driver Plan the Work and Executor do the actual Works, but if you run any no Spark code, it will run in the driver, basically DuckDB use only the driver, the executor is just sitting there and you pay for it.

The experiment is basically : generate the Table Lineitem from the TPCH dataset as a folder of parquet files and sort it on a date field then save it. Pre-sorting the data on a field used for filtering is a very well known technique.

Create a Workspace

When doing POC, it is always better to start in a new workspace, at the end you can delete it and it will remove all the artifacts inside it. Use any name you want.

Create a Lakehouse

Click New then Lakehouse, choose any name

You will get an empty lakehouse (it is a just a storage bucket with two folders, Files and Table)

Load the Python Code

The Notebook is straightforward, Install DuckDB , create the data files if they don’t exist already, sort and save in a delta table using both DuckDB and Spark

Define Spark Pool Size

By default the notebook came with a starter pool that are warm and ready to be used, the startup is in my experience is always less than 10 second, but it is a managed service and I can’t control the number of nodes, instead we will use custom pool where you can choose the size of the compute and the number of nodes in our case 1 driver and 1 executor, the startup is not bad at all, it is consistently less than 3 minute.

Schedule the Notebook

I don’t not know, how to pass a parameter to change the initial value in the pipeline, so I run it using a random number generator`, I am sure there is a better way, but anyway, it does works, and every insert the results

The Results

The Charts show the resource usage by data size, CPU(s) = Duration * Number of cores * 2.

Up to 300 Million rows, DuckDB is more efficient even when it is using only half the resources.

To make it clearer , I build another chart that show the Engine combination with less resource utilization by Lintem size

From 360 Million rows, Spark became more economical ( with the caveat that DuckDB is just using half the resources) or maybe DuckDB is not using the whole 32 cores ?

Let’s filter only DuckDB

DuckDB using 64 cores is not very efficient for the size of this Data.

Partying Thoughts

Adding more resources to a problem does not make it necessarily an optimal solution, you get faster duration but it costs way more.

DuckDB Performance even using half the compute is very intriguing !!!

Fabric Custom pools are a very fine solution, waiting around 2 minutes is worth it.

I am no Spark expert, but it will be handy to be able to configure at runtime a smaller Executor compute, in that case, DuckDB will be cheaper option for all sizes up to 100 GB and maybe more.

ACID Transaction in Fabric ?

TL:DR; Fabric Lakehouse don’t support ACID transactions as the Bucket is not locked and it is a very good thing, if you want ACID then use the DWH offering

Introduction :

When you read the decision tree between Fabric DWH and Lakehouse you may get the impression that in terms of ACID support the main difference is the support for Multi Table Transactions for DWH.

The reality is the Lakehouse in Fabric is totally open and there is no consistency guaranty at all even for a single table, all you need is to have the proper access to the workspace which give you access to the storage layer, then you can simply delete the folder and mess with any Delta table, that’s not a bad thing, it give you a maximum freedom to do stuff like Writing Data using different Engines or just upload directly to the “managed” Table area.

Managed vs unmanaged Tables

In Fabric, the only managed tables are the one maintained by the DWH, I tried to delete a file using Microsoft Azure Storage Explorer and to my delight, it was denied

I have to say, it still feels weird to look at a DWH Storage, it is like watching something we are not supposed to see 🙂

Ok what’s this OneSecurity thing

It is not available yet, so no idea, but surely it will be some sort of a catalog, I just hope it will not be Java based, and I am pretty sure it is read Only.

How About the Hive Metastore in the Lakehouse

It seems the hive metastore acts like a background service when you open the Lakehouse interface, it scan the section of the Azure Cloud storage “/Tables” and detect if there is a valid Delta Table, I don’t think there is a public HMS endpoint.

What’s the Implication

So you have two options :

Ingest Data using Fabric DWH, it is ACID and very fast as it uses a dedicated Compute.

I have done a test and imported the Data for TPCH_SF100 which is around 100 GB uncompressed in 2 minutes !!!That’s very fast.

and it seems Table stats are created too at ingestion time, but you pay for the DWH usage. Don’t forget the Data is still in the Delta Table Format, so it can be read by Any compatible Engine ( when reading directly from the storage)

Use Lakehouse Tables : Fabric has a great support for reading LH Tables( stats at runtime though), but you have to maintain the Data consistency by yourself, you have to run vacuum, make sure no multiple writers are running concurrently and that no one is messing with you Azure Storage Bucket, but it can be extremely cheap as any compatible Delta Table writer is accepted.

Use Shortcuts from Azure Storage, you can literally build all your data pipelines outside of Fabric and just link it. see example here

It is up to you, it is a classical managed vs unmanaged situation, I suspect the cost of Fabric DWH will push it either way, personally I don’t think maintaining DB tables is a very good idea. But remember either way, it is still an Open Table Format.

Edit : 16-June -2023

Assuming you don’t modify Delta Tables files directly from OneLake, and use Only Spark code, Multiple optimistics writers to the same table is a supported scenario , previously I had an issue with that setup but it was in S3, and it seems Azure Storage is different, Anyway this is from the Horse’s mouth 🙂

Create Delta Table in Azure Storage using Python and serve it with Direct Lake Mode in Fabric.

TL;DR ; Although the optimal file size and the row group size spec is not published by Microsoft, PowerBI Direct Lake mode works just fine with Delta table generated by non-Microsoft tools and that’s the whole point of a Lakehouse.

Edit : added example on how to write directly to Onelake

Quick How to

You can download the Python script here, it is using the open source Delta lake writer written in Rust with a Python Binding.(does not require Spark)

Currently Writing Directly to OneLake using the Python writer is not supported, there is an open bug , you can upvote it, if it is something useful to you

The interesting part of the code is this line

You can append or overwrite a Delta table, delete a specific partition is supported too, merge and delete rows is planned

The Magic of OneLake Shortcut

The idea is very simple, run a python script which create a delta table in Azure storage then make it visible to Fabric using a shortcut, which is like an external table, no data is copied

The whole experience was really joyful 🙂 lakehouse discovered the table and showed a preview

Just nitpicking, it would be really nice if the tables from the shortcut have different colors, the icon is not very obvious.

If you prefer SQL, then it is just there

Once it is in the lake, it is visible to PowerBI automatically. I just had to add a relationship , I did not use the default dataset as for some reason, it did not refresh correctly, it is a missed opportunity as this is a case where default dataset just makes perfect sense.

But Why ?

Probably you are asking why,in Fabric we have already Spark and dataflow Gen2 for this kind of scenarios ? which I did already 🙂 you can download the PowerQuery and PySpark Script here

So what’s the best tool ? I think how much compute every solution will consume from your capacity will be a good factor to decide which solution is more appropriate, today we can’t see the usage for Data Flow Gen2 so we can’t tell.

Actually , I was just messing around, what people will use is the simplest solution with less friction, a good candidate is Dataflow Gen2, and that’s the whole point of Fabric, you pay for convenience, still I would love to have a fully managed single node python experience.

Importing Delta table from OneLake to PowerBI

A very quick blog on how to import Delta Table to PowerBI Desktop, probably your first question is why ? doesn’t the new Direct Lake solve that ?

Direct lake assumes the compute to be running all the time, which does not make a lot of sense if you want to reduce your Fabric Bill and you are using pay as you go model.
Import mode has been battle tested since 2009, Direct Lake is still very early tech and will take time to mature.
That’s a side effect of using delta table as the tech for Onelake lakehouse, it does not require a running catalog, the One Security Model will require a running compute of some sort, we don’t know the details but that’s a discussion for another day.

Get the URL

Sandeep has a great blog, go and read it

Import Dataflow to your PowerBI Workspace

I am using the code from this blog, ( I know, I just copied other people ideas, but at least I give credit ) , to make it easier, I created a simple Dataflow where all you need is just to import the json to your PowerBI workspace and change the parameter for the OneLake URL

WARNING : Don’t use in your desktop, you will be charged for egress fees

You can download the json here , create a new dataflow, choose this Option

Ignore the authentication error, first change the Onelake URL to something like this

That’s all, now go and turn off that F2 instance, you are welcome 🙂

	Querying a Fabric La… on Writing to SQL Server using…
	Benjamin on Running DuckDB at 10 TB s…
	mim on Running DuckDB at 10 TB s…
	Benjamin on Running DuckDB at 10 TB s…
	Running DuckDB at 10… on Running DuckDB at 10 TB s…

Introduction :

Create a Workspace

Create a Lakehouse

Load the Python Code

Define Spark Pool Size

Schedule the Notebook

The Results

Partying Thoughts

Share this:

Introduction :

Managed vs unmanaged Tables

Ok what’s this OneSecurity thing

How About the Hive Metastore in the Lakehouse

What’s the Implication

Share this:

Quick How to

The Magic of OneLake Shortcut

But Why ?

Share this:

Get the URL

Import Dataflow to your PowerBI Workspace

Share this: