Building a cost effective solution using Fabric

I have being playing with Electricity Market Data recently and now I have a #Fabric solution that I think is easy enough to use but more importantly cost effective that it works just fine using F2 SKU, which cost $156 /Month

https://github.com/djouallah/aemo_fabric

a Quick Video

Howto

0- Create a Fabric Workspace

1- Create a lakehouse

2-Download the notebooks from Github and import it to Fabric Workspace

3-Open a notebook, attached it to the Lakehouse

4-Run the notebook in sequence just to have the initial load

5-Build your semantic Model either using Direct Lake for better “latency” , or use the attached template for import mode if you are happy with 8 refreshes per day( for people with a pro license), all you need is to input the Lakehouse SQL Endpoint, my initial plan was to read Delta Table directly from OneLake but currently filter pushdown works only at the partition level ( but apparently further stuff are coming)

6- Use a scheduler to run the jobs, 5 minutes and 24 Hours

Show me the Money

For developpement use Starter Pool, but for scheduling use Small Single Node, it is good enough, medium is faster but will consume more capacity Unit, still within F2 limits.

This is the actual usage of capacity unit for this scenario, yesterday I was messing around trying to further optimize which end up consuming more πŸ™‚

Spark Dataframe API is amazing

When reading the csv files, I need to convert a lot of columns to double, using SQL you have to manually type all the fields names, in Spark , you can just do it in a loop, that was the exact moment, I *understood*  why people like Spark  API !!!

df_cols = list(set(df.columns) -{'SETTLEMENTDATE','DUID','file','UNIT','transactionId','PRIORITY'})
for col_name in df_cols:
df = df.withColumn(col_name, f.col(col_name).cast('double'))

Source Data around 1 Billion of dirty csv, data added every 5 minutes and backfilled every 24 hours, all data is saved in OneLake using Spark Delta Table.

not everything is csv though, there is some data in Json and Excel of course πŸ™‚

Although the main code is using PySpark , I used Pandas and DuckDB too, no philosophical reasons, basically I use whatever Stackoverflow give me first πŸ™‚ and then I just copy the result to a Spark dataframe and save it as Delta Tables, you don’t want to miss on VOrder

For example this is the code to generate a calendar Table using DuckDB SQL, the unnest syntax is very elegant !!!

df=duckdb.sql(""" SELECT cast(unnest(generate_series(cast ('2018-04-01' as date), cast('2024-12-31' as date), interval 1 day)) as date) as date,
EXTRACT(year from date) as year,
EXTRACT(month from date) as month
""").df()

x=spark.createDataFrame(df)
x.write.mode("overwrite").format("delta").saveAsTable("Calendar")

The Semantic Model

The Model is not too complex but not trivial, 2 facts tables, 4 dimensions, biggest table 240 M rows.

Stuff that needs improvements

1- have an option to download the semantic Model from the service using pbit, it has to be one file for ease of distribution.

2- More options for the scheduler, something like run this job only between 8 AM to 6 PM every 5 minutes.

Parting Thoughts

Notebook + OneLake + Direct Lake is emerging as a very Powerful Pattern , specially when the data is big and freshness is important, but somehow users needs to learn how to write Python code, I am afraid that’s not obvious at all for the average data analyst, maybe this AI thing can do something about ? that’s the big question.

Does it mean, Notebook is the best solution for all use cases ? I don’t know, and I don’t think there is such a thing as a universal best practice in ETL, as far as I am concerned, there is  only one hard rule.

Save raw data when the source system is volatile.

anything else is a matter of taste πŸ™‚

Use Fabric Notebook code based orchestration tool to avoid concurrent write conflicts.

I had a simple data ingestion use case, Notebook A inserts data to a Delta Table every 5 minutes and Notebook B backfills the same table with new fields but only at 4 am.

Initially I just scheduled Notebook A to run every 5 minutes and Notebook B to run at 4 AM , did not work as I got a write conflict, basically Notebook B take longer time to process the data, when it is ready to update the table, it is a bit too late as it was already modified by Notebook A and you get this error

Workarounds

Solution 1 :  Schedule Notebook A to run every 5 minutes except from 4 AM to 4:15 AM, today it is not supported in Fabric scheduler ( although it works fine in Azure Data Factory).

Solution 2 : Partition by Date to avoid Spark writing to the same file at the same time, which is fine for my table as it is big enough around 230 Millions spread over 6 years, generating 2000 files is not the end of the world, but the same approach does not work for another table which is substantially smaller around 3 millions

Solution : Turn out, there is a code base orchestration tool in Fabric Notebook 

I knew about MSSparkUtils mainly because Sandeep Pawar can’t stop talking  about it πŸ™‚ but I did not know that it does orchestration too,  in my case  the solution was trivial.

Add a check in notebook A if there is a new file to backfill ; if yes call Notebook B

if len(files_to_upload_full_Path) > 0 :

Β mssparkutils.notebook.run("Transform_Backfill_Previous_Day")

And it did work beautifully ( I know the feeling, it is easy when you know it)

Notice that the second Notebook runs using the same Runtime, so it is faster and maybe even cheaper.

Ok there is More 

Conditionally running a notebook based on a new file arrival is a simple use case, but you can do more, for example you can run multiple notebooks in parallel or even define complex relationships between Notebooks using a DAG with just Python code !!!!

Take Away

 This is just a personal observation , because Fabric was released with all the Engines at the Same time, a lot of very powerful features and patterns did not get a chance to be fully exposed and appreciated, and based on some anecdotal evidence on twitter , it seems I am not the only one who never heard about Fabric Notebook code orchestration. 

For PowerBI people Starting with Fabric, Python is just too Powerful. Yes, we did fine without it all these years, but  if you have any complex data transformation scenarios, Python is just too important to ignore. 

Thanks Jene Zhang for answering my silly questions.

Using Arrow and Delta Rust to transfer Data from BigQuery to Fabric OneLake

 It is just a POC on how using Arrow with Delta Rust can give you a very good experience when importing Data from BigQuery to OneLake 

For a serious implementation, you need to use Azure Key Vault and use it from Fabric Notebook, again this is just a POC

The core idea is that Delta Rust accept Arrow Table as an input without the need for a conversion to Pandas 

The Data is public, the Query scans nearly 19 GB of uncompressed data. 

It took less than 2 minutes to run the Query and Transfer the Data !!! That’s GCP Tokyo Region To Azure Melbourne and nearly a minute and 25 second to write the Data to Delta Table using a small single Node ( 4 vCores and 32 GB of RAM) 

Show me the Code.

You can download the notebook here. although The Package is written in Rust, they do have a great Python binding which I am using .

Make sure you Install google-cloud-bigquery[‘all’] to have the Storage API Active otherwise it will be extremely slowΒ 

Notice though that using Storage API will incur egress Cost from Google Cloud

and use Fabric Runtime 1.1 not 1.2 as there is a bug with Delta_Rust Package.

Nice Try, how about vOrder ?

Because the data is loaded into a staging area, the lack of vOrder should not be a problem as ultimately it will be further transformed into the DWH ( it is a very wide table), as a matter of fact, one can load the data as just Parquet files. 

Obviously it works too with Spark, but trying to understand why datetime 64 whatever !!! and NA did not works well with Spark Dataframe was a nightmare.

I am sure it is trivial for Spark Ninja, but watching a wall of java errors was scary, honestly I wonder why Spark can’t just read Arrow without Pandas in the middle ?

With Delta Rust it did just work, datetime works fine, time type though is not supported but it gave me a very clear error message ( for now I cast it as string , will figure out later what to do with it) , but it was an enjoyable experience.

As it is just  code, you can implement more complex scenarios like incremental refresh, or merge and all those fancy data engineering things easily using Spark or stored procedure or any Modern Python Library. 

Running a simple Query to make sure it is working

Take Away

The Notebook experience in Fabric is awesome, I hope we get some form of secret management soon, and Delta Rust is awesome !!!

Using Direct Query mode with Fabric DWH

TL;DR : Direct Query with Fabric SQL is considered a background operation, which means the usage is smoothed over a 24 hours period, this blog is definitely not a recommendation nor a good practise or anything in that nature, but I was just curious, what if this make Direct Query an attractive proposition in some scenarios ?

Use Case, Small Data with frequent refreshes

We assume it is a small company ( 10 users),we will test using a Fabric F2 SKU and PowerBI Pro license as a front end. ( free PowerBI readers start from F64)

Monthly cost = Fabric 156 $ + PowerBI Pro 10 x 10 = 256 US $/Month

In this case, the business requirement is to have the freshness of 5 minutes, the user needs to the see the latest data, which make import not an option as it is limited to 8 times per day

The Data Model

The data model is relatively small, 5 Tables, 3 dimensions and two fact tables, the biggest one is 9 Million rows, the facts are ingested with new data every 5 minutes, Table State and Settlement Date are fixed, and DUID changes very slowly, maybe once every couple of months.

Fabric Notebook as an ingestion tool

To reduce compute usage, we used Spark notebook with the smallest compute, 4 cores and 32 GB of RAM

How to simulate concurrency 

This one is tricky, 10 users does not mean, they will all open the same report at the same time and continuously clicking refresh, I know there are tools to test load PowerBI but you needs to install Powershell and stuff, I just simulated by using a dynamic slicer and running multiple copies of the report concurrently.

Two Workspaces

we will  try to keep it as simple as possible, no medallion architecture here, just two workspaces

Backend Workspace    : using an F2 capacity

Front End Workspace  : old school pro license workspace

Direct Lake Mode vs Direct Query vs Import in Fabric

As an oversimplification and specially for People not familiar with microsoft BI stack, PowerBI Engine is called analysis Service and it does basically two thing

Formula Engine : Translate DAX using the semantic model to SQL

Storage Engine : get the data from storage using SQL

Direct Query mode :  The data is served by a DB like Synapse or BigQuery, SQL Server etc

Direct Lake mode    : The Data is served by Vertipaq, the data format is open source

Import  mode           : The Data is served by Vertipaq, the data format is proprietary 

Note that Import and Direct Lake difference is in the storage format, but the In-Memory format is the same ( that’s a very clever design decision)

Vertipaq will always be the Fastest Engine 

 Vertipaq is designed for one  thing, pure speed, so I don’t have any expectation that other Engines can compete with it, we are talking milliseconds even with joins and multiple tables,  I am more interested in resource usage though 

Resource Usage Profile

Direct Lake (interactive tasks are smoothed over a short period of time)

Direct Query with Fabric SQL (Background are smoothed over 24 hours)

To be fair both modes worked as expected, Direct Lake is definitely faster which is expected but what got my attention is the DWH did well and drained the capacity with only a rate of  2 CU/s,  there is no bursting, it is the baseline performance here, that’s extremely encouraging as one of the biggest complaint about cloud DWH is they don’t scale down very well.

Keep in mind in both cases, the total capacity you can consume in 24 hours is still limited by

 24 X 2 CU =  172,800 CU(s).

Having a look at this documentation as it is important to understand how to properly size your capacity

PowerBI is still very chatty

PowerBI does generate a lot of SQL Queries in Direct Query mode, most of it took between 500 ms to 1 second. that’s not bad for a SQL Engine that cost 0.36 $/Hour

Ok what does this all mean ?

I reserve the right to change my view after further testing, but my hypothesis is, given that the DWH has a very good performance but more importantly very efficient engine at lower scale and with the fact it is considered a background operation, Direct Query maybe an interesting option if you need more than 8 refresh per day and you are using PowerBI Pro license with a small scale F SKU.

But as always test using your own data.