First impression of Microsoft Fabric

TL;DR : The hype is real, Microsoft has a new full data stack offering based on a lakehouse architecture with Delta table as a core foundation, it is not a simple marketing exercise but a complete rewrite of storage layer for key component ( DWH, PowerBI etc), The Performance of the New shared disk SQL Engine is  competitive with Snowflake, the pricing details were not announced, so some caution is warranted. Vertipaq performance on remote storage is spectacular.

Core architecture 

OneLake is a lakehouse architecture based on Open Table Format  Delta table, I suspect Microsoft choose  Delta table in Azure because of Databricks existing user base,

Open storage format is cool, but there is a problem, from the 4 main Compute Engine (BI, DWH,Kusto and Spark) only Spark supports  Delta table, apparently they were very busy at work in the last few years rebuilding the storage layer, and now with the public preview we can have a look at their results.

Self Service Experience

I know it may sound silly, but for me personally ( as a self service enthusiast) the best features released so far.

  • Onelake Windows client, does not require admin rights to Install
  • Append or replace in Dataflow Gen2

– Attention to Details in the Lakehouse experience, click to preview Data and another option to see files.

Usually you would think of a Lakehouse as a hard core Engineering thing, Microsoft made it so simple , I was very pleased by loading files directly from the UI.

Edit : we got Fabric activated in our tenant, I quickly created some Delta Table using Python in Fabric and they start showing up in my my OneLake drive, it is just a surreal experience.

First Impression of the Synapse DWH

Probably, you are reading this blog to understand how good the new Synapse DWH is. The short answer :  it is very Good. As far as I am concerned, it has nothing to do with the previous generation and maybe they should have not kept the word Synapse at all.( I never liked Dedicated Pool)

It is still in preview, so some functionalities are not available yet, for example, there is no query history, sometimes it get stuck in some queries, but let me tell you this, I have tested quickly maybe more than 10 DWH service, it is the easiest by far (once you get the initial confusion between Lakehouse and DWH)

In Other DWH, I had to load the data first to a cloud storage then run the Queries, with Fabric, the Python notebook is there by default, there is nothing to provision, you just write code.

I used DuckDB to generate the Parquet files for TPCH-SF100

Some observations : 

  • I was a bit confused as the notebook just worked without me doing any provisioning.

I am not a huge  Spark fan, but with Fabric, it is nearly hidden. All you have to focus on is writing SQL or Pyspark, I like this very much.

  •  The Data was written so fast to OneLake, I thought the Lakehouse was using something like S3 EBS 🙂 it is not, I presume because it is a Microsoft Product, they are getting some spectacular  throughput from Onelake.
  • When you have multiple Lakehouse in the same Workspace, it is not very obvious which LH is active, and using this path   ‘/lakehouse/default/Files’ you may end up writing to the wrong LH
  • I did not know how to use Copy from files to create Tables in the DWH, so instead I used Pyspark to write Delta tables into the LH

As I said previously, there is no Query History, so i don’t know the duration for every Query,  all I can show is the screenshot of the total duration, 2 min

You mileage may vary, for example in another run, I got 5 min

The last run i got this less than 2 min

Snowflake more or less gets around 2 minutes for a cost of 2 $/Hour for the basic edition,  I am not sure how much Compute the SQL Engine is consuming so I can’t compare for now, but keep in mind the same workload took Synapse serverless 10 Minutes !!!

Notice here, I did not have to understand anything about Table distribution nor provision any ressources, I think conceptually Synapse DWH is more like BigQuery than Snowflake or Databricks SQL.

Vertipaq Direct lake Mode.

Vertipaq which is the Query Engine for PowerBI got a lakehouse treatment too, traditionally, Vertipaq was a pure in-memory DB, it means, you have to import the data to its proprietary format before doing any analysis, this is all well and good till you have a data that’s bigger than the available memory, this limitation is no more, you can read directly from remote storage, to be honest I was suspicious, to make it worst the first public release of Fabric had a bug that made the performance not that great, I did try it today though with a new update, and all I can say, this is the fastest thing I have seen in my life, 2.6 second for a group by of of a 600 M rows

Some observations : 

  • This is all new, so we don’t have best practise yet, but just my gut feeling, if the data is small enough that it fit your memory, I suspect, import mode is unbeatable, but because with DirectLake mode, we have a separation of storage and compute that open all kind of scenarios for pure logical models ( that’s a book by itself)
  • It is not because it is fast, you can do anything, I notice a penalty when using very high cardinality columns in relationship, I would have being surprised if it wasn’t
  • In DirectLake mode, Vertipaq do columns pruning, which was one of the biggest complain about Import mode, now only used column will be loaded from Onelake to the memory
  • I have not tested it yet, but it seems Vertipaq will pushdown filters to Delta Table and load only required data, I presume it will leverage the Delta log.
  • Lakehouse + Vertipaq seems like a very attractive pattern that I feel we may even not need DWH at all ( Just trolling the DWH Team or maybe not) 

What I did not like 

  •  I do have access to Premium so that does not impact me personally, but I did not forget from where I came, Pro is still the only way for a lot of people to use PowerBI and as far as I can tell, Fabric does not mean much for them.
  • DWH and Lakehouse offering got all the attention for obvious reasons. Microsoft wanted to tell the market they are back driving innovation in the Data space, but I did not hear anything about Datamart’s future, I do believe it is a critical workload to address.
  • Feature management in Fabric is problematic, only Tenant Admin can turn things on and off, so even if you paid for a capacity, you are at the mercy of the admin. I think with Fabric, capacity admins and workspace admin needs more granular control. Edit, was corrected by Chris , capacity Admin can override Tenant Admin, that’s fantastic decision
  • Capacity management, I know it is a free trial now and Microsoft is paying for the  compute resources, but I think more documentation explaining in a very simple language how resources consumption works, all you need is one user doing something silly and people get the wrong impression.
  • IMHO, they should have added Logic App with Fabric, a lot of users are not comfortable using a Data Factory for scheduling.

What I really think as a data Enthusiast

Asking a PowerBI dev what he thinks about Fabric is the most boring question ever, of course he/she will be excited, we just got offered a full managed enterprise data stack, how not to be excited, you can bet your career on Microsoft Data stack till AGI got us replaced 😂

But in my free time, I used BigQuery and Looker studio since 2019, I am reasonably familiar with the major DWH offering, I have absolutely no problem with Fabric DWH being the exclusive writer for DWH Delta lake format, what matter is read compatibility with other Engines, it is extremely hard to support multiple concurrent writers from different Engines, especially when you need to add more complex workloads, like heavy write, multi table transaction etc.

What I don’t understand though, why you need the Spark exclusivity for writing to the LH, for a lot of scenarios, I would like to be able to write using the Python API, and not necessarily pay for the Spark JVM inefficiency,  I presume the whole point of Onelake is one storage multiple readers, maybe it was just not released yet, and they have a plan, but today it is not very clear.

Just a thought maybe instead of two products DWH and LH, having just one DWH with two Table format (Both Delta tables), something like  native  vs LH will be an easier option in the long term. 

Final Thoughts

Although I knew about Microsoft plan since a year ago, I was very skeptical, I thought it was just  too ambitious to be true, Just a week ago, I did not believe the  DWH was using Delta table as the storage format, it seemed to me too ambitious, obviously I was wrong,  this is an unbelievable technical achievement from the Product team, I think Microsoft just made the concept of Lakehouse ubiquitous.

Databend and the rise of Data warehouse as a code

TL:DR ; Databend is a new Data warehouse vendor, they have a paid Cloud service for Enterprise workload, but offer a full DWH engine as a Python Package, you can download a notebook that uses Cloudflare R2 as a remote storage

Introduction

Note : This Blog is a quick introduction to the Python library and will not review the commercial offering

DWH as a code was first coined by the famous TJ  the idea is very simple, you just write a python script to process data interactively from a remote storage, basically in the simplest use case, you don’t necessarily needs a middle man, sorry I meant a DWH, you run an Engine like DuckDB or Polars, and you get your results, Databend push this concept even further with the native support of a custom built lakehouse table format  and support for disk cache and even result cache, yes in a notebook, who would have thought that !!!!

Databend Architecture

Databend architecture is a typical Modern cloud data warehouse, contrary to something like Snowflake, everything is open source

in a notebook environment some layers don’t make much sense, you can install a docker image but that’s not something that interests me.

Setup                     

I built a Python notebook that generate some data ( using TPCH ) load it to Cloudflare R2, it works obviously with S3, but R2 has free egress fees.

  • First Run : Directly from remote storage
  • Disk cache : if the data was already downloaded in the VM of the notebook, it will run locally
  • Result cache : if the exact query was run again and the base Table did not change then results are returned in a sub 20 ms !!!

Databend use a native table format called strawboat that looks like Apache Iceberg, you can read the details here,  I know probably it is much more complicated than that but in mind, snapshot is an iceberg thing, transaction log is a delta table concept.

Load Data into R2

I Think databend can’t use “create or replace” syntax, so I had to drop the table if exist then recreated, the syntax is pretty standard, you need only to have R2 credential

Run TPCH Queries

I modified the first Query of TPCH to scan the whole “Linitem” table just to test the disc cache, notice that in Python the disk cache is not on by default and you have to turn it on by passing an environment variable

The same for the result cache, 

And here are the results. Please note the first run is dependent on R2 throughput and says nothing about Databend performance, you will notice the result cache takes virtually no time.

Cache Invalidation

Just to mess with it, I delete some rows from table “lineitem” and “customer”

which invalidate the result cache and disk cache, and trigger a remote scan, I can’t believe I am writing this about a SQL Engine running from a notebook 

Take away 

That was just a quick introduction, the python API is still in the early stage for example, the metadata layer is saved locally in the notebook which I think it is not very convenient, the dev said they can do something about, but it is really refreshing to see new startup building new products with new ideas.

Optimize BigQuery cost by creating independent sort order for the same table

TL;DR

Using materialized view to create a different sort order of the base table can reduce cost substantially, in this example up to 200 X for some queries.

The Problem : You can’t have multiple sort order.

One of most effective technique in columnar database to improve speed is to sort the table which will help the engine scan less data, but what if you have some queries that works better with different sorting, obviously you can’t have independent column sorted in the same table , turn out BigQuery has a very nice functionality, Materialized views which are used mainly to aggregate Data, see example here, can also works fine without aggregation and you can even change the sort order of the base table.

Example using PowerBI

In this report, which show analyze Electricity production for the Australian Market, the base table is sorted by day, that make perfect sense as most uses will want to see only a specific time period then aggregate by some attribute like region, technology etc

The Model is very simple , a fact table in Direct Query mode, and two dimensions in Import mode, a very simple star schema

The main report is a couple of charts using some aggregation, nothing particularly interesting, the queries use the Materialized view as we are aggregating per day, the base table is by minutes.

Drill Down to details

The trouble start here, let’s say a user want to see all the data for one particular Station name to the lowest level of details.

The Engine will have to scan the full table, as the table was sorted by date not by Station name

as you can see 4.61 GB is a lot of data scanned just to see one station name.

Non Aggregation Materialized View

let’s create a new MV with a different sort order, notice there is no group by, it is the same number of rows as the base table and you can’t have different partitions, only sort can be changed.

create  materialized view `XXX_MV_Sort`
cluster by   DUID,DAY as 

SELECT
  *
FROM
  BASE_Table

The Same Query is substantially cheaper 🙂 from 4.6 GB to 19 MB, that’s 230 times cheaper.

you can see the two Query Plan, one scanning 101 Millions rows vs 404000 rows

Take away

Because BigQuery already knows about all the queries, it may be useful to have some kind of service that give suggestions, like changing sort or adding a Materialized view, in the meantime, I think Modeling still matter and can save you a lot of money.

The Unreasonable Effectiveness of Snowflake SQL Engine

TL;DR :

in a previous blog, I did benchmark some database using TPCH-SF100, one of the main complaint was that the data was originally from snowflake although I did rewrite the parquet files using DuckDB, it did not feel right as Snowflake performance using the lowest tier was exceptionally good, in this blog, I am using Data generated independently, and yes Snowflake is very fast.

Reproducibility

Since that blog, DuckDB released a new update where you can generate any TPCH data using low resource computer (like my laptop), I thought it is a good opportunity to validate the previous results, this time though, I published the full pipeline so, it can be reproduced independently.

I used the following steps

  • Generate the dataset in my laptop using this code
  • Upload the parquet files to an Azure storage, the total size of the parquet files is 38 GB
  • Create an external tables in Snowflake, using this SQL script
  • Import the data from Azure Storage to Snowflake native table using create table as select from external table, the import took around 10 minutes ( that was rather slow to be honest), the data is imported as it is, no sorting, and no table distribution shenanigans.
  • I did noticed Snowflake has a better compression ratio than Parquet, for example the table “lineitem” size in Parquet is 25 GB, in Snowflake file format is 13.8 GB
  • Run TPCH benchmarks 2 times then do it again 1 hour later (I wanted to be sure that the disk cache was expired) obviously keeping the result cache off

The Results

Run 1 and 3 are querying data directly from Azure Storage you can see that from the local disk cache in Query 1, subsequent Queries use the data cached in the local SSD, notice the cache is temporary, if you suspend your Compute, you may lose it, Although Snowflake tries to give you back the same VM if available.

I plotted the results in PowerBI using the Query history table, The numbers are impressive, I think it is the combination of excellent features working together, great compression, extreme pruning, great Query plan, exact statistics, the whole thing make an extremely efficient engine.

Take away

Database performance is a very complex subject and vary based on the workload, but I wanted to have a simple heuristic, a number that I can reference, I think I have that magic Number 🙂 if your Database can Run TPCH-SF100 around 2 minutes and cost you around 2 $, you can say it is fast.