PowerBI Hybrid Table, you can have your cake and Eat it too.

Hybrid table is a Clever technical solution to a very fundamental problem in Data Analytics, How to keep Data fresh and at the same time fast, PowerBI , Tableau, Qlik solved this problem by importing data to a local Cache, this solution work for most of the use cases, but as with any solution it has limitation.

  • If the Data Source is too big, you can’t simply keep importing to the local cache.
  • If the Data Source change very frequently, like every couple of minutes or second, importing become just not practical or very hard.

PowerBI Engine team came up with a very simple Idea, you can have both Mode in the same table, Historical data that don’t change is cached and today data is queried live as it changed very quickly, Patrick from Guy in the Cube has a great video , Andy has another Video but specific to Synapse Serverless

This functionality was released in the December 2021 edition of PowerBI , but unfortunately when I test it with BigQuery, it did not work, I reported the issue and I have to say, I was really impressed by the Product team, (Kudos to Christian Wade and Krystian Sakowski ), yesterday they released an updated version and it fixed the issue. (it works with Snowflake, Databricks etc)

Setup

it is literary just an extra box to click compared to the Previous incremental refresh User interface

Yes, Just like that, The Engine will generate table partitions behind the scene, if you want to know why PowerBI is so successful, it is because of stuff like this, take a very hard problem and make it extremely easy for Non Tech people to use.

The Data Model is very simple; one fact table with Data that change every couple of minutes, and a Date dimension in a Mixed mode, ( watch Patrick Video, he explain why)

Premium Only

Yes, it is a premium only feature, and obviously it works with Premium per User, I am not going to complain, someone needs to pay for those R&D cost, but it will be really nice if they release it to the PRO license too, it just feels Odd that a core feature of the Engine is tied to a particular license, we had this situation before with incremental refresh and they did release it even for the free license, I hope the same will happen with Hybrid Table.

Mixed Partitions

I published the report to the service, and used Tabular Editor to see what’s going on behind the scene (make sure you download the latest version of Tabular Editor, works with the free version too)

Image

As expected the Last Partition is Live Mode, and everything else is cached in PowerBI.

How it Works

I used DAX Studio to capture what the engine is doing when you run a Query

Image

PowerBI formula engine send two Queries one to the remote DB in my case BigQuery and the local Storage, you can clearly see the difference in speed

1 Day using DQ : 2 second ( the Query take 400 ms at the end point, But BigQuery has a very substandard ODBC driver)

13 Months worth of Data Cached : 47 ms

The Point is, if you can just import, do it, you will get the best performance and user experience

( At Work we have a sub 5 minute pipeline end to end from the Source DB to PowerBI).

The Devil is in the details

As far as I can tell, Formula Engine keep sending two Queries every time, even when the required data is cached already, obviously the Query from the external DB will return null results, in Theory , it should not be a big deal, Modern DW are fast specially with partitioning pruning.

Unfortunately no, only some Database can return a sub second null result set to PowerBI ( yes the Quality of the Driver is as important as the DB Engine itself)

Take Away

It is a very interesting solution worth testing for specific scenarios, but if you can get away with Importing data only, then it is still the best way, yes Hybrid Table reduce the Workload on the remote Database, but still you need a solid Database, getting a sub second Query from end to end is still a hard problem even for 1 day worth of data ( just test it, don’t forget concurrency )

I heard a different use cases, which I find very intriguing, some users want the other way around, recent Data as Import and historical Data as Direct Query, I guess it is useful if you have a real big fact Table.

A Surprising side effect of PowerBI Hybrid table,( maybe it was planned, who knows) Synapse Serverless in Direct Query mode looks now like a very good candidate to use, scanning one day of data is faster and an order of magnitude cheaper !!!

I still Hope that the Vetipaq Engine team surprise us in a future update and somehow let the Formula engine generate only 1 Query when all Data needed is in the Local cache.

Benchmark Snowflake, BigQuery, SingleStore and Databricks using TPC-H-SF10

Another blog on my favorite topic, interactive Live BI Workload with low latency and high concurrency, but this time, hopefully with numbers to compare.

I tested only the Databases that I am familiar with, BigQuery, Snowflake, Databricks and SingleStore.

Note : I decided to skip Synapse dedicated pool , I don’t know enough to load the data in an optimized way.

Edit : I added Synapse serverless and DuckDB

How About OLAP Cube

A lot of vendors, Particularly Microsoft, thinks you don’t need a very fast Query Engine, just load your data into a Cube Vertipaq (in-memory engine) and go from there, that’s very sensible choice, and that what I use Personally.

but for customers using Looker, Superset, Mode etc, which does not have an internal Query Engine, a fast SQL Query time is very important.

My own speculation is DWH are getting fast enough that maybe we will not need an OLAP Cube in the middle.

TPC-H

The most widely used Benchmark to test BI workload is TPC-DS and TPC-H produced by the independent Organization TPC, unfortunately most of the available benchmark are for big dataset starting from 1 TB, as I said before I more interested in smaller Workload for a simple reason, after nearly 5 years of doing Business intelligence for different companies, most of the data model are really small, ( my biggest was 70 Million rows with 4 small dimension tables).

Benchmarking is a very complex process, and I am not claiming that my results are correct, all I wanted to know as a user is an order of magnitude and a benchmark can give you a high level impression of a database performance.

Schema

I Like TPC-H as it has a simpler schema 8 tables and only 22 Queries compared to TPC-DS which require 99 Queries.

Image

Some Considerations

  • Result Cache is not counted.
  • The results are using warm cache and at least one cold run, I run the 22 Queries multiple times.
  • Databricks by default provide a sample Database TPC-SF05, the main Table Lineitem is 30 Millions rows, I don’t know enough to import the data and apply the proper sorting etc , so I preferred to use the smaller dataset. I did create a local copy by using create table as select ( Loaded SF10 Data)
  • Snowflake and SingleStore have SF10 and other scale by default.
  • BigQuery, I imported the data from Snowflake , I sorted the tables for better performance, it is a bit odd that BigQuery don’t provide such an important public dataset by default

No Results Cache

Most DWH support results cache, basically if you run the same Query and the base tables did not change the Engine will return the same results very quickly, obviously in any benchmark, you need to filter out those queries.

  • In Snowflake you can use this statement to turn the results cache off
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
  • In Databrick
SET use_cached_result = false
  • BigQuery, just add an option in the UI
  • SingleStore, does not have a result cache per se, the engine just keep a copy of the Query Plan, but it scan The Data every time.

Warm Cache

Snowflake, SingleStore and Databricks leverage the local SSD cache, when you run a Query the first time, it scan the data from the cloud storage which is a slow operation, then when you run it again the Query will try to use the data already copied in the local disk which is substantially faster, specially with Snowflake if you want to keep the local warm cache it make sense to keep your Cluster running a bit longer.

BigQuery is a different beast there is no VM, the data is read straight from the Google Cloud Storage, yes google Cloud Network is famous for being very fast, but I guess it can not compete with a local SSD Disk, anyway that’s why we have BI Engine which basically cache the data in RAM, but not all Queries are supported, actually only 6 are fully accelerated as of this writing. ( see Limitations )

Query History

Getting Query results is very straightforward using information_Schema, except for databricks, it seems it is only supported using an API, I just copied one warm run and paste it to excel and load it from there.

Engine Used

  • Snowflake : X-Small (Lowest tier)
  • Databricks : 2X-Small (Lowest tier)
  • Single Store : S-0
  • BigQuery : on Demand + 1 GB Reservation of BI Engine

Results

The 22 Queries are saved in this repo, I am using PowerBI to combine all the results

let’s start with

Snowflake VS BigQuery

Snowflake Vs SingleStore

Snowflakes VS Databricks

Notice Databricks is using the smaller Dataset SF05, 30 million rows and still Snowflake show better performance

Overall

Edit : due to feedback, I am adding the sum of all Queries, You can download the results here

Synapse Serverless seems to hit a bottleneck, will update when I find the issue or maybe it is by design.

Edit : 26-Jan-2022, I Updated the results for Databricks SF10, I Uploaded the same data used for BigQuery, then created Delta Table and applied optimize Z Order

Take away

  • Snowflake is very fast and has consistent results for all the 22 Queries, Except Query 13 is a bit odd

  • SingleStore is remarkable but Query 13 is not good at all and skew the overall performance.

  • BigQuery is fantastic when BI Engine works ( not a lot I would say)

  • Databricks performance in TPC-H-SF05 is problematic, I just hope they release a proper TPC-H-SF10 dataset and information schema like other DWH

I think This workload is very relevant and I hope vendors will start publishing their own results, and in the near future all the 22 Queries will render under a second.

PowerBI Vertipaq Vs Snowflake Vs BigQuery BI Engine

Another Short blog on my favorite topic, High Concurrency, Low latency Interactive BI Workload, I can’t test everything, but so far, I am impressed by BigQuery BI Engine and Snowflake on Azure, a reader may thinks, I have an anti-Microsoft bias which is nonsense, I think Microsoft has a different approach, Data warehouse is for storage and Transformation but as a serving layer, you load your reporting tables into a Tabular Model (PowerBI, Analysis Services)

PowerBI tabular model, at a very highly level consists of an analytical Database and a semantic Model, for the purpose of this blog , I am only interested in The Vertipaq database order of magnitude performance, it is not a benchmark, I am just asking a lot of questions, and hopefully start a conversation.

Just a note for readers not familiar with PowerBI, when PowerBI uses Direct Query Mode, it means The tabular Model using the existing semantic Model will send the Queries to the resource system directly and Vertipaq is not used.

What’s the difference between the Three Engines

Although all the three are Columnars Database, They use different tech and implement different assumption, it is a very technical subject and I am afraid there is not enough literature, but from my perspective as a user,I have notice the following.

  • BigQuery BI Engine : The Data is loaded from BigQuery Storage to Memory, if the base tables changes, the data is reloaded nearly in near time, Only Column and partition needed for the Query are loaded, it does implement in-Memory compression , but , according to the dev, the engine may leave some columns without Compression for performance reason, I think it is a balance between compression overhead and loading data from the storage as soon as possible (This is my own speculation, I am sure it is much more complex), if the Query is not supported or the data scanned is too Big, it fall back to BigQuery, Mosha has a nice technical Video about the Engine

  • Snowflake : I Notice the data can be cached in the local SSD drive, the performance became substantially faster, Although it is not an in-memory Database, I was genuinely Impressed by the performance. ( To honest I don’t know much about Snowflake)

Vertipaq : When the data is loaded from the Disk Storage to memory, all the Model is loaded, it means all the tables and columns, unlike the previous two engine, you can’t load only a subset of the data or only columns needed for the Queries, you have to manually make that selection either by importing only the columns you think you will need initially and the same by filtering a subset of the rows.

Basically, it is another independant copy of the data, and you need ETL process to make sure the data is synchronized between the source Data Storage and Vertipaq Storage, Marco Russo has a detailed Video about the Engine

Note regarding relationship

Since I start using PowerBI, I was always mystified by the performance of Vertipaq joins, it is well known, joins are expensive in all Databases, Vertipaq seems to be the exception, initially I started with this Model

Image

I asked a simple Question, what’s the total Quantity by Country, to get the answer the filter has to traverse some non trivial relationship ( Customer to order, 1.5 Millions distinct value, and then orders to Lineitem 15 Million distinct value)

  • Vertipaq gave the results in 536ms !!!!
  • Snowflake 2.9 Second.
  • BI Engine did fall back to BigQuery (4 second), BI engine has a limitation of 5 Million Dim Table

Vertipaq Materialized the joins, I don’t know how an analytical Database can compete with this implementation, but I read that firebolt has something called Join Index ( to be verified, I have not test it myself), anyway here the duration using DAX Studio, it is very fast

Star Schema Model

As BI Engine does not support yet a join with more than 5 millions rows, I changed the Semantic Model to a Star Schema, I did join the table order and Lineitem and create 1 wide fact Table, Notice I am using the sample Data provided by Snowflake TPC-H with a factor of 10, the main fact is 60 Millions rows. The supplier is 100K rows and Customer is 1.5 Millions rows.

Snowflake is using the x-Small, cost $2.75 /Hour

BigQuery : 1 GB cost 5 cents/Hour

Vertipaq : Using my PowerBI Desktop.

Joins Performance

Using the Start Schema Model, and the same Question, total Quantity sold by country ( I know the metric is not very useful, but I am interested only in performance)

BI Engine : Around 1.8 Sec, I had to Cluster the Table by Custkey.

Snowflake : 1.6 S

Vertipaq, using Materialized relationship average 52 ms

Vertipaq using runtime joins : Average 8 S

I used TREATAS to simulate a join without a physical relationship ( I hope my DAX is not wrong)

sum_Quantity_Virtual = 
CALCULATE (
    SUM ( lineitem_orders[L_QUANTITY] ),
    TREATAS (
        VALUES ( customer_nation_region[C_CUSTKEY] ),
        lineitem_orders[O_CUSTKEY]
    )
)

and here is the results using DAX Studio

Group by using Same Table

A simple Query, total Quantity by Line Status

BI Engine : 200 ms

Vertipaq : 6.8 ms , yeah, it is not a mistake, and it is not a result cache, honestly I have no idea how it is even possible to get this result.

Snowflake : 366 ms

Takeway

  • It is a no surprise, Vertipaq is extremely fast, at least for a small dataset, Analytical Database can’t really compete, Materialized relationship are awesome, but you need to make a copy of your data, usually it is not an issue, until your data is too big or too volatile, Microsoft is aware of this and added feature like Hybrid Table will help

  • Another interesting Question, at what Data Size Analytical DWH become faster than PowerBI Vertipaq ?

  • This one is my own speculation, Maybe Analytical DWH don’t need to be as fast as Vertipaq, maybe not having to copy data is a bigger advantage than having a response time in less than 100 ms.
  • I write only about Tech that I have used personally, but it seems, Firebolt and SingleStore serve the same workload too, my experience with Databricks SQL was a mixed feeling, I don’t think it is fast enough for this workload.

For the kind of work I do, Vertipaq is a Godsend, it is very fast, it tolerate wrong design choices, and for the majority of my workload it is the best solution in the market, and it was battle tested for 10 years, with a lot of real life optimization.

Having said that, it is just a Database competing indirectly with other technology for the same Workload, and some competitors are getting dangerously good.

BigQuery Materialized View optimization

A quick post on a nice functionality in BigQuery, if you create a Materialized view based on Table X, and you run a Query on that Table, the Query Optimizer is smart enough to reroute the Query plan to use the materialized View instead of the base Table, it is faster and substantially cheaper.

for example, let’s take this table which has a granularity of 5 minute ( 75 Million Records)

let’s create a Materialized view that aggregate at the day Level

CREATE MATERIALIZED VIEW
  XXXXX.ReportingDataset.UNITARCHIVE_Summary AS
SELECT
  StationName,
  DUID,
  DAY,
  SUM(Mwh) AS Mwh
FROM
  `XXXXXX.ReportingDataset.UNITARCHIVE`
GROUP BY
  1,
  2,
  3

The resulting MV is 16 MB, and 350K rows , notice once you create a MV, you can forget about it, BigQuery make sure it is updated, and for whatever reason if the MV was not updated the Query planner will default back to the base Table, please see documentation for further details

Now here is the interesting Part, let’s make a Query that target the Base Table

because the fields used in the Query exist already in the MV, the Query optimizer change the Query plan to use the MV, instead of scanning GB of data, it end up scanning 10 MB.

How About BI Engine

When Using BigQuery BI Engine performance wise both Queries either Base table or MV will return results in millisecond, but I suspect using Materialized view is beneficial as it uses less resources which means potential even more concurrency.

%d bloggers like this: