PowerBI Hybrid Table, you can have your cake and Eat it too.

Hybrid table is a Clever technical solution to a very fundamental problem in Data Analytics, How to keep Data fresh and at the same time fast, PowerBI , Tableau, Qlik solved this problem by importing data to a local Cache, this solution work for most of the use cases, but as with any solution it has limitation.

  • If the Data Source is too big, you can’t simply keep importing to the local cache.
  • If the Data Source change very frequently, like every couple of minutes or second, importing become just not practical or very hard.

PowerBI Engine team came up with a very simple Idea, you can have both Mode in the same table, Historical data that don’t change is cached and today data is queried live as it changed very quickly, Patrick from Guy in the Cube has a great video , Andy has another Video but specific to Synapse Serverless

This functionality was released in the December 2021 edition of PowerBI , but unfortunately when I test it with BigQuery, it did not work, I reported the issue and I have to say, I was really impressed by the Product team, (Kudos to Christian Wade and Krystian Sakowski ), yesterday they released an updated version and it fixed the issue. (it works with Snowflake, Databricks etc)

Setup

it is literary just an extra box to click compared to the Previous incremental refresh User interface

Yes, Just like that, The Engine will generate table partitions behind the scene, if you want to know why PowerBI is so successful, it is because of stuff like this, take a very hard problem and make it extremely easy for Non Tech people to use.

The Data Model is very simple; one fact table with Data that change every couple of minutes, and a Date dimension in a Mixed mode, ( watch Patrick Video, he explain why)

Premium Only

Yes, it is a premium only feature, and obviously it works with Premium per User, I am not going to complain, someone needs to pay for those R&D cost, but it will be really nice if they release it to the PRO license too, it just feels Odd that a core feature of the Engine is tied to a particular license, we had this situation before with incremental refresh and they did release it even for the free license, I hope the same will happen with Hybrid Table.

Mixed Partitions

I published the report to the service, and used Tabular Editor to see what’s going on behind the scene (make sure you download the latest version of Tabular Editor, works with the free version too)

Image

As expected the Last Partition is Live Mode, and everything else is cached in PowerBI.

How it Works

I used DAX Studio to capture what the engine is doing when you run a Query

Image

PowerBI formula engine send two Queries one to the remote DB in my case BigQuery and the local Storage, you can clearly see the difference in speed

1 Day using DQ : 2 second ( the Query take 400 ms at the end point, But BigQuery has a very substandard ODBC driver)

13 Months worth of Data Cached : 47 ms

The Point is, if you can just import, do it, you will get the best performance and user experience

( At Work we have a sub 5 minute pipeline end to end from the Source DB to PowerBI).

The Devil is in the details

As far as I can tell, Formula Engine keep sending two Queries every time, even when the required data is cached already, obviously the Query from the external DB will return null results, in Theory , it should not be a big deal, Modern DW are fast specially with partitioning pruning.

Unfortunately no, only some Database can return a sub second null result set to PowerBI ( yes the Quality of the Driver is as important as the DB Engine itself)

Take Away

It is a very interesting solution worth testing for specific scenarios, but if you can get away with Importing data only, then it is still the best way, yes Hybrid Table reduce the Workload on the remote Database, but still you need a solid Database, getting a sub second Query from end to end is still a hard problem even for 1 day worth of data ( just test it, don’t forget concurrency )

I heard a different use cases, which I find very intriguing, some users want the other way around, recent Data as Import and historical Data as Direct Query, I guess it is useful if you have a real big fact Table.

A Surprising side effect of PowerBI Hybrid table,( maybe it was planned, who knows) Synapse Serverless in Direct Query mode looks now like a very good candidate to use, scanning one day of data is faster and an order of magnitude cheaper !!!

I still Hope that the Vetipaq Engine team surprise us in a future update and somehow let the Formula engine generate only 1 Query when all Data needed is in the Local cache.

Benchmarking Synapse Serverless using TPC-H-SF10

in a previous blog, I did a benchmark for a couple of Database Engine, although it was not a rigorous test, pretty much the results were in the expected range, except for Synapse serverless, I got some weird results, and not sure if it is by design or I am doing something very wrong, so I thought it worth showing the steps I took hoping to find what’s exactly going on.

First Check : Same region

I am using an azure storage in Southeast Asia

My synapse Instance is in the same region

Ok both are in the same region, first best practice.

Loading Data into Azure Data Store

The 8 parquet files are saved in this Google drive, so anyone can download it,

Define Schema

In Synapse, you can directly start querying a file without defining anything, using Openrowset, I thought I can test TPC-H Query 1 as it uses only 1 table, which did not work , some kind of case sensitive issue, when writing this blog I run the same Query and it worked just fine, ( no idea what changed)

1 minute on a second run, hmm not good, let’s try a proper external table , the data_source and File_format were already defined, so need to recreate it again.

CREATE EXTERNAL TABLE lineitem_temp (
	[L_ORDERKEY] bigint,
	[L_PARTKEY] bigint,
	[L_SUPPKEY] bigint,
	[L_LINENUMBER] bigint,
	[L_QUANTITY] float,
	[L_EXTENDEDPRICE] float,
	[L_DISCOUNT] float,
	[L_TAX] float,
	[L_RETURNFLAG] nvarchar(1),
	[L_LINESTATUS] nvarchar(1),
	[L_SHIPINSTRUCT] nvarchar(25),
	[L_SHIPMODE] nvarchar(10),
	[L_COMMENT] nvarchar(44),
	[l_shipdate] datetime2(7),
	[l_commitdate] datetime2(7),
	[l_receiptdate] datetime2(7)
	)
	WITH (
	LOCATION = 'lineitem.parquet',
	DATA_SOURCE = [xxx_core_windows_net],
	FILE_FORMAT = [SynapseParquetFormat]
	)
GO


SELECT count (*) FROM dbo.lineitem_temp
GO

A Proper Table with Data type and all

let’s try again the same Query 1

ok 2 minute for the first run, let’s try another run which will use statistics, it should be faster, 56 second ( btw, you pay for those statistics too)

Not happy with the results I asked Andy ( Our Synapse expert) and he was kind enough to download and test it, he suggested splitting the file give better performance , he got 16 second.

CETAS to the rescue

Create External Table as Select is a very powerful functionality in Serverless, The code is straightforward

CREATE EXTERNAL TABLE lineitem 
	WITH (
	LOCATION = '/lineitem',
	DATA_SOURCE = [xxxx_core_windows_net],
	FILE_FORMAT = [SynapseParquetFormat]
	)
as
SELECT * FROM dbo.lineitem_temp

Synapse will create a new table Lineitem with the same data type and a folder that contain multiple parquet files.

That’s all what you can do, you can’t partition the table, you can’t sort the table, but what’s really annoying you can’t delete the table, you have first to delete the table from the database then delete the folder

but at least it is well documented

Anyway, let’s see the result now

Not bad at all, 10 second and only 587 MB scanned compared to 50 second and 1.2 GB.

Now that I know that CETAS has better performance, I have done the same for remaning 7 tables.

Define all the tables

First Create an external Table to define the type then a CETAS, Synapse has done a great job guessing the type, I know it is parquet after all, but varchart is annoying by default it is 4000, you have to manually adjust the correct length.

TPC-H document contains the exact schema

Running the Test

The 22 Queries are saved here, I had to do some change to the SQL, changing limit to Top and extract year from x to Year (x), Query 15 did not run, I asked the Question on Stackoverflow and Wbob kindly answer it very quickly

The first run, I find some unexpected results

I thought I was doing something terribly wrong, the Query duration seems to increase substantially, after that I start messing around, what I found is, if you run just one Query at the time, or even 4, the results are fine, more than that, and the performance deteriorate quickly.

A Microsoft employee was very helpful and provided this script to Query the Database History

I imported the Query History to PowerBI and here is the results

There is no clear indication in the documentation that there is a very strict concurrency limitation, I tried to run the Script in SSMS and it is the same behavior, that seems to me the Engine is adding the Queries to a queue, there is a bottleneck somewhere.

Takeway

The Good news, the product team made it very clear, Synapse Serverless is not an Interactive Query Engine

Realistically speaking, reading from Azure storage will always be slower compared to a local SSD Storage, so no I am not comparing it to other DWH offering, having said that even for exploring files on azure storage, the performance is very problematic.

Benchmark Snowflake, BigQuery, SingleStore and Databricks using TPC-H-SF10

Another blog on my favorite topic, interactive Live BI Workload with low latency and high concurrency, but this time, hopefully with numbers to compare.

I tested only the Databases that I am familiar with, BigQuery, Snowflake, Databricks and SingleStore.

Note : I decided to skip Synapse dedicated pool , I don’t know enough to load the data in an optimized way.

Edit : I added Synapse serverless and DuckDB

How About OLAP Cube

A lot of vendors, Particularly Microsoft, thinks you don’t need a very fast Query Engine, just load your data into a Cube Vertipaq (in-memory engine) and go from there, that’s very sensible choice, and that what I use Personally.

but for customers using Looker, Superset, Mode etc, which does not have an internal Query Engine, a fast SQL Query time is very important.

My own speculation is DWH are getting fast enough that maybe we will not need an OLAP Cube in the middle.

TPC-H

The most widely used Benchmark to test BI workload is TPC-DS and TPC-H produced by the independent Organization TPC, unfortunately most of the available benchmark are for big dataset starting from 1 TB, as I said before I more interested in smaller Workload for a simple reason, after nearly 5 years of doing Business intelligence for different companies, most of the data model are really small, ( my biggest was 70 Million rows with 4 small dimension tables).

Benchmarking is a very complex process, and I am not claiming that my results are correct, all I wanted to know as a user is an order of magnitude and a benchmark can give you a high level impression of a database performance.

Schema

I Like TPC-H as it has a simpler schema 8 tables and only 22 Queries compared to TPC-DS which require 99 Queries.

Image

Some Considerations

  • Result Cache is not counted.
  • The results are using warm cache and at least one cold run, I run the 22 Queries multiple times.
  • Databricks by default provide a sample Database TPC-SF05, the main Table Lineitem is 30 Millions rows, I don’t know enough to import the data and apply the proper sorting etc , so I preferred to use the smaller dataset. I did create a local copy by using create table as select
  • Snowflake and SingleStore have SF10 and other scale by default.
  • BigQuery, I imported the data from Snowflake , I sorted the tables for better performance, it is a bit odd that BigQuery don’t provide such an important public dataset by default

No Results Cache

Most DWH support results cache, basically if you run the same Query and the base tables did not change the Engine will return the same results very quickly, obviously in any benchmark, you need to filter out those queries.

  • In Snowflake you can use this statement to turn the results cache off
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
  • In Databrick
SET use_cached_result = false
  • BigQuery, just add an option in the UI
  • SingleStore, does not have a result cache per se, the engine just keep a copy of the Query Plan, but it scan The Data every time.

Warm Cache

Snowflake, SingleStore and Databricks leverage the local SSD cache, when you run a Query the first time, it scan the data from the cloud storage which is a slow operation, then when you run it again the Query will try to use the data already copied in the local disk which is substantially faster, specially with Snowflake if you want to keep the local warm cache it make sense to keep your Cluster running a bit longer.

BigQuery is a different beast there is no VM, the data is read straight from the Google Cloud Storage, yes google Cloud Network is famous for being very fast, but I guess it can not compete with a local SSD Disk, anyway that’s why we have BI Engine which basically cache the data in RAM, but not all Queries are supported, actually only 6 are fully accelerated as of this writing. ( see Limitations )

Query History

Getting Query results is very straightforward using information_Schema, except for databricks, it seems it is only supported using an API, I just copied one warm run and paste it to excel and load it from there.

Engine Used

  • Snowflake : X-Small (Lowest tier)
  • Databricks : 2X-Small (Lowest tier)
  • Single Store : S-0
  • BigQuery : on Demand + 1 GB Reservation of BI Engine

Results

The 22 Queries are saved in this repo, I am using PowerBI to combine all the results

let’s start with

Snowflake VS BigQuery

Snowflake Vs SingleStore

Snowflakes VS Databricks

Notice Databricks is using the smaller Dataset SF05, 30 million rows and still Snowflake show better performance

Overall

Edit : due to feedback, I am adding the sum of all Queries, You can download the results here

Synapse Serverless seems to hit a bottleneck, will update when I find the issue or maybe it is by design.

Take away

  • Snowflake is very fast and has consistent results for all the 22 Queries, Except Query 13 is a bit odd

  • SingleStore is remarkable but Query 13 is not good at all and skew the overall performance.

  • BigQuery is fantastic when BI Engine works ( not a lot I would say)

  • Databricks performance in TPC-H-SF05 is problematic, I just hope they release a proper TPC-H-SF10 dataset and information schema like other DWH

I think This workload is very relevant and I hope vendors will start publishing their own results, and in the near future all the 22 Queries will render under a second.

First Look at SingleStore

I could have written a nice paragraph why I got interested in SingleStore but to be honest the reason is very simple and has nothing about the tech, Jordan Tigani one of the founding Engineers of BigQuery is now their Chief Product Officer, so I became very curious 🙂

Again, I am only interested in Small Interactive BI Workload, contrary to the usual Suspects ( BigQuery, Snowflake etc), SingleStore is not a pure Data warehouse but rather a multi purpose database, it does OLTP Workloads but has an excellent support for OLAP Workload, I am only interested in Analytical Workloads.

Setup

There is a free trial with $500, the setup was very intuitive, I really liked the way you create a new Cluster, notice I don’t have an account with AWS, but it is a software as a service Experience, SingleStore manage everything on behalf of the user, I chose AWS as they support the Sydney Region.

The smallest tier start at 0.25 Credit/hour, which cost 0.65 $/Hour, Unlike Snowlake and Databricks there is no auto suspend and auto start, you have to do it manually.

For some reason Suspend a Cluster is not available in Google Cloud !!!

The Console has the bare minimum but functional, there is no multiple tab, if you run a Query, you need to wait till it is done before running another one

There is an odd choice in the UI, when you want to monitor the Cluster you need to open a new page called SingleStore Studio

It is not the end of the world, but a bit annoying when you are new to the product

Loading Data

There is sample Data you can quickly load to start running Queries, but I wanted to test only my own dataset (TPC-H- SF10) ( nice surprise it was added this week)

Although my Cluster is in AWS, loading files from Google Cloud was trivial, all I had to do is setup a new pipeline

first define the table, notice the Clustered Columnstore key

CREATE TABLE `orders` (
`o_orderkey` bigint(11) NOT NULL,
`o_custkey` int(11) NOT NULL,
`o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`o_totalprice` decimal(15,2) NOT NULL,
`o_orderdate` date NOT NULL,
`o_orderpriority` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`o_clerk` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`o_shippriority` int(11) NOT NULL,
`o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
SHARD KEY (`o_orderkey`) USING CLUSTERED COLUMNSTORE
);

Define a new pipeline

CREATE or REPLACE PIPELINE `LoadGCPorders`
AS LOAD DATA GCS 'xxxxxxxxxxxxx'
CREDENTIALS '{"access_id": "xxx", "secret_key": "xxxxxx"}'
INTO TABLE tpch.orders
(`O_ORDERKEY` <- `O_ORDERKEY`,
  `O_CUSTKEY` <- `O_CUSTKEY`,
  `O_ORDERSTATUS` <- `O_ORDERSTATUS`,
  O_TOTALPRICE <- O_TOTALPRICE
  , O_ORDERDATE <- O_ORDERDATE
  , O_ORDERPRIORITY <- O_ORDERPRIORITY
  , O_CLERK <- O_CLERK
  , O_SHIPPRIORITY <- O_SHIPPRIORITY
  , O_COMMENT <- O_COMMENT
  
  )
FORMAT PARQUET

Then Run the pipeline and the data is automatically loaded, very nice

START PIPELINE LoadGCPorders FOREGROUND;

Testing using TPC-H SF10 Benchmark

to test all the 22 Queries of the benchmark I used the same script for BI Engine, here is the results after 10 runs using the Dataset Provided by Singlestore ( Using S-0 Cluster, see pricing here)

A lot of Queries are already under a second even when using a lower tier !!! Queries 13 result is a bit odd.

if I understood correctly, SingleStore does not have a results cache when you run the same Query again, SingleStore store the Query plan but scan the data again, although the data is stored on Disk, metadata on the tables is stored In-Memory ( tables for OLTP workload are always In-Memory)

The Previous chart was built using Google Data Studio, as of this writing, PowerBI does not have a native connector, you need to download a custom connector which means you need a gateway, not sure if Direct Query is supported at all, I quickly used MySQL connector which works fine but import mode only (SingleStore is compatible with MySQL tools)

Take Away

I was really impressed by the Product, we all hear about this operational analytics and it seems SingleStore has a good solution, there are missing functionalities though, Auto suspend and resume is not available yet and no native connector for PowerBI is very problematic, but it is really Fast and do write workload too.

%d bloggers like this: