How PowerBI Composite Model Works

In a previous blog, I showed that dual Mode is really a very good pattern when building PowerBI Model that uses Direct Query, but it in order to work, both Tables needs to be using the same Data Source, you can’t physically join a table from a SQL Server with another Table from Excel, But still PowerBI Engine manage to do that using a clever trick, to explain how it works, I build two Models one using Dual Mode and Another using Composite Model and then we compare the behavior.

Note : Kasper that a great video explaining how everything works behind the scene.

Composite Model

The Diagram give already an indication that the two dimension Tables are imported to the Local cache and that the relationship is a bit different than a “Normal” Relationship, I think the official term is weak relationship.

To Understand how this special Join Works, let’s try a simple Query, show me the total Mwh of coal Production

And here is the SQL Query generated by PowerBI Engine, at first sight it seems very weird !!!!

select `DUID`,
    `C1`
from 
(
    select `DUID`,
        sum(`Mwh`) as `C1`
    from 
    (
        select `DUID`,
            `DAY`,
            `time`,
            `Mwh`
        from `test-187010`.`ReportingDataset`.`UNITARCHIVE`
        where `DUID` in ('APPIN', 'BRAEMAR2', 'BRAEMAR5', 'BW02', 'OAKY2', 'TARONG#1', 'LD03', 'MP1', 'BW01', 'MORANBAH', 'LYA3', 'MP2', 'KPP_1', 'TAHMOOR1', 'TARONG#3', 'LYA2', 'CPP_3', 'BW04', 'TNPS1', 'TARONG#4', 'LYA1', 'BW03', 'OAKYCREK', 'GROSV1', 'TARONG#2', 'LYA4', 'CPP_4', 'GROSV2', 'VP6', 'CALL_B_1', 'WILGAPK', 'GSTONE5', 'VP5', 'LOYYB2', 'CALL_B_2', 'WILGB01', 'GSTONE3', 'STAN-1', 'LOYYB1', 'CALL_A_4', 'DAANDINE', 'GSTONE6', 'YWPS1', 'ER01', 'GERMCRK', 'GSTONE2', 'STAN-2', 'YWPS2', 'ER03', 'MBAHNTH', 'STAN-3', 'YWPS3', 'TERALBA', 'GSTONE4', 'STAN-4', 'YWPS4', 'ER02', 'GSTONE1', 'LD02', 'ER04', 'TOWER', 'BRAEMAR3', 'LD01', 'BRAEMAR6', 'MPP_1', 'GLENNCRK', 'BRAEMAR1', 'LD04', 'BRAEMAR7', 'MPP_2')
    ) as `ITBL`
    group by `DUID`
) as `ITBL`
where not `C1` is null
LIMIT 1000001 OFFSET 0

The Fact table in Direct Query mode contained only DUID, which is the code for the station name ( Coal Power plant, Solar Farm, Winds etc), the remote Source here is BigQuery, which have no idea what Coal means, as it is not a field defined in the table.

PowerBI Engine is smart enough to know which DUID belong to Coal as it is defined in the Dimension Table, get those items and injects them as a filter in the SQL Query, send the Query to the source system and get back the results

to be honest I did like this approach very much as usage based Database that I used Synapse Serverless and BigQuery, you pay a minimum of 10 MB by table, if you can avoid joins and pass everything as filters you save a bit of money.

Does it scale Though

to test it, I built two exact same visual, one using composite and the other Dual

So Far, so good, nearly same performance ( it is hard to believe it is 80 millions rows, and the region is Tokyo )

Now let’s add a date dimension, show me, Mwh per state per year

that’s not Good, 12 second is definitely not interactive, my first gut feeling, BigQuery slow down because of all those filters value, let’s check

Composite Model 2.1 sec, notice it did billed only 10 MB ( I am using a materialized View on the base Table )

And Now Dual Mode, which make the joins at the source, that’s why I am billed for 30 MB ( Synapse Serverless do the same)

Data Transfer is the bottleneck

ok doing the join is faster, but still it does not explain the big difference observed in PowerBI. now let’s check the result set returned by every Query

Dual Mode

35 rows, the same level of granularity as the visual

Composite Model

375K rows returned, yes, it is correct, PowerBI in composite mode don’t know anything about Year and Region, it has to get everything by DUID and Day level then group everything locally using the special join.

Downloading 370K will be slow and not very efficient for everything involved here, yes I know, you can add dimension year and region to the fact table, in that case we may just use flat table and call it a day. (I am joking you still need a dimension specially f you want to join another Fact)

so is Composite Model Bad ? absolutely not, but there is no free lunch, if you use it with dimensions that generate a small number of row it is fine, otherwise it can be slow, DWH are fast but data transfer is always a problem

How about Direct Query for PowerBI Dataset

it works the same way, two remote PowerBI Dataset are absolutely isolated from one another, PowerBI just see them as a separate Server !!!, and the join works by passing filter values around, Vertipaq is very fast though and all the datasets are located in the same space, I suspect it is less of a problem, But if you are not carefully enough with dimension with high cardinality, it may slow down the experience.

This is an example of a composite Model between two very small tables from two PowerBI Dataset, the DAX Query is passing day filter around, it is still fast, but the more you add, the slower it get.

We don’t use Composite Model at works as currently it needs a build permission for every user, and I did find sometimes rebuilding a model from scratch is much more practical than trying to decipher someone else disconnected table measure shenanigan, I think we currently use it only for special model to show a summary of all KPI from all existing Models grouped at a very high level.

The perfect use case for composite Model is if you have a Mature Enterprise Model and you need only to add a special dimension, like a different hierarchy then it is just perfect, anything else you need to be rather careful , you may end up with spaghetti Models all over the place.

What if ?

But I have to admit, the concept is very tempting and make you wonder, what if somehow we can just join between two arbitrary dataset using a real join, Vertipaq engineers are clever and they can figure it out, what if PowerBI service somehow accept a DAX Query and loaded not the whole Models but just the columns used for the Query , maybe even only the partition needed for the Query, what if in PowerBI service you will have different dataset just for storing data by department, and a lot of lightweight Logical Model in Direct Query mode.

Total separation of Storage, Compute and Semantic Model all using the same tables, can we just imagine how Vertipaq will look like in 2030 ?

Benchmarking Synapse Serverless using TPC-H-SF10

in a previous blog, I did a benchmark for a couple of Database Engine, although it was not a rigorous test, pretty much the results were in the expected range, except for Synapse serverless, I got some weird results, and not sure if it is by design or I am doing something very wrong, so I thought it worth showing the steps I took hoping to find what’s exactly going on.

First Check : Same region

I am using an azure storage in Southeast Asia

My synapse Instance is in the same region

Ok both are in the same region, first best practice.

Loading Data into Azure Data Store

The 8 parquet files are saved in this Google drive, so anyone can download it,

Define Schema

In Synapse, you can directly start querying a file without defining anything, using Openrowset, I thought I can test TPC-H Query 1 as it uses only 1 table, which did not work , some kind of case sensitive issue, when writing this blog I run the same Query and it worked just fine, ( no idea what changed)

1 minute on a second run, hmm not good, let’s try a proper external table , the data_source and File_format were already defined, so need to recreate it again.

CREATE EXTERNAL TABLE lineitem_temp (
	[L_ORDERKEY] bigint,
	[L_PARTKEY] bigint,
	[L_SUPPKEY] bigint,
	[L_LINENUMBER] bigint,
	[L_QUANTITY] float,
	[L_EXTENDEDPRICE] float,
	[L_DISCOUNT] float,
	[L_TAX] float,
	[L_RETURNFLAG] nvarchar(1),
	[L_LINESTATUS] nvarchar(1),
	[L_SHIPINSTRUCT] nvarchar(25),
	[L_SHIPMODE] nvarchar(10),
	[L_COMMENT] nvarchar(44),
	[l_shipdate] datetime2(7),
	[l_commitdate] datetime2(7),
	[l_receiptdate] datetime2(7)
	)
	WITH (
	LOCATION = 'lineitem.parquet',
	DATA_SOURCE = [xxx_core_windows_net],
	FILE_FORMAT = [SynapseParquetFormat]
	)
GO


SELECT count (*) FROM dbo.lineitem_temp
GO

A Proper Table with Data type and all

let’s try again the same Query 1

ok 2 minute for the first run, let’s try another run which will use statistics, it should be faster, 56 second ( btw, you pay for those statistics too)

Not happy with the results I asked Andy ( Our Synapse expert) and he was kind enough to download and test it, he suggested splitting the file give better performance , he got 16 second.

CETAS to the rescue

Create External Table as Select is a very powerful functionality in Serverless, The code is straightforward

CREATE EXTERNAL TABLE lineitem 
	WITH (
	LOCATION = '/lineitem',
	DATA_SOURCE = [xxxx_core_windows_net],
	FILE_FORMAT = [SynapseParquetFormat]
	)
as
SELECT * FROM dbo.lineitem_temp

Synapse will create a new table Lineitem with the same data type and a folder that contain multiple parquet files.

That’s all what you can do, you can’t partition the table, you can’t sort the table, but what’s really annoying you can’t delete the table, you have first to delete the table from the database then delete the folder

but at least it is well documented

Anyway, let’s see the result now

Not bad at all, 10 second and only 587 MB scanned compared to 50 second and 1.2 GB.

Now that I know that CETAS has better performance, I have done the same for remaning 7 tables.

Define all the tables

First Create an external Table to define the type then a CETAS, Synapse has done a great job guessing the type, I know it is parquet after all, but varchart is annoying by default it is 4000, you have to manually adjust the correct length.

TPC-H document contains the exact schema

Running the Test

The 22 Queries are saved here, I had to do some change to the SQL, changing limit to Top and extract year from x to Year (x), Query 15 did not run, I asked the Question on Stackoverflow and Wbob kindly answer it very quickly

The first run, I find some unexpected results

I thought I was doing something terribly wrong, the Query duration seems to increase substantially, after that I start messing around, what I found is, if you run just one Query at the time, or even 4, the results are fine, more than that, and the performance deteriorate quickly.

A Microsoft employee was very helpful and provided this script to Query the Database History

I imported the Query History to PowerBI and here is the results

There is no clear indication in the documentation that there is a very strict concurrency limitation, I tried to run the Script in SSMS and it is the same behavior, that seems to me the Engine is adding the Queries to a queue, there is a bottleneck somewhere.

Takeway

The Good news, the product team made it very clear, Synapse Serverless is not an Interactive Query Engine

Realistically speaking, reading from Azure storage will always be slower compared to a local SSD Storage, so no I am not comparing it to other DWH offering, having said that even for exploring files on azure storage, the performance is very problematic.

First Look at Synapse Dedicated Pool

Edit: March 2022, I am keeping the blog post as it was my first impression at the time, but I have to admit I was wrong, Dedicated Pool is designed for Big Data Workload, Starting from TB, it does not work well at all for small dataset, basically I was testing it for a load that was not design for it.

for Data size in the 100 GB , SQL Server with columnstore index is probably a better option.

in a previous blog, I showed that Synapse Serverless is not great for interactive BI workload, which is rather obvious as even Microsoft don’t recommend it as a best practice

The logical next step is to have a look at Dedicated Pool ( ex SQL DW), where you don’t pay by data scanned but rather resource reserved, this model make more sense for a steady interactive Workload.

To be very clear, this is a real first impressed, I did not tried any optimization (to be honest I don’t even know how to do it), all I wanted is to get a feeling of the product.

Creating a new cluster is trivial, notice, I have already a Synapse analytics Workspace, Microsoft is very good in creating integrated solution

You can directly start running Queries from data store in Azure storage, see example here,in this example , I am selecting serverless here.

Now using dedicated Pool, I had first to start the cluster as it was suspended, you need to wait a couple of minutes

And for some reason, openrowset did not work, instead, I need to create External Table, still very easy

The GUI generate the SQL Script, very nice

DUID is a short string, but I will leave the default value as nvarchart(4000), for some reason the column SETTLEMENTDATE is recognized as a number, it is should be a datetime ( maybe the issue is with parquet file itself), but that’s not a big issue for this blog. the point is you can Query your data lake files either using dedicated pool or Serverless

Native Storage

Dedicated Gen2 does not support a real separation between storage and compute, when you load a table, it has to be allocated to a specific DWH, and cross Query between DWH is not supported ( it is planned since 5 years).

Loading data is very easy, same approach, a GUI will generate a script

and here is the Script generated, I changed the text column to varchar(20)

Now instead of loading only 1 file, I loaded 536 files, it took 8 minutes, that’s seems very slow

Performance

Obviously, it is not a benchmark, but just to get a general feeling, when I test it with a very small table, I did like the results in PowerBI (Direct Query took 940 ms and DAX literally 5 ms)

Image

A bit latter, I try it with Another again , first I got this error

Image

it was my fault, when a cluster is suspended, it will not wake up automatically when it receive Queries, you need to manually start it.

The PowerBI report is a simple Direct Query with some table, just to generate multiple SQL Queries

Initially it felt alright, but when I added another instance of the report, things start to feel slow, had a look at the service and SQL Pool has already started adding SQL Queries to the Queue !!!!

Take Away

The Good :

  • Integration with Azure Storage

The Bad :

  • 1.181 $/Hour is far from being cheap for an entry level tier specially that auto suspend does not Work.
  • Dedicated Pool is the only major DWH that does not separate Storage from compute, you can’t simply spin up a compute and Cross Query another dataset.

The Ugly :

  • Performance and concurrency seems particularly poor out of the box.

I am sure there are some optimization that can be done ; for example switch on result cache ( why it is not on by Default ?) maybe partition etc but I wanted just to give a first impression, and honestly color me unimpressed, Better Gen3 be Good.

Edit : got this feedback from Andy that DW100c is limited to 4 Concurrent Queries

Synapse Serverless vs BigQuery BI Engine using a dataset under 10 GB

Disclaimer : this is not a scientific reproducible benchmark.

Edit : TL;DR, don’t use Synapse Serverless with Direct Query mode in PowerBI, it is an extremely bad idea.

One aspect that bother me about technical blogs nowadays, it seems cost is rarely considered, when the subject is about a fixed cost Product like PowerBI Pro license ( 1/user/10$/Month) then it is fine, we know what to expect, but when we talk about usage based Pricing, the cost structure is extremely important, a solution may be great for certain usage load, but it became just exorbitant when the load increases.

One particular architecture that some people start promoting as some kind of magical solution is the use of Synapse Serverless as a logical Data-warehouse, and somehow it can be used too as a live Query layer to PowerBI, I will argue in this blog that this setup is simple too expensive.

Testing Synapse Serverless Indirectly

My thinking is very simple, I was not very excited by the prospect of paying 200 dollars just to test Synapse Serverless, Instead I will test it indirectly, BigQuery BI Engine has a nice functionality that show how much data was scanned, it is for information only, we don’t pay by data scanned but instead in-memory reserved ( 1$/GB compressed/Day, minimum 1 hour).

The approach here is to get the volume of Data scanned and multiply it by 5$/TB (Synapse serverless Pricing) , I appreciate it is not 100 % accurate, but I hope will show a general pattern.

Load test using a PowerBI report

The Fact table is 12 GB, 72 Millions rows that get data add every 5 minutes, the Model is a simple Star Schema, I am using dual mode for the dimension Tables

The test consist of using play Axis to loop on some dimension values every 5 second, I launched multiple copy of the same reports to generate more SQL Queries.

The Results

This table summarize the test results , more details in the report , please keep the filter between 22 Nov and 27 Nov 2021, as testing was done in that period.

BigQuery BI engine is very fast, but that’s not the subject of the current blog, what’s interesting here is the volume of data scanned 16,75 TB, that’s a lot of data, which does not count for the cache.

For simplification purpose we estimate the cache to be the same ratio as scanned TB (16.46 * 44.89 %/54.42 %) = 13.58 TB

Synapse Serverless is 25 X more expensive than BigQuery BI Engine, and even if they add the result set cache it will be still 14 X more expensive ( Same as BigQuery without reservation)

Key findings

Interactive BI reports generate a massive number of SQL Queries, in our example, it was 76K Queries, which simply make SQL Engine that are cost based on data scanned too Much expensive.( Synapase Serverless and BigQuery default mode)

This scenario will be better served by a dedicated capacity, but as of this writing Synapse does not support auto suspense and auto resume which make it too expensive , and in any case, Synapse dedicated pool does not scale down well for small data ( hopefully Gen3 will fix that)

BigQuery BI Engine make Direct Query on PowerBI a viable solution, which is a great achievement and still with very competitive pricing.

Synapse Serverless is an interesting SQL Query Engine, but it not designed for heavy interactive BI load, I just hope people stop suggesting otherwise.

I think next year the battle for 100 GB interactive sub second BI workload will be an interesting space to watch, let’s see what Dedicated Pool Gen 3 , Databricks, Snowflake and Firebolt will bring to the table 🙂

%d bloggers like this: