Synapse Serverless vs BigQuery BI Engine using a dataset under 10 GB

Disclaimer : this is not a scientific reproducible benchmark.

Edit : TL;DR, don’t use Synapse Serverless with Direct Query mode in PowerBI, it is an extremely bad idea.

One aspect that bother me about technical blogs nowadays, it seems cost is rarely considered, when the subject is about a fixed cost Product like PowerBI Pro license ( 1/user/10$/Month) then it is fine, we know what to expect, but when we talk about usage based Pricing, the cost structure is extremely important, a solution may be great for certain usage load, but it became just exorbitant when the load increases.

One particular architecture that some people start promoting as some kind of magical solution is the use of Synapse Serverless as a logical Data-warehouse, and somehow it can be used too as a live Query layer to PowerBI, I will argue in this blog that this setup is simple too expensive.

Testing Synapse Serverless Indirectly

My thinking is very simple, I was not very excited by the prospect of paying 200 dollars just to test Synapse Serverless, Instead I will test it indirectly, BigQuery BI Engine has a nice functionality that show how much data was scanned, it is for information only, we don’t pay by data scanned but instead in-memory reserved ( 1$/GB compressed/Day, minimum 1 hour).

The approach here is to get the volume of Data scanned and multiply it by 5$/TB (Synapse serverless Pricing) , I appreciate it is not 100 % accurate, but I hope will show a general pattern.

Load test using a PowerBI report

The Fact table is 12 GB, 72 Millions rows that get data add every 5 minutes, the Model is a simple Star Schema, I am using dual mode for the dimension Tables

The test consist of using play Axis to loop on some dimension values every 5 second, I launched multiple copy of the same reports to generate more SQL Queries.

The Results

This table summarize the test results , more details in the report , please keep the filter between 22 Nov and 27 Nov 2021, as testing was done in that period.

BigQuery BI engine is very fast, but that’s not the subject of the current blog, what’s interesting here is the volume of data scanned 16,75 TB, that’s a lot of data, which does not count for the cache.

For simplification purpose we estimate the cache to be the same ratio as scanned TB (16.46 * 44.89 %/54.42 %) = 13.58 TB

Synapse Serverless is 25 X more expensive than BigQuery BI Engine, and even if they add the result set cache it will be still 14 X more expensive ( Same as BigQuery without reservation)

Key findings

Interactive BI reports generate a massive number of SQL Queries, in our example, it was 76K Queries, which simply make SQL Engine that are cost based on data scanned too Much expensive.( Synapase Serverless and BigQuery default mode)

This scenario will be better served by a dedicated capacity, but as of this writing Synapse does not support auto suspense and auto resume which make it too expensive , and in any case, Synapse dedicated pool does not scale down well for small data ( hopefully Gen3 will fix that)

BigQuery BI Engine make Direct Query on PowerBI a viable solution, which is a great achievement and still with very competitive pricing.

Synapse Serverless is an interesting SQL Query Engine, but it not designed for heavy interactive BI load, I just hope people stop suggesting otherwise.

I think next year the battle for 100 GB interactive sub second BI workload will be an interesting space to watch, let’s see what Dedicated Pool Gen 3 , Databricks, Snowflake and Firebolt will bring to the table 🙂

Analyse Big Spatial Dataset using hex bin maps in PowerBI

Come across this blog post in Twitter and I thought this approach can be very useful in PowerBI to deal with Big GIS dataset.

The idea is simple, let’s say you have a big spatial dataset with million of coordinates, so instead of plotting all the individual points which is unfeasible for performance reason, you simply do this,

– First generate a grid for the area you want to analyse using something like QGIS.

– Then count how my point belong to every cell in the grid and color the results accordingly, you need a spatial Join for that, either using QGIS , SQL Server etc

– Drill down to individual points using WKT multi point

Publish to web here and pbix

Generate a Grid System

After a lot of googling as I did not even know what to look for, turn out QGIS can very easily generate a Grid, here is a good tutorial I found

and here is a results for Chicago area as I am using it as an example for this blog, Please notice, the grid should have less than 30K polygons which is the maximum number of rows that PowerBI can plot.

Spatial Join

I could have used QGIS to do the spatial join, but my data is from BigQuery public dataset (bigquery-public-data.chicago_crime.crime) and for convenience, it is easier to do the join there, the previous link show how to do the join in QGIS.

After loading the grid geometry to BigQuery, I am using this Query to join the two dataset, Grid and my current example, Crime data in Chicago

CREATE OR REPLACE TABLE
  XXXXXX.GIS.crime AS
WITH
  crime AS (
  SELECT
    fbi_code,
    ST_GEOGPOINT(longitude,
      latitude) AS geo,
    longitude,
    latitude
  FROM
    `bigquery-public-data.chicago_crime.crime`)
SELECT
  fbi_code,
  geo,polygon,
  longitude,
  latitude,
  ID
FROM
  crime
JOIN
  XXXX.GIS.Grid AS Grid
ON
  ST_INTERSECTS(geo,
    polygon)

The Query is checking every coordinates and looking up which polygons it belong to, so the Grid table is something like this

And the Query results is like this, we go the ID from the Grid Dataset, Any DB that support geometry can be used SQL Server, PostgreSQL etc

PowerBI

Let’s import those two table to PowerBI, the two tables are joined by the field ID

Luckily PowerBI support WKT format using Icon Map, show the grid with the count of number of coordinates is straightforward

Ok, it is great we have an aggregation of 4 Millions points and we can see the areas with the highest Number of recorded crime, you can filter by FBI code which classify the crime gravity (I have zero knowledge about this dataset, I just used as an example)

Drill Down to individual Cell

because we have the individual points, all I need to do is to generate Multi point geometry using DAX , that’s why I love Icon Map so flexible , The measure will simply detect if the cell ID is filtered or not and then switch between Polygons and Multi point

Geometry =
VAR polygon =
    MIN ( Grid_View[polygon] )
VAR concat =
    CONCATENATEX ( VALUES ( crime[geo] ), crime[geo], "," )
VAR Multi_point = "MULTIPOINT (" & concat & ")"
RETURN
    IF ( ISFILTERED ( dummy_id[id] ), Multi_point, polygon )

Notice Here, as a PowerBI Visual can not filter itseelf, I had to add a dummy dimension

and here is the final result

Take Away

PowerBI Icon map and QGIS is a very powerful combination, and the fact that DAX can generate a geometry on the fly is very interesting Pattern.

Building Complex Data Model using Nested Data in BigQuery

was talking to someone I know on twitter, and got this cryptic reply

I have a vague idea of how to nest two tables, specially Header/Details, it is very easy to conceptualize but I never really understood Json format or how to get multiple levels.

Building a Nested Table

Just for fun I tried to convert this Data Model, a typical Multi fact Star Schema ( apparently it is called galaxy Schema)

I am not going to share the SQL Query that converted those 3 facts Table to 1 Nested Table ( because it is probably horrible), it took me some time to get a mental image how to stack the common dimensions first, and started nesting the common values at lower level, anyway I end up with this schema

And here is a Preview

Querying Nested Table

Front end tools don’t support nested Data by default, you have to flattened it first using SQL , Exception Data Studio but you have to be careful of this bug

A recent interesting development, Looker out of nowhere introduced a New Open Source Analytics Language called “Malloy” , and it has a first class support for nested Data, see example here

It is just a fun exercise, BigQuery support Star Schema just fine, but according to people who work with Big Data, let’s say 2 billion rows and such, not doing join is extremely Faster and cheaper, and apparently it is easier to transfer Data, after all moving one Table is easier than moving three.

Now even if all what you do is Star Schema, it is fruitful to learn other approaches, turn out Json format is useful aft all 🙂

Model Header/Details Schema using Nested Data in BigQuery

I was watching a Video by Alberto regarding the best way to model Header/Details Schema and how Star Schema is always the best option at least in PowerBI , even if you have never used PowerBI I think it is a great video to watch.

An interesting information I learned from the Video, although PowerBI columnar database is extremely fast in joins, still when the join column has a massive number of distinct values, performance will degrade, in his case, one of the fact has 1.4 B rows and 94 million distinct values on the join !!!, the solution was basically to flatten the two fact Tables.

ok, what this has to do with BigQuery !! for some reason, an idea occurred to me, this scenario will be a perfect scenario for Nested Data, but as I never used nested data before, in this blog, I am just messing around and see how it goes !!!

The Dataset

As usual, I will use the Australia electricity market as an example, it has 72 Million records with a size of 10 GB uncompressed, here is a sample f the Data, the dataset is public using this adreess

“test-187010.ReportingDataset.UNITARCHIVE”

As you can see the same attribute is repeated for every 5 minutes, what change is the date and the production in Mw

let’s see how it looks as a nested format

in a pure denormalized form they are two tables in reality, the dimension which contains the generator attribute and the production every 5 minutes

How to Nest data in BigQuery

I literally copied from this blog by Lak, just go and read it, I will just show the SQL Query here

drop table if exists XXXXX.nested.UNITARCHIVE_nested;
CREATE OR REPLACE TABLE
  XXXXX.nested.UNITARCHIVE_nested
PARTITION BY
 RANGE_BUCKET(DUID_ID, GENERATE_ARRAY(1,4000, 1))
CLUSTER BY StationName,DAY
AS
SELECT
  UNIT,
  tt.DUID,
  StationName,
  DUID_ID,
  Region,
  FuelSourceDescriptor,
  Technology,
  latitude,
  longitude,
  Tech,
  DAY,
  cast(SUM(mwh) as numeric) AS Mwh,
  ARRAY_AGG(STRUCT( time,
      INITIALMW)
  ORDER BY
    time ASC) AS obs
FROM
  test-187010.ReportingDataset.UNITARCHIVE tt
  left join (select DUID,row_number() over() as DUID_ID
  
FROM
  test-187010.ReportingDataset.UNITARCHIVE
  group by 1 ) as xx
  on tt.DUID=XX.duid
WHERE
  unit="DUNIT"
GROUP BY
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11

Compare Performance

For the same level of details, the size went from 10.1 GB to 1.13 GB !!! the cost of storage is rather insignificant, but if you use Query on demand, you pay by size scanned, the less the better.

Let’s check the total sum of Mwh , the same Query went from 1.1 GB to 5.1 MB !!! that nearly 100 X Cheaper ( in BigQuery you pay a minimum of 10 MB)

let’s check the 5 top days by production, same went from 1.6 GB to 7.7 MB that’s 160 X Cheaper , and faster too from 0.9 Second to 0.6 Second, it is funny both return subsecond results and it is not cached 🙂

What’s the catch !!!

There is a catch though, when you Query inside the Nested Data, the performance is nearly the same, went from 1.6 GB to 1.1 GB still it is a saving of 0.5 GB, but what surprised me, unnesting is a bit faster than reading from flat data, I was expecting the opposite

Visualise the Results in your Favorite BI Tools

I am using PowerBI to show the final results, using Direct Query Mode, the Dimensions are Imported to improve the slicer performance, Obviously it works with any BI tools that support SQL.

Looker has a nice support for nested data, unfortunately there is no free version to test it.

As of this Writing BI Engine does not support Unnest Data,Hopefully they fix the issue very soon., Nice, it is supported under BI Engine SQL Interface !!!!

And a sample report using nested data (5 minute Dimension)

Take Away

let’s say you have a Header and details tables, there are some options

  • Join at run time in the BI tools, can be slow for high cardinality joins
  • Merge the two Fact at the ETL Stage, very fast but potentially generate multiple grains and you need to using distinct count to remove duplicates
  • Nested Data seems to be very elegant solution, no joins and you keep the granularity and it is cheaper to Query