First Look at BigQuery BI Engine with PowerBI

Google made BigQuery BI engine available in a public preview , you need to enroll first here, for the last two years it was available only for Google Data Studio, and I had use it extensively for this Project, so I was really curious how it will work with PowerBI.

I don’t think I know enough to even try to reproduce a benchmark, Instead I am interested in only one Question, how much value I can get using the lowest tier of BI Engine and can PowerBI works smooth enough t make Direct Query a realistic option.

BigQuery team was nice enough for the preview period to have 100 GB reservation free of charge, just to keep it realistic, I kept reservation to 1 GB with a cost of $30 per month, I built a couple of reports in PowerBI and tried to observe how BI engine behave and observe the Query statistic , The report is located here, The Data is using Direct Query Mode, the Query statistics update every 1 hour.

How BI Engine Works

it is extremely easy to setup just select how much memory you want to reserve by Project, and that’s all, you pay by GB reserved per hour.

Keep in mind the Project used for reservation can Query any other projects as long as it is in the same region, in PowerBI, you can define which project you use for the Query

After you wait a couple of minutes for BI engine to start, this is more or less how it works

1-Query received by BigQuery, based on columns used in the Query, BI engine will load only those columns and partition into the Memory, the First Query will be always slower as it has to scan the data from BigQuery Storage and compress it in memory in my case usually between 1-2 second

2-Second Query arrive, the data is already in Memory, very Fast 100 ms to 500 ms

3- Same Query arrive as 2, BigQuery will just hit the cache, that’s the sweet spot, less than 100 ms

4- A new Query arrive that target different table, that’s the interesting part, BI Engine based on the size of the scanned column, and the available reservation, either evict the old table from memory or decide that there is not enough Reservation then it will fall back to the default BI Engine, where you pay how much data is scanned

5- A Query arrive that contains feature not supported by BI Engine , it will fall back to the default engine

6- Data appended to the Base table or Table changed, BI Engine will invalidate the cache , it will load the delta to memory or load everything again if the table was truncated

Obviously it is much more complex behind the scene, But I find it fascinating that BI engine in a fraction of a second decide what’s the best way to serve the Query ( cache, Memory or Storage)

Personally I am very interested in Case 4, obviously if I reserve a Memory I want to minimize scanned storage to the lowest possible, here is the result for the last 10 days, I think that’s a great result, my ” Big Table is 6 GB, 50 Million rows” other tables are smaller , the dev team said they are working on improving even more how BI Engine algorithm deals with smaller tables, so far happy with that. ( it is fixed now, the memory consumption is extremely low now )

I appreciate other users with flat rate pricing would not care that much about file scanned , for user with usage based pricing, it is a very important factor

Query Performance

Again the results is based on my usage, the only way to know is to try it yourself, nearly 72 % of Queries render in less than 100 ms, I think it is fair to say, we are into a different kind of data warehouse architecture

PowerBI Performance

The Query Performance is only a portion of the whole story, you need to add network latency and PowerBI overhead ( DAX calculation, Viz rendering etc), my data is in Tokyo Region and PowerBI Service is located in Melbourne, a bit of distance I would say and using Publish to web add an extra latency.

The good thing, using Direct Query on a 51 Million Fact table with 5 dimensions is an achievement in itself, I feel I can use that in a Production, at the same time, using other report, it seems I am hitting a bug in the ODBC driver, and the performance is not good.

but to be totally Honest, it seems PowerBI driver for BigQuery is far from being optimized, it seems they are using SIMBA ODBC , other BI tools are using the native API and it is substantially faster, but I have reason to believe the PowerBI team will invest more in better Integration ( PowerBI parameter in SQL Query is coming for example)

I Think it is extremely interesting new development, specially if you have Big Fact tables or data that change very frequently, Direct Query mode have a lot of advantages, it is very simple to setup, the data is always fresh and there is no data movement. and BI Engine is fast, extremely Fast, and Cheap !!!, I am using a state of the art data warehouse for $30 per month !!!!, now it is up to The PowerBI Team to take advantage of that.

Building a Modern Data Stack using BigQuery, Dataform and PowerBI

Google cloud has bought recently Dataform and made it available for free, although I  play it with it before, Now I thought it is a good time to use it more seriously, this is not a review but my own experience as a data analyst who is more comfortable with Microsoft self-service data tools and  does not use SQL in day to day work.

I have an existing data pipeline in BigQuery, the data is loaded using python and there are schedule Queries using python and BigQuery native scheduler, although the whole thing worked very well for the last 15 months, I would not say working with multiple views and table was a pleasant experience, to be honest,  Because I was afraid to break something, I have not touch it much,  everything change since I start using Dataform to manage it

What Dataform did ( and I imagine dbt too)  is to implement some very simple functionalities that make the whole work flow extremely easy to manage, so you write your SQL code in Dataform, Dependencies between Tables are auto generated , and when you click run, it will build those Tables and views in BigQuery

I think showing a general overview of what I did, hopefully give you a sense of the Big picture

1- Define your Source Tables

Here is the representation in Dataform

for Example The Table “DREGION”, you write this code

config {
  type: "declaration",
  schema: "aemodataset",
  name: "DREGION",
  description: "Price very 5 minute, history"
}

you repeated the same for all the source Tables

and here is the View in the dependency trees

2- Remove hard coded refrence to Tables in SQL Queries

let’s say you have an existing view

SELECT
   *
 FROM
   xxxxxx.aemodataset.rooftoptoday

instead of hard coding the table, you change it to this

config {
   type: "view",
   schema: "PowerBI",
   tags: ["PowerBI"]
 }
 SELECT
   *
 FROM
   ${ref("rooftoptoday")}

This format is called SQLX, as you can see it is still SQL but they added some new functionalities, in the config, you define if it is a table or a view, in which dataset it will be located and tag ( will be useful later for schedule refresh)

Now, repeated this for all your tables and you get this beautiful dependency tree

3- Schedule Queries

And that’s that where the magic is, when you schedule a Query, you have an option to schedule all dependant tables, for example, I setup a daily refresh for the Table “UNITARCHIVE” , the two Tables “archive_view” and “revenue” will be run in Sequence without me writing any extra code

The Dataform project is published here github, it is really nice to see the history of all the changes made so easy with the integration of version controls

4- Here is the final Views in BigQuery

I think it is a good practise to always expose only Views to PowerBI, as you can change the logic later without breaking the connection to PowerBI

5-Connect PowerBI to BigQuery

PowerBI Connect to BigQuery using incremental refresh to reduce the time required to update, it is pretty trivial to setup.

Although the data changes every 5 minutes, I am using PowerBI PRO license which is limited to 8 refresh/Day, if Premium per user turn out to have a reasonable price, I will upgrade 🙂

hopefully in 2021 we will have the option to serve PowerBI using BI Engine, as Currently using DirectQuery from BigQuery can be expensive very Quickly if you have a lot of usage, Obviously if you are on a flat rate, it is not a problem.

to clarify, BigQuery is very cost effective, actually the current pipeline cost me less than 2 $/Month, you have just to be careful with PowerBI and use only import mode, PowerBI is very chatty when used in live mode, it simply generate two much SQL Queries.

6-Semantic Model in PowerBI

Dataform Data model are not meant to replace a semantic model, all Dataform do is taking raw tables and generating reporting tables that can be consumed by a BI tool (to clarify, BigQuery is generating the views and tables, Dataform just manage the SQL code, and schedule refresh, but the compute is done by the DB).

For a simple scenario, some flat tables is all you need (in Fact I am using Google Data Studio too for this example), but anything slightly complex , you need a semantic model on top of it, here I am using PowerBI to host the semantic model.

I would have loved to test Looker semantic model, But currently you need to call a sales department to schedule a demo which I am not really interested in doing.

7- Final Reports

Here is the final reports, as the data is public I am using publish to web

what I really like about the dependency tree, it gives visual clues to redundant logic, it gave me the confidence to simplify my workflow and when I delete or change a table names, automatically it raise an errors that a dependency will be out of sync

I keep reading about how ELT will be the next big thing, and to be honest I never bought the concept, but with Dataform, I can see myself writing very complex workflow using SQL.

Dynamic Weekend using Parameter in PowerBI

I had a perfectly Working PowerBI report the last 7 Months , The data is refreshed daily, but we need to see the weekly Quantity basically on a rolling basis of the last 7 days.

When the data refresh, a calculated column generate a new weekly grouping based on the latest update from a Fact Table “Quantity Installed”, see example here

weekend.JPG

Everything Worked well, till someone asked if he can filter previous days, let’s say the report is Updated Monday, but he want to see the Data only till the previous Thursday , which is the official Cut off for the Client reporting.

Obviously it is easy just filter the data, but visually it is annoying as the weekend date is still Monday, calculated column change only when your refresh a report , Then they are fixed.

To show an example, this report show the total fatalities due to COVID19 per Week, The Cut off is 03/11/2020 which is a Tuesday, the weekly calendar is based on Tuesday as the end of the week

Now if you filter the data to the Previous Friday, the Weekend do not change, hence you get the impression that the total number for the Current week are going down which is not correct.

So, I asked in PowerBI forum how to do that, and Brian Maher solution is as columns do not change, we need to change the DAX,see link to the original solution

“The measure below checks to see if our date on the X-axis is the same day of the week as your slicer. If not it returns BLANK( ), if it is the same day of week, it sums the qty for the previous 7 days”

 Qty at 'week end' =
 VAR SelectedSlicerDate =
     SELECTEDVALUE ( 'Slicer Dates'[Date] )
 VAR ThisDateOnAxis =
     SELECTEDVALUE ( 'Calendar'[Date] )
 RETURN
     IF (
         WEEKDAY ( SelectedSlicerDate ) <> WEEKDAY ( ThisDateOnAxis ),
         BLANK (),
         CALCULATE (
             SUM ( 'Data Table'[qty] ),
             DATESBETWEEN ( 'Calendar'[Date], ThisDateOnAxis - 6, ThisDateOnAxis )
         )
     )

It is basically using a well know technique in DAX, show all the dimension value then anything with BLANK() result will be filtered out.

The DAX works perfectly, and solved the user request but , In my real case I had a lot of measures, changing all of them and risking introducing regression for a perfectly Working Data Model just for this small request was not worth it.

All I wanted is somehow make the column Weekend Dynamic at runtime, which was not possible till …… October 2020, we do have a new option, we can finally change the dimension dynamically and keep the DAX simple !!!!

if you have not heard about the New Parameter, maybe a good start is here.

My first attempt was to get the Day of the week in PowerQuery, but unfortunately Query Folding stopped Working and to be honest there is no clear way to know what’s supported or not as it is data Source specific, here I am using BigQuery.

Instead I create a simple calendar table with a column Day and 7 columns with all the possible week end ( Saturday, Sunday etc)

and then Create a new Parameter, with a default Day Thursday

Then Created a new Column “DynamicWeek” using this formula

I think we can agree the formula is trivial , a simple if then else

In order to make the parameter Dynamic, we create a new disconnected Table (not connected to any other Tables in the Model)

Then we Bind the Value of the Column Day to the Parameter

When the User changed the value of the disconnected Table using a Slicer

The Value get passed to the Parameter which Change the Value of the Column “Dynamic Parameter”

And here is the result

Update

a couple of hours after I published this blog, I got an idea that actually did Work.

Query folding does not work with the function Day of the Week, but Text.Middle did fold indeed, so all I had to do is create a new Disconnected Table like this

the same approach add a calculated column in the calendar Table

Just to be fancy, in the parameter, I used a Query that return always the latest date as a default value

and View of the Data Model

Created a new Measure that use the Disconnected Table as a filter Cut off, so the date get filtered

Death_cutoff = Var Sum_=sum(covid19new[daily_deaths])
return if (min(MstDates[Date])>SELECTEDVALUE(MstDates_Table_Cutoff[date]),BLANK(),Sum_)

I imagine you can search the date text and see if it is the end of the Month and generate Monthly calendar for example, a lot of flexibility.

notice here, because the Date Table is DirectQuery ( the fact and the parameter selection table are import) at least when using BigQuery the response time for the first select is around 3 seconds, but when you do the same selection, it become instantaneous as PowerBI cache the query result.

Now the Exciting part is which we don’t have yet, but apparently is coming next year at least for SnowflakeDB and Redshift, we will be able to write Custom SQL Query with Parameters in DirectQuery Mode, and then basically we will have finally “Dynamic Calculated Column“, ok if we have them in Import Mode, it is even better but let’s not get too Excited 🙂

DAX is extremely versatile , combine that with Dynamic column and we have a very powerful new tool, the future is exciting.

Dynamic Date Granularity using Parameter in Google Data Studio

I created this dynamic bar chart which change the granularity based on the date selected, so if you the date selected > year, it will group the data per year then you can drill down to Month, the same if it is Month, you can drill down to week etc. see Example here.

it was just for fun, but someone asked me how it works, I thought it is worth a Quick Blog.

As of 27 October 2020, the date parameter are only available if you connect to BigQuery.

ok, basically you need just create a custom Query that calculate the duration between the start date Parameter and end date parameter which are passed by the Calendar Control

This example is using Average Australian wholesale Electricity Market, Obviously it works with any dataset that contain date.

WITH
   xx AS (
   SELECT
     *EXCEPT(YEAR),
     CAST(DATE_TRUNC(DATE(SETTLEMENTDATE), YEAR) AS timestamp) AS YEAR,
     CAST(DATE_TRUNC(DATE(SETTLEMENTDATE), MONTH) AS timestamp) AS MONTH,
     CAST(DATE_TRUNC(DATE(SETTLEMENTDATE), WEEK) AS timestamp) AS WEEK,
     CAST(DATE(SETTLEMENTDATE) AS timestamp) AS date,
     DATE_DIFF(PARSE_DATE('%Y%m%d',
         @DS_END_DATE),PARSE_DATE('%Y%m%d',
         @DS_START_DATE),DAY) + 1 AS nbrdays
   FROM
     xxxxxxx.PRICEARCHIVE
   WHERE
     DATE(SETTLEMENTDATE) >= PARSE_DATE('%Y%m%d',
       @DS_START_DATE)
     AND DATE(SETTLEMENTDATE) <= PARSE_DATE('%Y%m%d',
       @DS_END_DATE)
     AND UNIT="DUNIT")
 SELECT
   REGIONID,
   RRP,
   nbrdays,
   CASE
     WHEN nbrdays <= 1 THEN SETTLEMENTDATE
     WHEN nbrdays <= 7 THEN date
     WHEN nbrdays <= 31 THEN WEEK
     WHEN nbrdays <= 365 THEN MONTH
   ELSE
   YEAR
 END
   AS newdate,
   CASE
     WHEN nbrdays <= 1 THEN SETTLEMENTDATE
     WHEN nbrdays <= 7 THEN SETTLEMENTDATE
     WHEN nbrdays <= 31 THEN date
     WHEN nbrdays <= 365 THEN WEEK
   ELSE
   MONTH
 END
   AS newdate2
 FROM
   xx