Building a Modern Data Stack using BigQuery, Dataform and PowerBI

Google cloud has bought recently Dataform and made it available for free, although I  play it with it before, Now I thought it is a good time to use it more seriously, this is not a review but my own experience as a data analyst who is more comfortable with Microsoft self-service data tools and  does not use SQL in day to day work.

I have an existing data pipeline in BigQuery, the data is loaded using python and there are schedule Queries using python and BigQuery native scheduler, although the whole thing worked very well for the last 15 months, I would not say working with multiple views and table was a pleasant experience, to be honest,  Because I was afraid to break something, I have not touch it much,  everything change since I start using Dataform to manage it

What Dataform did ( and I imagine dbt too)  is to implement some very simple functionalities that make the whole work flow extremely easy to manage, so you write your SQL code in Dataform, Dependencies between Tables are auto generated , and when you click run, it will build those Tables and views in BigQuery

I think showing a general overview of what I did, hopefully give you a sense of the Big picture

1- Define your Source Tables

Here is the representation in Dataform

for Example The Table “DREGION”, you write this code

config {
  type: "declaration",
  schema: "aemodataset",
  name: "DREGION",
  description: "Price very 5 minute, history"
}

you repeated the same for all the source Tables

and here is the View in the dependency trees

2- Remove hard coded refrence to Tables in SQL Queries

let’s say you have an existing view

SELECT
   *
 FROM
   xxxxxx.aemodataset.rooftoptoday

instead of hard coding the table, you change it to this

config {
   type: "view",
   schema: "PowerBI",
   tags: ["PowerBI"]
 }
 SELECT
   *
 FROM
   ${ref("rooftoptoday")}

This format is called SQLX, as you can see it is still SQL but they added some new functionalities, in the config, you define if it is a table or a view, in which dataset it will be located and tag ( will be useful later for schedule refresh)

Now, repeated this for all your tables and you get this beautiful dependency tree

3- Schedule Queries

And that’s that where the magic is, when you schedule a Query, you have an option to schedule all dependant tables, for example, I setup a daily refresh for the Table “UNITARCHIVE” , the two Tables “archive_view” and “revenue” will be run in Sequence without me writing any extra code

The Dataform project is published here github, it is really nice to see the history of all the changes made so easy with the integration of version controls

4- Here is the final Views in BigQuery

I think it is a good practise to always expose only Views to PowerBI, as you can change the logic later without breaking the connection to PowerBI

5-Connect PowerBI to BigQuery

PowerBI Connect to BigQuery using incremental refresh to reduce the time required to update, it is pretty trivial to setup.

Although the data changes every 5 minutes, I am using PowerBI PRO license which is limited to 8 refresh/Day, if Premium per user turn out to have a reasonable price, I will upgrade 🙂

hopefully in 2021 we will have the option to serve PowerBI using BI Engine, as Currently using DirectQuery from BigQuery can be expensive very Quickly if you have a lot of usage, Obviously if you are on a flat rate, it is not a problem.

to clarify, BigQuery is very cost effective, actually the current pipeline cost me less than 2 $/Month, you have just to be careful with PowerBI and use only import mode, PowerBI is very chatty when used in live mode, it simply generate two much SQL Queries.

6-Semantic Model in PowerBI

Dataform Data model are not meant to replace a semantic model, all Dataform do is taking raw tables and generating reporting tables that can be consumed by a BI tool (to clarify, BigQuery is generating the views and tables, Dataform just manage the SQL code, and schedule refresh, but the compute is done by the DB).

For a simple scenario, some flat tables is all you need (in Fact I am using Google Data Studio too for this example), but anything slightly complex , you need a semantic model on top of it, here I am using PowerBI to host the semantic model.

I would have loved to test Looker semantic model, But currently you need to call a sales department to schedule a demo which I am not really interested in doing.

7- Final Reports

Here is the final reports, as the data is public I am using publish to web

what I really like about the dependency tree, it gives visual clues to redundant logic, it gave me the confidence to simplify my workflow and when I delete or change a table names, automatically it raise an errors that a dependency will be out of sync

I keep reading about how ELT will be the next big thing, and to be honest I never bought the concept, but with Dataform, I can see myself writing very complex workflow using SQL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s