Custom SQL in Google Datastudio

in the last 12 months, Google Datastudio has added many new interesting new features, specially the integration with BigQuery BI engine, and custom SQL Queries.

Obviousely, I am a huge PowerBI fan, and I think it is the best thing that happen to analytics since Excel, but if you want to share a secure report without requiring a license for every user, datastudio is becoming a valid option.

I have already blogged about building a near real time dashboard using Bigquery and Datastudio , but in this quick blog, I will try to show case that using SQL one can create a more complex business logic reports.

I am using a typical dataset we have in our industry, a lot of facts tables, with different granularity, the facts tables don’t all update at the same time, planned values changes only when there is a program revision, actual changes every day.

Instead of writing the steps here, please view the report that include the how to and the results.

The approach is pretty simple, all modern BI software works more or less the same way( at least PowerBI & Qlik, Tableau is coming soon), you load data to different tables then you model the data by creating relationships between the tables, then you create measures, when you click on a filter for example, or when you add dimension to a chart, the software generate a SQL query to the data source based on the existing relationship defined in the data model, it is really amazing , even without knowing any SQL coding you can do very complicated analysis.

DataStudio is no different to other tools, the Data Modeling is called Blending, it link all the tables together using left join, which is a big limitation as if some values exist in one table and not in others, you will miss data.

The idea is let’s bypass the modeling layer and write some SQL code, and to make it dynamic let’s use parameters, it is not an ideal solution for an average Business users ( we don’t particularly like code) but it is a workaround, till DataStudio improve it’s offering.

How to Build a near real time Dashboard using Datastudio and BigQuery

TLDR, the report is here, please note, my experience with BigQuery and Google stack is rather limited, this is just my own perspective as a business user .

Edit : 20 Sept 2019, DataStudio use now BI engine by default for connecting to BigQuery, now the report contains the historical data too.

I built already a dashboard that track AEMO Data using PowerBI,  and it is nearly perfect except , the maximum update per day is 8 time, which is quite ok ( direct Query is not an option as it is not supported when you publish to web, actually it is support but rather slow) , but for some reason, I thought how hard would it be to build a dashboard that show always the latest Data.

Edit : 23 Sept 2019, actually now, my go to solution for near real time reporting is Google Datastudio, once you get used to real time time, you can’t go back.

The requirements are

  1. Very minimum cost, it is just a hobby
  2. Near Real time (the data is published every 5 minutes)
  3. Export to csv
  4. Free to share.
  5. Ideally not too much technical, I don’t want something to build from scratch.

I got some advices from a friend who works in this kind of scenario and it seems the best option is to build a web app with a database like Postgresql,  with a front end in the likes of apache superset or Rstudio Shiny and host it  in a cheap VM by digitalocean , which I may eventually do, but I thought let’s give BigQuery a try, the free tier is very generous, 1 TB of free Queries per month is more than enough, and Datastudio is totally free and by default use live connection.

Unlike PowerBI which is a whole self service BI solution in one package, Google offering is split to three separate streams, ETL, the data warehouse (Biguery) and the reporting tool (Datastudio), the pricing is pay per usage

For the ETL, Dataprep would be the natural choice for me,( the service is provided by Trifacta), but to my surprise, apparently you can’t import data from an URL, I think I was a bit unfair to Trifacta, the data has to be in google storage first, which is fine, but the lack of support for zip is hard to understand, at least in the type of business I work for, everyone is using zip

I tried to use Data fusion, but it involve spinning a new spark cluster !!!! , and their price is around 3000 $ per month !!!!!

I think I will stick with Python for the moment.

  • The first thing you do after creating a new project in BigQuery is to setup cost control.

The minimum I could get for BigQeury is 0.5 TB per day

  • The source files are located here, very simple csv file, compressed by zip, I care only about three fields

SETTLEMENT DATE  : timestamp

DUID                            : Generator ID , ( power station, solar, wind farm etc)

SCADAVALUE             : Electricity produced in Mw

  • Add a table with partition per day and clustered by the field DUID
  • Write a python script that load data to Bigquery,you can have a look at the code used here, hopefully I will blog about it separately
  • Schedule the script to run every 5 minutes: I am huge fan of azure WebJob, to be honest I tried to use Google function but you can’t write anything in the local folder by default, it seems the container has to be stateless but I just find it easy when I can write temporary data in the local folder (I have a limited understanding of Google function, that was my first impression anyway) , now, I am using google functions and cloud Scheduler, Google functions provide a /tmp that you can write to it, it will use some memory resources.
  • I added a dimension table that show a full Description for the generator id, region etc, I have the coordinates too, but strangely, Datastudio map visual does not support tiles!!!
  • Create a view that join the two tables and remove any duplicate, and filter out the rows where there is no production (SCADAVALUE =0), if there is no full Description yet for the generator id, use the id instead

Notice here, although it is a view, the filter per partition still works, and there is a minimum of 10 MB per table regardless of the memory scanned, for billing BigQuery used the uncompressed size !!

One very good thing though, the queries results are cached for 1 day, if you do the same query again, it is free!

  • Create the Datastudio report : I will create two connections :
  • live connection: pull only today data, every query cost 20 MB, as it is using only one date partition, (2 Tables), the speed is satisfactory, make sure to disactivate the cache

But to confuse everyone there two types of caches, see documentation here, the implication is sometimes you get different updated depending if your selection hit the cache or not, as the editor of the report, it is not an issue, I can manually click refresh, but for the viewer, to be honest, I am not even sure how it works, sometimes, when I test it with incognito mode, I get the latest data sometimes not.

  • Import connection : it is called extract, it load the data to Datastudio in-memory database (it uses BI engine created by one of the original authors of multidimensional) , just be careful as the maximum that can be imported is 100 MB (non compressed), which is rather very small (ok it is free so I can’t complain really), once I was very confused why the data did not match, it turn out Datastudio truncate the import without warning, anyway to optimise this 100 MB, I extract a summary of the data and removed the time dimension and filtered only to the last 14 days, and I schedule the extract to run every day at 12:30 AM, notice today data is not included.

Note : Because both datasets use the same data source, cross filtering works by default, if using two different sources (let’s say, csv and google search, you need some awkward workaround to make it works)

  • Voila the live report, 😊 a nice feature shown here (sorry for the gif quality) is the export to Sheet
  1. Schedule email delivery

  although the report is very simple, I must admit, I find it very satisfying, there is some little pleasure in watching real time data, some missing features, I would love to have

  • An option to disactivate all the caches or bring back the option to let the viewer manually refresh the report.
  • An option to trigger email delivery based on alert, (for example when a measure reaches a maximum value), or at least schedule email delivery multiple time per day.
  • Make datastudio web site mobile friendly, it is hard to select the report from the list of available reports.
  • Google Datastudio support for maps is nearly non existent, that’s a showstopper for a lot of business scenarios

Build a Dashboard using Data Studio

Google datastudio is a new reporting service from google, it is still in beta, has rudimentary capabilities compared to established players ( PowerBI, Tableau and Qlik) and can be very slow, yet it has two killer features, it is free, with amazing sharing options.

Google business model for this product is very simple, the visualisation layer is free, but the preferred data source is BigQuery which is a paid product, but other source usage are free.

In the last two months, I have been experimenting with the product, I never managed to make it produces any meaningful charts, that was my mistake as I was thinking from a semantic model background, which is not valid here.

Data studio do not support multiple datasets, you can use multiple datasets in the same reports but not in the same chart, but you can filter multiple datasets using a common dimension, when solution is to create a master datasets that combine all the tables, if is the same concatenate table used in Qlik.

Now if you have a data workhouse you could simply write an SQL query to combined all the dimensions you want to report on, I don’t, so I will use what I have PowerBI desktop !! You can use R studio too, or any data management solution you are familiar with.

So the approach is build a data model in PowerBI desktop,  Query the model using Dax and export the result to CSV,  then load the CSV to Datastudio and share the report, Datastudio is only used as reporting service.

1- I will use this datamodel.

Data_Studio_1

2- And use this query to extract the dimensions and measures I want to report on


EVALUATE
VAR Earned_Table =
FILTER (
ADDCOLUMNS (
SUMMARIZECOLUMNS (
TBPAYCODE[Commodity_ID],
TBCommodity[Commodity],
TBWBS[WBS_L3],
TBDate[Project Cut Off],
"Budget", tb_Measures[budget_hours],
"Period", tb_Measures[Earned_Hours_Period],
"Cumulative", tb_Measures[Earned_Hours_Cumulative]
),
"TYPE", "Earned"
),
MIN ( TBDate[date] ) <= [Cut off] && MIN ( TBDate[date] ) >= DATE ( 2012, 1, 1 )
)
VAR Plan_Table =
FILTER (
ADDCOLUMNS (
SUMMARIZECOLUMNS (
TBPAYCODE[Commodity_ID],
TBCommodity[Commodity],
TBWBS[WBS_L3],
TBDate[Project Cut Off],
"Budget Hours", tb_Measures[budget_hours],
"Planned Hours", tb_Measures[planned_Hours],
"planned Hours Cumulative", tb_Measures[planned_Hours_Cum]
),
"TYPE", "Planned"
),
MIN ( TBDate[date] ) >= DATE ( 2012, 1, 1 )
)
VAR Spent_Table =
SELECTCOLUMNS (
FILTER (
SUMMARIZECOLUMNS (
//TBPAYCODE[Commodity_ID],
//TBCommodity[Commodity],
TBWBS[WBS_L3],
TBDate[Project Cut Off],
"Budget", tb_Measures[budget_hours],
"Period", tb_Measures[spent_Hours_Period],
"Cumulative", tb_Measures[spent_Hours_Cum]
),
MIN ( TBDate[date] ) <= [Cut off] && MIN ( TBDate[date] ) >= DATE ( 2012, 1, 1 )
),
"Commodity_ID", BLANK (),
"Commodity", BLANK (),
"wbs_l3", [WBS_L3],
"TBDate", [Project Cut Off],
"Budget Hours", [budget_hours],
"Planned Hours", [planned_Hours],
"planned Hours Cumulative", [planned_Hours_Cum],
"Table", "Spent"
)
RETURN
ADDCOLUMNS (
UNION ( Earned_Table, plan_table, spent_table ),
"Data_Date", IF ( [Project Cut Off] = [Cut off], 1, 0 )
)

Basically I am using union to concatenate the values for Earned Hours, Planned Hours and spent Hours, I added a column type to distinguish between the different tables and in the case of spent, I assign the value blank for Commodity_ID and Commodity as the measures don’t have the same granularity.

I added a column ‘Data_Date” so I can filter only the dates where it is equal to the cut off.

3-Load the CSV

Data_Studio_2

If a field aggregation is none, then it is a dimension, otherwise it is a metric (measure), now, there are some functions to do calculations like sum, average but it is not DAX, it is really limited.

As an example for the measure Planned Hours = SUM(CASE WHEN REGEXP_MATCH(TYPE, ‘^Planned.*’) THEN Period ELSE 0 END)

4-build the report

Data_Studio_3

You can view the report here , and the datasets here , (sorry does not support internet Explorer, Only Chrome, Firefox, and Safari,  Internet Explorer works now, although not officially supported)

Personally the most interesting aspect of Data Studio is the integration with Google Drive, the reports you build will show in your Google drive and you can organize them in different folders, for example then you can group reports by departments, shop etc

Data_Studio_4

I think it has potential, they got the sharing right, now they need to add some basic features ( pivot table, controlled cache ….) , and more options in the calculated field formula, ( hint return null for example is not supported,  ok there is a workaround, Just create Metric*Metric/Metric, if it is 0 then it is an error, GDS return null in this case) and you will have data visualisation for the masses.