TL,DR : the report is here. and I appreciate a vote on this bug report
First, don’t be excited, it is a silly workaround, and introduce it is own problem, but if you are like me and need to deliver a nice-looking pivot table in Google Datastudio, it may be worth the hassle.
Show the spent and budget by Campaign and country, the spent is at the country level, the budget at the country level, here is a sample data set.
The Solution, First try
Probably you are saying, it is too easy, why you are writing a blog about it, GDS support pivot Table !!, let’s see the result
We have three Problems already (1 bug, 1 limitation and 1 by design)
Bug: you can return not return a null in the metric spent
By design: GDS does not understand hierarchy, country null is all good.
Limitation: The Famous Excel compact View is not supported
Here is the deal, contrary to what you may read in the internet, Pivot table is the most used viz in reporting ( ok, maybe second after table) and users will want their pivot table to look exactly like their beloved Excel, my own experience, if you show a user a map for example and he ask for a feature which is not possible, you can say, I can’t do it and people will tolerate that, but their Excel looking Pivot table, zero tolerance, if you can’t reproduce it, they will either think :
Your BI is not good
You don’t know the tool
The Solution, SQL!!.
Write a SQL that return a column that show the campaign and country in the same field, using union
Assuming your data is on google sheet.
Link Google Sheet to an External table in BigQuery
2-Write the Query
Connect to that table using a custom Query
SELECT project,sum(budget) as budget,sum(spent) as spent FROM `test-187010.work.factraw`
group by 1
SELECT Concat(“\U0001f680”,country), budget, spent FROM `test-187010.work.factraw` where country IS NOT NULL
3-BI engine does not support External table.
Every time you open the report, GDS studio will issue a new query which cost 10 mb minimum !!!, even if the data is 1 kb ( it is a big data thing after all), to avoid that we extract the data
We use conditional formatting to highlight the row campaign.
needless to say !!!! you should not use it unless you have to, cross filtering will be a mess , Hopefully GDS will improve pivot table formatting in the near future.
I love Google datastudio, I am using it for this project nemtracker.github.io and it is perfect for this use case.
For other Dashboard at Work, I am using PowerBI, for no reason, I tried a new experiment, can GDS be used instead, it is an academic exercise only. the use case is a typical Business Reports, a lot of different small datasets with different granularity , something like budget vs items sold etc.
I am just recording the pain points in GDS , the good thing, most of them are under development, and some of them overlap, for example if we have Parameters controls, probably, there will be less need for blending ( which is very limited at the moment).
it is not a critique, GDS has some killer features, I particular like custom visuals as there are no limit the number of data plotted which is a pain in other Software.
the assumption is all the data-sources is already loaded and cleaned and ready to be analysed in BigQuery.
TL,DR : the pain points are at the Calculation level , Obviously if all you data is at the same granularity, then everything is easy my conclusion, nearly there !!!, but will revise when Parameter controls are supported.
Instead of Writing a full blog, I thought showing a report is a more practical approach
in the last 12 months, Google Datastudio has added many new interesting new features, specially the integration with BigQuery BI engine, and custom SQL Queries.
Obviousely, I am a huge PowerBI fan, and I think it is the best thing that happen to analytics since Excel, but if you want to share a secure report without requiring a license for every user, datastudio is becoming a valid option.
I have already blogged about building a near real time dashboard using Bigquery and Datastudio , but in this quick blog, I will try to show case that using SQL one can create a more complex business logic reports.
I am using a typical dataset we have in our industry, a lot of facts tables, with different granularity, the facts tables don’t all update at the same time, planned values changes only when there is a program revision, actual changes every day.
Instead of writing the steps here, please view the report that include the how to and the results.
The approach is pretty simple, all modern BI software works more or less the same way( at least PowerBI & Qlik, Tableau is coming soon), you load data to different tables then you model the data by creating relationships between the tables, then you create measures, when you click on a filter for example, or when you add dimension to a chart, the software generate a SQL query to the data source based on the existing relationship defined in the data model, it is really amazing , even without knowing any SQL coding you can do very complicated analysis.
DataStudio is no different to other tools, the Data Modeling is called Blending, it link all the tables together using left join, which is a big limitation as if some values exist in one table and not in others, you will miss data.
The idea is let’s bypass the modeling layer and write some SQL code, and to make it dynamic let’s use parameters, it is not an ideal solution for an average Business users ( we don’t particularly like code) but it is a workaround, till DataStudio improve it’s offering.
TLDR, the report is https://nemtracker.github.io/, please note, my experience with BigQuery and Google stack is rather limited, this is just my own perspective as a business user .
Edit : 20 Sept 2019, DataStudio use now BI engine by default for connecting to BigQuery, now the report contains the historical data too.
I built already a dashboard that track AEMO Data using PowerBI, and it is nearly perfect except , the maximum update per day is 8 time, which is quite ok ( direct Query is not an option as it is not supported when you publish to web, actually it is support but rather slow) , but for some reason, I thought how hard would it be to build a dashboard that show always the latest Data.
Edit : 23 Sept 2019, actually now, my go to solution for near real time reporting is Google Datastudio, once you get used to real time time, you can’t go back.
The requirements are
Very minimum cost, it is just a hobby
Near Real time (the data is published every 5
Export to csv
Free to share.
Ideally not too much technical, I don’t want
something to build from scratch.
I got some advices from a friend who works in this kind of
scenario and it seems the best option is to build a web app with a database
like Postgresql, with a front end in the
likes of apache superset or Rstudio Shiny and host it in a cheap VM by digitalocean ,
which I may eventually do, but I thought let’s give BigQuery a try, the free
tier is very generous, 1 TB of free Queries per month is more than enough, and
Datastudio is totally free and by default use live connection.
Unlike PowerBI which is a whole self service BI solution in one package, Google offering is split to three separate streams, ETL, the data warehouse (Biguery) and the reporting tool (Datastudio), the pricing is pay per usage
For the ETL, Dataprep would be the natural choice for me,( the service is provided by Trifacta), but to my surprise, apparently you can’t import data from an URL, I think I was a bit unfair to Trifacta, the data has to be in google storage first, which is fine, but the lack of support for zip is hard to understand, at least in the type of business I work for, everyone is using zip
I tried to use Data fusion, but it involve spinning a new spark cluster !!!! , and their price is around 3000 $ per month !!!!!
I think I will stick with Python for the moment.
The first thing you do after creating a new project in BigQuery is to setup cost control.
The minimum I could get for BigQeury is 0.5 TB per day
The source files are located here, very simple csv file, compressed by zip, I care only about three fields
SETTLEMENT DATE : timestamp
DUID : Generator ID , ( power station, solar, wind farm etc)
SCADAVALUE : Electricity produced in Mw
Add a table with partition per day and clustered
by the field DUID
Write a python script that load data to Bigquery,you can have a look at the code used here, hopefully I will blog about it separately
Schedule the script to run every 5 minutes: I am huge fan of azure WebJob, to be honest I tried to use Google function but you can’t write anything in the local folder by default, it seems the container has to be stateless but I just find it easy when I can write temporary data in the local folder (I have a limited understanding of Google function, that was my first impression anyway) , now, I am using google functions and cloud Scheduler, Google functions provide a /tmp that you can write to it, it will use some memory resources.
I added a dimension table that show a full Description for the generator id, region etc, I have the coordinates too, but strangely, Datastudio map visual does not support tiles!!!
Create a view that join the two tables and
remove any duplicate, and filter out the rows where there is no production
(SCADAVALUE =0), if there is no full Description yet for the generator id, use
the id instead
Notice here, although it is a view, the filter per
partition still works, and there is a minimum of 10 MB per table regardless of
the memory scanned, for billing BigQuery used the uncompressed size !!
One very good thing though, the queries results are cached for 1 day, if you do the same query again, it is free!
Create the Datastudio report : I will create two
live connection: pull only today data, every query cost 20 MB, as it is using only one date partition, (2 Tables), the speed is satisfactory, make sure to disactivate the cache
But to confuse everyone there two types of caches, see
the implication is sometimes you get different updated depending if your
selection hit the cache or not, as the editor of the report, it is not an issue,
I can manually click refresh, but for the viewer, to be honest, I am not even
sure how it works, sometimes, when I test it with incognito mode, I get the
latest data sometimes not.
Import connection : it is called extract, it load the data to Datastudio in-memory database (it uses BI engine created by one of the original authors of multidimensional) , just be careful as the maximum that can be imported is 100 MB (non compressed), which is rather very small (ok it is free so I can’t complain really), once I was very confused why the data did not match, it turn out Datastudio truncate the import without warning, anyway to optimise this 100 MB, I extract a summary of the data and removed the time dimension and filtered only to the last 14 days, and I schedule the extract to run every day at 12:30 AM, notice today data is not included.
Note : Because both datasets use the same data source,
cross filtering works by default, if using two different sources (let’s say,
csv and google search, you need some awkward workaround to make it works)
Voila the live report, 😊 a nice feature shown here (sorry for the gif quality) is the export to Sheet
Schedule email delivery
although the report
is very simple, I must admit, I find it very satisfying, there is some little pleasure
in watching real time data, some missing features, I would love to have
An option to disactivate all the caches or bring back the option to let the viewer manually refresh the report.
An option to trigger email delivery based on alert, (for example when a measure reaches a maximum value), or at least schedule email delivery multiple time per day.
Make datastudio web site mobile friendly, it is hard to select the report from the list of available reports.
Google Datastudio support for maps is nearly non existent, that’s a showstopper for a lot of business scenarios