How to Build a near real time Dashboard using Data Studio and BigQuery

TLDR, the report is https://nemtracker.github.io/, please note, my experience with BigQuery and Google stack is rather limited, this is just my own perspective as a business user .

Edit : 20 Sept 2019, Data Studio use now BI engine by default for connecting to BigQuery, now the report contains the historical data too.

I built already a dashboard that track AEMO Data using PowerBI,  and it is nearly perfect except , the maximum update per day is 8 time, which is quite ok ( direct Query is not an option as it is not supported when you publish to web, actually it is support but rather slow) , but for some reason, I thought how hard would it be to build a dashboard that show always the latest Data.

Edit : 23 Sept 2019, actually now, my go to solution for near real time reporting is Google Data Studio, once you get used to real time time, you can’t go back.

The requirements are

  1. Very minimum cost, it is just a hobby
  2. Near Real time (the data is published every 5 minutes)
  3. Export to csv
  4. Free to share.
  5. Ideally not too much technical, I don’t want something to build from scratch.

I got some advices from a friend who works in this kind of scenario and it seems the best option is to build a web app with a database like Postgresql,  with a front end in the likes of apache superset or Rstudio Shiny and host it  in a cheap VM by digitalocean , which I may eventually do, but I thought let’s give BigQuery a try, the free tier is very generous, 1 TB of free Queries per month is more than enough, and Data Studio is totally free and by default use live connection.

Unlike PowerBI which is a whole self service BI solution in one package, Google offering is split to three separate streams, ETL, the data warehouse (Biguery) and the reporting tool (Data Studio), the pricing is pay per usage

For the ETL, Dataprep would be the natural choice for me,( the service is provided by Trifacta), but to my surprise, apparently you can’t import data from an URL, I think I was a bit unfair to Trifacta, the data has to be in google storage first, which is fine, but the lack of support for zip is hard to understand, at least in the type of business I work for, everyone is using zip

I tried to use Data fusion, but it involve spinning a new spark cluster !!!! , and their price is around 3000 $ per month !!!!!

I think I will stick with Python for the moment.

  • The first thing you do after creating a new project in BigQuery is to setup cost control.

The minimum I could get for BigQeury is 0.5 TB per day

  • The source files are located here, very simple csv file, compressed by zip, I care only about three fields

SETTLEMENT DATE  : timestamp

DUID                            : Generator ID , ( power station, solar, wind farm etc)

SCADAVALUE             : Electricity produced in Mw

  • Add a table with partition per day and clustered by the field DUID
  • Write a python script that load data to Bigquery,you can have a look at the code used here, hopefully I will blog about it separately
  • Schedule the script to run every 5 minutes: I am huge fan of azure WebJob, to be honest I tried to use Google function but you can’t write anything in the local folder by default, it seems the container has to be stateless but I just find it easy when I can write temporary data in the local folder (I have a limited understanding of Google function, that was my first impression anyway) , now, I am using google functions and cloud Scheduler, Google functions provide a /tmp that you can write to it, it will use some memory resources.
  • I added a dimension table that show a full Description for the generator id, region etc, I have the coordinates too, but strangely, Data Studio map visual does not support tiles!!!
  • Create a view that join the two tables and remove any duplicate, and filter out the rows where there is no production (SCADAVALUE =0), if there is no full Description yet for the generator id, use the id instead

Notice here, although it is a view, the filter per partition still works, and there is a minimum of 10 MB per table regardless of the memory scanned, for billing BigQuery used the uncompressed size !!

One very good thing though, the queries results are cached for 1 day, if you do the same query again, it is free!

  • Create the Data Studio report : I will create two connections :
  • live connection: pull only today data, every query cost 20 MB, as it is using only one date partition, (2 Tables), the speed is satisfactory, make sure to disactivate the cache

But to confuse everyone there two types of caches, see documentation here, the implication is sometimes you get different updated depending if your selection hit the cache or not, as the editor of the report, it is not an issue, I can manually click refresh, but for the viewer, to be honest, I am not even sure how it works, sometimes, when I test it with incognito mode, I get the latest data sometimes not.

  • Import connection : it is called extract, it load the data to Data Studio in-memory database (it uses BI engine created by one of the original authors of multidimensional) , just be careful as the maximum that can be imported is 100 MB (non compressed), which is rather very small (ok it is free so I can’t complain really), once I was very confused why the data did not match, it turn out Data Studio truncate the import without warning, anyway to optimise this 100 MB, I extract a summary of the data and removed the time dimension and filtered only to the last 14 days, and I schedule the extract to run every day at 12:30 AM, notice today data is not included.

Note : Because both datasets use the same data source, cross filtering works by default, if using two different sources (let’s say, csv and google search, you need some awkward workaround to make it works)

  • Voila the live report, 😊 a nice feature shown here (sorry for the gif quality) is the export to Sheet
  1. Schedule email delivery

  although the report is very simple, I must admit, I find it very satisfying, there is some little pleasure in watching real time data, some missing features, I would love to have

  • An option to disactivate all the caches or bring back the option to let the viewer manually refresh the report.
  • An option to trigger email delivery based on alert, (for example when a measure reaches a maximum value), or at least schedule email delivery multiple time per day.
  • Make data Studio web site mobile friendly, it is hard to select the report from the list of available reports.
  • Google Data Studio support for maps is nearly non existent, that’s a showstopper for a lot of business scenarios

Using WebJobs to scrap public website and copy data to Azure blob storage.

When I start that AEMO Dashboard , I had a hard time dealing with PowerBI gateway, it is just setting there, my laptop has to be  online whenever I need to schedule a refresh, it just annoyed me, and I could not understand how cloud based data needs on-premises gateway anyway,   obviously later I learned that strictly speaking it was not required, there was just undocumented feature to get away of it ( the trick is in the first blog, thanks @Rad_Reza ).

but before I was aware of that, I went to some rabbit holes dealing with new tools that are out of my comfort zone, and I think they are worth sharing.

my first thought was instead of accessing the data directly from the website, let’s instead copy the data to a cloud storage then read it from there, I have already a google storage account, it is very generous with a free 5GB storage, my data is not big around 2 GB of zipped csv..

first setback, there is no native connector to Google storage and even if there was ,we have something called egress fees, in a nutshell, cloud storage is really low cost, loading data is free, but getting your data out is not free, unless it is for the same provider and the same region, most of the cloud vendors use the same model,  as my data will be processed in PowerBI, the clear choice is azure blob storage

Azure Blob Storage

the setup is very simple I used the following options :

  • the same region as my PowerBI region ( otherwise your pay the egress fees)
  • for replication I used LRS 

as PowerBI don’t support data lake V2, I used the classical Blob Storage.

Let’s move some data 

anyway, now I have my storage, I need a tool to copy the files from here http://nemweb.com.au/Reports/Current/Daily_Reports/ to my storage account.

Azure data factory

 When you want to copy data, the official tool in azure is data factory, I tried to play around with copy activities, it is straightforward, my first attempt did work and it was fast , actually too fast 😊, no zip was transferred but rather an  HTML 

probably copy data just handle this case just fine, but when you use your own credit card on a cloud tool and you don’t know what you are doing, better stay back and take the time to understand how it works, I deleted the new created resources and went to the second option, Python !!!

 PYTHON

Normally I go with R but blob storage has no API for R, I have very limited experience with Python , just using it for the excellent package altair , let’s try something new.

I was very pleasantly surprised, the amount of documentation for Python is just amazing, actually once I asked a question on stackoverflow and got a very succinct answer in less than a minute, no one was judgemental or downvoted my question ( the question was very basic). the only drawback is that sometimes the code works well for python 2, but I am using Python 3 anyway enough talking let’s show some pseudo code.

step 1 : get a list of files name from the web site

url = “web address where the files are saved”

result = urlopen(url).read().decode(‘utf-8’)

pattern = re.compile(r'[\w.]*.zip’)

here is a snapshot of the results, the full list is 60 items.

[‘PUBLIC_DAILY_201904260000_20190427040503.zip’, ‘PUBLIC_DAILY_201904260000_20190427040503.zip’, ‘PUBLIC_DAILY_201904270000_20190428040502.zip’]

step 2 : get a list of files name from the blob storage

 in the first run, the list is empty as we did not load anything yet, I load a couple of files manually just to test if it is working, the API for blob storage are very simple, you only need to provide your storage account name and key and  I love that.

block_blob_service = BlockBlobService(account_name=’’xxxxxx’,                                                          account_key=’xxxxxx’’)

generator = block_blob_service.list_blobs(container_name,prefix=”current/”)

the same you get a list of names.

step 3 : get a list of files that exist in web  and don’t exist in the storage

the code in Python is very simple, it is simply substraction of two sets, and then you converted to a list using function list ( i get why people like Python)

files_to_upload = list(set(List_website)-set(list_azure))

step 4 : Upload the new files to Azure Blob Storage

the same here, the Azure API are very simple and clear, I had only when issues, when the script upload in a loop, it does not wait until the transfer is completed before jumping to the next file, my workaround was just to use sleep ( sync is supported but not in this scenario where the input is from an url), anyone i got the answer in stackoverflow

for x in files_to_upload:

    block_blob_service.copy_blob(container_name,x,url+x)

    copy_status = block_blob_service.get_blob_properties(container_name,x)
    #use code below to check the copy status, if it’s completed or not.

    while(str(copy_status.properties.copy.status) != “success”):

        copy_status = block_blob_service.get_blob_properties(container_name,x)

basically wait till the status of the copy is success before moving to the next item, ( did I say I love Python syntax)

the full script is here 

 WebJobs; a Free Job Schedule  

ok, so we do have a script that works, now we need to run it on a schedule, once per day at 5 AM,  keep in mind the whole purpose of this workflow is not to use on-premise software, I just need to find the service that runs a script on the cloud on a schedule, as I am already on azure, let’s stick in that ecosystem.

and it is a personal project, I prefer a free solution,  my script runs only every 24 hours, for a couple of minutes,  a quick google search and i find this little treasure, I will not repeat here the steps, WebJobs is a service that just do that.

note that the package azure-blob-storage is not a base package in Python you need to install first in WebJobs, the schedule functionality is very flexible as it is using CRON, I wish we had something like that in PowerBI Dataflows.

End results 

Every day at exactly 5 AM, a new file show up in the azure storage, although I don’t need those files, I am using now another approach to load the files directly in PowerBI, it is important to build a data lake ( yes, I just said that, I am just joke, data lake is folder in the cloud where you save the raw files, nothing more), storage is cheap but most importantly the requirement may change, I may need to report on another dimension and it is crucial to keep the raw unprocessed data.

Take away

  • Python is awesome
  • Azure API for python are straightforward
  • Azure is awesome.
  • Be careful of Egress fees
  • CRON is awesome wish it was supported in PowerBI dataflows.
  • Wish PowerBI dataflows could save a raw file, Powerquery is amazing but it does not copy raw files.
  • Wish WebJobs add support to R