Today was playing with a report in PowerBI and I got this idea of exporting data to BigQuery from PowerQuery, let me tell you something, it is very easy and it works rather well, PowerQuery is an amazing technology ( and it is free).
in PowerBI,you can export from R or Python visuals but there are a limitation of 150K rows, but if you use PowerQuery, there is no limitation ( I tried with a table of 23 Millions records and it works)
interesting after the step in Python we get a table, simply expand it
here is the total rows of the table in PowerBI
the results in BigQuery
ok, PowerQuery flow can execute many times, it is a black magic knowledge that’s only a handful of people knows, but in this cases, it does not matter, the BigQuery job truncate the tables every time, so there is no risk of data duplication.
probably you may ask why do that if there are a lot of data preparation tools that natively support BigQuery, based on my own experience, most of my data sources are Excel files and PowerQuery is just very powerful and versatile specially if you deal with “dirty” format.
TLDR, the report is https://nemtracker.github.io/, please note, my experience with BigQuery and Google stack is rather limited, this is just my own perspective as a business user .
Edit : 20 Sept 2019, Data Studio use now BI engine by default for connecting to BigQuery, now the report contains the historical data too.
I built already a dashboard that track AEMO Data using PowerBI, and it is nearly perfect except , the maximum update per day is 8 time, which is quite ok ( direct Query is not an option as it is not supported when you publish to web, actually it is support but rather slow) , but for some reason, I thought how hard would it be to build a dashboard that show always the latest Data.
Edit : 23 Sept 2019, actually now, my go to solution for near real time reporting is Google Data Studio, once you get used to real time time, you can’t go back.
The requirements are
Very minimum cost, it is just a hobby
Near Real time (the data is published every 5
Export to csv
Free to share.
Ideally not too much technical, I don’t want
something to build from scratch.
I got some advices from a friend who works in this kind of scenario and it seems the best option is to build a web app with a database like Postgresql, with a front end in the likes of apache superset or Rstudio Shiny and host it in a cheap VM by digitalocean , which I may eventually do, but I thought let’s give BigQuery a try, the free tier is very generous, 1 TB of free Queries per month is more than enough, and Data Studio is totally free and by default use live connection.
Unlike PowerBI which is a whole self service BI solution in one package, Google offering is split to three separate streams, ETL, the data warehouse (Biguery) and the reporting tool (Data Studio), the pricing is pay per usage
For the ETL, Dataprep would be the natural choice for me,( the service is provided by Trifacta), but to my surprise, apparently you can’t import data from an URL, I think I was a bit unfair to Trifacta, the data has to be in google storage first, which is fine, but the lack of support for zip is hard to understand, at least in the type of business I work for, everyone is using zip
I tried to use Data fusion, but it involve spinning a new spark cluster !!!! , and their price is around 3000 $ per month !!!!!
I think I will stick with Python for the moment.
The first thing you do after creating a new project in BigQuery is to setup cost control.
The minimum I could get for BigQeury is 0.5 TB per day
The source files are located here, very simple csv file, compressed by zip, I care only about three fields
SETTLEMENT DATE : timestamp
DUID : Generator ID , ( power station, solar, wind farm etc)
SCADAVALUE : Electricity produced in Mw
Add a table with partition per day and clustered
by the field DUID
Write a python script that load data to Bigquery,you can have a look at the code used here, hopefully I will blog about it separately
Schedule the script to run every 5 minutes: I am huge fan of azure WebJob, to be honest I tried to use Google function but you can’t write anything in the local folder by default, it seems the container has to be stateless but I just find it easy when I can write temporary data in the local folder (I have a limited understanding of Google function, that was my first impression anyway) , now, I am using google functions and cloud Scheduler, Google functions provide a /tmp that you can write to it, it will use some memory resources.
I added a dimension table that show a full Description for the generator id, region etc, I have the coordinates too, but strangely, Data Studio map visual does not support tiles!!!
Create a view that join the two tables and
remove any duplicate, and filter out the rows where there is no production
(SCADAVALUE =0), if there is no full Description yet for the generator id, use
the id instead
Notice here, although it is a view, the filter per
partition still works, and there is a minimum of 10 MB per table regardless of
the memory scanned, for billing BigQuery used the uncompressed size !!
One very good thing though, the queries results are cached for 1 day, if you do the same query again, it is free!
Create the Data Studio report : I will create two connections :
live connection: pull only today data, every query cost 20 MB, as it is using only one date partition, (2 Tables), the speed is satisfactory, make sure to disactivate the cache
But to confuse everyone there two types of caches, see
the implication is sometimes you get different updated depending if your
selection hit the cache or not, as the editor of the report, it is not an issue,
I can manually click refresh, but for the viewer, to be honest, I am not even
sure how it works, sometimes, when I test it with incognito mode, I get the
latest data sometimes not.
Import connection : it is called extract, it load the data to Data Studio in-memory database (it uses BI engine created by one of the original authors of multidimensional) , just be careful as the maximum that can be imported is 100 MB (non compressed), which is rather very small (ok it is free so I can’t complain really), once I was very confused why the data did not match, it turn out Data Studio truncate the import without warning, anyway to optimise this 100 MB, I extract a summary of the data and removed the time dimension and filtered only to the last 14 days, and I schedule the extract to run every day at 12:30 AM, notice today data is not included.
Note : Because both datasets use the same data source,
cross filtering works by default, if using two different sources (let’s say,
csv and google search, you need some awkward workaround to make it works)
Voila the live report, 😊 a nice feature shown here (sorry for the gif quality) is the export to Sheet
Schedule email delivery
although the report
is very simple, I must admit, I find it very satisfying, there is some little pleasure
in watching real time data, some missing features, I would love to have
An option to disactivate all the caches or bring back the option to let the viewer manually refresh the report.
An option to trigger email delivery based on alert, (for example when a measure reaches a maximum value), or at least schedule email delivery multiple time per day.
Make data Studio web site mobile friendly, it is hard to select the report from the list of available reports.
Google Data Studio support for maps is nearly non existent, that’s a showstopper for a lot of business scenarios
When I start that AEMO Dashboard , I had a hard time dealing with PowerBI gateway, it is just setting there, my laptop has to be online whenever I need to schedule a refresh, it just annoyed me, and I could not understand how cloud based data needs on-premises gateway anyway, obviously later I learned that strictly speaking it was not required, there was just undocumented feature to get away of it ( the trick is in the first blog, thanks @Rad_Reza).
but before I was aware of that, I went to some rabbit holes dealing with new tools that are out of my comfort zone, and I think they are worth sharing.
my first thought was instead of accessing the data directly from the website, let’s instead copy the data to a cloud storage then read it from there, I have already a google storage account, it is very generous with a free 5GB storage, my data is not big around 2 GB of zipped csv..
first setback, there is no native connector to Google storage and even if there was ,we have something called egress fees, in a nutshell, cloud storage is really low cost, loading data is free, but getting your data out is not free, unless it is for the same provider and the same region, most of the cloud vendors use the same model, as my data will be processed in PowerBI, the clear choice is azure blob storage
Azure Blob Storage
the setup is very simple I used the following options :
the same region as my PowerBI region ( otherwise your pay the egress fees)
for replication I used LRS
as PowerBI don’t support data lake V2, I used the classical Blob Storage.
When you want to copy data, the official tool in azure is data factory, I tried to play around with copy activities, it is straightforward, my first attempt did work and it was fast , actually too fast 😊, no zip was transferred but rather an HTML
probably copy data just handle this case just fine, but when you use your own credit card on a cloud tool and you don’t know what you are doing, better stay back and take the time to understand how it works, I deleted the new created resources and went to the second option, Python !!!
Normally I go with R but blob storage has no API for R, I have very limited experience with Python , just using it for the excellent package altair , let’s try something new.
I was very pleasantly surprised, the amount of documentation for Python is just amazing, actually once I asked a question on stackoverflow and got a very succinct answer in less than a minute, no one was judgemental or downvoted my question ( the question was very basic). the only drawback is that sometimes the code works well for python 2, but I am using Python 3 anyway enough talking let’s show some pseudo code.
step 1 : get a list of files name from the web site
url = “web address where the files are saved”
result = urlopen(url).read().decode(‘utf-8’)
pattern = re.compile(r'[\w.]*.zip’)
here is a snapshot of the results, the full list is 60 items.
step 2 : get a list of files name from the blob storage
in the first run, the list is empty as we did not load anything yet, I load a couple of files manually just to test if it is working, the API for blob storage are very simple, you only need to provide your storage account name and key and I love that.
step 4 : Upload the new files to Azure Blob Storage
the same here, the Azure API are very simple and clear, I had only when issues, when the script upload in a loop, it does not wait until the transfer is completed before jumping to the next file, my workaround was just to use sleep ( sync is supported but not in this scenario where the input is from an url), anyone i got the answer in stackoverflow
ok, so we do have a script that works, now we need to run it on a schedule, once per day at 5 AM, keep in mind the whole purpose of this workflow is not to use on-premise software, I just need to find the service that runs a script on the cloud on a schedule, as I am already on azure, let’s stick in that ecosystem.
and it is a personal project, I prefer a free solution, my script runs only every 24 hours, for a couple of minutes, a quick google search and i find this little treasure, I will not repeat here the steps, WebJobs is a service that just do that.
note that the package azure-blob-storage is not a base package in Python you need to install first in WebJobs, the schedule functionality is very flexible as it is using CRON, I wish we had something like that in PowerBI Dataflows.
Every day at exactly 5 AM, a new file show up in the azure storage, although I don’t need those files, I am using now another approach to load the files directly in PowerBI, it is important to build a data lake ( yes, I just said that, I am just joke, data lake is folder in the cloud where you save the raw files, nothing more), storage is cheap but most importantly the requirement may change, I may need to report on another dimension and it is crucial to keep the raw unprocessed data.
Python is awesome
Azure API for python are straightforward
Azure is awesome.
Be careful of Egress fees
CRON is awesome wish it was supported in PowerBI dataflows.
Wish PowerBI dataflows could save a raw file, Powerquery is amazing but it does not copy raw files.