in the last 12 months, Google Datastudio has added many new interesting new features, specially the integration with BigQuery BI engine, and custom SQL Queries.
Obviousely, I am a huge PowerBI fan, and I think it is the best thing that happen to analytics since Excel, but if you want to share a secure report without requiring a license for every user, datastudio is becoming a valid option.
I have already blogged about building a near real time dashboard using Bigquery and Datastudio , but in this quick blog, I will try to show case that using SQL one can create a more complex business logic reports.
I am using a typical dataset we have in our industry, a lot of facts tables, with different granularity, the facts tables don’t all update at the same time, planned values changes only when there is a program revision, actual changes every day.
Instead of writing the steps here, please view the report that include the how to and the results.
The approach is pretty simple, all modern BI software works more or less the same way( at least PowerBI & Qlik, Tableau is coming soon), you load data to different tables then you model the data by creating relationships between the tables, then you create measures, when you click on a filter for example, or when you add dimension to a chart, the software generate a SQL query to the data source based on the existing relationship defined in the data model, it is really amazing , even without knowing any SQL coding you can do very complicated analysis.
DataStudio is no different to other tools, the Data Modeling is called Blending, it link all the tables together using left join, which is a big limitation as if some values exist in one table and not in others, you will miss data.
The idea is let’s bypass the modeling layer and write some SQL code, and to make it dynamic let’s use parameters, it is not an ideal solution for an average Business users ( we don’t particularly like code) but it is a workaround, till DataStudio improve it’s offering.
TLDR, the report is here, please note, my experience with BigQuery and Google stack is rather limited, this is just my own perspective as a business user .
Edit : 20 Sept 2019, DataStudio use now BI engine by default for connecting to BigQuery, now the report contains the historical data too.
I built already a dashboard that track AEMO Data using PowerBI, and it is nearly perfect except , the maximum update per day is 8 time, which is quite ok ( direct Query is not an option as it is not supported when you publish to web) , but for some reason, I thought how hard would it be to build a dashboard that show always the latest Data.
Edit : 23 Sept 2019, actually now, my go to solution for near real time reporting is Google Datastudio, once you get used to real time time, you can’t go back.
The requirements are
Very minimum cost, it is just a hobby
Near Real time (the data is published every 5
Export to csv
Free to share.
Ideally not too much technical, I don’t want
something to build from scratch.
I got some advices from a friend who works in this kind of
scenario and it seems the best option is to build a web app with a database
like Postgresql, with a front end in the
likes of apache superset or Rstudio Shiny and host it in a cheap VM by digitalocean ,
which I may eventually do, but I thought let’s give BigQuery a try, the free
tier is very generous, 1 TB of free Queries per month is more than enough, and
Datastudio is totally free and by default use live connection.
Unlike PowerBI which is a whole self service BI solution in one package, Google offering is split to three separate streams, ETL, the data warehouse (Biguery) and the reporting tool (Datastudio), the pricing is pay per usage
For the ETL, Dataprep would be the natural choice for me,( the service is provided by Trifacta), but to my surprise, apparently you can’t import data from an URL, I think I was a bit unfair to Trifacta, the data has to be in google storage first, which is fine, but the lack of support for zip is hard to understand, at least in the type of business I work for, everyone is using zip
I tried to use Data fusion, but it involve spinning a new spark cluster !!!! , and their price is around 3000 $ per month !!!!!
I think I will stick with Python for the moment.
The first thing you do after creating a new project in BigQuery is to setup cost control.
The minimum I could get for BigQeury is 0.5 TB per day
The source files are located here, very simple csv file, compressed by zip, I care only about three fields
SETTLEMENT DATE : timestamp
DUID : Generator ID , ( power station, solar, wind farm etc)
SCADAVALUE : Electricity produced in Mw
Add a table with partition per day and clustered
by the field DUID
Write a python script that load data to Bigquery,you can have a look at the code used here, hopefully I will blog about it separately
Schedule the script to run every 5 minutes: I am huge fan of azure WebJob, to be honest I tried to use Google function but you can’t write anything in the local folder by default, it seems the container has to be stateless but I just find it easy when I can write temporary data in the local folder (I have a limited understanding of Google function, that was my first impression anyway) , now, I am using google functions and cloud Scheduler, Google functions provide a /tmp that you can write to it, it will use some memory resources.
I added a dimension table that show a full Description for the generator id, region etc, I have the coordinates too, but strangely, Datastudio map visual does not support tiles!!!
Create a view that join the two tables and
remove any duplicate, and filter out the rows where there is no production
(SCADAVALUE =0), if there is no full Description yet for the generator id, use
the id instead
Notice here, although it is a view, the filter per
partition still works, and there is a minimum of 10 MB per table regardless of
the memory scanned, for billing BigQuery used the uncompressed size !!
One very good thing though, the queries results are cached for 1 day, if you do the same query again, it is free!
Create the Datastudio report : I will create two
live connection: pull only today data, every query cost 20 MB, as it is using only one date partition, (2 Tables), the speed is satisfactory, make sure to disactivate the cache
But to confuse everyone there two types of caches, see
the implication is sometimes you get different updated depending if your
selection hit the cache or not, as the editor of the report, it is not an issue,
I can manually click refresh, but for the viewer, to be honest, I am not even
sure how it works, sometimes, when I test it with incognito mode, I get the
latest data sometimes not.
Import connection : it is called extract, it load the data to Datastudio in-memory database (it uses BI engine created by one of the original authors of multidimensional) , just be careful as the maximum that can be imported is 100 MB (non compressed), which is rather very small (ok it is free so I can’t complain really), once I was very confused why the data did not match, it turn out Datastudio truncate the import without warning, anyway to optimise this 100 MB, I extract a summary of the data and removed the time dimension and filtered only to the last 14 days, and I schedule the extract to run every day at 12:30 AM, notice today data is not included.
Note : Because both datasets use the same data source,
cross filtering works by default, if using two different sources (let’s say,
csv and google search, you need some awkward workaround to make it works)
Voila the live report, 😊 a nice feature shown here (sorry for the gif quality) is the export to Sheet
Schedule email delivery
although the report
is very simple, I must admit, I find it very satisfying, there is some little pleasure
in watching real time data, some missing features, I would love to have
An option to disactivate all the caches or bring back the option to let the viewer manually refresh the report.
An option to trigger email delivery based on alert, (for example when a measure reaches a maximum value), or at least schedule email delivery multiple time per day.
Make datastudio web site mobile friendly, it is hard to select the report from the list of available reports.
Google Datastudio support for maps is nearly non existent, that’s a showstopper for a lot of business scenarios
Kelpler.gl is a Geospatial data analysis
by Uber !!!, in this short blog we show
how to build a map with a time attribute and then showing a time playback slider,
it is very and straightforward process, all you need is to format the data as a
This data was randomized, as an example I show the main step to build a solar farm, piling, tracker, module then Electrical works, but obviously you can try it with road , pipeline projects, any project really that have geographic data
Please note the data is rendered using your local computer, so it is totally safe
The map is here,
keep reading if you want to understand how it was built
I attached a sample dataset here;
the field are self-explanatory.
Just be careful with Excel for some reason, when you open
the csv file, it changes the date format, make sure the format is like this
And because in a medium sized solar farm, the total pile
will be around 50,000 Piles, if you want just to show 4 main stages, the total
number of rows will be 50,000 * 4 = 200,000 row, PowerBI will not support that
( Tableau shine in this scenario),
The documentation is very clear, please read it to experiment with your own data
If your dataset has this fields names
X_lat, X_lng, Kepler.gl will automatically recognised it as a point layer
2-Define the legend
We select colour based on the field status ( select the palette
3- Select time field
The filed should be recognized as time otherwise, it will
4- Add satellite as new map Style
For some reason, it is not there by default, but it is
trivial to add
You can play with the slider, either moving the end date
or a window, up to you
6-Now you have multiple options for sharing
You need to have a mapbox
account, it is virtual free to medium usage !!!
For this example, I will share it using Dropbox, but you can export to HTML and securely share it either per email or hosted in sharepoint
This is one of the best piece of software
that I had the privilege to use, nothing to install, free , simple and straight
to the point !!
I have been asked to produce a simple construction report, we need to show the last 4 weeks of actual progress data and 6 weeks of forecast and to make thing a little bit complex the average installation since the start of the project, nothing special three measures, average to date, install per week and forecast per week
Obviously, it is trivial to be done in Excel using named sets,
if you don’t know what’s named set and cube formula is, you are missing the
most powerful reporting paradigm in Excel, a good introduction is here,
and there are plenty of resources here.
Unfortunately named set is not supported yet in PowerBI, you
can vote here,
Just for demonstration purpose, if you try to add those three
measures to a matrix visual, PowerBI just repeat them for every time period, obviously
that’s not good at all, the actual
installation make sense only in the past and the forecast has to be in the future,
there is no option to hide a measure if there is no value in a column and even
if it was possible we need to show the average installation independently of
the time period, anyway this the report when you add the three measures
and because I already learned a new trick on how to dynamically add measures to a matrix visual in PowerBI, I was tempted to try and see if it works in this scenario.
So, let’s see how it can be done using the disconnected table
Create a disconnected table with two columns
Order and status
Add a calculated column,
As the cut-off date change at least three times a week, the week number change accordingly, we can’t simply hard code the dates, instead let’s add a new calculated column, which will just lookup the week date from a master calendar table based on the order, when the order is -4 it will return “average to date”, I added a dummy 0.5 order just to add an empty space between actual and forecast ( cosmetic is important)
Primavera has a very comprehensible security mechanism, based on OBS and EPS and user access, it is very mature and rigorous, but once you have a direct connection to the database’s backend all this security disappear, the connection will give you access to everything in the database.
This blog post is not a comprehensible security introduction to Primavera reporting, but rather a gentle introduction to RLS in PowerBI and how you can leverage it in the simplest form possible, and a reminder why security is a very important consideration, specially when you deal with a portfolio and multiple division in the same company.
if you don’t want to read the whole blog, I think all I want to say is
Only Primavera Admin or IT should have access to Primavera Database connection.
There are multiple solution to implement Projects level access, particularly if you are dealing with multi divisions portfolio
What is row level security?
Is the ability to access a table but read only some rows,
for example you want to read the table TASK, which show all the activities for
all the projects, obviously you need your user to view only the activities that
belong to the projects assign to them.
Primavera administrator want to create a companywide reporting
system based only on Primavera Database, RLS is implemented in PowerBI everyone
see only the data that belongs to him, happy days, planning manager from
division X like the reports but wants to use other data sources too, not only
Primavera but Ecosys, cobra , progress measurement system etc, still he needs
to see only his projects, Planner Y don’t care about PowerBI and wants the raw
data to do his own stuff using some obscure VBA Excel, PowerBI as of this
writing can not connects to multiple datasets from other reports, and you can’t
mix live connection with import from other source.
Whatever you do , you need to have only one connection to
Primavera Database, don’t give multiple users access to the production
database, that’s bad ( probably your DBA, will not do it anyway), I know it is
still read only, but it is bad practise, if your write an expensive query
against a reporting server, it is annoying, but slowing Primavera server will get you
angry planners from everywhere ( we are very famous for being grumpy)
Ideal solution, just spin a cheap SQL instance on azure and make sure it is on the same region as your PowerBI ( no egress fees), implements RLS there for the people who want to author reports ( the viewer access will be done in PowerBI),any IT can easily create a small pipeline to copy data from Primavera Database ( doesn’t matter, Oracle or SQL server) and you get your data fresh every morning or whatever schedule you want. ( bonus point, no PowerBI gateway between PowerBI and SQLServer as both are in azure)
If SqlServer is
not an option , PowerBI dataflows seems like a perfect solution here, you
connect once to the Primavera databased and you can share the results with other
user, PowerBI will be hitting the
Dataflows instead of the production server, perfect, yes, unfortunately,
Dataflows results are just CSV, no RLS, either you read the whole TASK table or
Analyse in Excel
Personally, I think it is the most Powerful feature of
PowerBI, when you have access to a report in PowerBI, when you click analyse in
Excel you get access to all the data behind the model, not only that you can
create a table that fetch the data and bingo, it does honnor RLS, you see only
your stuff, so the workflow will be something like this, for how to use analyse
in Excel, Please read this excellent blog by Chris Webb
Let’s see how to use RLS in PowerBIRLS is very vast subjects, and has many different implementations and nuance, in real life you need to have something like a hierarchy security like Primavera OBS, and you need to read this excellent series of blogs from Reza, let’s say I want to grant view access to a couple of Projects to two users (Viewer @ projectscontrols and test@ projectscontrols ), those users are not necessary planners they don’t have access to Primavera
I just used Project user defined field and typed the full address,
you need the email address as it is the format that PowerBI understand, the
good thing about UDF, you can put multiple values separated by a comma, I
pulled the following tables from P6 Database (UDFTYPE , UDFVALUE) and just
using PowerQuery to split and unpivot, and I go this little tables.
Now I have the Project ID and the username of the users who
can access the projects.
Edit the relationship between access table and Project
Basically, table access get filtered by PowerBI, then it
will filter table PROJECT, that will filter table TASK
Publish to PowerBI service to the owner account but not to user Viewer as he should only have access to view and build ( you need build to use the dataset in other reports), RLS works only with viewer role,
Now let’s see what the user viewer will see
Yes, only 7 projects are visible to that user, let’s try analysing
Once I select table access, only the viewer gets selected
Now let’s query only the table TASK ( yes DAX is a query language
Voila we have the TABLE Task filtered, with only 7 Projects
Now you can add more tables and load those tables into another PowerBI datasets and do your own enhanced reports.
Primavera P6 calculated a lot of rich metrics for a
schedule, we all know the basics, Start Date and Finish date, but there are all
kinds of other values ( Cost, Labor hours, not Labor , Float Duration etc)
The challenge is, different people want to see different
measures, and once you publish your
reports, the viewer can not change the visual, Microsoft is working on this feature,
but it will take another couple of months to be released.
3- Add the column Masure_Value to the column in the Matrix and Seletced_value in the values Area.
I am using bookmarks to toggle the select Columns Button, I think it is
freaking cool, hopefully in the short
term PowerBI add more features so we can build not only dashboard but real application
in the first blog post of this series, we showed how to connect to Primavera SQL server, in this blog we build our first report, we use only the three tables TASK, PROJECT, and PROJWBS
you can download the pbix here, an online report is published here
I will not show the details of every steps, you need to have a basic understanding of PowerQuery and DAX, but i will highlights some aspects of Primavera Database schema that you should be aware of.
Reproduce Project View
this view show all the projects grouped by WBS and show measures, start date, finish date and budget labor units
and here is the equivalent report in PowerBI
the table PROJWBS store the EPS/WBS data for all projects, it is represented the database as a parent, child ID, which can’t be used directly by PowerBI, first we need to flatten the data to multiple levels so we can show it in a matrix visual, so basically moved from this format
to this format
you have multiple options either using SQL , DAX or PowerQuery, for Powerquery here is an excellent resource by the Imke Feldmann , for our example I am using DAX , the canonical reference is by Marco Russo
just make sure when you import a table from SQL server to have this filter [delete_session_id] = null , because Primavera don’t directly delete data, but instead have something called soft delete, ie; the items is not shown in the client but it is still in the database and will be deleted later, anyway for PROJWBS remove all the template WBS, (I think they are used by the EPPM web client)
task is straightforward it save all the tasks of the project, same filter [delete_session_id] = null
we use project table only to filter the baseline out, in Primavera current project and baseline are saved in the database exactly the same way ( that very powerful paradigm ), but for our report we want to show the activities only for the current project, too easy [orig_proj_id] = null and the best part, we don’t have to write any queries, Powerquery simply generate the SQL for the database ( that’s awesome)
Simple Data Model
the two tables are connected by the field wbs_id, we added another copy of the table TASK as a dimension table for reporting ( just activity id, and activity name), and we have this simple data model, I like to save measures in a separate dummy tables
as you can see, building a data model is relatively easy, the complexity start when you want to add more measures, for example, total float, you need to connect to the table CALENDAR, if you want cost, you need to connect to other tables, and if you want spread it will become a little trickier ( hint it is not saved in the database)
hopefully by now, instead of asking how to connect to Primavera Database, the interesting question become, in which table the data is saved and how to join two separate tables to get the report you want
if you are still reading, I will appreciate if you can vote on this idea, unfortunately you can’t dis-activate table sorting in PowerBI, in this particular report, the sorting is already defined by the WBS, if the user click on the header, the order will change, they can still reset the order using the measure sort, still very annoying