Custom SQL in Google Datastudio

in the last 12 months, Google Datastudio has added many new interesting new features, specially the integration with BigQuery BI engine, and custom SQL Queries.

Obviousely, I am a huge PowerBI fan, and I think it is the best thing that happen to analytics since Excel, but if you want to share a secure report without requiring a license for every user, datastudio is becoming a valid option.

I have already blogged about building a near real time dashboard using Bigquery and Datastudio , but in this quick blog, I will try to show case that using SQL one can create a more complex business logic reports.

I am using a typical dataset we have in our industry, a lot of facts tables, with different granularity, the facts tables don’t all update at the same time, planned values changes only when there is a program revision, actual changes every day.

Instead of writing the steps here, please view the report that include the how to and the results.

The approach is pretty simple, all modern BI software works more or less the same way( at least PowerBI & Qlik, Tableau is coming soon), you load data to different tables then you model the data by creating relationships between the tables, then you create measures, when you click on a filter for example, or when you add dimension to a chart, the software generate a SQL query to the data source based on the existing relationship defined in the data model, it is really amazing , even without knowing any SQL coding you can do very complicated analysis.

DataStudio is no different to other tools, the Data Modeling is called Blending, it link all the tables together using left join, which is a big limitation as if some values exist in one table and not in others, you will miss data.

The idea is let’s bypass the modeling layer and write some SQL code, and to make it dynamic let’s use parameters, it is not an ideal solution for an average Business users ( we don’t particularly like code) but it is a workaround, till DataStudio improve it’s offering.

Advertisements

How to Build a near real time Dashboard using Datastudio and BigQuery

TLDR, the report is here, please note, my experience with BigQuery and Google stack is rather limited, this is just my own perspective as a business user .

Edit : 20 Sept 2019, DataStudio use now BI engine by default for connecting to BigQuery, now the report contains the historical data too.

I built already a dashboard that track AEMO Data using PowerBI,  and it is nearly perfect except , the maximum update per day is 8 time, which is quite ok ( direct Query is not an option as it is not supported when you publish to web) , but for some reason, I thought how hard would it be to build a dashboard that show always the latest Data.

Edit : 23 Sept 2019, actually now, my go to solution for near real time reporting is Google Datastudio, once you get used to real time time, you can’t go back.

The requirements are

  1. Very minimum cost, it is just a hobby
  2. Near Real time (the data is published every 5 minutes)
  3. Export to csv
  4. Free to share.
  5. Ideally not too much technical, I don’t want something to build from scratch.

I got some advices from a friend who works in this kind of scenario and it seems the best option is to build a web app with a database like Postgresql,  with a front end in the likes of apache superset or Rstudio Shiny and host it  in a cheap VM by digitalocean , which I may eventually do, but I thought let’s give BigQuery a try, the free tier is very generous, 1 TB of free Queries per month is more than enough, and Datastudio is totally free and by default use live connection.

Unlike PowerBI which is a whole self service BI solution in one package, Google offering is split to three separate streams, ETL, the data warehouse (Biguery) and the reporting tool (Datastudio), the pricing is pay per usage

For the ETL, Dataprep would be the natural choice for me,( the service is provided by Trifacta), but to my surprise, apparently you can’t import data from an URL, I think I was a bit unfair to Trifacta, the data has to be in google storage first, which is fine, but the lack of support for zip is hard to understand, at least in the type of business I work for, everyone is using zip

I tried to use Data fusion, but it involve spinning a new spark cluster !!!! , and their price is around 3000 $ per month !!!!!

I think I will stick with Python for the moment.

  • The first thing you do after creating a new project in BigQuery is to setup cost control.

The minimum I could get for BigQeury is 0.5 TB per day

  • The source files are located here, very simple csv file, compressed by zip, I care only about three fields

SETTLEMENT DATE  : timestamp

DUID                            : Generator ID , ( power station, solar, wind farm etc)

SCADAVALUE             : Electricity produced in Mw

  • Add a table with partition per day and clustered by the field DUID
  • Write a python script that load data to Bigquery,you can have a look at the code used here, hopefully I will blog about it separately
  • Schedule the script to run every 5 minutes: I am huge fan of azure WebJob, to be honest I tried to use Google function but you can’t write anything in the local folder by default, it seems the container has to be stateless but I just find it easy when I can write temporary data in the local folder (I have a limited understanding of Google function, that was my first impression anyway) , now, I am using google functions and cloud Scheduler, Google functions provide a /tmp that you can write to it, it will use some memory resources.
  • I added a dimension table that show a full Description for the generator id, region etc, I have the coordinates too, but strangely, Datastudio map visual does not support tiles!!!
  • Create a view that join the two tables and remove any duplicate, and filter out the rows where there is no production (SCADAVALUE =0), if there is no full Description yet for the generator id, use the id instead

Notice here, although it is a view, the filter per partition still works, and there is a minimum of 10 MB per table regardless of the memory scanned, for billing BigQuery used the uncompressed size !!

One very good thing though, the queries results are cached for 1 day, if you do the same query again, it is free!

  • Create the Datastudio report : I will create two connections :
  • live connection: pull only today data, every query cost 20 MB, as it is using only one date partition, (2 Tables), the speed is satisfactory, make sure to disactivate the cache

But to confuse everyone there two types of caches, see documentation here, the implication is sometimes you get different updated depending if your selection hit the cache or not, as the editor of the report, it is not an issue, I can manually click refresh, but for the viewer, to be honest, I am not even sure how it works, sometimes, when I test it with incognito mode, I get the latest data sometimes not.

  • Import connection : it is called extract, it load the data to Datastudio in-memory database (it uses BI engine created by one of the original authors of multidimensional) , just be careful as the maximum that can be imported is 100 MB (non compressed), which is rather very small (ok it is free so I can’t complain really), once I was very confused why the data did not match, it turn out Datastudio truncate the import without warning, anyway to optimise this 100 MB, I extract a summary of the data and removed the time dimension and filtered only to the last 14 days, and I schedule the extract to run every day at 12:30 AM, notice today data is not included.

Note : Because both datasets use the same data source, cross filtering works by default, if using two different sources (let’s say, csv and google search, you need some awkward workaround to make it works)

  • Voila the live report, 😊 a nice feature shown here (sorry for the gif quality) is the export to Sheet
  1. Schedule email delivery

  although the report is very simple, I must admit, I find it very satisfying, there is some little pleasure in watching real time data, some missing features, I would love to have

  • An option to disactivate all the caches or bring back the option to let the viewer manually refresh the report.
  • An option to trigger email delivery based on alert, (for example when a measure reaches a maximum value), or at least schedule email delivery multiple time per day.
  • Make datastudio web site mobile friendly, it is hard to select the report from the list of available reports.
  • Google Datastudio support for maps is nearly non existent, that’s a showstopper for a lot of business scenarios

Construction Map time playback using Kepler.gl

Kelpler.gl is a Geospatial data analysis by Uber !!!,  in this short blog we show how to build a map with a time attribute and then showing a time playback slider, it is very and straightforward process, all you need is to format the data as a list.

This data was randomized, as an example I show the main step to build a solar farm, piling, tracker, module then Electrical works, but obviously you can try it with road , pipeline projects, any project really that have geographic data

Please note the data is rendered using your local computer, so it is totally safe

The map is here, keep reading if you want to understand how it was built

I attached a sample dataset here; the field are self-explanatory.

Just be careful with Excel for some reason, when you open the csv file, it changes the date format, make sure the format is like this

And because in a medium sized solar farm, the total pile will be around 50,000 Piles, if you want just to show 4 main stages, the total number of rows will be 50,000 * 4 = 200,000 row, PowerBI will not support that ( Tableau shine in this scenario),

 The documentation is very clear, please read it to experiment with your own data

  1. Add data

If your dataset has this fields names X_lat, X_lng, Kepler.gl will automatically recognised it as a point layer

2-Define the legend

We select colour based on the field status ( select the palette of colour)

3- Select time field

The filed should be recognized as time otherwise, it will not work.

4- Add satellite as new map Style

For some reason, it is not there by default, but it is trivial to add

mapbox://styles/mapbox/satellite-streets-v9

5-And voila

You can play with the slider, either moving the end date or a window, up to you

6-Now you have multiple options for sharing

You need to have a mapbox account, it is virtual free to medium usage !!!

For this example, I will share it using Dropbox, but you can export to HTML and securely share it either per email or hosted in sharepoint

             This is one of the best piece of software that I had the privilege to use, nothing to install, free , simple and straight to the point !!

Build asymmetrical Pivot table in PowerBI

I have been asked to produce a simple construction report, we need to show the last 4 weeks of actual progress data and 6 weeks of forecast and to make thing a little bit complex the average installation since the start of the project, nothing special three measures, average to date, install per week and forecast per week

Obviously, it is trivial to be done in Excel using named sets, if you don’t know what’s named set and cube formula is, you are missing the most powerful reporting paradigm in Excel, a good introduction is here, and there are plenty of resources here.

Unfortunately named set is not supported yet in PowerBI, you can vote here,

Just for demonstration purpose, if you try to add those three measures to a matrix visual, PowerBI just repeat them for every time period, obviously that’s not good at all,  the actual installation make sense only in the past and the forecast has to be in the future, there is no option to hide a measure if there is no value in a column and even if it was possible we need to show the average installation independently of the time period, anyway this the report when you add the three measures

and because I already learned a new trick on how to dynamically add measures to a matrix visual in PowerBI,  I was tempted to try and see if it works in this scenario.

 So, let’s see how it can be done using the disconnected table

  1. Create a disconnected table with two columns Order and status
  • Add a calculated column,

As  the cut-off date change at least three times a week, the week number change accordingly, we can’t simply hard code the dates, instead let’s add a new calculated column, which will just lookup the week date from a master calendar table based on the order, when the order is -4 it will return “average to date”, I added a dummy 0.5 order just to add an empty space between actual and forecast ( cosmetic is important)

Week_Num =
SWITCH (
    [order],
    -4, “Average to Date”,
    0.5, BLANK (),
    “WE “
        & FORMAT (
            LOOKUPVALUE ( MstDates[dynamic Week End], MstDates[week_number], [order] ),
            “dd/mm/yy”
        )
)

  • Add a new measure that show specific measures (Average,Install or forecast) based on the value of column

dynamic_Pivot =

SWITCH (

    SELECTEDVALUE(pivot[order],BLANK()),

    -4,[Install_qty_average_week],

-3,CALCULATE([Installed_qty],MstDates[week_number]=-3),

-2,CALCULATE([Installed_qty],MstDates[week_number]=-2),

-1,CALCULATE([Installed_qty],MstDates[week_number]=-1),

0,CALCULATE([Installed_qty],MstDates[week_number]=0),

1,CALCULATE([Forecast_Qty],MstDates[week_number]=1),

2,CALCULATE([Forecast_Qty],MstDates[week_number]=2),

3,CALCULATE([Forecast_Qty],MstDates[week_number]=3),

4,CALCULATE([Forecast_Qty],MstDates[week_number]=4),

5,CALCULATE([Forecast_Qty],MstDates[week_number]=5),

6,CALCULATE([Forecast_Qty],MstDates[week_number]=6),

7,CALCULATE([Forecast_Qty],MstDates[week_number]=7))

And voila an asymmetrical matrix visual in all its glory 😊

Edit 3-Sept-2019 : Maxim in the comment made an excellent suggestion to use variable to make the measure more manageable

dynamic_Pivot=
VAR _order =
    SELECTEDVALUE ( pivot[order], BLANK () )
RETURN
    IF (
        _order = -4,
        [Install_qty_average_week],
        IF (
            _order <= 0,
            CALCULATE ( [Installed_qty], MstDates[week_number] = _order ),
            IF ( _order > 0, CALCULATE ( [Forecast_Qty], MstDates[week_number] = _order ) )
        )
    )

Connecting PowerBI to Primavera, Implement Row Level security

Primavera has a very comprehensible security mechanism, based on OBS and EPS and user access, it is very mature and rigorous, but once you have a direct connection to the database’s backend all this security disappear, the connection will give you access to everything in the database.

This blog post is not a comprehensible security introduction to Primavera reporting, but rather a gentle introduction to RLS in PowerBI and how you can leverage it in the simplest form possible, and a reminder why security is a very important consideration, specially when you deal with a portfolio and multiple division in the same company.

if you don’t want to read the whole blog, I think all I want to say is

  • Connecting to Primavera database is not hard
  • Only Primavera Admin or IT should have access to Primavera Database connection.
  • There are multiple solution to implement Projects level access, particularly if you are dealing with multi divisions portfolio

What is row level security?

Is the ability to access a table but read only some rows, for example you want to read the table TASK, which show all the activities for all the projects, obviously you need your user to view only the activities that belong to the projects assign to them.

The Problem

Primavera administrator want to create a companywide reporting system based only on Primavera Database, RLS is implemented in PowerBI everyone see only the data that belongs to him, happy days, planning manager from division X like the reports but wants to use other data sources too, not only Primavera but Ecosys, cobra , progress measurement system etc, still he needs to see only his projects, Planner Y don’t care about PowerBI and wants the raw data to do his own stuff using some obscure VBA Excel, PowerBI as of this writing can not connects to multiple datasets from other reports, and you can’t mix live connection with import from other source.

Solutions

Whatever you do , you need to have only one connection to Primavera Database, don’t give multiple users access to the production database, that’s bad ( probably your DBA, will not do it anyway), I know it is still read only, but it is bad practise, if your write an expensive query against a reporting server, it is annoying,  but slowing Primavera server will get you angry planners from everywhere ( we are very famous for being grumpy)  

  1. Reporting Database

Ideal solution, just spin a cheap SQL instance on azure and make sure it is on the same region as your PowerBI ( no egress fees), implements RLS there for the people who want to author reports ( the viewer access will be done in PowerBI),any IT can easily create a small pipeline to copy data from Primavera Database ( doesn’t matter, Oracle or SQL server) and you get your  data fresh every morning or whatever schedule you want. ( bonus point, no PowerBI gateway between PowerBI and SQLServer as both are in azure)

  • PowerBI Dataflows

If  SqlServer is not an option , PowerBI dataflows seems like a perfect solution here, you connect once to the Primavera databased and you can share the results with other user,  PowerBI will be hitting the Dataflows instead of the production server, perfect, yes, unfortunately, Dataflows results are just CSV, no RLS, either you read the whole TASK table or not.

  • Analyse in Excel

Personally, I think it is the most Powerful feature of PowerBI, when you have access to a report in PowerBI, when you click analyse in Excel you get access to all the data behind the model, not only that you can create a table that fetch the data and bingo, it does honnor RLS, you see only your stuff, so the workflow will be something like this, for how to use analyse in Excel, Please read this excellent blog by Chris Webb

   Let’s see how to use RLS in PowerBIRLS is very vast subjects, and has many different implementations and nuance, in real life you need to have something like a hierarchy security like Primavera OBS, and you need to read this excellent series of blogs from Reza,  let’s say I want to grant view access to a couple of Projects to two users (Viewer @ projectscontrols and test@ projectscontrols ), those users are not necessary planners they don’t have access to Primavera

I just used Project user defined field and typed the full address, you need the email address as it is the format that PowerBI understand, the good thing about UDF, you can put multiple values separated by a comma, I pulled the following tables from P6 Database (UDFTYPE , UDFVALUE) and just using PowerQuery to split and unpivot, and I go this little tables.

Now I have the Project ID and the username of the users who can access the projects.

Edit the relationship between access table and Project tables

Manage role

Basically, table access get filtered by PowerBI, then it will filter table PROJECT, that will filter table TASK

Publish to PowerBI service to the owner account but not to  user Viewer as he  should only have access to view and build ( you need build to use the dataset in other reports), RLS works only with viewer role,

Now let’s see what the user viewer will see

Yes, only 7 projects are visible to that user, let’s try analysing in Excel,

Once I select table access, only the viewer gets selected

Now let’s query only the table TASK ( yes DAX is a query language too)

Voila we have the TABLE Task filtered, with only 7 Projects

Now you can add more tables and load those tables into another PowerBI datasets and do your own enhanced reports.

Reproducing Primavera Select Columns in PowerBI; aka Dynamically adding measures to a Matrix

TLDR, The online report is here

Primavera P6 calculated a lot of rich metrics for a schedule, we all know the basics, Start Date and Finish date, but there are all kinds of other values ( Cost, Labor hours, not Labor , Float Duration etc)

The challenge is, different people want to see different measures,  and once you publish your reports, the viewer can not change the visual, Microsoft is working on this feature, but it will take another couple of months to be released.

The trick was suggested by Kasper de Jonge and in a random chat with @DingbatData @_Ivan_Bond , they use it already to solve similar issues, actually it is very easy

Anyway the purpose to reproduce something like this from Primavera

  1. Create a new table that contains all the measures

Instead of copy and past all the measures, I just used DAX studio connected to my PowerBI desktop to generate a list of all the measures, read this link

The Category is to just to make it easy to select which measure to select, the index is to keep the same sort in the slicer, I don’t want to show actual Finish before Actual Start.

2- Create a Master Measure that check if the value is selected.

Using Switch and SelectedValue give the result, I had only to add some condition to format the results of Date to be show as date not Number, you can vote on this idea

Here is a snapshot for the first 4 measures ( currently I have 29 in the models)

Selected_Measure =
SWITCH (
    SELECTEDVALUE ( dummy_meaures[MEASURE_Values], BLANK () ),
    “Task_Count”, [Task_Count],
    “Budget Labor Units”, [Budget Labor Units],
    “Actual Finish”, SWITCH (
        [Actual Finish],
        BLANK (), BLANK (),
        FORMAT ( [Actual Finish], “mm-yy-yy” ),

    “%_Labor_units”, SWITCH (
        [%_Labor_units],
        BLANK (), BLANK (),
        FORMAT ( [%_Labor_units], “0.00%” )
    ))

3- Add the column Masure_Value to the column in the Matrix and Seletced_value in the values Area.

4-Voila

5-Bonus

I am using bookmarks to toggle the select Columns Button, I think it is freaking cool,  hopefully in the short term PowerBI add more features so we can build not only dashboard but real application interface.

Connecting PowerBI to Primavera Database – Part 2 (WBS report)

in the first blog post of this series, we showed how to connect to Primavera SQL server, in this blog we build our first report, we use only the three tables TASK, PROJECT, and PROJWBS

you can download the pbix here, an online report is published here

I will not show the details of every steps, you need to have a basic understanding of PowerQuery and DAX, but i will highlights some aspects of Primavera Database schema that you should be aware of.

Reproduce Project View

this view show all the projects grouped by WBS and show measures, start date, finish date and budget labor units

and here is the equivalent report in PowerBI

PROJWBS

the table PROJWBS store the EPS/WBS data for all projects, it is represented the database as a parent, child ID, which can’t be used directly by PowerBI, first we need to flatten the data to multiple levels so we can show it in a matrix visual, so basically moved from this format

to this format

you have multiple options either using SQL , DAX or PowerQuery, for Powerquery here is an excellent resource by the Imke Feldmann , for our example I am using DAX , the canonical reference is by Marco Russo

just make sure when you import a table from SQL server to have this filter [delete_session_id] = null , because Primavera don’t directly delete data, but instead have something called soft delete, ie; the items is not shown in the client but it is still in the database and will be deleted later, anyway for PROJWBS remove all the template WBS, (I think they are used by the EPPM web client)

TASK

task is straightforward it save all the tasks of the project, same filter [delete_session_id] = null

PROJECT

we use project table only to filter the baseline out, in Primavera current project and baseline are saved in the database exactly the same way ( that very powerful paradigm ), but for our report we want to show the activities only for the current project, too easy [orig_proj_id] = null and the best part, we don’t have to write any queries, Powerquery simply generate the SQL for the database ( that’s awesome)

Simple Data Model

the two tables are connected by the field wbs_id, we added another copy of the table TASK as a dimension table for reporting ( just activity id, and activity name), and we have this simple data model, I like to save measures in a separate dummy tables

as you can see, building a data model is relatively easy, the complexity start when you want to add more measures, for example, total float, you need to connect to the table CALENDAR, if you want cost, you need to connect to other tables, and if you want spread it will become a little trickier ( hint it is not saved in the database)

hopefully by now, instead of asking how to connect to Primavera Database, the interesting question become, in which table the data is saved and how to join two separate tables to get the report you want

if you are still reading, I will appreciate if you can vote on this idea, unfortunately you can’t dis-activate table sorting in PowerBI, in this particular report, the sorting is already defined by the WBS, if the user click on the header, the order will change, they can still reset the order using the measure sort, still very annoying