How to Keep Your Primavera P6 Clean?

This article addresses to all the schedulers and project professionals who import schedules into scrubbing P6 databases, remove undesired data, export the cleaned XER, and then import to a production database or share with third-parties such as contractors or sub-contractors.

If it happens to you to go through such a process, then you might want to read this article and see the better way to “clean” a XER file, prevent external data from corrupting your database thus maintain security and keep schedule integrity.

You can achieve this with a simple tool called ScheduleCleaner.

Now, I want to explain how the tool works and how you can benefit from it.

How to get started with ScheduleCleaner?

ScheduleCleaner is a desktop application for Windows operating system. It’s not connected to a database, and does not require internet connection to use it.

The “cleaning” process of an XER file can be achieved in 5 steps as explained below.

  1. Launch the software;
  2. Add an XER File;
  3. Select the output folder;
  4. Click on the categories of data you want to remove;
  5. Click “Clean” button.

As you can see, there is no manual work, no editing of a XER file in Notepad, and no scrubbing databases.

Steps to "clean" XER file with ScheduleCleaner

The software is intuitive, easy-to-use, and works offline as a standalone desktop application.

What’s more important, the software does not modify the original project plan. Instead it creates a copy and modifications are saved in the new file. The original project plan remain untouched.

Now, let’s see what you can accomplish with this tool in more specifics.

Removing POBS

If it takes a lot of time to import XER file intro Primavera P6 database, POBS data might be the reason for that.

Overall, the POBS defect affect the performance of the application and users lose valuable during the import operation. According Oracle, the POBS data is not used yet:

“We do not utilize the POBS table yet we export/import the data from this table when completing XER Export/Import. The XER export/import should be written to exclude this data with XER export/import operations of P6 Professional.”

The removal of POBS data can be done manually, but the process is prone to errors and can be time consuming.

The impact of all these errors when managing global data in an enterprise, will ultimately result in a polluted database and unconscious mistakes on a project level.

So using a tool for removing POBS data is desirable.

You can see a significant difference of the file size before and after cleaning POBS which greatly affects the time needed to import XER file into a Primavera P6 database.

Imagine the time that can be saved for larger XER files.

Remove Units, Rates, Cost, Pricing, Progress

As the purpose of exporting data files in XER format is to transmit project data to another database, in many cases data should be kept private. For example, a general contractor wants to send the project to a sub-contractors, but without the cost of resources.

Another examples is related with the GDPR regulation. Namely project schedulers and managers share files that contain sensitive information such as resource names that can disrupt the guidelines of the GDPR.

To be GDPR compliant, companies need to hide/anonymize confidential information, and ScheduleCleaner is the perfect tool to easily and securely protect sensitive information.

Just by clicking checkboxes, users who want to share the XER schedule can pick certain categories of data that want to be removed from the schedule before sending to third-parties or upload to a Primavera P6 database.

Before “cleaning” prices
After “cleaning” prices

Mask Project Data

Similar as removing certain categories of data, you can also mask project data.

The only difference is that with masking, you can add custom codes, labels or text for the specific categories.

Add Prefix/Suffix

Inserting prefix or suffix to different categories in the project plan, can give additional information to the person who reads the information and acts according them.

To add Prefix/Suffix, you need to select the template that will contain Prefix/Suffix, select the appropriate category, and add the terms that will be words’ prefix or suffix.

Then, you go to “Clean” ribbon and click on the “Batch” button. The end result when adding prefix/suffix are given in the image below.

Converting Data

The software features an option to convert Global and EPS activity codes to Project Activity codes and EPS to Global Activity Codes. The activity codes are important to schedulers and planning engineers when creating different types of work performance reports.

So here are the type of categories that can be converted with ScheduleCleaner:

  • Convert Global/EPS to Project Activity Codes.
  • Convert EPS to Global Activity Codes

Moreover, you can convert Global calendars that are used in the project plan into project and shared resource calendar. In this way, you will avoid errors when importing the XER file into P6 database.

Save time with process automation

Who doesn’t want automation? Automation saves time and gives a sense of comfort and security.

Here, it’s not actually a full automation because you still need to click on a button in order to perform an action or combination of actions. But this is quite useful when you have a set of actions that need to done on a daily basis such as sending a daily progress report to top management or uploading recent progress into a database.

Automation is ScheduleCleaner is viable through creating Templates, save them and apply to imported XER files.

Batch Clean

“Batch Clean” is a feature that works with templates. User must create at least one template and assign it to a file in order to use the batch file cleaning.

“Quick Clean” on the other side is more suitable when user wants to modify very small number of project files, while “Batch Clean” is useful when large number of data files, usually located in different folders, need to be modified.

Final Words

ScheduleCleaner enables you to quickly remove or anonymize confidential data in XER data files exported from Primavera P6, while keeping the schedule integrity.

It replaces the many work when “cleaning” XER file prior to sharing the file or import to a production database.

As the manual process of removing or anonymizing project data is time-consuming and unreliable, performing Batch Clean in combination with Templates can speed up the process.

Organizations can significantly improve their productivity, communication and security by integrating ScheduleCleaner in their working environment.

How to Export data from PowerQuery to BigQuery

Today was playing with a report in PowerBI and I got this idea of exporting data to BigQuery from PowerQuery, let me tell you something, it is very easy and it works rather well, PowerQuery is an amazing technology ( and it is free).

in PowerBI,you can export from R or Python visuals but there are a limitation of 150K rows, but if you use PowerQuery, there is no limitation ( I tried with a table of 23 Millions records and it works)

here is the code using Python, but you can use R

import pandas as pd
import os
from google.cloud import bigquery
dataset['SETTLEMENTDATE']=pd.to_datetime(dataset['SETTLEMENTDATE'])
dataset['INITIALMW']=pd.to_numeric(dataset['INITIALMW'])
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "C:/BigQuery/test-990c2f64d86d.json"
client = bigquery.Client()
dataset_ref = client.dataset('work')
table_ref = dataset_ref.table('test')
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.schema = [
bigquery.SchemaField("SETTLEMENTDATE", "TIMESTAMP"),
bigquery.SchemaField("DUID", "STRING"),
bigquery.SchemaField("INITIALMW", "FLOAT"),
bigquery.SchemaField("UNIT", "STRING")]
job = client.load_table_from_dataframe(dataset, table_ref, job_config=job_config)
job.result() # Waits for table load to complete.

interesting after the step in Python we get a table, simply expand it

here is the total rows of the table in PowerBI

the results in BigQuery

ok, PowerQuery flow can execute many times, it is a black magic knowledge that’s only a handful of people knows, but in this cases, it does not matter, the BigQuery job truncate the tables every time, so there is no risk of data duplication.

probably you may ask why do that if there are a lot of data preparation tools that natively support BigQuery, based on my own experience, most of my data sources are Excel files and PowerQuery is just very powerful and versatile specially if you deal with “dirty” format

the second question is probably what’s the added value ? just load the data directly into PowerBI, the answer is very easy, data ubiquity

I want everyone to be able to access the data, regardless of the front end tools.

Construction Progress Report – PowerBI – by Darrin Kinney

A quick and easy construction progress and schedule dashboard.

I have previously outlined an approach that can be used for Engineering Progress.

This post is an extension to that which instead of looking at engineering model development, instead looks at construction development. I don’t want to delve too much into the details about exactly how this was built (again see the post above).

Some big differences is that I have used a resource assignment view. in addition to the date metrics This allows for resources histogram and progress curves to be quickly sorted down to an activity level. This approach also follows a prior post Resource Analysis Dashboard .

Construction02

The data

Construction01

The underlying data is very similar to our engineering progress example. We can use a flat file export direct from P6 with a standard set of columns. As I have mentioned before, you can achieve this in a SQL query as part of a larger data model, although with everything, a delicate balance is needed (balancing database formalism and easy excel solution)

We will also have the resource assignment data

Construction06data.JPG

The WBS Slicer and Area Selection

Construction03_wbs

This design element doesn’t work for project with too many WBS elements. For this example, each major area only has about 10 WBS elements, therefore I could pull this off with no drama. I really prefer this selection as opposed to drop downs where it is often difficult to quickly make  selection.

The Pie and Metrics

Construction04pies

Here we follow much of the look and feel I used with the engineering progress; however instead of just using activity count metrics, I have also inserted hour and percent complete metrics. There is nothing fancy about these.

The Data Table

Construction05table.JPG

I’ll sound like a broken record again, when you have a good design with one aspect of a project, you can likely take that and run with it for many other areas. In a following post I will detail this systems engineering aspect to nearly everything we touch.

Obviously the key inclusion into the table is the budget units and %’s. I still prefer these tables views vs the GANTT views. Having clear visibility into the last month dates, the prior month dates,  and variances is the purpose of this view.

The Future

Again, the extension of this are endless. At this stage, we are starting to see how pre filtered views provide more focused dashboard as compared to a one size fits all. Sitting in an EPCM world, most of the detailed activities and schedules are managed by our contractors. Thus, this construction view is more suited to using an export from a contractor Level 4 schedule.

At some point, we will need to begin to discuss an overarching design where a user can navigate to our various dashboard in a logic way.

Happy data wrangling!

Custom SQL in Google Datastudio

in the last 12 months, Google Datastudio has added many new interesting new features, specially the integration with BigQuery BI engine, and custom SQL Queries.

Obviousely, I am a huge PowerBI fan, and I think it is the best thing that happen to analytics since Excel, but if you want to share a secure report without requiring a license for every user, datastudio is becoming a valid option.

I have already blogged about building a near real time dashboard using Bigquery and Datastudio , but in this quick blog, I will try to show case that using SQL one can create a more complex business logic reports.

I am using a typical dataset we have in our industry, a lot of facts tables, with different granularity, the facts tables don’t all update at the same time, planned values changes only when there is a program revision, actual changes every day.

Instead of writing the steps here, please view the report that include the how to and the results.

The approach is pretty simple, all modern BI software works more or less the same way( at least PowerBI & Qlik, Tableau is coming soon), you load data to different tables then you model the data by creating relationships between the tables, then you create measures, when you click on a filter for example, or when you add dimension to a chart, the software generate a SQL query to the data source based on the existing relationship defined in the data model, it is really amazing , even without knowing any SQL coding you can do very complicated analysis.

DataStudio is no different to other tools, the Data Modeling is called Blending, it link all the tables together using left join, which is a big limitation as if some values exist in one table and not in others, you will miss data.

The idea is let’s bypass the modeling layer and write some SQL code, and to make it dynamic let’s use parameters, it is not an ideal solution for an average Business users ( we don’t particularly like code) but it is a workaround, till DataStudio improve it’s offering.

Engineering Progress Report – PowerBI – by Darrin Kinney

In this article, I will run through all the steps required to produce an elegant Engineering Progress Report.

Eng12

The intent is not to delve into the manner in which the progress or schedule are updated. I have assumed you have a schedule and progress status for each key area. It is quite amazing how easy is to generate this dashboard, and also the extensions available to use this not just for engineering, but for fabrication, material deliveries, major milestones, contractor key activities, etc.

I will outline the format for our 2 key datasets and then follow with the creation of 2 dashboards: An Overall Status Gauge, and the full detail EPR Dashboard

P6 Schedule Data

Below is our data set we want to use. This data set has been specifically tailored to our resulting visual. Thus, instead of linking directly to an XER, importing into a data model, and performing perhaps too much data work, a nice trick is to instead define specific VIEWS inside P6, so that you can easily copy-paste directly into Excel, then import directly into your dashboard. Thus, the below can be quickly generated each schedule update cycle.

Eng_9

A very nice aspect of this data set is the field “TYPE”. It is good practice to tag activities of a specific type (this ties into my belief about using a framework approach to approach controls). Thus, in theory, you can export the entire schedule, and drive many different dashboards by just filtering on different TYPE fields. In this example I have used

  • M090 = 90% Model Review
  • M100 = 100% AFC

Although, consider tagging every concrete pour Activity in your schedule with C010. You can then use that code to drive a similar dashboard for concrete pours: Or you use F100 for Module Fabrication, where we tag the completion activity for each module for use in dashboard. Ultimately you create a catalog of TYPE codes and can go dashboard crazy with how easy this turns out to me.

This data does not have all the fields we will need in our dashboard. Specifically we will want to create a several measures that will allow for a few metrics. We will need to know if an activity is “FINISHED”, “NOT FINISHED”, and “Critically LATE”. Because these fields are dependent on your target audience, its best to leave the generation of these to code (because everyone can code right!). If you wanted to display metrics on “Started”, then your source data would need to include the start date and perhaps the activity status field from P6. Again, its important to understand the relationship between your visual and your data. In this example, I am treating these activities as effectively milestones in which case the concept of “started” doesn’t apply. key conceptual discussions such as this are vital.

Progress Data

The progress data in this example is only overall progress. The intents is to just show an overview for the entire project and quick metrics for model reviews. Ultimately, you would want a “WBS Specific” dashboard that would display more information over the entire lifecycle of that WBS. In that view, you could present the engineering curve and perhaps EVMS metrics.

Strategy – Do not do everything in one place- keep focus

Too often, I see users pushing design features into dashboards, for what appears to just be whimsical value. Dashboards are not meant to answer 100 questions. Its easier to have 100 dashboards each displaying a key metric, as opposed to 1 dashboard displaying 100 metrics. Keep your approach CLEAN and FOCUSED.

Ideally, our progress data will include fields such as Area, WBS. In this example I have pulled data with just 1 data date and only 1 dataseries (Engineering_Overall). Your backend progress data will likely have data from multiple cut off dates and for multiple series.

Our progress data will look like this. The full data set will also contain a series for “Construction_Overall” too. This will be used on our summary page to outline the power in using this approach to progress data.

Eng_02

Linking our Data into PowerBI

In this example, both data files are simply Excel based files with the data converted to table. This allows for the easiest importing (and also allows for quick refresh of data). Housing the data in the excel files can also facilitate a movement to a more digital way of thinking (more on that in another article)

VISUALIZATION 1 – SUMMARY GAUGE

I am a firm believer in Overall Project Flash reports. So, when we think about dashboards we should have a starting point our overall project status. Thus, the elements presented here are only a key subset of metrics and visuals I would expect on a Project Status Report dashboard.

In this example, looking at engineering progress, we want to see what Percent % Complete we are and how that compares against our Planned % Complete.

A Gauge is a good way to provide a quick visual (Bullet charts are other, and really, the skies the limit)

Eng_Gauge

To generate this we need to create 2 measures: Actual % and Planned %. This is where you really need to understand how dashboards work and how databases work. If you feed a computer a data source, it is no innate way of know something as simple as “What is the current %”. Therefore, we need to write some code.

Because of the format of our progress data, we can search for the maximum data date, then find the value of our actual % field on that date. We can follow an identical approach for the Planned %. Depending on your data, you would need to custom build these measures.

Code to generate our measure for Current %

M_Progress_Actual = CALCULATE (
SUM ( data1[Actual] ),
FILTER (
data1,
data1[DataDate] = MAX ( data1[DataDate] ) && data1[Date]=MAX(data1[DataDate]
)
))

Code to generate our measure for Planned % (similarly we could also pull in our Plan late)

M_Progress_Plan = CALCULATE (
SUM ( data1[BL_Early] ),
FILTER (
data1,
data1[DataDate] = MAX ( data1[DataDate] ) && data1[Date]=MAX(data1[DataDate]
)
))

The required fields for the gauge are obviously these 2 measures.

  • Value = M_Progress_Actual
  • Target Value = M_Progress_Plan

We will also need to provide a filter where Series=”Engineering_Overall” (note that this gauge can now be easily reproduced to showcase planned vs actual for all Series inside our data source. Obviously in the image above you can see I created 2 gauges each with a filter for the specific data series. Ultimately if your back end data has multiple data series for progress sliced and diced different ways, all you have to do is adjust your filter and you can display an endless series of graphs. Of, you can fancy with smart slicers too.

VISUALIZATION 2 – Engineering Progress Report

This is perhaps the most easy to read, interactive and intuitive view into engineering I have ever seen. We can immediately filter into what areas are complete, what areas are critical, scroll to see upcoming deliverables and see an overall graph.

Eng12

It might seem we have a lot going on here, but again, this is all driven off 2 quite simple data sources, and for this page, mostly everything here is from 1 schedule driven table.

The Data Table

The Table is just pulling from our Schedule data (although I have inserted a page level filter to only include activities with the TYPE = M100 and M090). Our fields are as

Eng_4

In the above image, you can see I have had to insert a few measures. I don’t want to go into them all. I’ve inserted some conditional formatting into the Actual/Forecast date column. To achieve this, I created a measure Activity_Status_Num

Activity_Status_Num = IF(ISBLANK(Schedule[Float]),1,IF(Schedule[Float]<1,2, 0))

Then, with these values I can select a formatting specific just for that column in the table. This is very nice feature of the tables in PowerBI that can add nice level of polish.

The Donut Charts

Eng_6

A nice feature of the donut chart is the count metric in the middle. It is generated from a nice little bit of code as seen below. We have 2 Donut Charts. One for our 90% activity and another for the 100%. Thus, all we need to do is place a visual level filter on each.

IsFinished = IF(ISBLANK(Schedule[Float]),1,0)
DonutCounts = SUM(Schedule[IsFinished]) &”/”& COUNT(Schedule[ActivityDesc])
In the above, there are 2 measures: “IsFinished” and “DonutCount”. Again if you want anything to display on a dashboard in a digital world, you are going to have to see this type of code

The real power of the Donut Chart is to allow for very quick sorting – after all we want to see the critical late activities right! Just click the red 90% or 100% section.

Eng13

 

Progress Graph

Eng_7

We have a progress graph too. This is effectively a dumb page level graph. It is not linked to a specific progress series for each WBS. So it will not auto update, and our data model does not link these tables. Although, the graph should add context to the overall page. Deviations from the plan curves, should be viewed by a growing number of critically late packages.

Care needs to be made whenever we look at schedule dates and progress graphs. We do not typically create progress graphs at an activity level (although, you can certainty consider it – I would offer caution against going down that route).

EXTENSIONS

This example has show the power simple data sets can have to improve visibility into our projects. This only showcased a few engineering based activities. However, if you read between the lines, you will understand there is nothing “engineering specific” about what I have done. This approach is completely universal. Given this example also included a progress data set for Construction, obviously, the easiest extension will be to link in a few construction activities in the same way.

Planning and Project Management – What is Missing

Can you imagine using Twitter in the world of Construction – I can! Read on..

Recently, it has hit me, just how poor our project management world is. The ability to clearly communicate and understand what is occurring on a project appears completely lost. Planners have absolutely amazing schedules (I do believe this), cost engineers have incredible detail about the cost build ups sliced and diced every which way, our document management systems are full with every manner of communication – BUT we are still left sitting in meetings where everyone is confused with various key information either not available, or buried inside someone own excel file, bounced off 100 emails threads, or a million other permutations.

Issue – Visibility into our Projects Sucks

Why has this happened? In my view, our leadership teams have failed at pushing good practices of project management. Leadership teams have transitioned from “manage the work” to “manage the people” (thank you Edin M for this quote).

Solution – Get Back to Basics

OpenProj_01

Project Management is task management (you can argue, but even all the new people management issues – are still tasks and can be managed in the same way).

We are building something – lets focus on the activities required to build (and engineer, and contract and procure). All the activities that exist in our Primavera schedules! We need these activities front in center and OWNED by project management. Honestly, how can we even have a management discussion and NOT have the schedule front and center. The schedule says what we should be working on, the schedule says when we should finish something, the schedule says what comes next!

I believe that schedules have been forgotten in project management because they are too unwieldy, to abstract, can only be run by P6 jockeys and not the project at large. We need to get our schedules into the hands of those that actually manage the scope.

In the past, our leadership were more in-tuned to schedules and this synergy was easier. However i fear in today’s world, our leadership have lost the tools and dealing with our schedules (no thanks to our reliance on antiquated tools like P6 that perpetuate the need for designated planning teams to operate the software in pure isolation to the real PM teams).

Issue – We need better Project Management tools.

Answer – They exist everywhere!

Commercially built, off the shelf project management software has risen to be one of the dominant fields of software development. The construction world needs to embrace the tools and get back to basics. It is odd in that 20 years ago, the leaders in project management was the construction world. However, when the technology world sprung up, they didn’t have the knowledge we did, so they built their own tools and approaches. I now believe the tide has turned to the point construction project management now severely trails the rest of the business world.

So How Does This Work?

First, the corporate strategy teams need to decide on a platform (hint – USE JIRA).

JIRA Quick Demo

Here is an example of a JIRA typical managemetn page using JIRA.

JIRA_01

This is a short quick example of some substation work. I have populated JIRA with a few activities that might mirror what you currently manage inside your existing schedule. The difference here is that these activities are not updated by a planner, they are activity managed.

This process to push the ownership of schedule tasks to those that actually manage and deliver the scope is where immense value can be gained. Additionally, each activity allows for commentary and discussion. The ability to insert comments and discuss an activity or its relationships with other activities also brings teams together an allows for a focus on touch points to be activity managed.

OpenProject.org Example

However, really the choice of software isn’t that critical, its the work processes you are going to change – the new online PM tools are structurally all the same. What we are pushing is simply “clarity in what we are doing”. We are pushing the management oversight of what we do, into the hands of those that actually manage the scope. Don’t hide your schedule, don’t hide weekly and daily reports inside your document control system – embed everything into what is effectively a social media platform.

In this example, I am using OpenProject.org , however, keep in mind there are a lot of systems that all work similar.

Add activities

The starting point for me would be to add you P6 activities to your tool. This is the natural place to begin. You schedule already has a structure and usually a very good balance of level of detail.

OpenProj_02

In the above, I have added a typical task that will exist in our P6 schedules. Immediately off the bat, we can see we are operating in a distributed web based environment. We have a nice detailed description for this activity and we have the ability to assign this task to a person.

Up to this point, we are a little overlapped with P6. However, what is lacking in P6 is the ability to really discuss and communicate and UPDATE information associated with a task. The ability to pull the task into a proper project management discussion.

OpenProj_04

The above example says more than you can find in any weekly or monthly report. A picture tells a 1000 stories! The picture is also properly assigned to the activity it represents. The activity has a clearly visible Finish date than can be live edited 24/7.

Our new tools are not meant to replace P6. They are meant to force our discussions into properly structured slices of the project. They are meant to clearly communicate the status of activities. They are meant to get everyone onto the same playing fields when discussing something so that 5 different people do not end up with 10 different dates.

An activity only has 1 start and finish date, an activity only has one percent complete. It is maddening when a project manager asks me to insert the contractual dates into a report. Honestly, when you are building something, the contract date is useless in helping you decide “when will this finish”. It was only the starting point. When you get people out of their office view, in into “I need to manage this scope” you quickly understand that the contract date, or even contractors weekly reports are useless. You have to make a determination of when an activity will finish based on what you know at that time – and be proactive in actively editing the dates when required

Empower people to update activities!

OpenProj_05

This is So Simple?

I sit and look at this capability of something I built in 30 minutes on a Sunday morning and really wonder why our Project Management is leading us down what may not be avenues of real improvement to projects. Does our Project Leadership have the vision to accept such simple solutions to improve our communication?

Digital Transformation?

I have discussed this before, digital transformation is all about Keywords – not project management. Real digital transformation is about altering the way we work – not building a dashboard or a database. This is why digital transformation is not working. Provide tools and process to manage the work, enable your staff to manage their own scope, and clearly communicate and update their tasks.

Really think about how you manage your scope and how implementing a more social platform to break down the walls of communication. Understand how this is disruptive to our old ways of working (not updating schedules). Get people talking off just one play-sheet!

Twitter in Construction

I’d like to end this with what I thought was the most amazing application of this new management approach.

Mersey Gateway Twitter Site

I kid you not. During construction, these guys posted nearly daily pictures and updates.

While working on this project from the home office, I got better updates from the project twitter site, then I did from the project manager. Yes, are you finally able to see that solutions exist, creative solutions exist, that can bring construction in the new digital world!

OpenProj_06

 

How to Build a near real time Dashboard using Datastudio and BigQuery

TLDR, the report is here, please note, my experience with BigQuery and Google stack is rather limited, this is just my own perspective as a business user .

Edit : 20 Sept 2019, DataStudio use now BI engine by default for connecting to BigQuery, now the report contains the historical data too.

I built already a dashboard that track AEMO Data using PowerBI,  and it is nearly perfect except , the maximum update per day is 8 time, which is quite ok ( direct Query is not an option as it is not supported when you publish to web, actually it is support but rather slow) , but for some reason, I thought how hard would it be to build a dashboard that show always the latest Data.

Edit : 23 Sept 2019, actually now, my go to solution for near real time reporting is Google Datastudio, once you get used to real time time, you can’t go back.

The requirements are

  1. Very minimum cost, it is just a hobby
  2. Near Real time (the data is published every 5 minutes)
  3. Export to csv
  4. Free to share.
  5. Ideally not too much technical, I don’t want something to build from scratch.

I got some advices from a friend who works in this kind of scenario and it seems the best option is to build a web app with a database like Postgresql,  with a front end in the likes of apache superset or Rstudio Shiny and host it  in a cheap VM by digitalocean , which I may eventually do, but I thought let’s give BigQuery a try, the free tier is very generous, 1 TB of free Queries per month is more than enough, and Datastudio is totally free and by default use live connection.

Unlike PowerBI which is a whole self service BI solution in one package, Google offering is split to three separate streams, ETL, the data warehouse (Biguery) and the reporting tool (Datastudio), the pricing is pay per usage

For the ETL, Dataprep would be the natural choice for me,( the service is provided by Trifacta), but to my surprise, apparently you can’t import data from an URL, I think I was a bit unfair to Trifacta, the data has to be in google storage first, which is fine, but the lack of support for zip is hard to understand, at least in the type of business I work for, everyone is using zip

I tried to use Data fusion, but it involve spinning a new spark cluster !!!! , and their price is around 3000 $ per month !!!!!

I think I will stick with Python for the moment.

  • The first thing you do after creating a new project in BigQuery is to setup cost control.

The minimum I could get for BigQeury is 0.5 TB per day

  • The source files are located here, very simple csv file, compressed by zip, I care only about three fields

SETTLEMENT DATE  : timestamp

DUID                            : Generator ID , ( power station, solar, wind farm etc)

SCADAVALUE             : Electricity produced in Mw

  • Add a table with partition per day and clustered by the field DUID
  • Write a python script that load data to Bigquery,you can have a look at the code used here, hopefully I will blog about it separately
  • Schedule the script to run every 5 minutes: I am huge fan of azure WebJob, to be honest I tried to use Google function but you can’t write anything in the local folder by default, it seems the container has to be stateless but I just find it easy when I can write temporary data in the local folder (I have a limited understanding of Google function, that was my first impression anyway) , now, I am using google functions and cloud Scheduler, Google functions provide a /tmp that you can write to it, it will use some memory resources.
  • I added a dimension table that show a full Description for the generator id, region etc, I have the coordinates too, but strangely, Datastudio map visual does not support tiles!!!
  • Create a view that join the two tables and remove any duplicate, and filter out the rows where there is no production (SCADAVALUE =0), if there is no full Description yet for the generator id, use the id instead

Notice here, although it is a view, the filter per partition still works, and there is a minimum of 10 MB per table regardless of the memory scanned, for billing BigQuery used the uncompressed size !!

One very good thing though, the queries results are cached for 1 day, if you do the same query again, it is free!

  • Create the Datastudio report : I will create two connections :
  • live connection: pull only today data, every query cost 20 MB, as it is using only one date partition, (2 Tables), the speed is satisfactory, make sure to disactivate the cache

But to confuse everyone there two types of caches, see documentation here, the implication is sometimes you get different updated depending if your selection hit the cache or not, as the editor of the report, it is not an issue, I can manually click refresh, but for the viewer, to be honest, I am not even sure how it works, sometimes, when I test it with incognito mode, I get the latest data sometimes not.

  • Import connection : it is called extract, it load the data to Datastudio in-memory database (it uses BI engine created by one of the original authors of multidimensional) , just be careful as the maximum that can be imported is 100 MB (non compressed), which is rather very small (ok it is free so I can’t complain really), once I was very confused why the data did not match, it turn out Datastudio truncate the import without warning, anyway to optimise this 100 MB, I extract a summary of the data and removed the time dimension and filtered only to the last 14 days, and I schedule the extract to run every day at 12:30 AM, notice today data is not included.

Note : Because both datasets use the same data source, cross filtering works by default, if using two different sources (let’s say, csv and google search, you need some awkward workaround to make it works)

  • Voila the live report, 😊 a nice feature shown here (sorry for the gif quality) is the export to Sheet
  1. Schedule email delivery

  although the report is very simple, I must admit, I find it very satisfying, there is some little pleasure in watching real time data, some missing features, I would love to have

  • An option to disactivate all the caches or bring back the option to let the viewer manually refresh the report.
  • An option to trigger email delivery based on alert, (for example when a measure reaches a maximum value), or at least schedule email delivery multiple time per day.
  • Make datastudio web site mobile friendly, it is hard to select the report from the list of available reports.
  • Google Datastudio support for maps is nearly non existent, that’s a showstopper for a lot of business scenarios