whenever I need to join Primavera Activity id to the quantity measurement system, I use this pattern, it did serve me well all those years, recently I started a new project where for the first time, I don’t get an extract using Excel but a proper live connection to SQL server 🙂
To get something quickly running, I started using the same approach, load Primavera export, unpivot the date and normalize it, every activity has a spread from 0 to 100 % then merge it to a Table from SQL server, all working as expected.
Although it works well, it is a bit clunky , specially that the export from Primavera does not change frequently, for the baseline maybe once a year and the forecast once a month, so instead of merging the data using Powerquery, I loaded the Primavera data as a separate table, here what the model looks like
As you have guessed the Activity id is duplicated in both tables
Now the Metric I am looking for is how to spread the budget hours from the table BOQ using the spread ( 0-100 %) from Primavera, let’s say I filter 1 row from the BOQ the result should be something like this
As it is multiple to multiple if you simply multiply the hours X spread you get duplicate values
In a previous blog, I showed how to load a raster tiles into PowerBI data model, in theory that should solved all my issues with doing a detailed maps in PowerBI.
unfortunately, no, even if R and Python visual support up to 150K points, the reality is the implementation of R in the PoweBI service has a massive overhead and you can’t do anything about it, as it is literally a black box, all you can do is try to reduce the data passed to R visual and hope it works.
Actually, in my case, the visual did not even show up and I got an error message that resources are exceeded
I am in a situation where I can’t filter data because the whole point of the visual is to show all the data, at the same time, if the visual does not work in the service then there is no point in the whole exercise.
The trick is using wkt, I will simplify the geometry without losing any visual data, for example:
Instead of showing all the points, I will just group the points in the same order and colour as a line, as you can see from 14 rows of data, it is reduced to 5 rows, and the visual representation is the same, it is like sampling, but we keep the exact shape of the data.
Now in PowerBI, all we need to do is to automatically group those points together, turn out the solution was very easy using Rankx, keep in mind the wkt is dynamic for every update, I get a new geometry
After that I just added some calculated columns to create the wkt format
For a point, POINT (X Y)
For a line, STRINLINE (start_X start_Y,finish_X finish_Y)
Keep in mind you can create polygons too, but the DAX become more complex (maybe for another blog)
you can create the wkt file in QGIS very easily but as my data change daily, it was not practical
And here is the final result
The number or rows were reduced from 3528 to 218
That make a massive difference in PowerBI service, my real data is 58K rows and I can’t tell how much I was happy when finaly it worked in the service,not only that, but the total rows using wkt keep decreasing when I do more updates 🙂
There is a catch though, unfortunately as of Dec 2019, only R and Python script can render wkt geometry, there is a new custom visual by @james dales, but it is in a private beta and has some limitation on colors by category. ( icon map support color per category now)
A typical situation in the construction industry the progress data is sourced from multiple system with different format, generally we get two type of reports.
Time stamp items
My preferred one, the data is tracked at a very low level (cable, spool, pre-commissioning and commissioning tracking), and you get a date when the item is completed something like this
This format is very convenient as you need to maintain only 1 file, the history is recorded in the data itself unfortunately, this kind of report is not always available for multiple reasons, the main one is, in some kind of work to finish one item it will take longer period of time, for example completing 1 drawing will take 3 weeks, no manager will wait 3 weeks to claim a progress.
This format reports the cumulative progress at a time period (daily, weekly, or whenever there is a progress) something like this
This is format is very common, it is very easy to update by the supervisor, and works with any level of details
the challenge of this format is
To get the historical data you need to keep all the previous files.
As it is cumulative data, calculating the progress per time period is a bit harder, and getting something like year to date is very awkward.
we need to normalize those files to be in the same format, one approach I use with PowerQuery is
Load the cumulative files.
Calculate the reverse total cumulative using self-join
Filter only the values where there is a progress
Append to the time stamp file.
Now we have a normalize Actual Table, where quantity per period, year to date and all date calculations are very easy to calculate.
I know it is tempting to just load data and start making visual and do some complex DAX calculation, but it is not sustainable and it will make your life miserable, a simple data model will make further development much easier.
In this blog, I will show how to leverage Python (or R) to implement an incremental refresh in PowerBI using PowerQuery and Python, nothing is really new ( I am sure Imke and Maxim has blogged about it before).
in a previous blog, I showed how to use R & Python integration to load data to a Database
This approach make sense only when you do a lot of heavy
transformation and your data source change based on time.
As an example, in my previous job, we receive a new excel
file every Monday (300K rows), this file gets approved and corrected every Thursday.
the workflow was:
save the files in a folder, do the transformation, which was fine , but after the first year, it was around 52 files, and although technically you need only to do transformation for the last file, and as PowerBI does not support incremental refresh, twice a week we redo everything, after two years, the refresh took nearly 30 Minutes and sometimes we get out of memory errors.
in the big picture,Half an hour was not that bad (we have a desktop just for refresh), the worst was, you refresh the model and once you finish, you get a new revision and you must refresh again.
Now using Python/R script, the idea is every file get transformed only 1
time, regardless of how many times you refresh, just by exporting the
results of the transformation of every file as a csv in a staging folder.
The first run is slow, as it will process all the existing files in Source Data, but the subsequent run, will transform only new files.
Let’s say File 2 was revised, all you need to do,is to delete File2.csv and it will be transformed again, but only that file.
Ok, if you see step 4, the files are reloaded each time, I am not too much worried about that, as the batch loading of csv files from a folder using PowerQuery is relatively fast (yes, a bit slow compared to R), the bottleneck is rather the transformation.
the code for python script is here, as you can see PowerQuery integration is amazing, just add a new step and you get a dataframe, that’s all,
the script split the dataframe by the column filename, and then export each file separately, currently it is saving into a local folder, but you can easily save those files into a cloud storage
to test it, I built a quick workflow using public data, PBIX here, the source data is zip files in a public website, there is a new zip file daily, it is relatively complex transformation as you need to unzip the file split it, delete some columns etc, the first run is slow, as it is processing all the files (62 files), but the next run, will just process 1 file, you can simulate that just by deleting some csv files in the staging folder, when you refresh again, only the files deleted will be processed again.
I think the main take away is, Python and R integration are amazing tools to implement new possibilities that will not be necessary available in PowerBI, and you don’t need to be a programmer to use those integration, a serious search on stackoverflow will get you started quickly.
Today was playing with a report in PowerBI and I got this idea of exporting data to BigQuery from PowerQuery, let me tell you something, it is very easy and it works rather well, PowerQuery is an amazing technology ( and it is free).
in PowerBI,you can export from R or Python visuals but there are a limitation of 150K rows, but if you use PowerQuery, there is no limitation ( I tried with a table of 23 Millions records and it works)
interesting after the step in Python we get a table, simply expand it
here is the total rows of the table in PowerBI
the results in BigQuery
ok, PowerQuery flow can execute many times, it is a black magic knowledge that’s only a handful of people knows, but in this cases, it does not matter, the BigQuery job truncate the tables every time, so there is no risk of data duplication.
probably you may ask why do that if there are a lot of data preparation tools that natively support BigQuery, based on my own experience, most of my data sources are Excel files and PowerQuery is just very powerful and versatile specially if you deal with “dirty” format.
Who doesn’t love the glossy Level 1 reports our project produce. But really, when you look into these beauties, really understand the difficulty that goes into them. What follows is first a description of what a typical Level 1 report is, and how we can structure our excel based data to be a bit smarter.
This is by no means a fully comprehensive guide on this subject. It is instead just a primer to get us thinking about how we feed data into our reports.
Who doesn’t love the glossy Level 1 reports our mega construction projects produce. But really, when you look into these beauties, do you really understand the difficulty that goes into them. What follows is first a description of what a typical Level 1 report is, and how we can structure our excel based data to be a bit smarter (which is the real message to this article).
Interspersed with hopefully be a few key strategy points which can guide your work.
I’ll then showcase how you can take what will now be structured data and upload into a powerBI visual (although the process to capture the data into any database and drive any visualization tool would be the same)
Strategy – Don’t be afraid to use excel (not everything needs to be automated)
Key Elements of a Level 1 Report
Cost and Progress
Here we are presented with:
Overall progress curve
Cost & Commitment curves
Some may argue what to lead with – for me its always %. No bigger value highlights where your are more than what % are we. Not displayed on the image above is a data series reflecting how many people are have and comparison against planned. People achieve progress. Its impossible to talk progress without talking how many people we have. The graphs provide enough enough context to allow for discussions about productivity without having to muddy the waters
The cost sections should include visibility into what our final forecast costs will be (and comparison against baseline). Underneath that key metric are a few sub items such as how much contingency we have, a few cost curves associated with spend profiles and commitment profiles.
Schedule and Narrative
The schedule aspects of a Level 1 report are always tricky. Do we need to only display the final project milestone? For me, on major projects no single DATE has any meaning. Thus even on a Level 1, I still prefer to include 10-15 dates that represent some key aspect of the project. All dates should be compared against what we said last month to highlight current month variances, and dates should be compared against our project baseline (or whatever current approved version thereof).
The narrative section of a Level 1 can nearly always be updated by simply reading the progress, cost and schedule tables. Just put words to the graphs. Key adders here are insights into RISKs. What may come in the future that will alter what we are saying today!
As always, safety metrics are also usually front and center. For me, this has always been a difficult aspect of our jobs. A political correctness that is forced into our reporting. Don’t get me wrong, safety is the most important aspect of a project. So, including a safety table somewhere on the Level 1 is always done. For this article, I want to instead focus on the key project control elements and data integration.
Level 1 Data Structures
So, we all know what a Level 1 report looks like, and I would fathom we can all mostly agree these are the elements included and can be rolled out as a standard for any major construction contractor. Most of our reports likely already report this information in some manner or another. The entire point of this article is that we should really focus on entering the data in a smart data centric way so that if you want to automate anything down the line, you have the foundations to do so.
At this stage, I don’t want to talk about the source data used to generate your summaries. We can leave that for a later discussion.
Key Data Domains
We are aiming towards consistency here and want to actually represent all the data required for our key Level 1 chart to be housed in a database. Therefore we need to have structure.
Strategy – Do not focus on systems, focus on DATA
A critical strategic element in my approach is that I do not care what systems you use. Our reporting is not a function of our systems (at least in this step 1 phase). We instead need a structure from which we can extract data and as easily as possible, move that data into a template or format in which we can drive our level 1.
If you go down the path to seamlessly integrate source systems with a Level 1, you unwittingly constrain yourself.
Typically our (time phased) progress data will be sourced from Primavera. There are other systems where the progress data may live, but again, that isn’t the focus of this article – I don’t care where it lives and neither will any seasoned project controls manager. We just need to know it exists and has a common structure
Here, a few key notes, use a consistent data format. The above structure is how all your progress data should be housed, not just Level 1. All time phased data, all the way down to Level 5 detail items should be managed in a data structure, not a fancy formatted excel file. Trust me, updating a table such as the above will serve you in the long run. Even if your data is fully managed inside a system such as P6 or PRISM or ECOSYS or COBRA, you should be able to at least extract Level 1 into the format defined above.
You guessed, we can capture our Level 1 cost data in exactly the same format
In the graphs we are building, there are only 11 specific datasets. Only 4 of these require update on a period basis. So again, we really boil this down to something simple.
Strategy – Do not over complicate anything in your Level 1 layer
The implementation of the specific data model I have outlined above fits the strategic approach to keep your level 1 simple. Any project can implement this data model for Level 1 with without any integration into source systems. Level 1 can be updated by the project controls team doing a few copy-pastes into excel to capture project wide data. Again, I would assume your teams already do this, but perhaps end up copying this data into various other corporate systems as well.
Again, we are keeping a simple approach and only capture the required information.
Here, we are forced into a different structure. So whereas the cost and progress data can fit the same data model (as seen above), we will need a different template for schedule dates. We will typically be using Primavera, as such this model fits P6, but the idea is universal.
I do not believe this information can ever be fully automated from our scheduling systems. These paths will continually be adjusted. The planning lead will always refine what activities are being tracked to be displayed on the Level 1. Behind the scenes, there are tricks upon tricks to pull the dates, however, again, we are talking about the data layer here, not necessarily HOW you get the data into this format.
It is entirely possible to have the assignments encoded into P6 activity codes. Therefore, it would be possible to integrate your Level 1 data directly into either the source P6 database, or an XER export. In my experience, any automation that is attempted in this arena (for Level 1 data), is futile. We are only talking 10-15 key activities. Let you lead planner sort out how they get the data into this format. Again, our strategy is to not over complicate this. If the data is provided to a digital team in the format about, you are for all intents done.
The model above only captures the finish dates. If added visuals with simplified GANTT charts are needed in your Level 1 (and will be discussed in my next Level 2 article), you would have to edit the above.
The nice value of the above structure is that we have effectively created an interface, an integration layer, between what will be P6 data and our dashboard. The list of what activities can easily be edited by way of a sharepoint list. Then, in your data model, you can link on scheduleID to pull the relevent date data (I suspect many do this).
Too often, narrative comments are shuffled between parties via email, entered into several documents, edited, customized, etc. The project controls team is always struggling sourcing commentary from various sources, and in my experience, we end up entering in something ourselves.
Level 1 data structures have to fit into these complications. In this realm, sharepoint offers a canned solution by way of sharepoint lists.
Strategy – If Technology already exists, use it
Strategy – Technology can be used in innovative ways – use a mashup mindset to use existing technology in a new way
I find that sharepoint lists offer unparalleled capabilities for commentary. However, for lists to be really functional, they need to be embedded into FORMS or some routines that provide export functionality
In this example, I have mocked up a simple INFOPATH form that could represent our sharepoint fields. The sky is the limit when it comes to existing technology that can automate the capture of this type of commentary.
The value adder here is that instead of allowing unstructured comments (via email or manually marking up a word , excel or power point file), we have structured comments that are housed in a database and that database can be updated in a distributed manner using WEB based technologies.
The above would be a web based form which will be updated by the associated responsible parties. However, we can’t quite import a form into our data model. When the above form is filled out, the data will be stored in a data model (which we will have to design first before we can even build the form above). Thus, what we are looking for is something akin to the below
The above is just a table in an excel file, but again, when we house data in this format, it can naturally flow into a database. That is what we need to focus on. Even in our excel reporting world, if you can capture commentary in this tabular data centric way, you can still link to it from your main dashboard tabs to be “smarter” in how information is managed.
Strategy – Focus on the DATA! (I can’t say this enough)
Everything we do can be captured in a data model. Every report we design should be able to pull direct data out of a data structure. Thus, before we add anything to reports, first consider the entire flow of data required.
Putting it all together
At no point in time in the above have I had to rely on a source system. However, I have been able to take a typical Level 1 report and extract everything from it. I have taken this data and outlined a data model (in simple form) that can drive not just 1 project, but an entire corporate endeavor in this space.
As with everything, nothing novel here. Many companies already have systems that capture some of this information. This is more just a thought experiment for those that perhaps do not have a clear data model that supports level 1 reporting. It also highlights the discussion topics of “what are the manual steps” – because there will be manual steps in getting the data into the right format.
For me, everything above has to be manual at some point up or down the food chain. Your projects and portfolios need to have the discussions about where this type of Level 1 data is housed. If all projects already have this data in consistent databases, all you need to do is query that source. Everything discussed here is system independent. You can easily generate these data tables by way of query a source system directly (if you can), but I have not limited or require that approach
Strategy – Whatever you do, allow for flexibility
Even though my data model is entirely excel based, the data structure is very powerful. I can, in quite automated steps, import and convert these datesets into a more database model and thus gain value from dashboards that wouldn’t be custom for your project, but could drive an entire portfolio (and when you see how this scales to Level 2 data and beyond, the worlds your oyster).
If you actually want to proceed with a dashboard, and if you have your data as outlined above, here is what you can do with it. In fact, I would recommend that your source tab in excel that is driving your dashboard looks like the below.
The above data isn’t “immediately” friendly for digital reporting. A few transformations are required. The key steps involved are (the below was done as just an example using PowerQuery)
Unpivot the Timephase date columns
Pivot the the “SeriesName” column to create a unique “Column” for each dataset (this is need to create unique lines on our dashboard graphs)
At this stage, we have a nicely formatted table and we can now import into PowerBI. The intent here is not to showcase a beautiful Level 1 dashboard in PowerBI. My intent is more to showcase the data structures need to drive a dashboard. With the above data, we get pull each data series into graphs, tables, cards, KPI metrics, etc.
Our model has tagged each record with a “As-Of” date. Thus you can utilize this structure to have your dashboard display ALL prior months by way of a slider or select. Given more advanced skills, you can also pull out metrics about current incremental values vs what we said last month. Although, I feel those metrics are best served in Level 2 report where more detail is available.
Apologies for the look and feel below, I just pulled in the data to showcase that indeed you can drive a dashboard with what is effectively just a few lines of data that every project already has. We can bring together cost, schedule, progress, and commentary quite easily and in a very data friendly way.
For me, there is no substitute for an excel based dashboard. The value in this for me is ensuring that when I produce a Level 1 Dashboard (in Excel), I should give consideration to ensuring my data is structured appropriately. This gives us a fighting change to perhaps go down the path of creating a more digital world. It also allows for perhaps more flexibility in dealing with Level 2 data to maybe have some real automation of rolling up of data.
Level 2 obviously. I hope to showcase how the same ideas and concepts here can also help you structure your raw excel based Level 2 data to perhaps be better utilized in a more digital world
I was looking for the Power Production of a particular solar farm, and I couldn’t find any public dashboard that show this level of details, all I could find was high level aggregated data (Later after I built the dashboard I found this excellent resources Nemlog)
Current, last 60 days of data ( current day not included, Updated at 4 AM)
Archive : the last 13 Months of data ( current month not included, Updated Monthly)
Pulling data from a website and building a dashboard in PowerBI is straightforward, it took me a couple of hours on a weekend to do it, the problem is how to maintain it.
Ideally, you build a dashboard and all the refresh is done by the service, which was not the case here
Pulling the data directly from the archive is very slow, it takes nearly 3 hours ( unzip, filters only the data we are interested in), and is not sustainable as the earliest month will be removed from the website, I like to keep the history, and it is really bad practise to download the same data every day
To keep the history, we need to save the archive somewhere else, too easy , just save it on a local laptop
History issues solved, now we created a new problem, on-premise data require a gateway, basically you need to install a software on your laptop, and obviously the laptop must be on when you do the refresh
After playing around of some options, I come up with this workflow
Create a local folder that contains all the archive files.
Create a PowerBI data model on the desktop just to process the archive data
Export to clean tables ( price and Production ) to CSV using DAX studio !!!!!!
Load the CSV to azure blob storage ( to get rid of the gateway)
Load the current zip files from the web site , it does not require the gateway, but you need the following consideration – Use relative path in web.content functions ( see Chris Blog) and @TheBIccountant
Append the data from azure blob storage and the current folder from the web site, the refresh is now very fast, as PowerBI just read the csv without any transformation
Publish to web
Good so far, I manage to get rid of the gateway, the dashboard is refreshed automatically in the service, no maintenance for 60 days.
as the current folders contains data for the current 60 days only, you need to update the initial CSV files.
Download the pbix from the service, export the csv , and upload to blob storage, you need to do that only once every 60 days.
PowerBI Publish to web is an amazing service and it is totally free
Powerful solution without writing any codes
PowerBI free license is free 🙂
Publish to web is not suitable for real time, as it takes nearly 1 hours to propagate the update to the web site, that’s why I can’t publish the current day data, which is updated every 5 minutes.
Publish to web does not include export data from the visual
pricing for azure blob storage can be tricky : storage itself is very cheap, data upload is free, download in the same region is free ( for example blob to PowerBI service), but when you read data from the blob to PowerBI desktop you incurs charges, so just be careful, it is not your Onedrive model, where download is free.
we showed here a simple workflow using PowerBI free license and azure blob storage (Dropbox), it is very easy but with one inconvenient you need manual operation once every two months, that’s a bit annoying.
edit 23-June-2019 :after I published this blog, I got an excellent feedback from Maxim Zelensky, actually using PowerBI dataflows ( require a PRO license), we can fully automated the whole process, as with dataflows we can have a self reference query, I am not going to repeated here, go and read it
edit 24-June-2019: as it is a personal project, and the data is public, I am not really excited about using a paid service to host the CSV files, I moved the two csv files from blob storage to dropbox, it is totally free, so the whole dashboard infrastructure is free, Good work Microsoft
edit 26-June-2019 : a proper solution will be to save the raw data in a data lake, see here
edit 7 -Sept-2019 : I was a bit tired of workaround, now using Google Cl0ud functions and BigQuery , see blog here