Using WebJobs to scrap public website and copy data to Azure blob storage.

When I start that AEMO Dashboard , I had a hard time dealing with PowerBI gateway, it is just setting there, my laptop has to be  online whenever I need to schedule a refresh, it just annoyed me, and I could not understand how cloud based data needs on-premises gateway anyway,   obviously later I learned that strictly speaking it was not required, there was just undocumented feature to get away of it ( the trick is in the first blog, thanks @Rad_Reza ).

but before I was aware of that, I went to some rabbit holes dealing with new tools that are out of my comfort zone, and I think they are worth sharing.

my first thought was instead of accessing the data directly from the website, let’s instead copy the data to a cloud storage then read it from there, I have already a google storage account, it is very generous with a free 5GB storage, my data is not big around 2 GB of zipped csv..

first setback, there is no native connector to Google storage and even if there was ,we have something called egress fees, in a nutshell, cloud storage is really low cost, loading data is free, but getting your data out is not free, unless it is for the same provider and the same region, most of the cloud vendors use the same model,  as my data will be processed in PowerBI, the clear choice is azure blob storage

Azure Blob Storage

the setup is very simple I used the following options :

  • the same region as my PowerBI region ( otherwise your pay the egress fees)
  • for replication I used LRS 

as PowerBI don’t support data lake V2, I used the classical Blob Storage.

Let’s move some data 

anyway, now I have my storage, I need a tool to copy the files from here http://nemweb.com.au/Reports/Current/Daily_Reports/ to my storage account.

Azure data factory

 When you want to copy data, the official tool in azure is data factory, I tried to play around with copy activities, it is straightforward, my first attempt did work and it was fast , actually too fast 😊, no zip was transferred but rather an  HTML 

probably copy data just handle this case just fine, but when you use your own credit card on a cloud tool and you don’t know what you are doing, better stay back and take the time to understand how it works, I deleted the new created resources and went to the second option, Python !!!

 PYTHON

Normally I go with R but blob storage has no API for R, I have very limited experience with Python , just using it for the excellent package altair , let’s try something new.

I was very pleasantly surprised, the amount of documentation for Python is just amazing, actually once I asked a question on stackoverflow and got a very succinct answer in less than a minute, no one was judgemental or downvoted my question ( the question was very basic). the only drawback is that sometimes the code works well for python 2, but I am using Python 3 anyway enough talking let’s show some pseudo code.

step 1 : get a list of files name from the web site

url = “web address where the files are saved”

result = urlopen(url).read().decode(‘utf-8’)

pattern = re.compile(r'[\w.]*.zip’)

here is a snapshot of the results, the full list is 60 items.

[‘PUBLIC_DAILY_201904260000_20190427040503.zip’, ‘PUBLIC_DAILY_201904260000_20190427040503.zip’, ‘PUBLIC_DAILY_201904270000_20190428040502.zip’]

step 2 : get a list of files name from the blob storage

 in the first run, the list is empty as we did not load anything yet, I load a couple of files manually just to test if it is working, the API for blob storage are very simple, you only need to provide your storage account name and key and  I love that.

block_blob_service = BlockBlobService(account_name=’’xxxxxx’,                                                          account_key=’xxxxxx’’)

generator = block_blob_service.list_blobs(container_name,prefix=”current/”)

the same you get a list of names.

step 3 : get a list of files that exist in web  and don’t exist in the storage

the code in Python is very simple, it is simply substraction of two sets, and then you converted to a list using function list ( i get why people like Python)

files_to_upload = list(set(List_website)-set(list_azure))

step 4 : Upload the new files to Azure Blob Storage

the same here, the Azure API are very simple and clear, I had only when issues, when the script upload in a loop, it does not wait until the transfer is completed before jumping to the next file, my workaround was just to use sleep ( sync is supported but not in this scenario where the input is from an url), anyone i got the answer in stackoverflow

for x in files_to_upload:

    block_blob_service.copy_blob(container_name,x,url+x)

    copy_status = block_blob_service.get_blob_properties(container_name,x)
    #use code below to check the copy status, if it’s completed or not.

    while(str(copy_status.properties.copy.status) != “success”):

        copy_status = block_blob_service.get_blob_properties(container_name,x)

basically wait till the status of the copy is success before moving to the next item, ( did I say I love Python syntax)

the full script is here 

 WebJobs; a Free Job Schedule  

ok, so we do have a script that works, now we need to run it on a schedule, once per day at 5 AM,  keep in mind the whole purpose of this workflow is not to use on-premise software, I just need to find the service that runs a script on the cloud on a schedule, as I am already on azure, let’s stick in that ecosystem.

and it is a personal project, I prefer a free solution,  my script runs only every 24 hours, for a couple of minutes,  a quick google search and i find this little treasure, I will not repeat here the steps, WebJobs is a service that just do that.

note that the package azure-blob-storage is not a base package in Python you need to install first in WebJobs, the schedule functionality is very flexible as it is using CRON, I wish we had something like that in PowerBI Dataflows.

End results 

Every day at exactly 5 AM, a new file show up in the azure storage, although I don’t need those files, I am using now another approach to load the files directly in PowerBI, it is important to build a data lake ( yes, I just said that, I am just joke, data lake is folder in the cloud where you save the raw files, nothing more), storage is cheap but most importantly the requirement may change, I may need to report on another dimension and it is crucial to keep the raw unprocessed data.

Take away

  • Python is awesome
  • Azure API for python are straightforward
  • Azure is awesome.
  • Be careful of Egress fees
  • CRON is awesome wish it was supported in PowerBI dataflows.
  • Wish PowerBI dataflows could save a raw file, Powerquery is amazing but it does not copy raw files.
  • Wish WebJobs add support to R
Advertisements

Tracking AEMO data using PowerBI

I was looking for the Power Production  of a particular solar farm, and I couldn’t find any public dashboard that show this level of details, all I could find was high level aggregated data (Later after I built the dashboard I found this excellent resources Nemlog)

The dashboard is published here  https://djouallah.github.io/AEMO-POWERBI/  , it is refreshed every day at 5 AM

Capture

How it works

Australian Energy Market Operator (AEMO) publish all kind of datasets,  one I believe is real time (require a  subscription ) but for my particular use case, I m interested in this dataset

http://www.nemweb.com.au/#daily-reports

there are two folders :

  1. Current, last 60 days of data ( current day not included, Updated at 4 AM)
  2. Archive : the last 13 Months of data ( current month not included, Updated Monthly)

Pulling data from a website and building a dashboard in PowerBI is straightforward,  it took me a couple of hours on a weekend to do it, the problem is how to maintain it.

Ideally, you build a dashboard and all the refresh is done by the service, which was not the case here

  • Pulling the data directly from the archive is very slow, it takes nearly 3 hours ( unzip, filters only the data we are interested in), and is not sustainable as the earliest month will be removed from the website, I like to keep the history, and it is really bad practise to download the same data every day
  • To keep the history, we need to save the archive somewhere else, too easy , just save it on a local laptop
  • History issues solved, now we created a new problem, on-premise data require a gateway, basically you need to install a software on your laptop, and obviously the laptop must be on when you do the refresh

After playing around of some options, I come up with this workflow

  • Create a local folder that contains all the archive files.
  • Create a PowerBI data model on the desktop just to process the archive data
  • Export to clean tables ( price and Production ) to CSV using DAX studio !!!!!!
  • Load the CSV to azure blob storage ( to get rid of the gateway)
  • Load the current zip files from the web site , it does not require the gateway, but you need the following consideration                                                                                 – Use relative path in web.content functions ( see Chris Blog) and @TheBIccountant 

Web.Contents("http://www.nemweb.com.au/REPORTS/CURRENT/",[RelativePath = "Daily_Reports/" ])

Don’t use Web.Page function but parse it using XML or csv , ( Thanks Reda Rad for the advise)

so you can use something like this

Table.FromColumns({Lines.FromBinary(Web.Contents("http://www.nemweb.com.au/REPORTS/CURRENT/",[RelativePath = "Daily_Reports/" ]), null, null, 65001)})

  • Append the data from azure blob storage and the current folder from the web site, the refresh is now very fast, as PowerBI just read the csv without any transformation
  • Publish to web

Good so far, I manage to get rid of the gateway, the dashboard is refreshed automatically in the service, no maintenance for 60 days.

as the current folders contains data for the current 60 days only, you need to update the initial CSV files.

  • Download the pbix from the service, export the csv , and upload to blob storage, you need to do that only once every 60 days.

PRO

  • PowerBI Publish to web is an amazing service and it is totally free
  • Powerful solution without writing any codes
  • PowerBI free license is free 🙂

Cons

  • Publish to web is not suitable for real time, as it takes nearly 1 hours to propagate the update to the web site, that’s why I can’t publish the current day data, which is updated every 5 minutes.
  • Publish to web does not include export data from the visual
  • pricing for azure blob storage can be tricky : storage itself is very cheap, data upload is free, download in the same region is free ( for example blob to PowerBI service), but when you read data from the blob to PowerBI desktop you incurs charges, so just be careful, it is not your Onedrive model, where download is free.

we showed here a simple workflow using PowerBI free license and azure blob storage (Dropbox), it is very easy but with one inconvenient you need manual operation once every two months, that’s a bit annoying.

edit 23-June-2019 :after I published this blog, I got an excellent feedback from Maxim Zelensky, actually using PowerBI dataflows ( require a PRO license), we can fully automated the whole process, as with dataflows we can have a self reference query, I am not going to repeated here, go and read it

edit 24-June-2019: as it is a personal project, and the data is public, I am not really excited about using a paid service to host the CSV files, I moved the two csv files from blob storage to dropbox, it is totally free, so the whole dashboard infrastructure is free, Good work Microsoft

edit 26-June-2019 : a proper solution will be to save the raw data in a data lake, see here

Create a PowerBI Resource Analysis Dashboard – by Darrin Kinney

Feel free to provide comments directly to my LinkedIn Post that references this article and contact me directly – Darrin Kinney

Analyzing  resources in P6 is a common responsibility of all planners. However, ultimately the data typically stays in the realm of the planner, and never properly given to the specific project managers and project engineers who actually need to execute the work. Thus, there exist a huge opportunity in the digital world to extract data from P6, and present it to the masses. Planners are penultimate professionals in this. We live for this. We take the resource assignment data from P6 and work our magic in excel creating a suite of reports and graphs (s-curves).

However, all the excel work is customized. The ability to quickly drill into the data from the data table or the graph is just too difficult for management. The need to create custom graphs, takes time. Instead, we can create something quite amazing using PowerBI.

Using PowerBI we can create a simply easy to use dashboard that provide nearly unlimited flexibility to display both schedule and resource information to any user on a project. The below is a walk through to create a simple view with a simple schedule. However, I have run this through a schedule with 5,000 activities and multiple resources including both manhour resources and quantity resources. It is situations where you have data overload that PowerBI shines. Get the data out from the planner, and into the hands of the project management group.

Step 1 – Get resource assignment data

Using the resource assignment tab in P6, remove all the grouping and just display everything. You will want to ensure some key fields are available: Start Data, End Data, Resource Data, Resource ID and Type, Activity Name and ID, plus some WBS and grouping data (ex contract).

RA_2_P6_resources

Step 2 – Copy-Paste Resource Assignment Data into Excel

A simply copy-paste from P6 into excel will suffice. It is really that easy. Sometimes the data from P6 may not contain the right descriptions, so this is an opportunity to use some vlookups (or Index(Match()). The data also needs to be presented in a table. Select All and use CNTL-T to convert to a table.

RA_3_excel_table

Step 3 – Unpivot Data using Power Query

This step can be done directly inside PowerBI Desktop, or in perhaps a more flexible excel environment. I prefer to do as much data handling in excel to reduce the complications once inside PowerBI; however, many options exist.

Inside Excel (ensure you have power query add in), select the columns with the dates, and click “UnPinvot Columns”. Close and Exit.

After running the unpivot, you will see each data column turned into a row. This is an easier data format for use in databases and a trick universal not just for powerBI, but anytime you might be dealing with databases.

RA_5_powerq

Step 4 – Import Into PowerBI

PowerBI has very quick and easy import routines. This post is not meant to be a specific click by click guide. There are easy import routines you can find to import excel files. Again, as indicated above, just make sure your data is converted into a table. You can directly create a datasource using something similar to the below

=Excel.Workbook(File.Contents(“C:\Users\Name\Downloads\P6_resource_Assignments_HP.xlsx”), null, true)

Step 5 – Create your PowerBI Elements : Gantt, Graph, and Slicers

GANTT

The PowerBI file uses a custom visual you can download for free from the marketplace

GANTT by MAQ Software

The parameters you will want to use are roughly as noted below

GRAPH

The PowerBI file use AREA CHART. This is an easy to use built in graph feature of PowerBI. You can use potential extension of line graphs to show a % Complete curve. However, from a base usage, the area chart provides the easiest visual.

A nice trick in the overall dashboard is to align the data range of the GANTT with the GRAPH. I have found some limitations to this exist, but in this example, it worked quite nicely.

 

SLICER

The real power of this visual is through the use of slicers. Slicers allow for immediate filtering based on a variety of selections. This visuals uses a range of filters for Contract, Facility, SubFacility and ResourceID. Depending on the structure and data you extracted from the P6 resource assignment, your options here are unlimited.

 

Conclusion

This example just scratches the surface of what is possible. I found the above to be immediately useful to our project team in clearly visualizing the resources required over time. There is some manual effort to keep the source excel updated after changed to P6 – this is not a live linked datasource. There are numerous possible development extension people can use to customize this.

The biggest critique to the dashboard is the lack of visibility into the Baseline dates and plan curves. For this, we can only hope someone builds a more robust custom visual to accommodate this. In the end, happy planning.

RA_1_Final

Integrated Project Control system using PowerBI

One of the most popular discussion in planning forum is how to have an integrated project control system, every practitioner has a different opinion how it must be done, and of course you get a lot of marketing from people trying to sells their systems

In this blog, we share high level description of a data warehouse built using PowerBI desktop.

Data warehouse are not widespread in the construction industry, because the reporting specifications are different from project to project, and every client has a different systems and tools, and trying to have a standard system works only if you are the main contractor but if you are a subcontractor you have to adopt the client system.

Another reason is; it required a specialised IT skills, we are just business users not programmers, we do understand data very well, but not necessarily having the skills and tools to manage it, The good news is, with the rise of self-service Business intelligence, we have exactly that, Powerful data management tools yet accessible (assuming you want to learn something else than Excel).

So at high level this is how it works.

 Project Controls Data Warehouse

  • The Data warehouse was built using PowerBI desktop, I know it should be called sematic model, (for me data model, data warehouse are fundamentally the same thing), initially it was using Excel PowerPivot but it did not scale well with the increase of the volume of data.
  • As the data is not always in the format we want, PowerQuery is very handy in this case, as virtually it can transform any source of data, example lookup the subsystem using the tag field, trying to do that if you have 8 Million rows using Excel or Access is not feasible.
  • We maintain Master tables to integrate all the different source of data (tags, WBS, subsystems etc)
  • Every week, we get new Export from the source systems (Cobra, proprietary database systems etc), we load the new data and keep the historical records, it took 15 minute to refresh, which is quite impressive, Cobra alone is a folder 60 Excel file, and nearly 2 Giga in size.
  • Usually you publish your reports into PowerBI.com service to end users once your refresh your data model, in our case we can’t use the cloud for privacy reason, instead we use Excel as a reporting tool that pull the data from PowerBI desktop, the advantage of this approach is that we have different reports for different users, Skyline, Gantt chart, Client reports (in their required format), management reports etc.

Some thoughts.

  • As you can see Primavera P6 is used only as a forecasting tool and to calculate the critical path, the earned value calculation is done in the data model, personally I think P6 should not be used as the centre of your project control system, I remember the first time I start learning Primavera P3 ( a long time ago:), we kept asking the trainer how it is possible to track the spent hours at the activity level, the answer is we don’t, actually deciding at which level you track you spent hours it is the most important decision to make when you start a new project.
  • the basic idea here is in order to have an integrated project controls system is stop trying to have one, data will be always in silo, don’t try to change other department how they manage their specific data, it will not work and they will not listen to you anyway, So instead of trying to have one system to rule them all, just use the existing systems and build a data warehouse for reporting and Integration with P6.

How to use Excel as a Front End to Power BI Desktop

You can download the data model, the excel frond ends and the data sets in this folder.

The techniques used in the spreadsheet for connecting to PowerBI Desktop are not officially supported by Microsoft,  ( as of  24 may 2018, Microsoft don’t mind people using this approach ) but I think are useful to understand how the MS BI stack works, especially if you are coming from an Excel Background, and Installing SQL server developer is either intimidating or not allowed.

I really appreciate if you vote for this idea to make this scenario offically supported by Microsoft

As far I am aware there are two approach to access PowerBI desktop, or more correctly the SSAS instance launched by Power BI Desktop.

  • Connecting using a live connection, detailed here
  • Connecting using PowerQuery, detailed here

 

The live connection is very interesting but it has the drawback that if you close PowerBI Desktop the pivot table stop working, so you can not share the spreadsheet, one solution is to use cube formula as they are persistent, if you don’t use know what a cube formula is then you are missing of the most powerful feature of Excel.

But what if you want to have a pivot chart, or a pivot table that you can keep using even if you lose the connection, or if you want to share the results with people that have not access to the data model,  turn out it is possible, welcome to Excel Pivot cache

 

  • Invoke the function SSAS_QUERY
  • 1
  • The Parameter is optional:
    • Either you write a DAX Query to retrieve the fields you want, MDX is supported too, personally I find MDX more suitable to import measures with different dimensions, I understand that DAX support this scenario too using crossjoin but I never manage to make it works correctly.
    • or Just click ok, than you can browse the SSAS cube, you can select any dimensions and measures you want, but mind that for a big cube, a query fetch the result faster.
  • Keep Powerquery as a connection only.
  • Insert a pivot table, use an external data source, choose connection, select the Powerquery Query, and voila

2

 

 

Excel cache the data in the Pivot, not only that it is extremely compressed.

Notice here PowerPivot is not used at all, Excel is acting as visualisation layer to PowerBI Desktop, leveraging two well known capabilities cube formula and Pivot cache.

Microsoft plan to release SQL server v.next this year, and then we can deploy the data model built with Power BI /Powerpivot /Powerquery into a production system, that’s what I call a natural growing path from self service to corporate BI.

Edit Dec-17 : i add a new file that use only VBA, so Powerquery is not required, it can be useful if you are still on Excel 2007, or you can’t install Powerquery.

 

Thin and Core workbook without Sharepoint or SSAS

edit : 1/1/2017, another option is described here, the thin workbook connect directly to Power BI desktop, very handy if: your client are still on Excel 2010, or you don’t want to share the data in the data model but just the results.

 

The concept of thin and core workbook was familiarized by the guys at PowerPivotPro, but they assume, either you have SharePoint installed or a connection to the cloud.

Let’s assume the only resource you have are:

  • Your PC.
  • A local server used for shared folders.

Backend and Front End approach

if you have small data sets then load everything to PowerPivot and do the reports in the same workbook, but if you are dealing with a lot of data sets, and you need to provide tailored reports to different people, typical example your client wants a particular format, the PM want a different format, the construction manager wants something different altogether, we all know what the commissioning guy wants 🙂

Then a sensible approach is to separate the data model from the reports itself, the backend will be used to store the data and a front end to report and visual the data.

Front End:

Excel: natural choice for many obvious reasons.

Tableau: has a killer feature in this scenario.

Backend:

There are many options, depending on the size and the complexity of your data.

1-MS Access:  is a solid option, but if you have tables that have more than 500 000 rows, then perhaps that’s not really a good solution, I know it depends on filed type, number of columns etc., the other issue there is no ETL for MS Access.

2- MS SQL SERVER EXPRESS: it is a free and a kick ass Database server, probably it will load all your data, but if your data is not well structured then it became a pain to load it, there is no free ETL for SQL express, and Powerquery does not natively load to SQL, you can vote for this idea here .

3- PowerBI Desktop: load the data in PowerBI Desktop and access the data from Excel using this hack. It is highly experimental but very promising, it is potentially a game changer.  You can vote here if you want Microsoft to officially support it.

4-PowerPivot: it is already there with Excel, Powerquery can literally load any data you have, it can clean transform with a simple click, and it is free, and don’t forget PowerPivot is a simple SSAS instance embedded in Excel, ok there is a problem 😦

We cannot access PowerPivot Model from another Excel workbook.

PowerPivot data can only be consumed from the same workbook :), I don’t know the reason but definitely it is not technical, all we can do is just to vote and hope Microsoft will listen

But fortunately there are some workaround.

1- Tableau: tableau can read data from a PowerPivot data model, and it has some very amazing visualisation, (not free)

2- PowerUpdate: Powerupdate can load the data directly from the PowerPivot Model to an SQL Server, and bonus point, it can refresh any workbook automatically no need for manual refresh (not free)

3- Reverse Linked Table: instead of using pivot table to read the data from the Pivot tables, we just used tables to get the same result, the Italians provided a detailed explanation here, and to avoid Excel limitation of 1 Million record, you can split the result in new tabs, you can even create summarized data that feed your different report, the only inconvenient here, the size of the workbook will increase with a lot of data.

Conclusion 

a simple option to deals with a lot of data is to load all the data source in one Master workbook, then write back the result in the same workbook using Reverse Linked Tables.

Now you can use that Master workbook as your backend for all your reports.

or use PowerBI Desktop as your backend, officially it is not support but Excel can perfectly connect to a local instance of  PowerBI desktop.

 

 

 

Introduction to Modern Excel

Download the .pbix file, Click  here to open on the web.

Untitled

Modern Excel is a term  first coined to distinguish between Excel 2007, and Excel 2010  that saw  the introduction of PowerPivot and PowerQuery,  Excel has become a first class enterprise Business intelligence solution, Millions of row can be loaded, using PowerQuery , data can be linked using PowerPivot interactive, reports are easy to generate using PivotCharts and Cube Formula, and your audience will have more confidence in your reports.

Basically excel become a personal datawarehouse that store and analyze millions of records from multiple data sources.

A sample dashboard In Modern Excel

A “traditional” excel dashboard has no real distinction between the data source and the report itself, everything is mixed together with the results generated by a mixtures of vlookup and sumproduct, and maybe some VBA code,  if the boss ask for a different report, you have to start from scratch, but most importantly, it is error prone, if you move one cell, you may change the results without noticing, and there is no obvious way to audit the results.

The Dashboard was first built in Excel using PowerPivot and Powerquery then imported into PowerBI desktop then  published to the web.

I could have simply distribute the excel file with the reports  but I wanted to emphasis it is not simply another Excel spreadsheet with zillion of cryptic formulas that no one can understand except the author but a proper database solution based on columns not cells.

 

How it works

Anyone who used MS Access will find modern Excel very familiar, so let see how the steps needed to generate the sample  dashboard, it is only a very high level description, not a detailed tutorial.

Data Source: most of SQL Servers, excel, text files, web, ms access, basically any data source you may think of

Data Preparation: PowerQuery is a Powerful ETL Tool for cleansing data and prepare it to be loaded in the data Model, for example if your export the spread of hours from P6, the data will be spread horizontally, in a couple of clicks  you can unpivot the data and make it as a table

Data Model: is where the relationship between the table are defined, here is a screenshot from the sample dashboard

DataModel

Dax:  the language to query the data model and define your measures, it is not that different from sql queries

Report and Charts: Pivot table, Pivot chart and cube formula that read from the data model

Here is an example of a pivot table that query from the data model, notice how are all the tables are visible, a classical pivot table will show only one table.

Pivot

Powerpivot vs PowerBI

It can be a little confusing when you hear all those terms but basically The core technology in Microsoft Business intelligence stack is in an in-memory analytic engine called Vertipaq, the same technology is present in different products intendent to service different market segment

  • PowerBI : the cloud offer from Microsoft to provide business intelligence for enterprise, it mainly compete with the like of tableau and Qlik, PowerBI come in two editions free and professional.
  • SQL Server Analysis Services : On premise Enterprise offer come with MS SQL Server
  •  PowerPivot : in Excel 2010  and up for self-service Business Intelligence

Wrap Up

Self service business intelligence is changing how to deal with data, microsoft has manage to bring a powerful technology to the end users,  previously a word like datawarehouse was exclusively a heavy IT investment, now with PowerPivot and PowerQuery you can do it by yourself.