Analyzing GIS data using BigQuery and PowerBI

TLDR, world data here , pbix file (Publish to web has a limit of 1 GB, only points are used)

Australia Report with polygons , pbix file

Australia report Using Datastudio Google Map

Edit : 14 April 2020, Updated the report to load all the tags amenity in the world, I am using this formula to dynamically calculate the distance between two points

Due to the COVID19 pandemic Google has made some public dataset free to query, one of them is openstreetmap, I thought it is an excellent opportunity to play with BigQuery GIS functions.

Using the existing documentation, I come up with this Query which return all the geometries in a radius of 100 Km from an arbitrary point ( for some reason I choose Microsoft office building in Brisbane as a reference) and with a tag =amenity

WITH
params AS (
SELECT
ST_GeogPoint(153.020749,
-27.467539) AS center,
100000 AS maxdist_m )
SELECT
ar.key,
ar.value,
feature_type,
osm_id,
osm_way_id,
geometry,
ST_CENTROID(geometry) AS center_location,
ST_Distance(ST_CENTROID(geometry),
params.center)/1000 AS distance
FROM
bigquery-public-data.geo_openstreetmap.planet_features,
params,
UNNEST(all_tags) AS ar
WHERE
('amenity') IN (
SELECT
(key)
FROM
UNNEST(all_tags))
AND ST_DWithin(ST_CENTROID(geometry),
params.center,
params.maxdist_m)

the query return

WARNING

the query processed 245 GB in 16 seconds !!!, and it did cost 0 $ at least till 14 Sept 2020, after that it will incur cost ( 1 TB/5 $)

you can explore the result using the built in Geoviz, but you can’t share the data.

PowerBI does not support custom queries when connecting to Bigquery , I had to save the query results in a view, then the connection to PowerBI is straightforward.

the query results is returned as a Key, Value

using PowerQuery pivot, it is trivial to denormalize the table ( I could not find how to do that in SQL), anyway the results looks much easier to analyze.

by the way just be careful , PowerBI support a maximum of  32766 characters , but there is an easy workaround, split the column by 32766 and then concatenate in a calculated column, yes it will increase the memory size, but it works.

and here is the final results using the beta version of icon Map, for example filtering all the data less than 4 Km, if you want print quality map you can always use R visual, see example here

the custom visual is still in beta, polygons and multipolygons render perfectly, point works but with a visual discrepancy, and I don’t think linestring is supported at all.

Icon map is a very versatile visual, I hope the author will release an official update and fix the rendering bugs and add an option for color per category.

Bigquery GIS is very powerful and easy to use, the documentation is excellent, I wished only they release a smaller public GIS dataset to play with.

How to reduce data volume in PowerBI Maps by using WKT

In a previous blog, I showed how to load a raster tiles into PowerBI data model, in theory that should solved all my issues with doing a detailed maps in PowerBI.

unfortunately, no, even if R and Python visual support up to 150K points,  the reality is the implementation of R in the PoweBI service has a massive overhead and you can’t do anything about it, as it is literally a black box, all you can do is try to reduce the data passed to R visual and hope it works.

Actually, in my case, the visual did not even show up and I got an error message that resources are exceeded

I am in a situation where I can’t filter data because the whole point of the visual is to show all the data, at the same time, if the visual does not work in the service then there is no point in the whole exercise.

The trick is using wkt, I will simplify the geometry without losing any visual data, for example:

Instead of showing all the points, I will just group the points in the same order and colour as a line, as you can see from 14 rows of data, it is reduced to 5 rows, and the visual representation is the same, it is like sampling, but we keep the exact shape of the data.

Now in PowerBI, all we need to do is to automatically group those points together, turn out the solution was very easy using Rankx, keep in mind the wkt is dynamic for every update, I get a new geometry

After that I just added some calculated columns to create the wkt format

For a point, POINT (X Y)

For a line, STRINLINE (start_X start_Y,finish_X finish_Y)

Keep in mind you can create polygons too, but the DAX become more complex (maybe for another blog)

you can create the wkt file in QGIS very easily but as my data change daily, it was not practical

And here is the final result

The number or rows were reduced from 3528 to 218

That make a massive difference in PowerBI service, my real data is 58K rows and I can’t tell how much I was happy when finaly it worked in the service,not only that, but the total rows using wkt keep decreasing when I do more updates 🙂

There is a catch though, unfortunately as of Dec 2019, only R and Python script can render wkt geometry, there is a new custom visual by @james dales, but it is in a private beta and has some limitation on colors by category. ( icon map support color per category now)

You can download the pbix file here

I hope that in 2020, Microsoft invest more on improving the Maps offering in PowerBI , and optimize R and Python scripts on the service, I am very optimistic

with the new ICON map my use case is fully solved 🙂

Load Raster tiles to PowerBI Data Model using R

In a previous blog, I showed how to use PowerBI to generate high quality print maps, with a caveat that the R script does not work in the Service unless all the packages are supported (the desktop use your local R install, so no limitation here)

I am a huge fan of ceramic, for me it shows the best of R philosophy, it does one thing and do it very well, you give it coordinated and it will give you back raster tiles.

I spent some time trying to figure out a workaround to make it works in the service and I found this trick.

  • Generate raster using ceramic outside PowerBI ( using Rstudio for example).
  • Save the raster object using saveRDS but using the option ASCII = TRUE, so it is a text file, notice you need to write version = 2, otherwise it will not work in the service.
  • Load the file into PowerBI using PowerQuery
  • The maximum data you can pass to R visual is 150K rows, which is not enough for my use case.
  • The trick is to group the data using concatenation, the limit is the number of rows not the number of columns :), please note the maximum number of length of a text value is 32766 
  • Merge the table for the raster with the table of data (coordinates, attribute etc), unfortunately, you can pass only one dataframe to R script, I changed to append so the visual can be sliced
  • Now you have a dataframe with the coordinates and a raster data which you pass to R Visual script
  • Unest the raster data into a dataframe, notice the dataframe holding the raster data is 1 Million rows
  • Save the dataframe to a “raster.rds”
  • Load the “raster.rds” and it is R Magic the raster is alive
  • Plot the map ( it is on the PowerBI service not the Desktop)

you can filter the status and hide the tiles if you want, as it is slow to render in the service, please use query reduction option in the filter

This workflow does works with any R object, not only Raster but any binary data can be passed to PowerBI data model.

All the codes are saved here.

I think the main take away is you can circumvent PowerBI limitation of 150,000 rows when you pass data to R or Python, but there is a trick, the resource available in the PowerBI Pro instanced are limited and not documentation, so your mileage may vary , but it is worth the try

now, you may ask, why bother with this tedious, slow visual, the answer is very easy, in some cases you want to control the exact look of a map, and R give you just that, you can show multiple layers, text, it support more than 30 K point in a map, it is worth the pain

edit : I just noticed that PowerBI cache the visual output , if you do the exact selection again, the visual show instantly !!!

Normalize multiple progress files using PowerQuery

A typical situation in the construction industry the progress data is sourced from multiple system with different format, generally we get two type of reports.

  • Time stamp items

My preferred one, the data is tracked at a very low level (cable, spool, pre-commissioning and commissioning tracking), and you get a date when the item is completed something like this

This format is very convenient as you need to maintain only 1 file, the history is recorded in the data itself unfortunately, this kind of report is not always available for multiple reasons, the main one is, in some kind of work to finish one item it will take  longer period of time, for example completing 1 drawing will take 3 weeks, no manager will wait 3 weeks to claim a progress.

  • Cumulative Progress

This format reports the cumulative progress at a time period (daily, weekly, or whenever there is a progress) something like this

This is format is very common, it is very easy to update by the supervisor, and works with any level of details

the challenge of this format is

  1. To get the historical data you need to keep all the previous files.
  2. As it is cumulative data, calculating the progress per time period is a bit harder, and getting something like year to date is very awkward.

we need to normalize those files to be in the same format, one approach I use with PowerQuery is

  • Load the cumulative files.
  • Calculate the reverse total cumulative using self-join
  • Filter only the values where there is a progress
  • Append to the time stamp file.

 Now we have a normalize Actual Table, where quantity per period, year to date and all date calculations are very easy to calculate.

The pbix files and the source data is saved here

I know it is tempting to just load data and start making visual and do some complex DAX calculation, but it is not sustainable and it will make your life miserable, a simple data model will make further development much easier.