Analyse Big Spatial Dataset using hex bin maps in PowerBI

Come across this blog post in Twitter and I thought this approach can be very useful in PowerBI to deal with Big GIS dataset.

The idea is simple, let’s say you have a big spatial dataset with million of coordinates, so instead of plotting all the individual points which is unfeasible for performance reason, you simply do this,

– First generate a grid for the area you want to analyse using something like QGIS.

– Then count how my point belong to every cell in the grid and color the results accordingly, you need a spatial Join for that, either using QGIS , SQL Server etc

– Drill down to individual points using WKT multi point

Publish to web here and pbix

Generate a Grid System

After a lot of googling as I did not even know what to look for, turn out QGIS can very easily generate a Grid, here is a good tutorial I found

and here is a results for Chicago area as I am using it as an example for this blog, Please notice, the grid should have less than 30K polygons which is the maximum number of rows that PowerBI can plot.

Spatial Join

I could have used QGIS to do the spatial join, but my data is from BigQuery public dataset (bigquery-public-data.chicago_crime.crime) and for convenience, it is easier to do the join there, the previous link show how to do the join in QGIS.

After loading the grid geometry to BigQuery, I am using this Query to join the two dataset, Grid and my current example, Crime data in Chicago

  crime AS (
      latitude) AS geo,
  XXXX.GIS.Grid AS Grid

The Query is checking every coordinates and looking up which polygons it belong to, so the Grid table is something like this

And the Query results is like this, we go the ID from the Grid Dataset, Any DB that support geometry can be used SQL Server, PostgreSQL etc


Let’s import those two table to PowerBI, the two tables are joined by the field ID

Luckily PowerBI support WKT format using Icon Map, show the grid with the count of number of coordinates is straightforward

Ok, it is great we have an aggregation of 4 Millions points and we can see the areas with the highest Number of recorded crime, you can filter by FBI code which classify the crime gravity (I have zero knowledge about this dataset, I just used as an example)

Drill Down to individual Cell

because we have the individual points, all I need to do is to generate Multi point geometry using DAX , that’s why I love Icon Map so flexible , The measure will simply detect if the cell ID is filtered or not and then switch between Polygons and Multi point

Geometry =
VAR polygon =
    MIN ( Grid_View[polygon] )
VAR concat =
    CONCATENATEX ( VALUES ( crime[geo] ), crime[geo], "," )
VAR Multi_point = "MULTIPOINT (" & concat & ")"
    IF ( ISFILTERED ( dummy_id[id] ), Multi_point, polygon )

Notice Here, as a PowerBI Visual can not filter itseelf, I had to add a dummy dimension

and here is the final result

Take Away

PowerBI Icon map and QGIS is a very powerful combination, and the fact that DAX can generate a geometry on the fly is very interesting Pattern.

Model Header/Details Schema using Nested Data in BigQuery

I was watching a Video by Alberto regarding the best way to model Header/Details Schema and how Star Schema is always the best option at least in PowerBI , even if you have never used PowerBI I think it is a great video to watch.

An interesting information I learned from the Video, although PowerBI columnar database is extremely fast in joins, still when the join column has a massive number of distinct values, performance will degrade, in his case, one of the fact has 1.4 B rows and 94 million distinct values on the join !!!, the solution was basically to flatten the two fact Tables.

ok, what this has to do with BigQuery !! for some reason, an idea occurred to me, this scenario will be a perfect scenario for Nested Data, but as I never used nested data before, in this blog, I am just messing around and see how it goes !!!

The Dataset

As usual, I will use the Australia electricity market as an example, it has 72 Million records with a size of 10 GB uncompressed, here is a sample f the Data, the dataset is public using this adreess


As you can see the same attribute is repeated for every 5 minutes, what change is the date and the production in Mw

let’s see how it looks as a nested format

in a pure denormalized form they are two tables in reality, the dimension which contains the generator attribute and the production every 5 minutes

How to Nest data in BigQuery

I literally copied from this blog by Lak, just go and read it, I will just show the SQL Query here

drop table if exists XXXXX.nested.UNITARCHIVE_nested;
  cast(SUM(mwh) as numeric) AS Mwh,
    time ASC) AS obs
  test-187010.ReportingDataset.UNITARCHIVE tt
  left join (select DUID,row_number() over() as DUID_ID
  group by 1 ) as xx
  on tt.DUID=XX.duid

Compare Performance

For the same level of details, the size went from 10.1 GB to 1.13 GB !!! the cost of storage is rather insignificant, but if you use Query on demand, you pay by size scanned, the less the better.

Let’s check the total sum of Mwh , the same Query went from 1.1 GB to 5.1 MB !!! that nearly 100 X Cheaper ( in BigQuery you pay a minimum of 10 MB)

let’s check the 5 top days by production, same went from 1.6 GB to 7.7 MB that’s 160 X Cheaper , and faster too from 0.9 Second to 0.6 Second, it is funny both return subsecond results and it is not cached 🙂

What’s the catch !!!

There is a catch though, when you Query inside the Nested Data, the performance is nearly the same, went from 1.6 GB to 1.1 GB still it is a saving of 0.5 GB, but what surprised me, unnesting is a bit faster than reading from flat data, I was expecting the opposite

Visualise the Results in your Favorite BI Tools

I am using PowerBI to show the final results, using Direct Query Mode, the Dimensions are Imported to improve the slicer performance, Obviously it works with any BI tools that support SQL.

Looker has a nice support for nested data, unfortunately there is no free version to test it.

As of this Writing BI Engine does not support Unnest Data,Hopefully they fix the issue very soon., Nice, it is supported under BI Engine SQL Interface !!!!

And a sample report using nested data (5 minute Dimension)

Take Away

let’s say you have a Header and details tables, there are some options

  • Join at run time in the BI tools, can be slow for high cardinality joins
  • Merge the two Fact at the ETL Stage, very fast but potentially generate multiple grains and you need to using distinct count to remove duplicates
  • Nested Data seems to be very elegant solution, no joins and you keep the granularity and it is cheaper to Query

“Modern” BI tools are missing something like PowerBI Composable SQL Generator

If you are on twitter or LinkedIn, Probably you heard already about the emergence of the “Modern” Data Stack and the triumph of the Cloud Data warehouse , this is all good, I use BigQuery BI Engine and I am amazed at what can be done with a little Bit of SQL.

Before I used to have an obsession with semantic model, I always thought you can’t have a decent analytical workflow without a proper multi fact semantic Model or what’s called now “Metric Store” , but I started changing my mind, flat wide table with multi grains are definitely ugly but they work , and storage is cheap, the point is ; there is a workaround it is not a showstopper, I think the issue is Further downstream.

To explain my point I will use PowerBI , and compare it to Apache Superset as an example, but the same can be said about other “Modern” tools too.

Analysis of Simple Table

we will try to analyse only one table using Direct Query mode, the Table is loaded in an external Database, so no ETL , no OLAP, no data movement.

Then I created this simple Report

The main metric is Actual Hours, which is basically sum (Actual Hours) grouped by category, to my surprise, it was trivial to reproduce the first three metrics easily using Apache Superset

The SQL was not too complex

SQL Aggregate can not simply ignore filters

The Problem started with the Metric “sum_all_ignore_all_filter”, which basically sum the total hours and ignore any filters or grouping, In PowerBI, we use this syntax to get this behavior

sum_all_ignore_all_filter = CALCULATE(sum(actual[actualhours]),REMOVEFILTERS())

now let’s see the Query generated by Superset

As far as I know there is no way in SQL to add an aggregation and tell SQL to ignore the where clause , I presume you can create another Query and do a join, but I don’t think Superset permit this in the Metric definition, another solution is to use Parameters to pass the filter values, but then you will lose cross filter functionality.

Now you may wonder how PowerBI solved this Problem, how it is possible by just writing a generic Metric, it will always works regardless of filters or which dimension is used in group by ?

I know it is pretentious statement, but in my View this is the Ethos of the Product, this the killer feature of PowerBI

The idea is very simple, PowerBI DAX Engine will just send multiple SQL statement and assemble the results , the implementation is very complex involving two engines ( formula and Storage Engine), complex Query plan, and even a patent which I think it is worth reading , but obviously this is all hidden from the user.

here is a look at how the PowerBI process the report

And the Actual SQL Queries generated

Take Away

Obviously this is only my own understanding from an analyst point of View, but as far as I can see , all I can read is SQL is better than proprietary DSL, but no one explain how to handle very complex calculation that ignore filters and level of details using SQL and keeping the flexibility of Metrics, where they are just supposed to work.

The Previous example was just simple use case with one table, now imagine a complex model with multiple tables and Non additive Metrics.

I am not suggesting PowerBI is the only software that can do that, all what I am saying “Modern” consumption layer tools need to have more sophisticated SQL Generators.

Share Data from a PowerBI Dataset using Dataflow

Using PowerBI Dataflow to share data between departments is increasingly getting more popular, with the rise of managed self service offering, it is not unusual to find a non centralised Data warehouse but rather every department managing it is own data, I don’t know enough to say anything useful on which model works better in the long term.

this blog is about one use case I encountered recently, and thought it maybe worth sharing and to be very clear it is always good practise to share the data from the upstream Data source , but in this case, there was a non trivial business logic applied inside PowerBI dataset.

Usually we have two options.

  • A user can just create a new thin report assuming he has build permission
  • With Direct Query on composite model, a user can mix between live connection to a dataset and import different sources.

Both approach are fine, but in this case, the data consumer from another departement is just interested in 1 table (a couple of dimension and some measures) and because there is a non trivial logic in the Model, we can’t get the data from the source system, and it is rather important not to provide Build permission.

In my humble opinion sometimes giving access to a whole Data model with a lot of tables, relationship etc does seems an overkill and potentially confusing , instead I am suggesting an easier approach, sharing a dataflow that run a DAX Query against the model, and return exactly the agreed schema,basically a flat table which will not change even if the structure of Model change.

it is very straightforward and works with both Pro and premium ( Premium works better with large models, the timeout duration is more generous, and anyway PPU is very cheap)

Build a visual that show the required data

I will use a random dataset as an example, let’s say the data consumer wants this data updated daily ( real life example is substantially more complex)

click on performance analyzer , refresh visual then copy the Query

Clean the Query in DAX Studio

Paste the Query in DAX studio, don’t get intimidated by the apparent complexity , The main Query start by VAR __DS0Core

Now in a simplified format

// DAX Query


      TREATAS({"5 Minutes"}, 'unit'[description]),
      "average_Price", '00-Measures'[average_Price],
      "INITIALMW_Sum", '00-Measures'[INITIALMW_Sum]

Alternatively you can build The Query straight in DAX Studio using the Query Builder

Connect to PowerBI service endpoint using PowerBI Desktop

you can read the documentation here, basically you connect to SQL Server Analysis Services which is the Engine of PowerBI,

Note : I am using my personal tenant.

Copy PowerQuery code to Dataflow

Dataflow don’t have a connector to Analysis Services at least in the UI, but coping the Query just works ( notice the double Quotes in QLD)

    Query1 = AnalysisServices.Database("powerbi://", "aemo_data",
    // DAX Query


      TREATAS({""5 Minutes""}, 'unit'[description]),
      ""average_Price"", '00-Measures'[average_Price],
      ""INITIALMW_Sum"", '00-Measures'[INITIALMW_Sum]

I am not suggesting it is a good practise, in an ideal world everything is in a nice Cloud DWH with business logic and all, and the BI tool is just a front end to run aggregation and show nice charts, but real life is not perfect and it is useful to have the option when you need it.

Edit : The original title of this blog was using Dataflow as Poor Man’s Data mesh, but I had to be realistic as of this writing Dataflow can only be consumed by a PowerBI Client, not even Excel is supported and I know about Azure Storage, but with the current paranoia about security, it is extremely hard to have that option on for a lot of companies , Hopefully Dataflow product team open up a little bit for other Client too.