“Modern” BI tools are missing something like PowerBI Composable SQL Generator

If you are on twitter or LinkedIn, Probably you heard already about the emergence of the “Modern” Data Stack and the triumph of the Cloud Data warehouse , this is all good, I use BigQuery BI Engine and I am amazed at what can be done with a little Bit of SQL.

Before I used to have an obsession with semantic model, I always thought you can’t have a decent analytical workflow without a proper multi fact semantic Model or what’s called now “Metric Store” , but I started changing my mind, flat wide table with multi grains are definitely ugly but they work , and storage is cheap, the point is ; there is a workaround it is not a showstopper, I think the issue is Further downstream.

To explain my point I will use PowerBI , and compare it to Apache Superset as an example, but the same can be said about other “Modern” tools too.

Analysis of Simple Table

we will try to analyse only one table using Direct Query mode, the Table is loaded in an external Database, so no ETL , no OLAP, no data movement.

Then I created this simple Report

The main metric is Actual Hours, which is basically sum (Actual Hours) grouped by category, to my surprise, it was trivial to reproduce the first three metrics easily using Apache Superset

The SQL was not too complex

SQL Aggregate can not simply ignore filters

The Problem started with the Metric “sum_all_ignore_all_filter”, which basically sum the total hours and ignore any filters or grouping, In PowerBI, we use this syntax to get this behavior

sum_all_ignore_all_filter = CALCULATE(sum(actual[actualhours]),REMOVEFILTERS())

now let’s see the Query generated by Superset

As far as I know there is no way in SQL to add an aggregation and tell SQL to ignore the where clause , I presume you can create another Query and do a join, but I don’t think Superset permit this in the Metric definition, another solution is to use Parameters to pass the filter values, but then you will lose cross filter functionality.

Now you may wonder how PowerBI solved this Problem, how it is possible by just writing a generic Metric, it will always works regardless of filters or which dimension is used in group by ?

I know it is pretentious statement, but in my View this is the Ethos of the Product, this the killer feature of PowerBI

The idea is very simple, PowerBI DAX Engine will just send multiple SQL statement and assemble the results , the implementation is very complex involving two engines ( formula and Storage Engine), complex Query plan, and even a patent which I think it is worth reading , but obviously this is all hidden from the user.

here is a look at how the PowerBI process the report

And the Actual SQL Queries generated

Take Away

Obviously this is only my own understanding from an analyst point of View, but as far as I can see , all I can read is SQL is better than proprietary DSL, but no one explain how to handle very complex calculation that ignore filters and level of details using SQL and keeping the flexibility of Metrics, where they are just supposed to work.

The Previous example was just simple use case with one table, now imagine a complex model with multiple tables and Non additive Metrics.

I am not suggesting PowerBI is the only software that can do that, all what I am saying “Modern” consumption layer tools need to have more sophisticated SQL Generators.

First look at Apache Superset from a PowerBI user perspective

Apache Superset is a very well known open source BI tool, I knew about it for the last couple of years, but was not thrilled by the idea of running a docker image on my laptop, recently though a new company Preset created by original author of the software start offering a cloud hosted service, basically they maintain the service and the consumer just use the software.

The pricing is very competitive, a free starter Plan with up to 5 users, then a pro plan for 20 $/Month/user ( the same as PowerBI PPU) and an Enterprise plan with custom pricing, kudos for the vendor for providing a free limited plan for users to learn and test.

The selling Point of Preset is very simple , it is open source software, so there is zero risk of vendor lock-in, if you are not happy with their service, you can always move your assets to another Provider or hosted yourself on-premise or in any cloud provider, this arguments is very attractive and make sense for a lot of customers , and companies like Red Hat have a good success using this business Model.

User interface

The user interface is simple and predictable which is a good thing, if you have used PowerBI or Data Studio you will feel at home.

Generally speaking there are two type of BI tools

  • Integrated : The BI tools provide, ETL, Storage, Modeling and the viz layer , as an example : PowerBI, Tableau, Qlik , Data Studio
  • Light : The BI tool scope is only Modeling and the viz layer, they need an External Database to host the data and run the SQL Query, Example Looker, Bipp etc ( I reference only the tool I personally used)

I guess the big assumption here, is that you have already a mature Data infrastructure which can be Queried from the same Data Base , Superset like Looker can not Query two DB in the same Dataset, The sweet spot for Superset is when all your data is hosted in a Cloud Data warehouse

Connecting to the Database

As usual I used BigQuery as my Data Source, the connection was pretty trivial to setup, although I could not figure out how to activate CSV upload

Building a Dataset

Dataset is the Data model it is the same concept as PowerBI with dimensions and Metrics (Measure), but it does not support relationships , you have to manually build a Query that join multiple tables or leverage a View in the Database , The metrics and dimension use SQL as calculation language.

One confusing aspect of the User interface; if you want to build a Dataset based on one table, you just click Data, Dataset then add Dataset, simple, but if you want to build it based on a SQL Query, you need to click on SQL Lab, then SQL Editor , you write the Query then save it as a dataset.

Example : Dataset based on A SQL Query and Parameter

I have a simple multi Fact semantic model in PowerBI, and usually used it as test case, 3 Facts with different grains and 3 dimension

I add this SQL Query to generate the Dataset, Parameter is used to pass date as a filter, yes I am biased, I used PowerBI for so long, that writing SQL Query to return Dataset seems strange.

{% set param_Date = filter_values('day')[0]  %}
SELECT category, commodity,  actualhours ,  budgethours ,  forecast
FROM (
  SELECT category, commodity , SUM(actualhours) AS actualhours
  FROM `testing-bi-engine.starschema.actual`
   WHERE date <= date(TIMESTAMP_MILLIS({{cache_key_wrapper(param_Date)}}))
  GROUP BY category, commodity
) t2
FULL OUTER JOIN (
  SELECT category, commodity , SUM(budgethours) AS budgethours
  FROM `testing-bi-engine.starschema.budget`
  GROUP BY category, commodity
) t3 USING(category, commodity)
FULL OUTER JOIN (
  SELECT category, commodity , SUM(forecast) AS forecast
  FROM `testing-bi-engine.starschema.baseline`
  WHERE date <= date(TIMESTAMP_MILLIS({{cache_key_wrapper(param_Date)}}))
  GROUP BY category, commodity
) t4 USING(category, commodity)

The metrics

The Columns

I am using BigQuery BI engine, one fact is 43 Million rows and another is 1 Million, and although Preset Cloud is hosted in AWS even with network transfer the experience is very smooth, as of this writing Publish to web is not supported yet, so all I can show is this GIF

Chart library

Apache Superset has a very decent visual library, the first I checked is Pivot table and it is not bad at all , my understanding cross filtering will be supported only on E Charts Viz

Take Away

Superset is a mature open source data consumption layer with an enterprise support provided by Preset Cloud, if you have a mature data infrastructure and you know SQL very well, then it is worth a shot, I suspect it will be very popular with tech savvy companies and startups.

Edit : added a new post about Superset SQL Metrics

Share Data from a PowerBI Dataset using Dataflow

Using PowerBI Dataflow to share data between departments is increasingly getting more popular, with the rise of managed self service offering, it is not unusual to find a non centralised Data warehouse but rather every department managing it is own data, I don’t know enough to say anything useful on which model works better in the long term.

this blog is about one use case I encountered recently, and thought it maybe worth sharing and to be very clear it is always good practise to share the data from the upstream Data source , but in this case, there was a non trivial business logic applied inside PowerBI dataset.

Usually we have two options.

  • A user can just create a new thin report assuming he has build permission
  • With Direct Query on composite model, a user can mix between live connection to a dataset and import different sources.

Both approach are fine, but in this case, the data consumer from another departement is just interested in 1 table (a couple of dimension and some measures) and because there is a non trivial logic in the Model, we can’t get the data from the source system, and it is rather important not to provide Build permission.

In my humble opinion sometimes giving access to a whole Data model with a lot of tables, relationship etc does seems an overkill and potentially confusing , instead I am suggesting an easier approach, sharing a dataflow that run a DAX Query against the model, and return exactly the agreed schema,basically a flat table which will not change even if the structure of Model change.

it is very straightforward and works with both Pro and premium ( Premium works better with large models, the timeout duration is more generous, and anyway PPU is very cheap)

Build a visual that show the required data

I will use a random dataset as an example, let’s say the data consumer wants this data updated daily ( real life example is substantially more complex)

click on performance analyzer , refresh visual then copy the Query

Clean the Query in DAX Studio

Paste the Query in DAX studio, don’t get intimidated by the apparent complexity , The main Query start by VAR __DS0Core

Now in a simplified format

// DAX Query

  EVALUATE 

    SUMMARIZECOLUMNS(
      'MstDate'[SETTLEMENTDATE],
      'REGION'[REGION],
      TREATAS({"QLD"}, 'REGION'[REGION]),
      TREATAS({"5 Minutes"}, 'unit'[description]),
      "average_Price", '00-Measures'[average_Price],
      "INITIALMW_Sum", '00-Measures'[INITIALMW_Sum]
    )

Alternatively you can build The Query straight in DAX Studio using the Query Builder

Connect to PowerBI service endpoint using PowerBI Desktop

you can read the documentation here, basically you connect to SQL Server Analysis Services which is the Engine of PowerBI,

Note : I am using my personal tenant.

Copy PowerQuery code to Dataflow

Dataflow don’t have a connector to Analysis Services at least in the UI, but coping the Query just works ( notice the double Quotes in QLD)

let
    Query1 = AnalysisServices.Database("powerbi://api.powerbi.com/v1.0/myorg/aemo_data", "aemo_data",
     [Query=
    "
    // DAX Query

  EVALUATE 

    SUMMARIZECOLUMNS(
      'MstDate'[SETTLEMENTDATE],
      'REGION'[REGION],
      TREATAS({""QLD""}, 'REGION'[REGION]),
      TREATAS({""5 Minutes""}, 'unit'[description]),
      ""average_Price"", '00-Measures'[average_Price],
      ""INITIALMW_Sum"", '00-Measures'[INITIALMW_Sum]
    )
    "
    ,Implementation="2.0"])
in
    Query1

I am not suggesting it is a good practise, in an ideal world everything is in a nice Cloud DWH with business logic and all, and the BI tool is just a front end to run aggregation and show nice charts, but real life is not perfect and it is useful to have the option when you need it.

Edit : The original title of this blog was using Dataflow as Poor Man’s Data mesh, but I had to be realistic as of this writing Dataflow can only be consumed by a PowerBI Client, not even Excel is supported and I know about Azure Storage, but with the current paranoia about security, it is extremely hard to have that option on for a lot of companies , Hopefully Dataflow product team open up a little bit for other Client too.

Calculate route distance, Tableau vs PowerBI

My previous  blog turn out to be very popular but unfortunately it is still not good enough, the Visual Interaction was just a hack by combining two Maps together with a big compromise the Zoom does not work and that’s very annoying ( Icon map is amazing, but still has to operate inside PowerBI framework)

The Second issue PowerBI is very chatty and insteading of sending 1 SQL Query by selection, it sends two SQL Queries, so selecting two points start and finish will generate 4 SQL Queries, usually I don’t care much, but in this case, it is relatively heavy calculation, and 1 extra Query can change the experience from this is a nice demo to I can do that at Work.

This behavior is hard to fix because it is basically how PowerBI was designed in the first place, the assumption was; you create a visual and the values change based on external filter selection or cross filtering from other visual, but there are a class of User interaction where the visual itself is the filter , a typical scenario is updating a visual title based on rows selected.

Instead of theoretical talks let’s rebuild the same visual in Tableau using parameter actions

Define a Parameter

Define SQL Query

WITH
      initial_parameter AS (
      SELECT
        *
      FROM
        UNNEST(split(<Parameters.TagsList> , ', ')) AS element ),
      mynetwork AS (
      SELECT
        ARRAY_AGG(geometry) roads   FROM   `test-187010.gis_free.brisbane_Road_Network_cluster` ),
      calculation AS(
      SELECT
        `libjs4us.routing.geojson_path_finder`(roads, st_geogfromtext(a.element), st_geogfromtext(b.element)) AS tt
      FROM
        mynetwork,    initial_parameter a,    initial_parameter b
      WHERE
        a.element>b.element
      LIMIT
        100),
        result as (
    SELECT
       1 as dummy,tt.weight    AS len,tt.path AS GEO
    FROM  calculation),
    points_extract as (
    SELECT ST_PointN(geo, idx) mypoint,len, dummy,idx
  FROM result,
  unnest(GENERATE_ARRAY(1,ST_NUMPOINTS(geo))) idx  )
  select len,  st_y(mypoint) as lat, st_x(mypoint) as lng, cast (idx as string) idx ,1 as type from points_extract
  union all
  select null as len,st_y(center_location) as lat, st_x(center_location) as lng, st_astext(center_location) as idx, 2 as type  from test-187010.gis_free.Brisbane_Bus_Stop

Define Parameter Action

in Tableau you can define exactly what happen when you select point either it will filter or update a parameter in this case we want Tableau to add the value selected to the previous selection , as for calculating distance you need Start and finish point, PowerBI by default will do that using icon map, so far so good both Product have the same behavior

Auto Update Behavior

Unlike PowerBI, Tableau auto update the Map if the data change, and because I am using only 1 visual, the zoom works and the experience is more pleasant, Tableau send only 1 SQL Query by selection.

See here another example, when you click on a suburb, the Map Auto update and change colors relative to the suburb selected, very simple yet very powerful

And please I have zero interest in this useless debate who is a better BI tool, I just want PowerBI to implement a similar behavior or giving us an alternative, Maybe make a Visual accept multiple datasets.

As a user all I can do is to show my displeasure in public and hope other users will vote