Aggregate 100 GB using PowerQuery and DuckDB

Maybe not known enough, but you can run a python script inside PowerQuery in the desktop, and deploy it to PowerBI service using a personal gateway.

it is just a POC that showcase you can do a lot; just using a decent laptop and python.

let’s say, you have a folder with 100 GB of parquet files, and you want to aggregate some results for further analysis in PowerBI, Obviously importing 100 GB maybe not be a great idea, in this scenario, from what I can see, people recommend stuff Like PySpark or Dask , which I guess are great option, but if you want a lightweight option with a simple pip install then DuckDB is amazing !!

For testing, first write your SQL Query using only 1 parquet file, when you are happy with the results, change the path from “folder/x.parquet” to “folder/*.parquet”

In this example I have a folder of 100 GB of Parquet files.

in PowerQuery you need just to insert a python script, you don’t need to know anything about python, it is just a standard SQL Query, PowerQuery is smart enough to know that df is a dataframe, and it will show the result in the next step.

import duckdb 
con = duckdb.connect()
df =con.execute('''
select  L_RETURNFLAG,L_LINESTATUS,count(*) from
 'C:/xxxxx/parquet/lineitem/*.parquet' 
        group by 1,2
''').df()

And here is the final results.

The duration will depend mainly on your CPU and SSD speed, DuckDB consume little memory, in that example less than 2 GB, the secret sauce is parallel scan of Parquet files.

Traditionally Big Data means; Data that don’t fit into RAM, nowadays, Big Data is how much Data can fit in your SSD, and that’s a welcome change.

Loading 1 Billion New York Taxi Dataset into Datamart

Was chatting with David Eldersveld and he suggested that he wants to run a competition using the famous New York Taxi Dataset with Datamart, long story short, I did endup publishing my attempt before he had the time to start the competition, my sincere apology.

The report is using my personal PPU instance, the data is located here , personally I wanted to exclusively use tools available in PowerBI out of the box, no Synapse nor Azure stuff or BigQuery, Just pure self service tools.

Initially I didn’t really believed that PowerQuery can download such a big volume of data, my first set back , PowerQuery can not read parquet files available in public url, but as usual Chris Webb has an excellent blog explaining the reason, and gave a workaround.

I added the code here, it does work with any PowerQuery in Dataflow, Datamart and PowerBI desktop, unfortunately Excel is not supported yet, when prompted for authentication use anonymous

The only section of the code that you should pay attention to is selecting the number of files to download by default it is 2, but you can increase it

Only when I was writing this blog, I noticed the files for 2022 are using a slightly different URL, will update the code later. Code Updated.

To reduce the database size, I had to split Datetime to date and time, low cardinality is good for performance too.

Loading into Datamart

I don’t know how much it took datamart to load the data, currently Query refresh history is broken, but I think it is more than 6 hours, I maybe wrong, but Datamart take a bit of time to generate the tables with Clustered Columnstore Index

Initially I loaded only 2 then 30 files just to see how Datamart behave and finally I went for 100 files, and it did work again to my surprise.

and the Lineage View of the report

Performance

The Performance is not bad at all considering, the data was loaded as it is and it not sorted, although the parquet files are organized by month, unfortunately there are some outlier in every file see for example, so you get overlapping segments.

You can check the database size on disk by running this Query

EXEC sp_spaceused

Optimisation

Pretty much the only optimisation you can do in Datamart is to pre sort the data before loading it, but when you have 1 billion rows saved in parquet files, sorting is a very expensive operation, but there are options I think.

Create another Datamart and load it from the “raw” Datamart and define incremental refresh which will create partitions, yes partitions should improve the performance.

Hybrid table in PowerBI Dataset where only the recent data is cached in Vertipaq and the history kept in Datamart as a Direct Query Mode.

Final thoughts

The Publish to web report is here, a very big missing piece is the option to append data to an existing Datamart, this will make adding new data without a full refresh extremely trivial, I know about incremental refresh, and I am sure a hack like this may work, but we want the real deal, Dataflow people hurry up šŸ™‚

I notice something interesting because the price of PPU is fixed, I felt I can experiment without the fear of getting a massive bill, maybe reserved pricing is not a bad thing after all.

My first reaction when I saw datamart was, that it will be a big validation for PowerQuery, and it is, as Alex said, PowerQuery everything !!!

Poor manā€™s snapshot using PowerBI Dataflow

It is a quick Hack and maybe useful, let’s say you want to keep a snapshot of a previous data, for example you have a dimension table and you want to see how the data has changed between multiple load.

This Scenario was famously hard to do in PowerBI, because when you load a table a into PowerBI Database, by default it will overwrite the existing data, there is an option for incremental refresh, but it assume your data change daily or at least regularly.

The canonical solution for this scenario nowadays is to attach a workspace to an Azure Storage, unfortunately, it is very problematic for a couple of of reasons

– Business users don’t usually have access to Azure Resources.

– As of today, a PowerBI admin need to opt for that Option, which apply for all Workspace Admins, it would have being easier if that option can be granted to only some Workspace Admins.

XMLA End Point

The Idea is very simple.

  • Getting already Loaded Data from a PowerBI using XMLA Endpoint ( See this blog for how to use it).
  • Load the new data using a Dataflow.
  • In the Same Dataflow you can append only the new Data, or any operation you want ( like merging new Data), it is very flexible, it can use any attribute not only Date
  • Just make sure there is a lag between Dataflow refresh and PowerBI Dataset refresh

Dataflow Should be Better

The Obvious Question ? because Dataflow keep snapshots of recent loads, why we need the extra step of attaching an Azure Storage, why not exposing this data using some extra option in the UI.

I think after all this years, it is time for Dataflow to tackle more use cases, can we have an option to append Data instead of overwrite ? or why not an easy to use upsert.

PowerBI made Data Modeling nearly trivial, Hopefully Dataflow can make Enterprise ETL available for everyone.

Share Data from a PowerBI Dataset using Dataflow

Using PowerBI Dataflow to share data between departments is increasingly getting more popular, with the rise of managed self service offering, it is not unusual to find a non centralised Data warehouse but rather every department managing it is own data, I don’t know enough to say anything useful on which model works better in the long term.

this blog is about one use case I encountered recently, and thought it maybe worth sharing and to be very clear it is always good practise to share the data from the upstream Data source , but in this case, there was a non trivial business logic applied inside PowerBI dataset.

Usually we have two options.

  • A user can just create a new thin report assuming he has build permission
  • With Direct Query on composite model, a user can mix between live connection to a dataset and import different sources.

Both approach are fine, but in this case, the data consumer from another departement is just interested in 1 table (a couple of dimension and some measures) and because there is a non trivial logic in the Model, we can’t get the data from the source system, and it is rather important not to provide Build permission.

In my humble opinion sometimes giving access to a whole Data model with a lot of tables, relationship etc does seems an overkill and potentially confusing , instead I am suggesting an easier approach, sharing a dataflow that run a DAX Query against the model, and return exactly the agreed schema,basically a flat table which will not change even if the structure of Model change.

it is very straightforward and works with both Pro and premium ( Premium works better with large models, the timeout duration is more generous, and anyway PPU is very cheap)

Build a visual that show the required data

I will use a random dataset as an example, let’s say the data consumer wants this data updated daily ( real life example is substantially more complex)

click on performance analyzer , refresh visual then copy the Query

Clean the Query in DAX Studio

Paste the Query in DAX studio, don’t get intimidated by the apparent complexity , The main Query start by VAR __DS0Core

Now in a simplified format

// DAX Query

  EVALUATE 

    SUMMARIZECOLUMNS(
      'MstDate'[SETTLEMENTDATE],
      'REGION'[REGION],
      TREATAS({"QLD"}, 'REGION'[REGION]),
      TREATAS({"5 Minutes"}, 'unit'[description]),
      "average_Price", '00-Measures'[average_Price],
      "INITIALMW_Sum", '00-Measures'[INITIALMW_Sum]
    )

Alternatively you can build The Query straight in DAX Studio using the Query Builder

Connect to PowerBI service endpoint using PowerBI Desktop

you can read the documentation here, basically you connect to SQL Server Analysis Services which is the Engine of PowerBI,

Note : I am using my personal tenant.

Copy PowerQuery code to Dataflow

Dataflow don’t have a connector to Analysis Services at least in the UI, but coping the Query just works ( notice the double Quotes in QLD)

let
    Query1 = AnalysisServices.Database("powerbi://api.powerbi.com/v1.0/myorg/aemo_data", "aemo_data",
     [Query=
    "
    // DAX Query

  EVALUATE 

    SUMMARIZECOLUMNS(
      'MstDate'[SETTLEMENTDATE],
      'REGION'[REGION],
      TREATAS({""QLD""}, 'REGION'[REGION]),
      TREATAS({""5 Minutes""}, 'unit'[description]),
      ""average_Price"", '00-Measures'[average_Price],
      ""INITIALMW_Sum"", '00-Measures'[INITIALMW_Sum]
    )
    "
    ,Implementation="2.0"])
in
    Query1

I am not suggesting it is a good practise, in an ideal world everything is in a nice Cloud DWH with business logic and all, and the BI tool is just a front end to run aggregation and show nice charts, but real life is not perfect and it is useful to have the option when you need it.

Edit : The original title of this blog was using Dataflow as Poor Man’s Data mesh, but I had to be realistic as of this writing Dataflow can only be consumed by a PowerBI Client, not even Excel is supported and I know about Azure Storage, but with the current paranoia about security, it is extremely hard to have that option on for a lot of companies , Hopefully Dataflow product team open up a little bit for other Client too.