Aggregate 100 GB using PowerQuery and DuckDB

Maybe not known enough, but you can run a python script inside PowerQuery in the desktop, and deploy it to PowerBI service using a personal gateway.

it is just a POC that showcase you can do a lot; just using a decent laptop and python.

let’s say, you have a folder with 100 GB of parquet files, and you want to aggregate some results for further analysis in PowerBI, Obviously importing 100 GB maybe not be a great idea, in this scenario, from what I can see, people recommend stuff Like PySpark or Dask , which I guess are great option, but if you want a lightweight option with a simple pip install then DuckDB is amazing !!

For testing, first write your SQL Query using only 1 parquet file, when you are happy with the results, change the path from “folder/x.parquet” to “folder/*.parquet”

In this example I have a folder of 100 GB of Parquet files.

in PowerQuery you need just to insert a python script, you don’t need to know anything about python, it is just a standard SQL Query, PowerQuery is smart enough to know that df is a dataframe, and it will show the result in the next step.

import duckdb 
con = duckdb.connect()
df =con.execute('''
select  L_RETURNFLAG,L_LINESTATUS,count(*) from
 'C:/xxxxx/parquet/lineitem/*.parquet' 
        group by 1,2
''').df()

And here is the final results.

The duration will depend mainly on your CPU and SSD speed, DuckDB consume little memory, in that example less than 2 GB, the secret sauce is parallel scan of Parquet files.

Traditionally Big Data means; Data that don’t fit into RAM, nowadays, Big Data is how much Data can fit in your SSD, and that’s a welcome change.

How an Open Source PowerBI Desktop May Look like

Was playing with Malloy and DuckDB and had this silly idea, can I just reproduce something Like PowerBI Desktop Just using Open Source Tools, obviously I am talking about the Overall architecture, not exact functionalities.

For a Typical Data Analytical Workload, Broadly Speaking you get something like this :

Using a Combination of Python script for ETL, DuckDB as a SQL Engine and Malloy as a Data Model you get this nice folder of files, Currently Malloy Don’t support DuckDB as a storage format, instead I am just using a Parquet file as a local Cache, but it is possible to Query the remote DB Directly, you can find the files here

And here is a nice chart in Visual Studio Code, as of this writing, Malloy extension in VSC don’t support interactivity, they have a Viewer App, but it is not available in Windows yet.

Obviously it is a nice theoretical exercise, but one can imagine, if all those assets were bundled together in a zip file and a viewer that can natively read it. who knows, we may have something like an Open Source PowerBI Desktop equivalent.

I know it is a weird blog post, what I am trying to convey , Building a full stack BI Application is doable, it is just a matter of a lot engineering effort, and contrary to 10 years ago, we have a lot of open source building Block, Tableau had to pay for hyper, now not only a very fast SQL Engine is available as open source but a Modeling layer too, and if you care about analytics, it is a very good thing to the overall market.

Building Complex Data Model using Nested Data in Malloy

Last year, I wrote a blog on how to build a nested complex Data Model using BigQuery , please read it first, unfortunately using standard SQL to nest and unnest Data was a painful exercise, maybe it was due to my lack of experience, or simply my brain is wired to think only about Tabular Data. happy to say, using the new Data Modeling language Malloy the experience became way simpler.

for a start Malloy will show a nice expended view of the nest table

The Model is self explanatory.

Now you need just to write Queries, Malloy will generate a SQL behind the scene

For example Total values of all measures :

No Idea what the SQL Means to be honest

Now Grouped by Category

Now More tricky, let’s group by year, notice, the Budget has no date values, the results is not correct !!!!

instead, you first group by year then nest the second level of grouping something like this

Malloy has a native export to nested Data Structure, but for this exercise I want a tabular results, you can easily flatten the previous Query using Project, notice how you can simply reference a previous Query, the measure Progress is simply Actual/Budget

I think it is simple enough that a “Normal” BI person can work it, you can find the Model here, The Model currently works only with BigQuery, found some bugs with DuckDB connector, hopefully it will be fixed in the next update of Malloy.

if you works with nested data, Malloy make it nearly easy 🙂

Update :

Malloy just added a semantic “All”, to ignore a group by dimension, see the same Query using the new syntax , this is Amazing !!!!

First Look at Google Malloy

Malloy is a new Modeling Language created by the original Author of Looker, it was released last year under an Open Source license, and made available to Windows users just last week, as someone who is enthusiast about Data Modeling I thought it is worth having a look at it.

Currently, Malloy is available under a free extension in Visual Studio Core, Malloy don’t have any “calculation Engine”, all it does, you build your model in a text where you define Source Tables, measures, relationship and dimension and then you write Queries, Malloy will parse those Queries and Generate SQL Code, it is more or less how every BI tools works behind the scene.

Interestingly, Malloy extension has DuckDB installed by default, it means, we have a full semantic layer and a fast OLAP engine as an open source offering, that’s a very big deal !!!

I spend sometime building a simple model, just one fact table and two dimensions, you can see the code here

When you run a Query, it will show the results in tabular format and you can see the SQL generated, you can even define some basic Visual like bar chart et

BI Engine not supported

Malloy has excellent support for BigQuery ( for obvious reason) and does support PostgreSQL too, but the SQL Generated is non trivial, BigQuery default Engine render the SQL extremely fast, no problem with that, but BI Engine struggle, basically any non trivial calculation are not supported. ( cross join and correlated variable etc)

Measure Behavior

let’s create three trivial measures

red is Quantity { where:  color = 'red' } 
black is Quantity { where:  color = 'BLACK' } 
red_or_black is Quantity { where:  color = 'red' or  color = 'black'}

then I run a simple group by

As a DAX users the results are rather unexpected

  • Measure Black : Return 0, it is case sensitive , maybe Malloy should maybe add a setting for DuckDB to ignore case sensitivity ( or it is by design, I don’t know)
  • Measure Red : return 2 only for red and 0 elsewhere, in DAX by default it will return 2 everywhere
  • red_or_black : was expecting to see the sum of both red and blacked which is 4 repeated in all rows

here is the same using DAX ( you can change that behavior by using keepfilters but I am interested only in the default behavior)

I don’t know enough about the language yet, but it would have being useful to have an option in the measure to ignore the group by ( it seems it is coming )

Why You Should care

For some historical reason, all Modeling language were proprietary and based only on vendor implementation, as far as I know, this is the first fully open source implementation, I am sure Google has a long term vision for Malloy and will show up in more service, I would not be surprised if BigQuery somehow integrated Malloy as a free semantic layer, after all it works well in a consumption model, I have not used it enough to have any good intuition, but I like the direction of the product, and Good on Google for making it Open Source, and DuckDB for being such an awesome SQL Engine.

%d bloggers like this: