Building Complex Data Model using Nested Data in Malloy

Last year, I wrote a blog on how to build a nested complex Data Model using BigQuery , please read it first, unfortunately using standard SQL to nest and unnest Data was a painful exercise, maybe it was due to my lack of experience, or simply my brain is wired to think only about Tabular Data. happy to say, using the new Data Modeling language Malloy the experience became way simpler.

for a start Malloy will show a nice expended view of the nest table

The Model is self explanatory.

Now you need just to write Queries, Malloy will generate a SQL behind the scene

For example Total values of all measures :

No Idea what the SQL Means to be honest

Now Grouped by Category

Now More tricky, let’s group by year, notice, the Budget has no date values, the results is not correct !!!!

instead, you first group by year then nest the second level of grouping something like this

Malloy has a native export to nested Data Structure, but for this exercise I want a tabular results, you can easily flatten the previous Query using Project, notice how you can simply reference a previous Query, the measure Progress is simply Actual/Budget

I think it is simple enough that a “Normal” BI person can work it, you can find the Model here, The Model currently works only with BigQuery, found some bugs with DuckDB connector, hopefully it will be fixed in the next update of Malloy.

if you works with nested data, Malloy make it nearly easy 🙂

Update :

Malloy just added a semantic “All”, to ignore a group by dimension, see the same Query using the new syntax , this is Amazing !!!!

First Look at Google Malloy

Malloy is a new Modeling Language created by the original Author of Looker, it was released last year under an Open Source license, and made available to Windows users just last week, as someone who is enthusiast about Data Modeling I thought it is worth having a look at it.

Currently, Malloy is available under a free extension in Visual Studio Core, Malloy don’t have any “calculation Engine”, all it does, you build your model in a text where you define Source Tables, measures, relationship and dimension and then you write Queries, Malloy will parse those Queries and Generate SQL Code, it is more or less how every BI tools works behind the scene.

Interestingly, Malloy extension has DuckDB installed by default, it means, we have a full semantic layer and a fast OLAP engine as an open source offering, that’s a very big deal !!!

I spend sometime building a simple model, just one fact table and two dimensions, you can see the code here

When you run a Query, it will show the results in tabular format and you can see the SQL generated, you can even define some basic Visual like bar chart et

BI Engine not supported

Malloy has excellent support for BigQuery ( for obvious reason) and does support PostgreSQL too, but the SQL Generated is non trivial, BigQuery default Engine render the SQL extremely fast, no problem with that, but BI Engine struggle, basically any non trivial calculation are not supported. ( cross join and correlated variable etc)

Measure Behavior

let’s create three trivial measures

red is Quantity { where:  color = 'red' } 
black is Quantity { where:  color = 'BLACK' } 
red_or_black is Quantity { where:  color = 'red' or  color = 'black'}

then I run a simple group by

As a DAX users the results are rather unexpected

  • Measure Black : Return 0, it is case sensitive , maybe Malloy should maybe add a setting for DuckDB to ignore case sensitivity ( or it is by design, I don’t know)
  • Measure Red : return 2 only for red and 0 elsewhere, in DAX by default it will return 2 everywhere
  • red_or_black : was expecting to see the sum of both red and blacked which is 4 repeated in all rows

here is the same using DAX ( you can change that behavior by using keepfilters but I am interested only in the default behavior)

I don’t know enough about the language yet, but it would have being useful to have an option in the measure to ignore the group by ( it seems it is coming )

Why You Should care

For some historical reason, all Modeling language were proprietary and based only on vendor implementation, as far as I know, this is the first fully open source implementation, I am sure Google has a long term vision for Malloy and will show up in more service, I would not be surprised if BigQuery somehow integrated Malloy as a free semantic layer, after all it works well in a consumption model, I have not used it enough to have any good intuition, but I like the direction of the product, and Good on Google for making it Open Source, and DuckDB for being such an awesome SQL Engine.

Delta lake with Python, Local Storage and DuckDB

TL;DR : I added a streamlit app here

a new experimental support for Writing Delta storage format using only Python was added recently and I thought it is a nice opportunity to play with it.

Apache Spark had a native support since day one, but personally the volume of data I deal with does not justify running Spark, hence the excitement when I learned we can finally just use Python.

instead of another hot take on how Delta Works, I just built a Python Notebook, that download files from a web site ( Australian Energy Market), create a delta table, then I use DuckDB and vega lite to show a chart, all you need to do is to define the Location of the Delta Table, I thought it maybe a useful example, all the code are located here

And I added a PowerBI report using the delta Connector

Some Observations

Currently DuckDB don’t support Delta natively, instead we first read the Delta table using pyarrow which DuckDB can read automatically, at this stage, I am not sure if DuckDB can push down filter selection or read the stats saved in the Log file, and currently, it seems only AWS S3 is supported.

Using DuckDB with PowerBI

DuckDB is one of the most promising OLAP Engine in the market, it is open Source, very lightweight, and has virtually no dependencies and work in-Process (think the good Old MS Access ) and it is extremely fast, specially in reading and querying parquet files. and has an Amazing SQL Support

The ODBC driver is getting more stable, I thought it is an opportunity to test it with PowerBI, notice JDBC was always supported and can be used with SQL frontend like DBeaver and obviously Python and R has a native integration

I download the ODBC driver using the latest version 0.3.3, you need to check always the latest release and make sure it is the right file.

Installing the binary is straightforward, but unfortunately you need to be an administrator

Configuring PowerBI

Select ODBC, if the driver was installed correctly, you should see an entry for DuckDB

As of this writing there is a bug in the driver, if you add a path to the DuckDB database file, the driver will not recognise the tables and views inside it, Instead I selected

database=:memory:

And defining the base Table as a CTE, reading Directly from a folder of parquet files

Just for fun, I duplicated the parquet file just to reach the 1 Billion Rows mark

The total size is 30 GB compressed.

1 Billion rows in a Laptop

And here is the results in PowerBI

The size of the PowerBI report is only 48 KB, as I import only the results of the query not the whole 30 GB of data, yes separation of Storage and Compute make a lot of sense in this case.

Although the POC in this blog was just for fun, the query take 70 seconds using the ODBC driver in PowerBI ( which is still in an Alpha stage), The same query using dbeaver take 19 second using the more mature JDBC driver, and it works only with import, for Direct Query you need a custom connector and the use of the Gateway, But I see a lot of potential.

There are a lot of people doing very interesting scenarios, like Building extremely fast and cheap ETL pipeline just using Parquet, DuckDB running on a cloud Functions. I think we will hear more about DuckDB in the coming years.

%d bloggers like this: