First Look at Google Malloy

Malloy is a new Modeling Language created by the original Author of Looker, it was released last year under an Open Source license, and made available to Windows users just last week, as someone who is enthusiast about Data Modeling I thought it is worth having a look at it.

Currently, Malloy is available under a free extension in Visual Studio Core, Malloy don’t have any “calculation Engine”, all it does, you build your model in a text where you define Source Tables, measures, relationship and dimension and then you write Queries, Malloy will parse those Queries and Generate SQL Code, it is more or less how every BI tools works behind the scene.

Interestingly, Malloy extension has DuckDB installed by default, it means, we have a full semantic layer and a fast OLAP engine as an open source offering, that’s a very big deal !!!

I spend sometime building a simple model, just one fact table and two dimensions, you can see the code here

When you run a Query, it will show the results in tabular format and you can see the SQL generated, you can even define some basic Visual like bar chart et

BI Engine not supported

Malloy has excellent support for BigQuery ( for obvious reason) and does support PostgreSQL too, but the SQL Generated is non trivial, BigQuery default Engine render the SQL extremely fast, no problem with that, but BI Engine struggle, basically any non trivial calculation are not supported. ( cross join and correlated variable etc)

Measure Behavior

let’s create three trivial measures

red is Quantity { where:  color = 'red' } 
black is Quantity { where:  color = 'BLACK' } 
red_or_black is Quantity { where:  color = 'red' or  color = 'black'}

then I run a simple group by

As a DAX users the results are rather unexpected

  • Measure Black : Return 0, it is case sensitive , maybe Malloy should maybe add a setting for DuckDB to ignore case sensitivity ( or it is by design, I don’t know)
  • Measure Red : return 2 only for red and 0 elsewhere, in DAX by default it will return 2 everywhere
  • red_or_black : was expecting to see the sum of both red and blacked which is 4 repeated in all rows

here is the same using DAX ( you can change that behavior by using keepfilters but I am interested only in the default behavior)

I don’t know enough about the language yet, but it would have being useful to have an option in the measure to ignore the group by ( it seems it is coming )

Why You Should care

For some historical reason, all Modeling language were proprietary and based only on vendor implementation, as far as I know, this is the first fully open source implementation, I am sure Google has a long term vision for Malloy and will show up in more service, I would not be surprised if BigQuery somehow integrated Malloy as a free semantic layer, after all it works well in a consumption model, I have not used it enough to have any good intuition, but I like the direction of the product, and Good on Google for making it Open Source, and DuckDB for being such an awesome SQL Engine.

Delta lake with Python, Local Storage and DuckDB

TL;DR : I added a streamlit app here

a new experimental support for Writing Delta storage format using only Python was added recently and I thought it is a nice opportunity to play with it.

Apache Spark had a native support since day one, but personally the volume of data I deal with does not justify running Spark, hence the excitement when I learned we can finally just use Python.

instead of another hot take on how Delta Works, I just built a Python Notebook, that download files from a web site ( Australian Energy Market), create a delta table, then I use DuckDB and vega lite to show a chart, all you need to do is to define the Location of the Delta Table, I thought it maybe a useful example, all the code are located here

And I added a PowerBI report using the delta Connector

Some Observations

Currently DuckDB don’t support Delta natively, instead we first read the Delta table using pyarrow which DuckDB can read automatically, at this stage, I am not sure if DuckDB can push down filter selection or read the stats saved in the Log file, and currently, it seems only AWS S3 is supported.

Using DuckDB with PowerBI

DuckDB is one of the most promising OLAP Engine in the market, it is open Source, very lightweight, and has virtually no dependencies and work in-Process (think the good Old MS Access ) and it is extremely fast, specially in reading and querying parquet files. and has an Amazing SQL Support

The ODBC driver is getting more stable, I thought it is an opportunity to test it with PowerBI, notice JDBC was always supported and can be used with SQL frontend like DBeaver and obviously Python and R has a native integration

I download the ODBC driver using the latest version 0.3.3, you need to check always the latest release and make sure it is the right file.

Installing the binary is straightforward, but unfortunately you need to be an administrator

Configuring PowerBI

Select ODBC, if the driver was installed correctly, you should see an entry for DuckDB

As of this writing there is a bug in the driver, if you add a path to the DuckDB database file, the driver will not recognise the tables and views inside it, Instead I selected

database=:memory:

And defining the base Table as a CTE, reading Directly from a folder of parquet files

Just for fun, I duplicated the parquet file just to reach the 1 Billion Rows mark

The total size is 30 GB compressed.

1 Billion rows in a Laptop

And here is the results in PowerBI

The size of the PowerBI report is only 48 KB, as I import only the results of the query not the whole 30 GB of data, yes separation of Storage and Compute make a lot of sense in this case.

Although the POC in this blog was just for fun, the query take 70 seconds using the ODBC driver in PowerBI ( which is still in an Alpha stage), The same query using dbeaver take 19 second using the more mature JDBC driver, and it works only with import, for Direct Query you need a custom connector and the use of the Gateway, But I see a lot of potential.

There are a lot of people doing very interesting scenarios, like Building extremely fast and cheap ETL pipeline just using Parquet, DuckDB running on a cloud Functions. I think we will hear more about DuckDB in the coming years.

Benchmark Snowflake, BigQuery, SingleStore , Databricks, Datamart and DuckDB using TPC-H-SF10

Edit 18 May 2022: Microsoft released Datamart which has excellent performance for this type of Workload.

Another blog on my favorite topic, interactive Live BI Workload with low latency and high concurrency, but this time, hopefully with numbers to compare.

I tested only the Databases that I am familiar with, BigQuery, Snowflake, Databricks , SingleStore , PowerBI Datamart and DuckDB

TPC-H

The most widely used Benchmark to test BI workload is TPC-DS and TPC-H produced by the independent Organization TPC, unfortunately most of the available benchmark are for big dataset starting from 1 TB, as I said before I more interested in smaller Workload for a simple reason, after nearly 5 years of doing Business intelligence for different companies, most of the data model are really small, ( my biggest was 70 Million rows with 4 small dimension tables).

Benchmarking is a very complex process, and I am not claiming that my results are correct, all I wanted to know as a user is an order of magnitude and a benchmark can give you a high level impression of a database performance.

Schema

I Like TPC-H as it has a simpler schema 8 tables and only 22 Queries compared to TPC-DS which require 99 Queries.

Image

Some Considerations

  • Result Cache is not counted.
  • The results are using warm cache and at least one cold run, I run the 22 Queries multiple times.
  • Databricks by default provide a sample Database TPC-SF05, the main Table Lineitem is 30 Millions rows, I don’t know enough to import the data and apply the proper sorting etc , so I preferred to use the smaller dataset. I did create a local copy by using create table as select ( Loaded SF10 Data)
  • Snowflake and SingleStore have SF10 and other scale by default.
  • BigQuery, I imported the data from Snowflake , I sorted the tables for better performance, it is a bit odd that BigQuery don’t provide such an important public dataset by default
  • Microsoft Datamart no sorting or partitioned was applied , the data was imported from Biguery.

No Results Cache

Most DWH support results cache, basically if you run the same Query and the base tables did not change the Engine will return the same results very quickly, obviously in any benchmark, you need to filter out those queries.

  • In Snowflake you can use this statement to turn the results cache off
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
  • In Databrick
SET use_cached_result = false
  • BigQuery, just add an option in the UI
  • SingleStore and Datamart, does not have a result cache per se, the engine just keep a copy of the Query Plan, but it scan The Data every time.

Warm Cache

Snowflake, SingleStore and Databricks leverage the local SSD cache, when you run a Query the first time, it scan the data from the cloud storage which is a slow operation, then when you run it again the Query will try to use the data already copied in the local disk which is substantially faster, specially with Snowflake if you want to keep the local warm cache it make sense to keep your Cluster running a bit longer.

BigQuery is a different beast there is no VM, the data is read straight from the Google Cloud Storage, yes google Cloud Network is famous for being very fast, but I guess it can not compete with a local SSD Disk, anyway that’s why we have BI Engine which basically cache the data in RAM, but not all Queries are supported, actually only 6 are fully accelerated as of this writing. ( see Limitations )

Query History

Getting Query results is very straightforward using information_Schema, except for databricks, it seems it is only supported using an API, I just copied one warm run and paste it to excel and load it from there.

Engine Used

  • Snowflake : X-Small (Lowest tier)
  • Databricks : 2X-Small (Lowest tier)
  • Single Store : S-0
  • BigQuery : on Demand + 1 GB Reservation of BI Engine
  • Datamart : included with PowerBI Premium, official spec not disclosed.
  • DuckDB : my laptop, 16GB RAM 🙂

Results

The 22 Queries are saved in this repo, I am using PowerBI to combine all the results

let’s start with

Snowflake VS BigQuery

Snowflake Vs SingleStore

Snowflakes VS Databricks

Notice Databricks is using the smaller Dataset SF05, 30 million rows and still Snowflake show better performance

Overall

Edit : due to feedback, I am adding the sum of all Queries, You can download the results here

Edit : 26-Jan-2022, I Updated the results for Databricks SF10, I Uploaded the same data used for BigQuery, then created Delta Table and applied optimize Z Order

Take away

  • Snowflake is very fast and has consistent results for all the 22 Queries, Except Query 13 is a bit odd

  • SingleStore is remarkable but Query 13 is not good at all and skew the overall performance.

  • BigQuery is fantastic when BI Engine works ( only 11 Queries are supported from the total of 22)

  • Databricks performance in TPC-H-SF05 is problematic, I just hope they release a proper TPC-H-SF10 dataset and information schema like other DWH

  • Datamart has the best user experience, the only Data Platform where you can load the data without writing any Code,The same as Singlestore; Query 13 has a very Big Cost on the overall performance.

  • DuckDB : Query 9 skew the overall performance and probably I need a new laptop 🙂
%d bloggers like this: