Benchmarking , Snowflake, Databricks , Synapse , BigQuery, Redshift , Trino , DuckDB and Hyper using TPCH-SF100

(Disclaimer : I use BigQuery for a personal project and it is virtually free for smaller workload, at work we use SQL Server as a Data Store, I will try my best to be objective )

TL:DR ;

Run TPCH-SF100 benchmark (base table 600 million rows) to understand how different Engine Behave in this workload using just the lowest tier offering, you can download the results here

Introduction

Was playing with Snowflake free trial ( maybe for the fifth time) and for no apparent reason, I just run Queries on TPCH-S100 Dataset, usually I am interested in smaller dataset, but I thought how Snowflake may behave with bigger data using the smallest cluster, long story short, I got 102 second, posted it in Linkedin and a common reaction was Snowflake is somehow cheating.

Obviously I did not buy the cheating explanation , as it is too risky and Databricks will make it international news.

Load the Data Again

Ideally I would have generated the Data myself and load it into Snowflake, generating 600 Million records in my laptop is not trivial, my tool of choice, DuckDB has an utility for that but it is currently single threaded, instead

  • I exported the data from Snowflake to Azure Storage as parquet files
  • Download it to my Laptop, generate new files using DuckDB as in Snowflake you can’t control the minimum size of files, you can control the max but not the Min

Snowflake Parquet External Table

My Plan was to run Queries directly on Parquet hosted on azure storage, the experience was not great at all, Snowflake got Query 5 join order wrong

Snowflake Internal Table

I loaded the parquet files generated by duckdb, Snowflake getting extremely good results. what I learnt, whatever Snowflake magic is doing, it is related to their proprietary file format.

BigQuery External Table

I have no frame of reference for this kind of workload, so I loaded the the data to BigQuery using external table in Google Cloud, Google got 5 minutes, one Run, 2.5 $ !!!!

BigQuery Internal Table

Loaded Data to BigQuery internal format, notice, BigQuery don’t charge for this operation , 2 Minutes 16 second, 1 Cold Run.

BigQuery Standard Edition

BigQuery added new pricing model where you pay by second, after the first minutes, I used the Standard Edition with a small size, I run the same query two time, unfortunately the new distributed disk cache don’t seems to be working, same result 5 minutes, that’s was disappointing

Redshift Serverless

Imported the same Parquet files into Redshift serverless, The schema was defined without Distribution keys, The results are for 3 Runs, the first run was a bit slower as it is fetching the data from the managed storage to the compute SSD the other 2 runs are substantially faster, I thought it is fair to have an average, Using the lowest Tier 8 RPU (2.88 $/Hour)

Redshift Serverless hot run was maybe the fastest performance I have seen so far, but they need still to improve on their cold Run.

I was surprised by the system overall performance, from my reading, it seems AWS basically rewrite the whole thing including separating compute from storage, Overall I think it is a good DWH.

Trino

Trino did not run Query 15, had to run a modified syntax but same results, 1 Run from Cold Storage, I am using the excellent service from Starburst Data

Synapse Serverless

Honestly, I was quite surprised by the performance of synapse serverless, initially I tested with the smaller file size generated by Snowflake and it did work, the first run failed but the second works just fine, I did like it, it did failed quickly, notice that Synapse run statistics on parquet files, so you would expect a more stable performance, not the fastest, but rather resilient.

Anyway , it took from 8-11 minutes, to be clear that’s not Synapse from two years ago.

Not related to the benchmark but I did enjoyed the lake database experience

Databricks External Table

I had not a great experience with Databricks, I could not simply pass authentication to Databricks SQL, you need a service principal and registering an App, and the documentation keep talking about Unity, which is not installed by default, This is a new install why Unity is not embedded if it is such a big deal ?

Anyway, First I created an external Table in databricks using the excellent passthrough technique in the Single Node Cluster, Databricks got 12 minutes,

Databricks Delta table

let’s try again with Delta, I created a new managed table, run optimize and analyse , (I always thought delta has already the stats), but it didn’t seems to make a big difference, still around 11 minutes, and this running from the disk, so no network bottleneck

DuckDB

My Plan was to run DuckDB on Azure ML, but I need a bigger VM than the one provided by default, I could not find a way to increase my Quota , I know it sounds silly, and I am just relating my experience, turn out Azure ML VM Quota is different from Azure VM, it did drive me crazy why I could get any VM in Databricks but Azure ML keep complaining I don’t have enough CPU.

Unfortunately I hit two bugs, first the native DuckDB file format seems to generate double the size of Parquet, the dev was very quick to identify the issue, the workaround is to define the table schema and then load the data using insert, the file became 24 GB compared to the original 40GB parquet files.

I End Up going with parquet files, I was not really excited by loading a 24 GB file in a storage account.

I run the Queries in Azure Databricks VM E8ds_v4 (8 cores and 64 GM of RAM)

As I am using fsspec with disk cache, the remote storage is used only the first run, after 4 tries, Query 21 keep crashing the VM 😦

Tableau Hyper

Tableau hyper was one of the biggest surprise, unfortunately, I hit a bug with Query 18, otherwise, it would have being the cheapest option.

Some Observations

  • Initially I was worried I made a mistake in Snowflake results, the numbers are just impressive for a single node tier, one explanation is the Execution Engine is mostly operating on compressed data with little materialization , but whatever they are doing, it has to do with the internal table format, which bring a whole discussion of performance vs openness, personally in a BI scenarios, I want the best performance possible, and wonder if they can get the same speed using Apache Iceberg.
  • Synapse Serverless improved a lot from last year, it did work well regardless of the data size of individual parquet files that I throw at it, and in my short testing it was faster than databricks and you pay by data scanned, so strictly speaking pure speed is not such a big deal but without a free result cache like BigQuery, it is still a hard sell.
  • Azure ML Quota policy was very confusing to me, and honestly I don’t want to deal with support ticket.
  • Databricks; may well be the fastest to run 100 TB, but for 100 GB workload, color me unimpressed.
  • DuckDB is impressive for an open source project that did not even reach version 1. I am sure those issues will be fixed soon.
  • Everything I heard about Redshift from twitter was wrong, it is a very good DWH, with Excellent performance.
  • BigQuery as I expected has excellent performance both for parquet and the native table format, The challenge is to keep the same using the new auto scale offering. added Auto scale performance, I think Google should do better.

Summary Results

You can find the results here, if you are a vendor and you don’t like the results feel free to host a TPCH-SF100 dataset in your service and let people test it themselves.

Note : Using SQL Query History : Bigquery one Cold Run , Synapse Serverless , Redshift Serverless and Snowflake a mix of cold and warm

(Note : Synapse Serverless always read from remote storage)

Databricks I am showing the best run from Disk, there is no system table, so I had to copy paste the results from the console.

Pricing

I did not kept the durations for Data load, it is just the cost for Read, obviously it is a theoretical exercise, and does not reflect real life usage which depends on other factors like concurrency performance , how you can share a pool of resources to multiple departement,free results cache, the performance of your ODBC drivers etc.

it is extremely important to understand what’s included in the basic price, for example.

Results cache:

BigQuery, Snowflake, Redshift results cache are free and you don’t need a running cluster, in Databricks you pay for it, Synapse don’t offer result cache at all.

Data loading :

BigQuery data loading is a free operation and other service like sorting and partitioning, in other DB you needs to pay.

Egress Fees :

Snowflake/BigQuery offer free egress fees, Other vendors you may pay, you need to check

Note :

BigQuery : for This workload make more sense to pay by compute not data scanned, either using auto scale, reserved pricing etc, I will try to test Auto scaling later.

Snowflake : I used the standard edition of Snowflake

Edit : I used a Google Colab notebook with a bigger VM for Hyper and DuckDB, see full reproducible notebook

Final Thoughts

Cloud DWH are amazing tech and only competition can drive innovation, not FUD and dishonesty, regardless of what platform you use, keep an eye on what other vendors are doing, and test using your own workload, you may be surprised by what you find.

Implementing a Poor Man’s Lakehouse in Azure

TLDR

A simple script that load some data from Azure Storage into a disk cache, run complex Queries using DuckDB and save the results into a destination Bucket using a cheap Azure ML Notebook, The resulting bucket can be consumed in Synapse serverless/ PowerBI/ Notebook etc

Introduction

Last year, Azure Synapse team published an excellent article on how to build a lakehouse architecture using Azure synapse, what I really liked is this diagram, it is very simple and to the point.

Notice, I am more interested in the overall Azure ecosystem, so it is vendor neutral.

In practical term, Lakehouse here means the storage system with an open Table Format, now if you ask three different people about this diagram, probably they will give you 4 different answers.

  • An Old school Dedicated pool professional will argue this is an over complicated system, and all you need is Source system —> Data integration tool —> Dedicated Pool
  • As I have a soft spot for Serverless, ideally, I would say, add Write capabilities to Serverless and call it a day
  • A Snowflake or Databricks Person will argue, One Engine should do everything; prepare and serve.
  • My colleague who is an Azure data engineer will say the whole thing does not make any sense, ADF and SQL Server is all you need 🙂

What if you don’t have big Data ?

The Previous Diagram assume a big data workload, as Spark has a massive overhead in compute usage and cost and does not make much sense for a smaller data.

 What if you have a smaller data size, can we keep this overall architecture and keep the lower cost, maybe we can, I will argue, it maybe even be useful very soon 🙂

Why you are keeping the serverless Pool

 I think it maybe the obvious question, DuckDB is an awesome Execution Engine, but it is not a client server DB, you can’t have a SQL Endpoint that you just use to run Queries from PowerBI etc.

There are projects to do that but it is not ready yet, and even if you find some hacks, trying to implement governance and access controls will be non-trivial.

Obviously you can use Azure ML notebook just to do exploratory analysis, but that’s not something that make sense for Business intelligence People 🙂

PowerBI

The theory is because the data is prepared and cleaned at the Storage level, PowerBI can just import the Parquet files, a SQL Endpoint is not strictly needed

Ok sweet, so where do you run this Duck thing?

Honestly that was the hardest Part, using synapse Compute fail the purpose as it is designed for Spark, the minimum is three VM, turn out Azure has an Amazing machine learning service, that we can just use for data engineering, you can run VM as low as 1 Core (8 cents/hour), auto shutdown is available, so you pay only for what you use, and you can schedule jobs, yes it is supposed to be for ML but it does work just fine for Data engineering Jobs.

Show me an Example

The Pipeline is very simple

-Read Data from Azure Storage bucket

-Run Some Complex Queries (22 Queries)

-Save the results in a Bucket

The Compute used is an Azure ML VM that cost 14 cents/Hour, the Raw Data is around 3 GB, main table 60 million rows, the overall pipeline took 2 minutes and 37 second, I think this is the most cost-effective way to run this workload on Azure. ( synapse serverless would have being perfect with the pay by scan model  but  write capabilities is rudimentary, basically you can’t overwrite a folder, but that may change anytime)

And The results

I Appreciate TPCH is not a benchmark for heavy write, but I used something easily accessible, and I can compare to other systems.

Thanks to Koen Vossen for showing how to use the disk cache with fsspec

What’s the catch?

As of this writing, Python support for open Table format (Iceberg, Delta ) is very limited, personally I found that Arrow dataset is the most mature offering today, but it support only Hive tables and only append or Overwrite.

if you need a merge or update directly on a remote storage, your only option is to do those operation in DuckDB using the native file format and overwrite the files in the remote storage, it works well for small data but it will not scale.

Note : my experience with Python Delta table (Not Spark) was mainly with GCP, turn out Azure has a better support ( can read, write, show history, vacuum) still as of today, I still think Arrow Dataset is more stable.

Final Thoughts

Regardless of what you use, I think it is important to ask your vendor, what’s your solution for smaller data? are you to paying a premium for a big data solution and is it justified by your workload ?

Apache Spark Benchmark for TPCH-SF10

TL;DR : a notebook to compare TPCH-SF10 results for Spark and DuckDB on a single Node.

Apache Spark is a system design for Big Data workload using distributed computing, it scale out very well just by adding more compute nodes, in real life though, it is used a lot with smaller data, people use whatever they have access to.

I don’t have a lot of experience with Pyspark, but I know enough to run some SQL Queries, and what a better way to test that than TPCH benchmark, in this case, I used a scale factor of 10 with the main table lineitem has 60 million records.

The test is run in Google Colab using the free compute ( 2 vCPU), I did try it with a paid VM ( 8 CPU) but strangely, Pyspark had a memory error ( although it did work fine with low memory, go figure)

I tried too to install in my laptop, but it did complain about some java compatibility, I have no time for this kind of shenanigans , I use to love it when I was much younger but now, I want just stuff to work.

The Data is first download to the local Disk, so all Queries are pretty much CPU bound, I used DuckDB as a reference as it has an excellent Single Node performance.

The Beauty of SQL

Pyspark has an excellent support for SQL, actually the same SQL is run both for Spark and DuckDB, we are talking 22 Queries with nearly 900 line of code, that impressive that it just works without modification for both Engine.

The Results

I used a Python Function to run every Query, and show duration and results

And here is the results, Using PySpark, DuckDB running Queries Directly from Parquet and just for reference DuckDB running from a DuckDB file

DuckDB around 5 minutes

Spark around 21 minutes

DuckDB using Duckdb file format : 2 minutes ( it take 3 minutes to import)

Take away

For Smaller Dataset that don’t fit in Memory which is a problem for Pandas, it is useful to look around for alternative before jumping to a distributed system, you don’t want to be paying for expensive computer overhead, tools like DuckDB and Polars can do wonder these days, there is a lot to like about Spark, but performance in a single node is not one of them.

Having said that, those new Engines have a lot of work to reach feature parity with Spark when doing Data Engineering jobs, for example support for Table Format like Iceberg and Delta is still experimental.

First Look at Open Table Format

TL; DR : I wrote a simple notebook how to create a Delta Table using PySpark, delete some rows and run some Queries using DuckDB.

When you write a record in a database, Like SQL Server, SQLite, it needs to store it somewhere, usually that location was your Disk, your local storage system is a very powerful system, you can write, read, modify existing file etc and it is very fast , so this problem was already solved 50 years ago. Why we should care about data lake Table Format (Delta, Iceberg etc).

The Internet happened

Internet tech company start getting massive amount of data generated by the exponential use of the internet, saving data in the local disk of Big Database is just too expensive, there was a need for a separate approach.

Separation of Storage and Compute

The solution was very simple separate the computer that run Queries from the system that store data, and both can be scaled independently, if you don’t need to run Analytical Queries fine just turn it off and keep adding data to your storage system. That was a genius idea!!! ( I think Google MapReduce was first then Yahoo basically copied the idea by creating the Open Source Hadoop, but it was slow then Spark tool over)

AWS S3

Another important event, in 2006, AWS, released S3, a massive Object store, basically you can save your files for a very low cost, but there is a catch, the files are immutable, you can’t simply open a file and do some modification and save it.

Basically, you can only add and delete files.

Table format solved this limitation by using some tricks, for example you want to delete some records, the DB will just add a new file with all the records and later when you read it, you just ignore those records. (it is called Merge on Read)

Another Approach is the DB will simply write a new file that don’t contains the delete records, and tag the existing file, so when you read it later, you will read the new file, as an example, if you have a table with 1 million records and want to delete 1 record, the DB will write a new file with 999999 records, that’s seems not very efficient to me !!!

Proprietary Table Format

Commercial DB Vendor like BigQuery (Capacitor), Snowflake (PAX) use it internally since day one, but it is Proprietary, you can’t read it using your Own Query Engine, although BigQuery provide an API that talk directly to the storage.

My understanding At that time Open Source system (Hadoop, Spark ) used a very simple table format called Hive, something that looks like this

But it did have a lot of problems, basically no ACID support , as an example, if someone is adding files when you are reading the table, you will get wrong results,  if you overwrite a table and your script timeout for any reason, your table will be corrupted ( I learn it the hard way).

Open Source Table Format

Open Source developers understood that they need to do something about it and they knew commercial DB vendors have already solved this problem, the result was those three Major format.

But at the very basic level, Table Storage Format is just a folder that contains

  • Data (usually) in a parquet format
  • Meta Data that contains the statistics of every file and its location

Something like this, Obviously the exact implementation details different for every format

Because it is open source, you will always find some developers who will build something slightly different, see databend as example, and even new Proprietary vendors , see FireBolt F3

I don’t delete why Should I care

That’s a respectable statement, even if all you do is read only and never delete, one may argue a folder of parquet file with a decent hive type partition is all you need, I still think even in that use case, Table Format maybe worth it, listing a directory is extremely slow operation and cost money, instead you can just read a metadata file and knows exactly the location of every file, not only that, some Query Engine are smart enough to read the statistics from the metadata file and skips files that don’t contains the Data needed for the Query.

Another example if you want to count the total records of a table or the min/max potentially you can just read it from the metadata and it will take millisecond.

Which one is better?

If you google it, you will find all kind of articles why one table format is better than the others, it is a very technical subject and I don’t know enough to comment on it, but my impression is Delta is rather controlled by Databricks ( and some features are still proprietary) which make the other vendors a bit suspicious and hence they are more into the Iceberg camp. (Snowflake, Trino, Dremio and to a less extend BigQuery)

Should you care about Table Format ?

Actually we should not, I never cared where SQL Server save its data, it is just there, and this is exactly what happening with Data Lake Vendors, they are slowly turning into a classical DW approach, surprisingly nowadays the best practise is to read the table using a catalog as there is no way to have a proper row level security, column masking just by reading from files. ( it is a bit ironic, it is like 2010 again)

Another interesting twist, Cloud DW vendors like Snowflake are adding support for Open Storage format as a native table, see this excellent presentation , Trino and Dremio has already a solid write and read support for Iceberg.

Microsoft and Google approach is to use Spark for Writing the Table Format and use their DW offering for read, I find this approach rather lazy. maybe it is just a workaround till they have a native support.

I suspect we will see a commoditization of open Table format this coming year(s), and users will just go back caring about what really matter;

Cost, Performance and Concurrency.