Benchmarking , Snowflake, Databricks , Synapse , BigQuery, Redshift , Trino , DuckDB and Hyper using TPCH-SF100

(Disclaimer : I use BigQuery for a personal project and it is virtually free for smaller workload, at work we use SQL Server as a Data Store, I will try my best to be objective )

TL:DR ;

Run TPCH-SF100 benchmark (base table 600 million rows) to understand how different Engine Behave in this workload using just the lowest tier offering, you can download the results here

Introduction

Was playing with Snowflake free trial ( maybe for the fifth time) and for no apparent reason, I just run Queries on TPCH-S100 Dataset, usually I am interested in smaller dataset, but I thought how Snowflake may behave with bigger data using the smallest cluster, long story short, I got 102 second, posted it in Linkedin and a common reaction was Snowflake is somehow cheating.

Obviously I did not buy the cheating explanation , as it is too risky and Databricks will make it international news.

Load the Data Again

Ideally I would have generated the Data myself and load it into Snowflake, generating 600 Million records in my laptop is not trivial, my tool of choice, DuckDB has an utility for that but it is currently single threaded, instead

I exported the data from Snowflake to Azure Storage as parquet files
Download it to my Laptop, generate new files using DuckDB as in Snowflake you can’t control the minimum size of files, you can control the max but not the Min

Snowflake Parquet External Table

My Plan was to run Queries directly on Parquet hosted on azure storage, the experience was not great at all, Snowflake got Query 5 join order wrong

Snowflake Internal Table

I loaded the parquet files generated by duckdb, Snowflake getting extremely good results. what I learnt, whatever Snowflake magic is doing, it is related to their proprietary file format.

BigQuery External Table

I have no frame of reference for this kind of workload, so I loaded the the data to BigQuery using external table in Google Cloud, Google got 5 minutes, one Run, 2.5 $ !!!!

BigQuery Internal Table

Loaded Data to BigQuery internal format, notice, BigQuery don’t charge for this operation , 2 Minutes 16 second, 1 Cold Run.

BigQuery Standard Edition

BigQuery added new pricing model where you pay by second, after the first minutes, I used the Standard Edition with a small size, I run the same query two time, unfortunately the new distributed disk cache don’t seems to be working, same result 5 minutes, that’s was disappointing

Redshift Serverless

Imported the same Parquet files into Redshift serverless, The schema was defined without Distribution keys, The results are for 3 Runs, the first run was a bit slower as it is fetching the data from the managed storage to the compute SSD the other 2 runs are substantially faster, I thought it is fair to have an average, Using the lowest Tier 8 RPU (2.88 $/Hour)

Redshift Serverless hot run was maybe the fastest performance I have seen so far, but they need still to improve on their cold Run.

I was surprised by the system overall performance, from my reading, it seems AWS basically rewrite the whole thing including separating compute from storage, Overall I think it is a good DWH.

Trino

Trino did not run Query 15, had to run a modified syntax but same results, 1 Run from Cold Storage, I am using the excellent service from Starburst Data

Synapse Serverless

Honestly, I was quite surprised by the performance of synapse serverless, initially I tested with the smaller file size generated by Snowflake and it did work, the first run failed but the second works just fine, I did like it, it did failed quickly, notice that Synapse run statistics on parquet files, so you would expect a more stable performance, not the fastest, but rather resilient.

Anyway , it took from 8-11 minutes, to be clear that’s not Synapse from two years ago.

Not related to the benchmark but I did enjoyed the lake database experience

Databricks External Table

I had not a great experience with Databricks, I could not simply pass authentication to Databricks SQL, you need a service principal and registering an App, and the documentation keep talking about Unity, which is not installed by default, This is a new install why Unity is not embedded if it is such a big deal ?

Anyway, First I created an external Table in databricks using the excellent passthrough technique in the Single Node Cluster, Databricks got 12 minutes,

Databricks Delta table

let’s try again with Delta, I created a new managed table, run optimize and analyse , (I always thought delta has already the stats), but it didn’t seems to make a big difference, still around 11 minutes, and this running from the disk, so no network bottleneck

DuckDB

My Plan was to run DuckDB on Azure ML, but I need a bigger VM than the one provided by default, I could not find a way to increase my Quota , I know it sounds silly, and I am just relating my experience, turn out Azure ML VM Quota is different from Azure VM, it did drive me crazy why I could get any VM in Databricks but Azure ML keep complaining I don’t have enough CPU.

Unfortunately I hit two bugs, first the native DuckDB file format seems to generate double the size of Parquet, the dev was very quick to identify the issue, the workaround is to define the table schema and then load the data using insert, the file became 24 GB compared to the original 40GB parquet files.

I End Up going with parquet files, I was not really excited by loading a 24 GB file in a storage account.

I run the Queries in Azure Databricks VM E8ds_v4 (8 cores and 64 GM of RAM)

As I am using fsspec with disk cache, the remote storage is used only the first run, after 4 tries, Query 21 keep crashing the VM 😦

Tableau Hyper

Tableau hyper was one of the biggest surprise, unfortunately, I hit a bug with Query 18, otherwise, it would have being the cheapest option.

Some Observations

Initially I was worried I made a mistake in Snowflake results, the numbers are just impressive for a single node tier, one explanation is the Execution Engine is mostly operating on compressed data with little materialization , but whatever they are doing, it has to do with the internal table format, which bring a whole discussion of performance vs openness, personally in a BI scenarios, I want the best performance possible, and wonder if they can get the same speed using Apache Iceberg.

Synapse Serverless improved a lot from last year, it did work well regardless of the data size of individual parquet files that I throw at it, and in my short testing it was faster than databricks and you pay by data scanned, so strictly speaking pure speed is not such a big deal but without a free result cache like BigQuery, it is still a hard sell.

Azure ML Quota policy was very confusing to me, and honestly I don’t want to deal with support ticket.

Databricks; may well be the fastest to run 100 TB, but for 100 GB workload, color me unimpressed.

DuckDB is impressive for an open source project that did not even reach version 1. I am sure those issues will be fixed soon.

Everything I heard about Redshift from twitter was wrong, it is a very good DWH, with Excellent performance.

BigQuery as I expected has excellent performance both for parquet and the native table format, ~~The challenge is to keep the same using the new auto scale offering.~~ added Auto scale performance, I think Google should do better.

Summary Results

You can find the results here, if you are a vendor and you don’t like the results feel free to host a TPCH-SF100 dataset in your service and let people test it themselves.

Note : Using SQL Query History : Bigquery one Cold Run , Synapse Serverless , Redshift Serverless and Snowflake a mix of cold and warm

(Note : Synapse Serverless always read from remote storage)

Databricks I am showing the best run from Disk, there is no system table, so I had to copy paste the results from the console.

Pricing

I did not kept the durations for Data load, it is just the cost for Read, obviously it is a theoretical exercise, and does not reflect real life usage which depends on other factors like concurrency performance , how you can share a pool of resources to multiple departement,free results cache, the performance of your ODBC drivers etc.

it is extremely important to understand what’s included in the basic price, for example.

Results cache:

BigQuery, Snowflake, Redshift results cache are free and you don’t need a running cluster, in Databricks you pay for it, Synapse don’t offer result cache at all.

Data loading :

BigQuery data loading is a free operation and other service like sorting and partitioning, in other DB you needs to pay.

Egress Fees :

Snowflake/BigQuery offer free egress fees, Other vendors you may pay, you need to check

Note :

BigQuery : for This workload make more sense to pay by compute not data scanned, either using auto scale, reserved pricing etc, I will try to test Auto scaling later.

Snowflake : I used the standard edition of Snowflake

Edit : I used a Google Colab notebook with a bigger VM for Hyper and DuckDB, see full reproducible notebook

Final Thoughts

Cloud DWH are amazing tech and only competition can drive innovation, not FUD and dishonesty, regardless of what platform you use, keep an eye on what other vendors are doing, and test using your own workload, you may be surprised by what you find.

13 thoughts on “Benchmarking , Snowflake, Databricks , Synapse , BigQuery, Redshift , Trino , DuckDB and Hyper using TPCH-SF100”

SUSHANT JAIN says:

March 9, 2023 at 10:44 pm

The numbers would me more unbaised and rational if you can sample data that is not associated with Snowflake.

LikeLike

1. mim says:
  
  March 9, 2023 at 11:32 pm
  
  I said the data was rewritten using DuckDB.
  
  LikeLike
  
  1. DWH says:
    
    June 22, 2023 at 3:21 am
    
    TPC-H sample table data from snowflake has been sorted on their cluster keys already which can be different than TPC-H out of box . I suggest using the tpch data generator tool or a unbiased tpc-h source in future benchmarks.
    
    LikeLike
  2. mim says:
    
    June 22, 2023 at 11:36 am
    
    I did 🙂 https://datamonkeysite.com/2023/04/10/the-unreasonable-effectiveness-of-snowflake-sql-engine/
    same results Snowflake is very fast
    
    LikeLike
Hi says:

March 10, 2023 at 3:34 am

terrible comparison. Author needs to learn how to use all platforms before benchmarking

LikeLike

Nikhil Lakshman says:

March 10, 2023 at 11:41 am

Interesting insights…any idea if x-small for Snowflake and 2x-small for Databricks uses the same configuration underlying VM ? Can you share those details as well ?

LikeLike

1. mim says:
  
  March 10, 2023 at 1:32 pm
  
  databricks uses 2 Standard_E8ds_v4 (1 worker, 1 Driver)
  Snowflake don’t disclose the hardware spec, but it seems it uses 1 VM that has 8 cores as a worker, the driver job is maintained by the service layer.
  
  LikeLike
  
Isaac says:

March 10, 2023 at 8:11 pm

Hi!
This is an interesting post, but, why didn’t you also tested the current leader in the TPC-H benchmarks, Exasol?
Indeed, they have been the leader for the latest 14 years in a row… Sadly other vendors like Snowflake did never take part into this official benchmark, audited by independent consultants. Maybe this is what they want to avoid…

In any case, here you have a link to the smallest TPC-H test run by Exasol:

Click to access hpe~tpch~1000~hpe_dl325_gen10~fdr~2021-04-02~v02.pdf

It is the 1TB (or 1,000GB test), run in 2019 with an already outdated version of Exasol, 6.2. Current Exasol version, 7.1, is way faster, so today’s results would be better.
If I have the chance I’ll take a look at that TPCH-SF100. If it is the 100GB test by the TPC-H, I think it will fit into the Exasol virtual machine I have in my laptop. It is so small that Exasol never bothered to officially benchmark it in the TPC-H.

By the way, one single run of the whole 22 queries in the TPC-H 1,000GB, the one linked above, takes just 24.46 seconds in Exasol (check the tables in the report).

Oh, and one great thing about Exasol is that it is really easy to use. If you can manage with SQL you know 90% of it, the rest is easy to find in the manual and as it is totally self tuning, you don’t need to configure or tweak anything to reach maximum speed from the very first moment.

(Disclaimer: Yes, I work for Exasol. Yes, I encourage you to try it in any of its flavours, trial is free).

LikeLike

1. raja says:
  
  March 15, 2023 at 2:20 pm
  
  Can you include price per run as part of result? Let’s say two vendor’s lowest tiers ran almost in equal time. But what if the lowest tier of one vendor costs twice as much as the another? price per run would be the most relevant metric for Datawarehouse.
  
  LikeLike
  
  1. mim says:
    
    March 16, 2023 at 8:10 am
    
    done
    
    LikeLike
2. Fred says:
  
  May 22, 2024 at 10:32 pm
  
  Exasol did do a 100GB test run. Twice in the past with v4.0 in 2011 and v5.0 in 2014.
  
  The 100GB test run in 2011 using EXASolution 4.0 used only 2 servers with 24GB of RAM each and X5690 CPUs. The power run of 22 TPC-H queries took 34 seconds.
  
  https://tpc.org/results/individual_results/Dell/Dell-R710.100G-2N.110405.01.es.pdf
  
  The 100GB test run in 2014 using EXASolution 5.0 used 6 servers with 16GB of RAM each and e5-2680v2 CPUs. The power run of 22 TPC-H queries took 9 seconds.
  
  https://www.tpc.org/results/fdr/tpch/dell~tpch~100~dell_poweredge_r720xd_using_exasolution_5.0~fdr~2014-09-23~v02.pdf
  
  I would love to see a run of v8. Too bad they don’t publish TPC-H and TPC-DS benchmarks anymore. Or any on the internet for that matter. Would be interesting to see how they would do on an SAP BI BW benchmark. Would also be cool to see how they compare doing ClickBench: https://benchmark.clickhouse.com/
  
  (Disclaimer: No, I don’t work for Exasol. I just want them to benchmark more or create a duckdb competitor for everybody to use.)
  
  LikeLike
  
Pingback: The Unreasonable Effectiveness of Snowflake SQL Engine – Project Controls blog
Pingback: BigQuery入門｜ゼロから始めるデータ分析基盤構築と導入方法

	Querying a Fabric La… on Writing to SQL Server using…
	Benjamin on Running DuckDB at 10 TB s…
	mim on Running DuckDB at 10 TB s…
	Benjamin on Running DuckDB at 10 TB s…
	Running DuckDB at 10… on Running DuckDB at 10 TB s…