Was doing an experimentation in a Fabric notebook, running TPCH benchmarks and kept increasing the size from 1 GB to 1 TB, using Spark, DuckDB reading both parquet files and native format, it is not a very rigorous test, I was more interested about the overall system behavior then specific numbers, and again benchmarks means nothing, the only thing that matter is the actual workload.
The notebook cluster has 1 driver and up to 9 executors, spark dynamically add and remove executors , DuckDB is a single node system and runs only inside the driver.
Some observations
- When the data fits into one machine, it will run faster than a distributed system, (assuming the engines have the same performance), in this test up to 100 GB , DuckDB Parquet is faster than Spark.
- It is clear that DuckDB is optimized for its native file format, and there are a lot of opportunities to improve Parquet performance.
- At 700 GB, DuckDB native file is still competitive with Spark, even with multiple nodes, that’s very interesting technical achievement, although not very useful in Fabric ecosystem as only DuckDB can read it.
- At 1 TB, DuckDB parquet timeout and the performance of the native file format degraded significantly, it is an indication we need a bigger machine.
Although clearly I am a DuckDB fan, I appreciate the dynamic allocation of Spark resources,Spark is popular for reason 🙂
Yes I could have used a bigger machine for DuckDB, but that’s a manual process and does changes based on the specific workload, one can imagine a world where a single node get resources dynamically based on the workload.
I think the main takeaway, if a workload fits into a single machine then that will give you the best performance you can get.
Edit : to be clear, so far the main use case for Something like DuckDB inside Fabric is cheap ETL, I use it for more than a year and it works great, specially with the fact that single node notebooks in Fabric start in less than 10 seconds.