First Impression of Databricks SQL

Edit : 23 Dec 2021, Shant Hovsepian from Databricks was kind enough and gave me some suggestion, Blog Updated.

Edit : 25 Dec 2021, added a note regarding Delta Lake open storage format.

This is another short blog on a series of my first impression of using different Data warehouse Engine and how they behave when used for a BI Workload, I am particular interested in small dataset and Mainly looking at concurrency and latency.

How to conduct a simple Test

The approach is the same, Build a simple PowerBI report using Direct Query , you can use Tableau with Live connection too or you favorite BI tool.

I run one instance of the report, I like to use play slicer to simulate user interaction, then a second instance etc, and see how the Engine behave

Setup Databricks SQL

Databricks has made an amazing job, The Workspace is very neat and intuitive, there is no Mention of the word Spark at all, they hided all the complexity, as far as I am concerned, it act like any Cloud Data warehouse, I had some hiccups though, Azure did complain about some Cpu Quota, it was easy to fix, but very annoying, Google Cloud setup was easier, but as of this writing, there is no SQL interface yet and you have to pay a 100-200 $ cost for Kubernetes, I end up Using Azure

when you explore a new SQL Database, the first thing you check is the sample Data, Strangely, it will ask for a Compute to be running to even have a look at the metadata.

Creating a new Compute is very simple, I really like that you can select which version of the engine you want to run, Current or Preview, Auto Stop works as expected, but

it is really slow to start a Cluster ( around 4 minute)

The Minimum duration for Auto Stop is 10 minutes ( in practise, you should increase it to 1 Hour, a user waiting 5 minutes for his report to Start is not fun)

Databricks at least in Azure is a platform as a service, when you create a new cluster it will use your own resources to Provision a new VM (CPU, Disk etc), no surprise it take so long to start a new Cluster.

I notice when you resize a Cluster, the Engine became offline, it make sense maybe to use Auto Scaling instead.

For the Pricing, you pay in Databricks Unit (DBU) 1 DBU = 0.22 $/Hour and the resources generated, I can’t find the reference, but it seems an 2X-Small require 2 CPU with a cost of 0.64 $/Hour.

so the Total for my test Cluster is 0.22 * 4 + 0.64 * 2 = 2.16 $/Hour

Testing PowerBI

Connecting to PowerBI and Tableau was literally a 2 clicks away, In PowerBI you click on a link and it will generate a PowerBI report file, fantastic

Image

I used some sample Data provided by Databricks, the main fact Table is 1 GB and has 30 Million records

here is my PowerBI Data Model

And here is the PowerBI report, Basically Looping on Customer Key and generate some simple aggregate, the report generate 3 SQL Queries every 5 Second

When I run Only 1 Instance of the report, it works rather well, added second Instance, still behaved well, but when I added a third Instance, it became unusable, and Queries start to get added to a Queue, the Cluster did not keep up with the Workload, I am surprise by the results

I notice something interesting, it seems, Databricks does not support result cache, what’s currently supported is SSD cache, it seems the Engine cache the raw data in the Local SSD, but the Engine will run the same Query even when the table did not change and it is the same SQL expression.

Using Performance analyser in PowerBI, the Query return in around 1.5 to 3 second, definitely it is not a sub second territory here.

Second try

Turn out the sample data Provided by Databricks is located in Washington state, in my defence, I thought when you create a new account, they copy the data in your account, that’s not the case.

I copy the same data into my local storage

redone the same test using 4 instance of PowerBI instead of 3, and Databricks behaved way better !!!!

Result Cache

The result cache implementation in Databricks is a bit unusual , Take this Query as an example

select
  `o_custkey`,
  sum(`l_quantity`) as `C1`,
  count(1) as `C2`
from
  (
    select
      `OTBL`.`l_quantity`,
      `ITBL`.`o_custkey`
    from
      `hive_metastore`.`default`.`lineitem` as `OTBL`
      inner join (
        select
          `o_orderkey`,
          `o_custkey`,
          `o_orderstatus`,
          `o_totalprice`,
          `o_orderdate`,
          `o_orderpriority`,
          `o_clerk`,
          `o_shippriority`,
          `o_comment`
        from
          `hive_metastore`.`default`.`orders`
        where
          `o_custkey` in (5, 11, 140, 4)
      ) as `ITBL` on (`OTBL`.`l_orderkey` = `ITBL`.`o_orderkey`)
  ) as `ITBL`
group by
  `o_custkey`
limit
  1000001

the First run took 2.75 second

The Second Run, which should be cached as I did not change the tables

1.19 second, is not a great result, I know Snowflake result cache return around 50 ms, and BigQuery around 100- 200 ms, if I understood correctly Because Databricks use an Open Storage Format, it has always to go back to Azure storage and check if something has changed, which introduce and extra latency.

Random Thoughts

Databricks SQL is a Data-Warehouse, in my opinion all this talk about lake House is just a distraction, as far as I can see, it is a Solid DWH with an open storage format ( it is a good thing), it is multi cloud which is a big advantages and the team is investing a lot in new functionalities.

I am aware that the biggest competitive advantage of Databricks compare to Snowflake is Delta Lake, its open table format, basically you can read your data for free see example here with PowerBI or you can use literally another Compute engine, but in this first look, I was only interested in Query performance not the overall architecture. (BigQuery has an open Storage API but it is not free)

Engine Startup time is really slow, 5 minute is too much, specially when other vendors offer 5 second startup ( Looking forward for the serverless preview in Azure).

I am not sure what’s going on exactly with concurrency, it seems Databricks is really good at aggregating massive data, but I am not sure, if is suited for High concurrency , low latency needed for Interactive BI Workload.

Databricks Got me interested in their engine, I need further testing, but it seems we have another interesting Azure DWH offering (Beside Snowflake).

After I quickly tested Azure Synapse ( Both Serverless and Dedicated) and Snowflake, I think for high Concurrency, Low latency , small dataset workload, Snowflake has an advantage, Databricks is a second, Synapse does not support this workload at all.

Some readers thought I was a bit unfair to databricks but that was not the intention as far as I am concerned, in the last 10 years, we saw some serious innovation in Data warehouse space.

BigQuery separating Storage from Compute.

Snowflake introducing the 60 second pricing Model, and being Multi Cloud

Databricks going even further and making the Storage open, so your data is not tied to one Engine.

First Look at Synapse Dedicated Pool

Edit: March 2022, I am keeping the blog post as it was my first impression at the time, but I have to admit I was wrong, Dedicated Pool is designed for Big Data Workload, Starting from TB, it does not work well at all for small dataset, basically I was testing it for a load that was not design for it.

for Data size in the 100 GB , SQL Server with columnstore index is probably a better option.

in a previous blog, I showed that Synapse Serverless is not great for interactive BI workload, which is rather obvious as even Microsoft don’t recommend it as a best practice

The logical next step is to have a look at Dedicated Pool ( ex SQL DW), where you don’t pay by data scanned but rather resource reserved, this model make more sense for a steady interactive Workload.

To be very clear, this is a real first impressed, I did not tried any optimization (to be honest I don’t even know how to do it), all I wanted is to get a feeling of the product.

Creating a new cluster is trivial, notice, I have already a Synapse analytics Workspace, Microsoft is very good in creating integrated solution

You can directly start running Queries from data store in Azure storage, see example here,in this example , I am selecting serverless here.

Now using dedicated Pool, I had first to start the cluster as it was suspended, you need to wait a couple of minutes

And for some reason, openrowset did not work, instead, I need to create External Table, still very easy

The GUI generate the SQL Script, very nice

DUID is a short string, but I will leave the default value as nvarchart(4000), for some reason the column SETTLEMENTDATE is recognized as a number, it is should be a datetime ( maybe the issue is with parquet file itself), but that’s not a big issue for this blog. the point is you can Query your data lake files either using dedicated pool or Serverless

Native Storage

Dedicated Gen2 does not support a real separation between storage and compute, when you load a table, it has to be allocated to a specific DWH, and cross Query between DWH is not supported ( it is planned since 5 years).

Loading data is very easy, same approach, a GUI will generate a script

and here is the Script generated, I changed the text column to varchar(20)

Now instead of loading only 1 file, I loaded 536 files, it took 8 minutes, that’s seems very slow

Performance

Obviously, it is not a benchmark, but just to get a general feeling, when I test it with a very small table, I did like the results in PowerBI (Direct Query took 940 ms and DAX literally 5 ms)

Image

A bit latter, I try it with Another again , first I got this error

Image

it was my fault, when a cluster is suspended, it will not wake up automatically when it receive Queries, you need to manually start it.

The PowerBI report is a simple Direct Query with some table, just to generate multiple SQL Queries

Initially it felt alright, but when I added another instance of the report, things start to feel slow, had a look at the service and SQL Pool has already started adding SQL Queries to the Queue !!!!

Take Away

The Good :

  • Integration with Azure Storage

The Bad :

  • 1.181 $/Hour is far from being cheap for an entry level tier specially that auto suspend does not Work.
  • Dedicated Pool is the only major DWH that does not separate Storage from compute, you can’t simply spin up a compute and Cross Query another dataset.

The Ugly :

  • Performance and concurrency seems particularly poor out of the box.

I am sure there are some optimization that can be done ; for example switch on result cache ( why it is not on by Default ?) maybe partition etc but I wanted just to give a first impression, and honestly color me unimpressed, Better Gen3 be Good.

Edit : got this feedback from Andy that DW100c is limited to 4 Concurrent Queries

First Impression of Snowflake from a BigQuery user perspective.

TL;DR : Random observations after using Snowflake for a couple of hours, there is a lot to likes but mixed feeling about the cost.

For no obvious reason, I felt an urge to try Snowflake, The setup was trivial, you get 30 days trial with $400, no Credit card required.

Snowflake is multi cloud product, first I had to choose the cloud provider and the region, My Personal PowerBI instance is in Melbourne, unfortunately as of this writing it is Only Available in Azure Sydney Region, it make sense to choose the same region for two reasons

  • Latency, inter region Transfer take more time
  • Egress Cost, Cloud Provider charge for Inter Region Transfer

For the record my personal data is in GCP Tokyo , but Snowflake is not available there.

User interface

The User interface is very neat and simplistic ( Good thing), I did not need to check any documentation

Snowflake provide free sample data by default

and Obviously, you can browser all kind of data from the Marketplace , it is very well integrated and seems trivial to use, as you have noticed already, Snowflake like BigQuery has a total separation between Storage and Compute, so far so good.

Preview Data

I click on Data Preview, and I got this message

Yes Unlike BigQuery, Data Preview is a paid operation and require a Cluster running.

Create a new Cluster

This is the core feature of Snowflake, creating a new cluster is trivial, as a test, I create the smallest possible Cluster.

The Cluster was up and running in a couple of seconds, very Impressive, and the way it works is very simple.

if there is no Query Running, it will shut down after 1 minute ( or whatever you choose), When a new Query showed up for example from a BI tool, the Engine very quickly wake up !!!!

The Minimum Cluster, I could setup was X-Small and it cost 1 credit/Hour ( 2,75 $/Hour), but you pay per second with a 1 minute minimum , I am using Standard Edition, Enterprise edition cost more.

Note : As a BigQuery enthusiast, I hope Google release the auto flex slot

Notice here, Snowflake is not simply a cluster to Run some Queries, but it does have a Service Layer which do a lot of operations behind the scene, personally I am mainly interested in free Results cache, which is freaking fast , as low as a 50 ms !!!!

Query History Log

A nice Query history log, I really liked the Client Driver, you can easily tell, if the Queries are coming from an external BI tools or Query from the Console, one very very annoying thing, if you change that view and you came back, you lose the selection and you have to select the columns again, I wish Snowflake could save the customization of the columns.

Query Console

it was not very obvious, but to select columns name, you need first to click on a table which open a panel then Click on those three little dot, (it is obvious once you know it) , there is no multi tab support, new Query open a new Window, but honestly seeing the new BigQuery UI, maybe it is not a bad idea after all 🙂

Performance

My initial plan was to use a copy a SQL Script from BigQuery that uses Loops, but turn out Snowflake don’t support DO while ( it is coming for SQL ), there is a workaround using javascript which I may use later, but instead, and just to have a first impression, I used a PowerBI report in DirectQuery mode and see how it goes.

Image

Snowflake Driver for PowerBI is amazing, I never saw a sub second Direct Query in PowerBI before, even when returning 1 row from cache, ( BigQuery Driver for PowerBI is not optimized, I don’t really knows who to blame, Google or Microsoft or probably both, turn out it is Google responsability)

I am using TPCH_SF10 dataset, the main fact Table contains 60 Million records

And here is the Snowflake data Model ( pun intended) in PowerBI using DirectQuery.

Some results a really intriguing, The ones that don’t have Bytes scanned are cached, but look at this Query that scanned 2.1 GB and returned in 770ms, that’s a BigQuery BI Engine territory right there !!!

The Reason, the Query is in the subsecond, is again because of another type of cache, Snwoflake cache the raw data in the local SSD drive of the cluster, hence it make sense for better performance to keep the Cluster running for a bit longer, if you suspend a Cluster, there is no guarantee that Snowflake will resume the same one.

Btw, the Query plan visual is very detailed and explain every step, very nice.

Takeaway

The Quick Auto suspend and resume of Clusters, Global Query result cache and the fact it is a Multi Cloud offering are the key strength of Snowflake.

I was really surprised by the experience of using Snowflake Direct Query with PowerBI, and the data marketplace was a very polished experience.

I will show my bias here , for 2.75 $/Hour I can reserved a BigQuery BI Engine instance with 50 GB in-memory RAM, it will be interesting to compare the performance and concurrency of Both Engine.

Synapse Serverless vs BigQuery BI Engine using a dataset under 10 GB

Disclaimer : this is not a scientific reproducible benchmark.

Edit : TL;DR, don’t use Synapse Serverless with Direct Query mode in PowerBI, it is an extremely bad idea.

One aspect that bother me about technical blogs nowadays, it seems cost is rarely considered, when the subject is about a fixed cost Product like PowerBI Pro license ( 1/user/10$/Month) then it is fine, we know what to expect, but when we talk about usage based Pricing, the cost structure is extremely important, a solution may be great for certain usage load, but it became just exorbitant when the load increases.

One particular architecture that some people start promoting as some kind of magical solution is the use of Synapse Serverless as a logical Data-warehouse, and somehow it can be used too as a live Query layer to PowerBI, I will argue in this blog that this setup is simple too expensive.

Testing Synapse Serverless Indirectly

My thinking is very simple, I was not very excited by the prospect of paying 200 dollars just to test Synapse Serverless, Instead I will test it indirectly, BigQuery BI Engine has a nice functionality that show how much data was scanned, it is for information only, we don’t pay by data scanned but instead in-memory reserved ( 1$/GB compressed/Day, minimum 1 hour).

The approach here is to get the volume of Data scanned and multiply it by 5$/TB (Synapse serverless Pricing) , I appreciate it is not 100 % accurate, but I hope will show a general pattern.

Load test using a PowerBI report

The Fact table is 12 GB, 72 Millions rows that get data add every 5 minutes, the Model is a simple Star Schema, I am using dual mode for the dimension Tables

The test consist of using play Axis to loop on some dimension values every 5 second, I launched multiple copy of the same reports to generate more SQL Queries.

The Results

This table summarize the test results , more details in the report , please keep the filter between 22 Nov and 27 Nov 2021, as testing was done in that period.

BigQuery BI engine is very fast, but that’s not the subject of the current blog, what’s interesting here is the volume of data scanned 16,75 TB, that’s a lot of data, which does not count for the cache.

For simplification purpose we estimate the cache to be the same ratio as scanned TB (16.46 * 44.89 %/54.42 %) = 13.58 TB

Synapse Serverless is 25 X more expensive than BigQuery BI Engine, and even if they add the result set cache it will be still 14 X more expensive ( Same as BigQuery without reservation)

Key findings

Interactive BI reports generate a massive number of SQL Queries, in our example, it was 76K Queries, which simply make SQL Engine that are cost based on data scanned too Much expensive.( Synapase Serverless and BigQuery default mode)

This scenario will be better served by a dedicated capacity, but as of this writing Synapse does not support auto suspense and auto resume which make it too expensive , and in any case, Synapse dedicated pool does not scale down well for small data ( hopefully Gen3 will fix that)

BigQuery BI Engine make Direct Query on PowerBI a viable solution, which is a great achievement and still with very competitive pricing.

Synapse Serverless is an interesting SQL Query Engine, but it not designed for heavy interactive BI load, I just hope people stop suggesting otherwise.

I think next year the battle for 100 GB interactive sub second BI workload will be an interesting space to watch, let’s see what Dedicated Pool Gen 3 , Databricks, Snowflake and Firebolt will bring to the table 🙂

%d bloggers like this: