AzureDWH – Page 2 – Small Data And self service

First Look at SingleStore

I could have written a nice paragraph why I got interested in SingleStore but to be honest the reason is very simple and has nothing about the tech, Jordan Tigani one of the founding Engineers of BigQuery is now their Chief Product Officer, so I became very curious 🙂

Again, I am only interested in Small Interactive BI Workload, contrary to the usual Suspects ( BigQuery, Snowflake etc), SingleStore is not a pure Data warehouse but rather a multi purpose database, it does OLTP Workloads but has an excellent support for OLAP Workload, I am only interested in Analytical Workloads.

Setup

There is a free trial with $500, the setup was very intuitive, I really liked the way you create a new Cluster, notice I don’t have an account with AWS, but it is a software as a service Experience, SingleStore manage everything on behalf of the user, I chose AWS as they support the Sydney Region.

The smallest tier start at 0.25 Credit/hour, which cost 0.65 $/Hour, Unlike Snowlake and Databricks there is no auto suspend and auto start, you have to do it manually.

For some reason Suspend a Cluster is not available in Google Cloud !!!

The Console has the bare minimum but functional, there is no multiple tab, if you run a Query, you need to wait till it is done before running another one

There is an odd choice in the UI, when you want to monitor the Cluster you need to open a new page called SingleStore Studio

It is not the end of the world, but a bit annoying when you are new to the product

Loading Data

There is sample Data you can quickly load to start running Queries, but I wanted to test only my own dataset (TPC-H- SF10) ( nice surprise it was added this week)

Although my Cluster is in AWS, loading files from Google Cloud was trivial, all I had to do is setup a new pipeline

first define the table, notice the Clustered Columnstore key

CREATE TABLE `orders` (
`o_orderkey` bigint(11) NOT NULL,
`o_custkey` int(11) NOT NULL,
`o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`o_totalprice` decimal(15,2) NOT NULL,
`o_orderdate` date NOT NULL,
`o_orderpriority` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`o_clerk` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`o_shippriority` int(11) NOT NULL,
`o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
SHARD KEY (`o_orderkey`) USING CLUSTERED COLUMNSTORE
);

Define a new pipeline

CREATE or REPLACE PIPELINE `LoadGCPorders`
AS LOAD DATA GCS 'xxxxxxxxxxxxx'
CREDENTIALS '{"access_id": "xxx", "secret_key": "xxxxxx"}'
INTO TABLE tpch.orders
(`O_ORDERKEY` <- `O_ORDERKEY`,
  `O_CUSTKEY` <- `O_CUSTKEY`,
  `O_ORDERSTATUS` <- `O_ORDERSTATUS`,
  O_TOTALPRICE <- O_TOTALPRICE
  , O_ORDERDATE <- O_ORDERDATE
  , O_ORDERPRIORITY <- O_ORDERPRIORITY
  , O_CLERK <- O_CLERK
  , O_SHIPPRIORITY <- O_SHIPPRIORITY
  , O_COMMENT <- O_COMMENT
  
  )
FORMAT PARQUET

Then Run the pipeline and the data is automatically loaded, very nice

START PIPELINE LoadGCPorders FOREGROUND;

Testing using TPC-H SF10 Benchmark

to test all the 22 Queries of the benchmark I used the same script for BI Engine, here is the results after 10 runs using the Dataset Provided by Singlestore ( Using S-0 Cluster, see pricing here)

A lot of Queries are already under a second even when using a lower tier !!! Queries 13 result is a bit odd.

if I understood correctly, SingleStore does not have a results cache when you run the same Query again, SingleStore store the Query plan but scan the data again, although the data is stored on Disk, metadata on the tables is stored In-Memory ( tables for OLTP workload are always In-Memory)

The Previous chart was built using Google Data Studio, as of this writing, PowerBI does not have a native connector, you need to download a custom connector which means you need a gateway, not sure if Direct Query is supported at all, I quickly used MySQL connector which works fine but import mode only (SingleStore is compatible with MySQL tools)

Take Away

I was really impressed by the Product, we all hear about this operational analytics and it seems SingleStore has a good solution, there are missing functionalities though, Auto suspend and resume is not available yet and no native connector for PowerBI is very problematic, but it is really Fast and do write workload too.

PowerBI Vertipaq Vs Snowflake Vs BigQuery BI Engine

Another Short blog on my favorite topic, High Concurrency, Low latency Interactive BI Workload, I can’t test everything, but so far, I am impressed by BigQuery BI Engine and Snowflake on Azure, a reader may thinks, I have an anti-Microsoft bias which is nonsense, I think Microsoft has a different approach, Data warehouse is for storage and Transformation but as a serving layer, you load your reporting tables into a Tabular Model (PowerBI, Analysis Services)

PowerBI tabular model, at a very highly level consists of an analytical Database and a semantic Model, for the purpose of this blog , I am only interested in The Vertipaq database order of magnitude performance, it is not a benchmark, I am just asking a lot of questions, and hopefully start a conversation.

Just a note for readers not familiar with PowerBI, when PowerBI uses Direct Query Mode, it means The tabular Model using the existing semantic Model will send the Queries to the resource system directly and Vertipaq is not used.

What’s the difference between the Three Engines

Although all the three are Columnars Database, They use different tech and implement different assumption, it is a very technical subject and I am afraid there is not enough literature, but from my perspective as a user,I have notice the following.

BigQuery BI Engine : The Data is loaded from BigQuery Storage to Memory, if the base tables changes, the data is reloaded nearly in near time, Only Column and partition needed for the Query are loaded, it does implement in-Memory compression , but , according to the dev, the engine may leave some columns without Compression for performance reason, I think it is a balance between compression overhead and loading data from the storage as soon as possible (This is my own speculation, I am sure it is much more complex), if the Query is not supported or the data scanned is too Big, it fall back to BigQuery, Mosha has a nice technical Video about the Engine

Snowflake : I Notice the data can be cached in the local SSD drive, the performance became substantially faster, Although it is not an in-memory Database, I was genuinely Impressed by the performance. ( To honest I don’t know much about Snowflake)

Vertipaq : When the data is loaded from the Disk Storage to memory, all the Model is loaded, it means all the tables and columns, unlike the previous two engine, you can’t load only a subset of the data or only columns needed for the Queries, you have to manually make that selection either by importing only the columns you think you will need initially and the same by filtering a subset of the rows.

Basically, it is another independant copy of the data, and you need ETL process to make sure the data is synchronized between the source Data Storage and Vertipaq Storage, Marco Russo has a detailed Video about the Engine

Note regarding relationship

Since I start using PowerBI, I was always mystified by the performance of Vertipaq joins, it is well known, joins are expensive in all Databases, Vertipaq seems to be the exception, initially I started with this Model

I asked a simple Question, what’s the total Quantity by Country, to get the answer the filter has to traverse some non trivial relationship ( Customer to order, 1.5 Millions distinct value, and then orders to Lineitem 15 Million distinct value)

Vertipaq gave the results in 536ms !!!!
Snowflake 2.9 Second.
BI Engine did fall back to BigQuery (4 second), BI engine has a limitation of 5 Million Dim Table

Vertipaq Materialized the joins, I don’t know how an analytical Database can compete with this implementation, but I read that firebolt has something called Join Index ( to be verified, I have not test it myself), anyway here the duration using DAX Studio, it is very fast

Star Schema Model

As BI Engine does not support yet a join with more than 5 millions rows, I changed the Semantic Model to a Star Schema, I did join the table order and Lineitem and create 1 wide fact Table, Notice I am using the sample Data provided by Snowflake TPC-H with a factor of 10, the main fact is 60 Millions rows. The supplier is 100K rows and Customer is 1.5 Millions rows.

Snowflake is using the x-Small, cost $2.75 /Hour

BigQuery : 1 GB cost 5 cents/Hour

Vertipaq : Using my PowerBI Desktop.

Joins Performance

Using the Start Schema Model, and the same Question, total Quantity sold by country ( I know the metric is not very useful, but I am interested only in performance)

BI Engine : Around 1.8 Sec, I had to Cluster the Table by Custkey.

Snowflake : 1.6 S

Vertipaq, using Materialized relationship average 52 ms

Vertipaq using runtime joins : Average 8 S

I used TREATAS to simulate a join without a physical relationship ( I hope my DAX is not wrong)

sum_Quantity_Virtual = 
CALCULATE (
    SUM ( lineitem_orders[L_QUANTITY] ),
    TREATAS (
        VALUES ( customer_nation_region[C_CUSTKEY] ),
        lineitem_orders[O_CUSTKEY]
    )
)

and here is the results using DAX Studio

Group by using Same Table

A simple Query, total Quantity by Line Status

BI Engine : 200 ms

Vertipaq : 6.8 ms , yeah, it is not a mistake, and it is not a result cache, honestly I have no idea how it is even possible to get this result.

Snowflake : 366 ms

Takeway

It is a no surprise, Vertipaq is extremely fast, at least for a small dataset, Analytical Database can’t really compete, Materialized relationship are awesome, but you need to make a copy of your data, usually it is not an issue, until your data is too big or too volatile, Microsoft is aware of this and added feature like Hybrid Table will help

Another interesting Question, at what Data Size Analytical DWH become faster than PowerBI Vertipaq ?

This one is my own speculation, Maybe Analytical DWH don’t need to be as fast as Vertipaq, maybe not having to copy data is a bigger advantage than having a response time in less than 100 ms.

Not all DWH are equal, most of them are designed only for Storage and running transformation which is fine, but there is a new category emerging, and it will compete with Workload traditionally served by OLAP Cubes

I write only about Tech that I have used personally, but it seems, Firebolt and SingleStore serve the same workload too, my experience with Databricks SQL was a mixed feeling, I don’t think it is fast enough for this workload.

For the kind of work I do, Vertipaq is a Godsend, it is very fast, it tolerate wrong design choices, and for the majority of my workload it is the best solution in the market, and it was battle tested for 10 years, with a lot of real life optimization.

Having said that, it is just a Database competing indirectly with other technology for the same Workload, and some competitors are getting dangerously good.

First Impression of Databricks SQL

Edit : 23 Dec 2021, Shant Hovsepian from Databricks was kind enough and gave me some suggestion, Blog Updated.

Edit : 25 Dec 2021, added a note regarding Delta Lake open storage format.

This is another short blog on a series of my first impression of using different Data warehouse Engine and how they behave when used for a BI Workload, I am particular interested in small dataset and Mainly looking at concurrency and latency.

How to conduct a simple Test

The approach is the same, Build a simple PowerBI report using Direct Query , you can use Tableau with Live connection too or you favorite BI tool.

I run one instance of the report, I like to use play slicer to simulate user interaction, then a second instance etc, and see how the Engine behave

Setup Databricks SQL

Databricks has made an amazing job, The Workspace is very neat and intuitive, there is no Mention of the word Spark at all, they hided all the complexity, as far as I am concerned, it act like any Cloud Data warehouse, I had some hiccups though, Azure did complain about some Cpu Quota, it was easy to fix, but very annoying, Google Cloud setup was easier, but as of this writing, there is no SQL interface yet and you have to pay a 100-200 $ cost for Kubernetes, I end up Using Azure

when you explore a new SQL Database, the first thing you check is the sample Data, Strangely, it will ask for a Compute to be running to even have a look at the metadata.

Creating a new Compute is very simple, I really like that you can select which version of the engine you want to run, Current or Preview, Auto Stop works as expected, but

it is really slow to start a Cluster ( around 4 minute)

The Minimum duration for Auto Stop is 10 minutes ( in practise, you should increase it to 1 Hour, a user waiting 5 minutes for his report to Start is not fun)

Databricks at least in Azure is a platform as a service, when you create a new cluster it will use your own resources to Provision a new VM (CPU, Disk etc), no surprise it take so long to start a new Cluster.

I notice when you resize a Cluster, the Engine became offline, it make sense maybe to use Auto Scaling instead.

For the Pricing, you pay in Databricks Unit (DBU) 1 DBU = 0.22 $/Hour and the resources generated, I can’t find the reference, but it seems an 2X-Small require 2 CPU with a cost of 0.64 $/Hour.

so the Total for my test Cluster is 0.22 * 4 + 0.64 * 2 = 2.16 $/Hour

Testing PowerBI

Connecting to PowerBI and Tableau was literally a 2 clicks away, In PowerBI you click on a link and it will generate a PowerBI report file, fantastic

I used some sample Data provided by Databricks, the main fact Table is 1 GB and has 30 Million records

here is my PowerBI Data Model

And here is the PowerBI report, Basically Looping on Customer Key and generate some simple aggregate, the report generate 3 SQL Queries every 5 Second

When I run Only 1 Instance of the report, it works rather well, added second Instance, still behaved well, but when I added a third Instance, it became unusable, and Queries start to get added to a Queue, the Cluster did not keep up with the Workload, I am surprise by the results

I notice something interesting, it seems, Databricks does not support result cache, what’s currently supported is SSD cache, it seems the Engine cache the raw data in the Local SSD, but the Engine will run the same Query even when the table did not change and it is the same SQL expression.

Using Performance analyser in PowerBI, the Query return in around 1.5 to 3 second, definitely it is not a sub second territory here.

Second try

Turn out the sample data Provided by Databricks is located in Washington state, in my defence, I thought when you create a new account, they copy the data in your account, that’s not the case.

I copy the same data into my local storage

redone the same test using 4 instance of PowerBI instead of 3, and Databricks behaved way better !!!!

Result Cache

The result cache implementation in Databricks is a bit unusual , Take this Query as an example

select
  `o_custkey`,
  sum(`l_quantity`) as `C1`,
  count(1) as `C2`
from
  (
    select
      `OTBL`.`l_quantity`,
      `ITBL`.`o_custkey`
    from
      `hive_metastore`.`default`.`lineitem` as `OTBL`
      inner join (
        select
          `o_orderkey`,
          `o_custkey`,
          `o_orderstatus`,
          `o_totalprice`,
          `o_orderdate`,
          `o_orderpriority`,
          `o_clerk`,
          `o_shippriority`,
          `o_comment`
        from
          `hive_metastore`.`default`.`orders`
        where
          `o_custkey` in (5, 11, 140, 4)
      ) as `ITBL` on (`OTBL`.`l_orderkey` = `ITBL`.`o_orderkey`)
  ) as `ITBL`
group by
  `o_custkey`
limit
  1000001

the First run took 2.75 second

The Second Run, which should be cached as I did not change the tables

1.19 second, is not a great result, I know Snowflake result cache return around 50 ms, and BigQuery around 100- 200 ms, if I understood correctly Because Databricks use an Open Storage Format, it has always to go back to Azure storage and check if something has changed, which introduce and extra latency.

Random Thoughts

Databricks SQL is a Data-Warehouse, in my opinion all this talk about lake House is just a distraction, as far as I can see, it is a Solid DWH with an open storage format ( it is a good thing), it is multi cloud which is a big advantages and the team is investing a lot in new functionalities.

I am aware that the biggest competitive advantage of Databricks compare to Snowflake is Delta Lake, its open table format, basically you can read your data for free see example here with PowerBI or you can use literally another Compute engine, but in this first look, I was only interested in Query performance not the overall architecture. (BigQuery has an open Storage API but it is not free)

Engine Startup time is really slow, 5 minute is too much, specially when other vendors offer 5 second startup ( Looking forward for the serverless preview in Azure).

I am not sure what’s going on exactly with concurrency, it seems Databricks is really good at aggregating massive data, but I am not sure, if is suited for High concurrency , low latency needed for Interact~~ive BI Workload.~~

Databricks Got me interested in their engine, I need further testing, but it seems we have another interesting Azure DWH offering (Beside Snowflake).

After I quickly tested Azure Synapse ( Both Serverless and Dedicated) and Snowflake, I think for high Concurrency, Low latency , small dataset workload, Snowflake has an advantage, Databricks is a second, Synapse does not support this workload at all.

Some readers thought I was a bit unfair to databricks but that was not the intention as far as I am concerned, in the last 10 years, we saw some serious innovation in Data warehouse space.

BigQuery separating Storage from Compute.

Snowflake introducing the 60 second pricing Model, and being Multi Cloud

Databricks going even further and making the Storage open, so your data is not tied to one Engine.

First Look at Synapse Dedicated Pool

Edit: March 2022, I am keeping the blog post as it was my first impression at the time, but I have to admit I was wrong, Dedicated Pool is designed for Big Data Workload, Starting from TB, it does not work well at all for small dataset, basically I was testing it for a load that was not design for it.

for Data size in the 100 GB , SQL Server with columnstore index is probably a better option.

in a previous blog, I showed that Synapse Serverless is not great for interactive BI workload, which is rather obvious as even Microsoft don’t recommend it as a best practice

The logical next step is to have a look at Dedicated Pool ( ex SQL DW), where you don’t pay by data scanned but rather resource reserved, this model make more sense for a steady interactive Workload.

To be very clear, this is a real first impressed, I did not tried any optimization (to be honest I don’t even know how to do it), all I wanted is to get a feeling of the product.

Creating a new cluster is trivial, notice, I have already a Synapse analytics Workspace, Microsoft is very good in creating integrated solution

You can directly start running Queries from data store in Azure storage, see example here,in this example , I am selecting serverless here.

Now using dedicated Pool, I had first to start the cluster as it was suspended, you need to wait a couple of minutes

And for some reason, openrowset did not work, instead, I need to create External Table, still very easy

The GUI generate the SQL Script, very nice

DUID is a short string, but I will leave the default value as nvarchart(4000), for some reason the column SETTLEMENTDATE is recognized as a number, it is should be a datetime ( maybe the issue is with parquet file itself), but that’s not a big issue for this blog. the point is you can Query your data lake files either using dedicated pool or Serverless

Native Storage

Dedicated Gen2 does not support a real separation between storage and compute, when you load a table, it has to be allocated to a specific DWH, and cross Query between DWH is not supported ( it is planned since 5 years).

Loading data is very easy, same approach, a GUI will generate a script

and here is the Script generated, I changed the text column to varchar(20)

Now instead of loading only 1 file, I loaded 536 files, it took 8 minutes, that’s seems very slow

Performance

Obviously, it is not a benchmark, but just to get a general feeling, when I test it with a very small table, I did like the results in PowerBI (Direct Query took 940 ms and DAX literally 5 ms)

A bit latter, I try it with Another again , first I got this error

it was my fault, when a cluster is suspended, it will not wake up automatically when it receive Queries, you need to manually start it.

The PowerBI report is a simple Direct Query with some table, just to generate multiple SQL Queries

Initially it felt alright, but when I added another instance of the report, things start to feel slow, had a look at the service and SQL Pool has already started adding SQL Queries to the Queue !!!!

Take Away

The Good :

Integration with Azure Storage

The Bad :

1.181 $/Hour is far from being cheap for an entry level tier specially that auto suspend does not Work.
Dedicated Pool is the only major DWH that does not separate Storage from compute, you can’t simply spin up a compute and Cross Query another dataset.

The Ugly :

Performance and concurrency seems particularly poor out of the box.

I am sure there are some optimization that can be done ; for example switch on result cache ( why it is not on by Default ?) maybe partition etc but I wanted just to give a first impression, and honestly color me unimpressed, Better Gen3 be Good.

Edit : got this feedback from Andy that DW100c is limited to 4 Concurrent Queries

	Power BI with DuckDB… on Using DuckDB with PowerBI
	Querying a Fabric La… on Writing to SQL Server using…
	Benjamin on Running DuckDB at 10 TB s…
	mim on Running DuckDB at 10 TB s…
	Benjamin on Running DuckDB at 10 TB s…

Setup

Loading Data

Testing using TPC-H SF10 Benchmark

Take Away

Share this:

What’s the difference between the Three Engines

Note regarding relationship

Star Schema Model

Joins Performance

Group by using Same Table

Takeway

Share this:

How to conduct a simple Test

Setup Databricks SQL

Testing PowerBI

Second try

Result Cache

Random Thoughts

Share this:

Native Storage

Performance

Take Away

Share this: