Using materialized view to create a different sort order of the base table can reduce cost substantially, in this example up to 200 X for some queries.
The Problem : You can’t have multiple sort order.
One of most effective technique in columnar database to improve speed is to sort the table which will help the engine scan less data, but what if you have some queries that works better with different sorting, obviously you can’t have independent column sorted in the same table , turn out BigQuery has a very nice functionality, Materialized views which are used mainly to aggregate Data, see example here, can also works fine without aggregation and you can even change the sort order of the base table.
Example using PowerBI
In this report, which show analyze Electricity production for the Australian Market, the base table is sorted by day, that make perfect sense as most uses will want to see only a specific time period then aggregate by some attribute like region, technology etc
The Model is very simple , a fact table in Direct Query mode, and two dimensions in Import mode, a very simple star schema
The main report is a couple of charts using some aggregation, nothing particularly interesting, the queries use the Materialized view as we are aggregating per day, the base table is by minutes.
Drill Down to details
The trouble start here, let’s say a user want to see all the data for one particular Station name to the lowest level of details.
The Engine will have to scan the full table, as the table was sorted by date not by Station name
as you can see 4.61 GB is a lot of data scanned just to see one station name.
Non Aggregation Materialized View
let’s create a new MV with a different sort order, notice there is no group by, it is the same number of rows as the base table and you can’t have different partitions, only sort can be changed.
create materialized view `XXX_MV_Sort`
cluster by DUID,DAY as
The Same Query is substantially cheaper 🙂 from 4.6 GB to 19 MB, that’s 230 times cheaper.
you can see the two Query Plan, one scanning 101 Millions rows vs 404000 rows
Because BigQuery already knows about all the queries, it may be useful to have some kind of service that give suggestions, like changing sort or adding a Materialized view, in the meantime, I think Modeling still matter and can save you a lot of money.
TL;DR : Random observations after using Snowflake for a couple of hours, there is a lot to likes but mixed feeling about the cost.
For no obvious reason, I felt an urge to try Snowflake, The setup was trivial, you get 30 days trial with $400, no Credit card required.
Snowflake is multi cloud product, first I had to choose the cloud provider and the region, My Personal PowerBI instance is in Melbourne, unfortunately as of this writing it is Only Available in Azure Sydney Region, it make sense to choose the same region for two reasons
Latency, inter region Transfer take more time
Egress Cost, Cloud Provider charge for Inter Region Transfer
For the record my personal data is in GCP Tokyo , but Snowflake is not available there.
The User interface is very neat and simplistic ( Good thing), I did not need to check any documentation
Snowflake provide free sample data by default
and Obviously, you can browser all kind of data from the Marketplace , it is very well integrated and seems trivial to use, as you have noticed already, Snowflake like BigQuery has a total separation between Storage and Compute, so far so good.
I click on Data Preview, and I got this message
Yes Unlike BigQuery, Data Preview is a paid operation and require a Cluster running.
Create a new Cluster
This is the core feature of Snowflake, creating a new cluster is trivial, as a test, I create the smallest possible Cluster.
The Cluster was up and running in a couple of seconds, very Impressive, and the way it works is very simple.
if there is no Query Running, it will shut down after 1 minute ( or whatever you choose), When a new Query showed up for example from a BI tool, the Engine very quickly wake up !!!!
The Minimum Cluster, I could setup was X-Small and it cost 1 credit/Hour ( 2,75 $/Hour), but you pay per second with a 1 minute minimum , I am using Standard Edition, Enterprise edition cost more.
Note : As a BigQuery enthusiast, I hope Google release the auto flex slot
Notice here, Snowflake is not simply a cluster to Run some Queries, but it does have a Service Layer which do a lot of operations behind the scene, personally I am mainly interested in free Results cache, which is freaking fast , as low as a 50 ms !!!!
Query History Log
A nice Query history log, I really liked the Client Driver, you can easily tell, if the Queries are coming from an external BI tools or Query from the Console, one very very annoying thing, if you change that view and you came back, you lose the selection and you have to select the columns again, I wish Snowflake could save the customization of the columns.
it was not very obvious, but to select columns name, you need first to click on a table which open a panel then Click on those three little dot, (it is obvious once you know it) , there is no multi tab support, new Query open a new Window, but honestly seeing the new BigQuery UI, maybe it is not a bad idea after all 🙂
Snowflake Driver for PowerBI is amazing, I never saw a sub second Direct Query in PowerBI before, even when returning 1 row from cache, ( BigQuery Driver for PowerBI is not optimized, I don’t really knows who to blame, Google or Microsoft or probably both, turn out it is Google responsability)
I am using TPCH_SF10 dataset, the main fact Table contains 60 Million records
And here is the Snowflake data Model ( pun intended) in PowerBI using DirectQuery.
Some results a really intriguing, The ones that don’t have Bytes scanned are cached, but look at this Query that scanned 2.1 GB and returned in 770ms, that’s a BigQuery BI Engine territory right there !!!
The Reason, the Query is in the subsecond, is again because of another type of cache, Snwoflake cache the raw data in the local SSD drive of the cluster, hence it make sense for better performance to keep the Cluster running for a bit longer, if you suspend a Cluster, there is no guarantee that Snowflake will resume the same one.
Btw, the Query plan visual is very detailed and explain every step, very nice.
The Quick Auto suspend and resume of Clusters, Global Query result cache and the fact it is a Multi Cloud offering are the key strength of Snowflake.
I was really surprised by the experience of using Snowflake Direct Query with PowerBI, and the data marketplace was a very polished experience.
I will show my bias here , for 2.75 $/Hour I can reserved a BigQuery BI Engine instance with 50 GB in-memory RAM, it will be interesting to compare the performance and concurrency of Both Engine.