Sharing Data Between OneLake and Snowflake

TL;DR : OneLake is really an Open Storage system and works with any tool that understands Azure ADLS Gen2 API, but they changed the endpoint naming which was very confusing, (at least for me), but there is no catch, it is the same tech under the hood, there is no Vendor Lock-in, it is totally an open ecosystem.

Use Case : 

For no obvious reason, I thought about an unusual scenario: Load data to OneLake using Dataflow Gen2 and Query the results using Snowflake, it works with any Engine that understands Delta Table, mind you, no Spark was involved at all, all using native code.

Azure AD

Fabric use Azure AD for authentication, and support service principal too , this is the recommended route, I am not sure if Snowflake understand OneLake Endpoint format, I think it expect something like this : “accountname.blob.core.windows.net/container_name/”

But OneLake you get something like this

“sharing@onelake.dfs.fabric.microsoft.com/sharing.Lakehouse/Tables/scada”

Maybe it works but I have not tested it, probably it is a blog for another day.

Shortcut

Another approach is to create a shortcut to an existing ADLS Gen2 storage account, it is trivial to setup and support read and write, which I did

I can write the data using either Spark or Dataflow Gen2, the Data Warehouse is not supported, you can read only using the OneLake Endpoint, anyway, for this example I am using Dataflow Gen2 to write the data

Turns out, PowerQuery has a native Delta table writer 🙂 , you can see it in the log

And here is the Data in Azure Storage

I feel like if you want to share data with external parties, maybe using an external Azure Container is a more flexible solution. I know it may sound silly but the fact I can see the data using the storage explorer in Azure independently of Fabric makes it seem more neutral.

Reading from Snowflake

Snowflake recommends using a service principal,but for this quick demo, I used a SAS token, making sure to grant read and list.

And then used this SQL Script the define the external table

create database ONELAKE ;
create schema AEMO ;
USE ONELAKE.AEMO ;
CREATE OR REPLACE STAGE onelake
  URL='azure://fabricshare.blob.core.windows.net/scada/'
  CREDENTIALS=(AZURE_SAS_TOKEN='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') ;

CREATE or replace EXTERNAL TABLE scada(
 SETTLEMENTDATE TIMESTAMP AS (value:SETTLEMENTDATE::VARCHAR::TIMESTAMP),
 DUID varchar AS (value:DUID::varchar),
 SCADAVALUE NUMBER(12,5) AS (value:SCADAVALUE::NUMBER(12,5)),
 file varchar AS (value:file::varchar))
 LOCATION=@ONELAKE
 REFRESH_ON_CREATE =  FALSE
 AUTO_REFRESH = FALSE
 FILE_FORMAT = (TYPE = PARQUET)
 TABLE_FORMAT = DELTA;
 ALTER EXTERNAL TABLE SCADA  REFRESH;
 select settlementdate,sum(scadavalue) as mw from scada group by all;

to get the latest Data, you can run this SQL code

ALTER EXTERNAL TABLE SCADA  REFRESH;

and because it is using Delta Table, you will not get dirty reads, only fully committed data is read.

And here is a sample chart from Snowflake

Writing from Snowflake

Snowflake doesn’t support writing to Delta Table, which I think is a great use case  for PowerBI users with Direct Lake mode. I guess for now, a user can use Fabric Data Factory to get the data, I feel there is a missed opportunity here.

Parting thoughts

I had a mixed feeling about Delta Table and the concept of Lakehouse in general, I felt it was too tight to Spark ecosystem, rather useful only for big data setup, I changed my mind,first by using Delta Rust and now Fabric is doing a good job making it even more accessible for business users, I think that’s progress, and to be very clear this setup works with Databricks, DuckDB and any Database that reads Delta Table.

Implementing a Poor Man’s Lakehouse in Azure

TLDR

A simple script that load some data from Azure Storage into a disk cache, run complex Queries using DuckDB and save the results into a destination Bucket using a cheap Azure ML Notebook, The resulting bucket can be consumed in Synapse serverless/ PowerBI/ Notebook etc

Introduction

Last year, Azure Synapse team published an excellent article on how to build a lakehouse architecture using Azure synapse, what I really liked is this diagram, it is very simple and to the point.

Notice, I am more interested in the overall Azure ecosystem, so it is vendor neutral.

In practical term, Lakehouse here means the storage system with an open Table Format, now if you ask three different people about this diagram, probably they will give you 4 different answers.

  • An Old school Dedicated pool professional will argue this is an over complicated system, and all you need is Source system —> Data integration tool —> Dedicated Pool
  • As I have a soft spot for Serverless, ideally, I would say, add Write capabilities to Serverless and call it a day
  • A Snowflake or Databricks Person will argue, One Engine should do everything; prepare and serve.
  • My colleague who is an Azure data engineer will say the whole thing does not make any sense, ADF and SQL Server is all you need 🙂

What if you don’t have big Data ?

The Previous Diagram assume a big data workload, as Spark has a massive overhead in compute usage and cost and does not make much sense for a smaller data.

 What if you have a smaller data size, can we keep this overall architecture and keep the lower cost, maybe we can, I will argue, it maybe even be useful very soon 🙂

Why you are keeping the serverless Pool

 I think it maybe the obvious question, DuckDB is an awesome Execution Engine, but it is not a client server DB, you can’t have a SQL Endpoint that you just use to run Queries from PowerBI etc.

There are projects to do that but it is not ready yet, and even if you find some hacks, trying to implement governance and access controls will be non-trivial.

Obviously you can use Azure ML notebook just to do exploratory analysis, but that’s not something that make sense for Business intelligence People 🙂

PowerBI

The theory is because the data is prepared and cleaned at the Storage level, PowerBI can just import the Parquet files, a SQL Endpoint is not strictly needed

Ok sweet, so where do you run this Duck thing?

Honestly that was the hardest Part, using synapse Compute fail the purpose as it is designed for Spark, the minimum is three VM, turn out Azure has an Amazing machine learning service, that we can just use for data engineering, you can run VM as low as 1 Core (8 cents/hour), auto shutdown is available, so you pay only for what you use, and you can schedule jobs, yes it is supposed to be for ML but it does work just fine for Data engineering Jobs.

Show me an Example

The Pipeline is very simple

-Read Data from Azure Storage bucket

-Run Some Complex Queries (22 Queries)

-Save the results in a Bucket

The Compute used is an Azure ML VM that cost 14 cents/Hour, the Raw Data is around 3 GB, main table 60 million rows, the overall pipeline took 2 minutes and 37 second, I think this is the most cost-effective way to run this workload on Azure. ( synapse serverless would have being perfect with the pay by scan model  but  write capabilities is rudimentary, basically you can’t overwrite a folder, but that may change anytime)

And The results

I Appreciate TPCH is not a benchmark for heavy write, but I used something easily accessible, and I can compare to other systems.

Thanks to Koen Vossen for showing how to use the disk cache with fsspec

What’s the catch?

As of this writing, Python support for open Table format (Iceberg, Delta ) is very limited, personally I found that Arrow dataset is the most mature offering today, but it support only Hive tables and only append or Overwrite.

if you need a merge or update directly on a remote storage, your only option is to do those operation in DuckDB using the native file format and overwrite the files in the remote storage, it works well for small data but it will not scale.

Note : my experience with Python Delta table (Not Spark) was mainly with GCP, turn out Azure has a better support ( can read, write, show history, vacuum) still as of today, I still think Arrow Dataset is more stable.

Final Thoughts

Regardless of what you use, I think it is important to ask your vendor, what’s your solution for smaller data? are you to paying a premium for a big data solution and is it justified by your workload ?

Querying Azure storage using DuckDB

DuckDB just added support for fsspec, which make Querying Object store Like GCP and Azure storage possible, please notice AWS S3 API was natively supported already.

Previously to Query Azure storage, you had to use pyarrow dataset as a workaround, with the recent update, it is no more needed.

Here is a simple example, Querying a folder of Parquet files partitioned using Hive style, notice DuckDB is smart enough to recognize Date as a partition field

import duckdb
import adlfs ,os
from dotenv import load_dotenv
load_dotenv()
AZURE_STORAGE_ACCOUNT_NAME = os.getenv('AZURE_STORAGE_ACCOUNT_NAME') 
AZURE_STORAGE_ACCOUNT_KEY = os.getenv('AZURE_STORAGE_ACCOUNT_KEY') 
table_path = os.getenv('table_path') 
fs = adlfs.AzureBlobFileSystem(account_name=AZURE_STORAGE_ACCOUNT_NAME, account_key=AZURE_STORAGE_ACCOUNT_KEY )
con = duckdb.connect()
con.register_filesystem(fs)
df = con.execute(f'''
    select *
    from read_parquet('{table_path}/scada/data/*/*.parquet', hive_partitioning=true)
    limit 10
    '''
).df()
con.unregister_filesystem('abfs')
df

and here is the result

Make sure to have the file”.env” when running the notebook from your computer, here is an example how it looks like

As a PowerBI user, I see a potential for a lightweight ETL process using just python that do complex transformation and output the results as a parquet files which PowerBI can consume.

As I deal with small data( less than 30 GB), Apache Spark does not make much sense to me, Hopefully Synapse will provide us with a cheap single node Notebook experience. I suspect it may be useful for a lot of customers.

First Look at Datamart

Power BI product team kindly provided us a private Preview of Datamart a couple of Months ago, the product has many components and may mean different thing to different people( with different technical background ), but at the core, it is a SQL Server DB with columnar compression, and a SQL Endpoint, the Data is loaded using Power BI dataflow, and a brand new Web Experience for Data Modelling and measures authoring, no desktop is required.

Before we start talking about details, let’s get some key feature first.

  • It is premium only and to my surprise, it is available to PPU too ( 20 $/Month/User is an insane value proposition).
  • Security is managed by Power BI using Azure active directory, you don’t need any obscure SQL Security knowledge.
  • Sharing a database is a two click Operation. ( even with external users )
  • Datamart is very fast, sometimes I had to double check I am not in import mode.
  • It is a single node server, Data currently is limited to 100 GB per Database.
  • Currently Only Power BI dataflow can load data ( incremental is supported too), and no support for creating views yet.
  • Multiple users can use Datamart UI at the same time.
  • You don’t need to know anything about compression, partition etc, table distribution, data skew, shuffle etc, Datamart handles it automatically, it does create Clustered Columnstore index, statistics, partitions etc.
  • The product team promises that full DDL/DML is coming soon.
  • I find this blog very useful on explaining how SQL Server Clustered Columnstore Index Works

The First success we had was to make some data from Power BI dataflow available using the SQL Endpoint, it was just trivial, Just one Click on the server setting, and voila, a connection string that any tool with SQL Server support can use , Obviously Power BI, SSMS, Excel, Tabular reports,  but Tableau, R, Python you name it.

I am a “Traditional” Power BI developer, what does it mean to me ?

Nothing has changed, it is not Power BI version 2,it is not replacing the Vertipaq, you can just keep using Power BI desktop as usual, in my opinion it is just Power BI product team trying to reimagine a better data experience  and solving some pain points.

  • Make Power BI Dataflow a more “Enterprise” solution, there is a SQL Endpoint point, Data engineer will take PowerQuery more seriously now 🙂
  • An online Modelling Experience, no need to install a VM for Mac users.
  • A new experience for users with SQL skills, yes you can just write SQL and get results without out Modelling, you don’t need to Know anything about DAX.
  • Addressing a market that was poorly served before, an easy to use SQL Experience for Business users that don’t necessarily know SQL, and don’t want to model anything.
  • Bonus point, no Azure access requirement, it is just Power BI experience as we know it and love it.

   So how it Works ?

You load the data using Power BI dataflow, it is instantly available in a SQL database, if you know SQL, you can just write Queries, and download an Excel spreadsheet that contains a connection to Datamart SQL Server with a SQL Query , the data will be always up to date, it is not a simply a copy of the results.

One nice thing about using  SQL Server as an Execution Engine, the documentation available on the internet is amazing, let’s say I want to know how much memory the server is using, I just copy it and it works.

Let’s check which SQL server edition, Datamart uses ( 150 means, SQL Server 2019)

here you can see the table structure created by Datamart, 1 partition and multiple segments ( the table has 60 Million rows, each segments is around 700K rows), I think with incremental refresh it create multiple partitions ( I have not test it yet)

Visual SQL

If you don’t want to write SQL, fine just use a Visual Experience to get the results, here I am getting the top customers by number of orders without writing any SQL code, bonus point the SQL is written behind the scene, it is like a Visual dbt 🙂

And the SQL generated

Ok, you don’t want ad hoc stuff fine, you can Model your data by defining relationship and row level security if necessary

To be clear, Tables, relationship and row level security are a pure “SQL artefact” and can be used by any front-end tool as long as it can connect to SQL Server and has proper access using Azure Active directory, and no Measures can not be defined as a SQL 🙂

Obviously, you can connect using SSMS. (notice the relationship columns)

It is Still Power BI

I presume, if you paid for Power BI service, probably you want still to use Power BI as a Visualisation tool 🙂

you can define Measures in DAX ( calculated column and tables are not supported, everything must be loaded using PowerQuery)

And here is the Trick Datamart  Auto generate a read only Dataset, it is only a logical semantic Model, no data is duplicated, it uses a Direct Query Connection to the SQL Server Datamart.

Auto Generated dataset

I will be honest I did not like it initially, I wanted more control, I want to configure which table to include into the dataset and which one to exclude, Direct Query performance is amazing, but still I want to define some tables as dual Mode, or simply import, and I am not going to be apologetic about it, I like calculated column and calculated table, I used them a lot, and I don’t care if they need some extra size, some pattern are trivial using DAX and very hard ( at least for me) using SQL and PowerQuery.

One day, I changed my Mind , Charles Webb (the PM for Datamart) was very clear in his explanation about the different personas, it is free, and you don’t have to use it in all cases, people with SQL Background and less experience with Power BI will find it amazing, they don’t have to deal with Power BI desktop or extra fancy modelling, people with Power BI experience may prefer to have more control and will extend the Dataset, other advance Developers may want to use Tabular Editor and it is fine.

Flexible Open Architecture

You can have a Datamart that manage everything, or you can keep Dataflow, Datamart, Dataset separated, it is really a nice architecture and provide a lot of flexibility, I think it depends, if you have a Datamart that contains a lot of tables that update based on different schedule it make sense to have dataflow do the ETL first then load the results later using Datamart. ( I think later Dataflow will provide even a more elegant solution)

Anyway, it is flexible and can be tightly coupled if you want to, or you can pick and choose which component to use, and I am sure users will use it in some unexpected way ( I may have loaded some DAX calculated table and expose it to the SQL Endpoint)

One aspect I really like for philosophical reason, even if all the downstream users will end up using Only Power BI reports or Excel, having a SQL Endpoints is extremely important, it is an Enterprise grade solution that can talk to other tools when needed and not a self-service silo.

PBIX Download is annoying 

The auto generated dataset cannot be downloaded for a good technical reason, but if you create a report based on it, the report will be download with a pbix that contains a DQ mode to SQL Server, IMHO a better approach is, if you build a thin report from the auto generated dataset, and you download the pbix, it should use only a live connection ( which is the same behaviour if you build a report using the Desktop), but if you download the pbix from Datamart then it should show a Direct Query Mode to SQL Server, it is not a problem per se, but rather unexpected behaviour.

Final thoughts

I think it is an amazing solution that will keep evolving, Microsoft Data team manage to create an exceptional user experience to build a full data stack that anyone can use regardless of his technical background,  I remember when I first used PowerPivot in 2015, I knew that it will change the industry (and it did), and I am sure Datamart will do the same.

Initially I wrote a long commentary on how Datamart fit in the “Modern Data Stack”, and how it compares to Other vendors offering, then something occurred to me, Microsoft is just building a useful product that solve hard problems for a very large segment of the market and anything else does not matter, call it “Post Modern Data Stack” if you want.