Sharing Data Between OneLake and Snowflake

TL;DR : OneLake is really an Open Storage system and works with any tool that understands Azure ADLS Gen2 API, but they changed the endpoint naming which was very confusing, (at least for me), but there is no catch, it is the same tech under the hood, there is no Vendor Lock-in, it is totally an open ecosystem.

Use Case : 

For no obvious reason, I thought about an unusual scenario: Load data to OneLake using Dataflow Gen2 and Query the results using Snowflake, it works with any Engine that understands Delta Table, mind you, no Spark was involved at all, all using native code.

Azure AD

Fabric use Azure AD for authentication, and support service principal too , this is the recommended route, I am not sure if Snowflake understand OneLake Endpoint format, I think it expect something like this : “accountname.blob.core.windows.net/container_name/”

But OneLake you get something like this

“sharing@onelake.dfs.fabric.microsoft.com/sharing.Lakehouse/Tables/scada”

Maybe it works but I have not tested it, probably it is a blog for another day.

Shortcut

Another approach is to create a shortcut to an existing ADLS Gen2 storage account, it is trivial to setup and support read and write, which I did

I can write the data using either Spark or Dataflow Gen2, the Data Warehouse is not supported, you can read only using the OneLake Endpoint, anyway, for this example I am using Dataflow Gen2 to write the data

Turns out, PowerQuery has a native Delta table writer 🙂 , you can see it in the log

And here is the Data in Azure Storage

I feel like if you want to share data with external parties, maybe using an external Azure Container is a more flexible solution. I know it may sound silly but the fact I can see the data using the storage explorer in Azure independently of Fabric makes it seem more neutral.

Reading from Snowflake

Snowflake recommends using a service principal,but for this quick demo, I used a SAS token, making sure to grant read and list.

And then used this SQL Script the define the external table

create database ONELAKE ;
create schema AEMO ;
USE ONELAKE.AEMO ;
CREATE OR REPLACE STAGE onelake
  URL='azure://fabricshare.blob.core.windows.net/scada/'
  CREDENTIALS=(AZURE_SAS_TOKEN='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') ;

CREATE or replace EXTERNAL TABLE scada(
 SETTLEMENTDATE TIMESTAMP AS (value:SETTLEMENTDATE::VARCHAR::TIMESTAMP),
 DUID varchar AS (value:DUID::varchar),
 SCADAVALUE NUMBER(12,5) AS (value:SCADAVALUE::NUMBER(12,5)),
 file varchar AS (value:file::varchar))
 LOCATION=@ONELAKE
 REFRESH_ON_CREATE =  FALSE
 AUTO_REFRESH = FALSE
 FILE_FORMAT = (TYPE = PARQUET)
 TABLE_FORMAT = DELTA;
 ALTER EXTERNAL TABLE SCADA  REFRESH;
 select settlementdate,sum(scadavalue) as mw from scada group by all;

to get the latest Data, you can run this SQL code

ALTER EXTERNAL TABLE SCADA  REFRESH;

and because it is using Delta Table, you will not get dirty reads, only fully committed data is read.

And here is a sample chart from Snowflake

Writing from Snowflake

Snowflake doesn’t support writing to Delta Table, which I think is a great use case  for PowerBI users with Direct Lake mode. I guess for now, a user can use Fabric Data Factory to get the data, I feel there is a missed opportunity here.

Parting thoughts

I had a mixed feeling about Delta Table and the concept of Lakehouse in general, I felt it was too tight to Spark ecosystem, rather useful only for big data setup, I changed my mind,first by using Delta Rust and now Fabric is doing a good job making it even more accessible for business users, I think that’s progress, and to be very clear this setup works with Databricks, DuckDB and any Database that reads Delta Table.

First Look at Dataflow Gen2

TL:DR; I wish I had this in 2019

Background

Long story short, in 2019, I was working in large scale solar farm having only access to PowerBI Pro license, wanted to build a dashboard that show Power generation data with a freshness of 5 minutes, tried multiple workaround, never worked reliably as it should be, the solution was to have some kind of a database where i can incrementally write, did not have access to Azure and even if I had, Azure DataFactory was too alien to me, COVID happen, moved to another division but I never forget about it, I felt a deep disappointment in PowerQuery and I even questioned the whole point of self service Data preparation if a relatively simple scenario  can not be handled easily.

Enter Fabric

My first encounter with Dataflow Gen2 was extremely negative, there was a bug where the auto generated dataset was consuming a lot of compute resource for no reason, I was distressed that they messed up the product, luckily I was wrong and the bug was fixed after 2 months, the graph does not need an explanation

Lakehouse Destination

Initially Dataflow supported only writing to Lakehouse but they added the DWH later, the performance is very odd, every refresh insert around 500 records (which is small), something feels not right

Anyway I reported the issue and the dev are looking at it.

DWH Destination

Because the data is public and I am just testing,I wanted to publish the report to the web, the messaging was not clear if it is supported or not, anyway the documentation was updated recently and it is a current limitation of Direct Lake mode, which is perfectly fine, it is a preview after all, but in this particular  case I care about publish to web more than Direct Lake.

I changed the destination to DWH and what a wonderful surprise

What’s the big deal about Dataflow Gen2 anyway ?

Just one option, you can append the result to a table, this makes incremental refresh a trivial task, obviously it was no easy feat, PowerQuery previously can write only to PowerBI Internal Engine, but now there are multiple destinations and more are coming

Not only that, because you are writing to a Database you can even run further data transformation using SQL, I recommend this excellent tutorial of what’s possible now

The Idea is very simple, get the name of the files already loaded and just add new files

And Here is the whole data pipeline from data ingestion to the final report all using one Product

You can have a look at the report here, Publish to web cache the data, so you may not see the latest update

I appreciate it is a very short blog, but if you know PowerQuery, there is nothing new to learn, same tech but substantially more Powerful as you are loading to a DWH, you want a Database, here is one, just like that 🙂

Final Thoughts

Dataflow Gen2 is finally blurring the line between self service Data preparation and Professional ETL developpement, I understand marketing people will focus more on Lakehouse and Direct Lake but most potential Fabric users are just PowerBI users and Dataflow Gen2 is exactly what was missing from PowerBI.

I am really looking forward to the GA of Fabric F2. Dataflow Gen2 is a killer feature and I hope they get the resource utilization right.

Loading files from a folder to Fabric DWH using data Factory pipeline.

I know it is very basic for a Data Factory professional but as a PowerBI dev, it was like painting with my feet, so I thought it may be worth sharing with my people 🙂

Use Case 

Let’s say, I have a folder of Parquet files ( it does works with json, excel etc)

And I wanted to Load this Data to a DWH, not lakehouse Tables, and I don’t want to use a notebook and I am not particularly interested in a delta lake as an intermediary step, all I want is to load data from a folder to a DB staging table, that’s pretty much a very common scenario and supported by virtually any DWH in the market including Synapse Serverless.

Define a Variable

Define an array that contains the name of the subfolders, I know it can be automated, but I am a slow learner, (it seems it is not supported yet) make sure you click in the empty canvas to get that view

Add for each activity

Add for each activity and , click setting, items, add dynamic contents, select the previous variable

You will get something like this

Edit Activity inside For Each

Use Copy Data assistant

Just pretend you are loading only 1 file, make sure it works, then when it is done change the hard coded name to a parameter

At the End you will get something like this

Source

Destination

Mapping

Make sure no mapping is assigned as different tables will have different schema

Change hard coded names to dynamic values

Change the source name to

Change destination to 

Item will use current item from the previous variable

Validate and Test Run

You will get something like this.

Alternatively, if you want to copy  files  directly using SQL, then please vote on this idea  

First Look at Fabric F2 

Edit : 14 December 2024 : all the issues were fixed, please see Updated experience : https://youtu.be/zduuh35SAac

Edit : 26 December 2023 : The Bug was fixed, changing capacity is working as expected.

TL:DR ; a simple change of capacity generated a cascade of failures in Fabric workspace.

Use case

For the last month, I have been testing Fabric in my personal tenant using the free F64 trial, after analyzing three different possible workflow for the same workload I concluded the following

  • Using OneLake shortcuts from an existing Azure storage bucket : did not like the storage transaction cost and it fail the purpose anyway if I had to use an external service.
  • Dataflow Gen2 : the auto generated dataset and lakehouse consume a lot of compute and there is no way yet to turn it off.
  • Spark Notebook :  the cheapest option in term of compute usage, around 45K CU(s) and it is pretty stable, Direct lake for the custom dataset works and everything seems ok

The report is straightforward download a file every 5 minutes and append it to an existing delta lake table, run a compact every 4 hours and update a dimension table every day, here is the overall diagram

And here is the final results, the key here is to have a freshness of 5 minutes at any time

Day 1 : Changing capacity to Fabric F2

I know it may sound weird, but when I test something seriously, I usually pay for it. It is all fun to use something for free but the real test is when you spend your own money, is the tool worth it ?

I take Fabric seriously, I think it is an interesting take on how a data platform should look like.

First in Azure, I created a new Fabric F2 capacity

306 USD dollar, that’s nearly 460 Aud, that’s a lot of money for a service that does not auto suspend, in this case, I can’t manually pause and resume as the whole point is to get 5 minutes freshness time, my thinking I can use it for 2 days and delete it later, it was  a big mistake.

F2 must work,  ignoring bursting and smoothing and all that’s stuff.

F2 total compute =  2 X 24 X 3600 = 172800 CU(s), the workload consumes 45000 CU(s), it should be good to go.

The theory is Fabric compute is stateless, you change the license and it seamlessly switches the compute.

Everything seems fine

Day 2 : 

I had a look at Fabric usage in F2, Spark usage for the main notebook was null, it seems the notebook scheduler was stuck with the previous capacity ( to be honest I don’t know for sure), anyway I stopped the scheduler, suspend F2, restart the scheduler and the Capacity , the usage start to appear. Then I noticed something weird. Direct Lake  custom dataset keeps failing, I send the errors to Microsoft dev to have a look, long story short, apparently, I hit a race condition and the database did not get evicted, it is a preview and stuff happens.

Day 3 : 

Anyway, 13 Australian $  for this experimentation, it is ok ,the good news is even if F2 does not work at least the cost per day is fixed.

As I was just trying to understand why direct lake did not work, I did not notice the more critical problem, data was duplicated !!!!

I don’t know how it is possible, the notebook is very simple, download a file, append the result in a delta table and add that filename to a log ( a simple parquet file),  I used the log just to avoid reading delta table every time just to get the distinct values of  the filename.  

Maybe somehow there was concurrent write from the two capacities (F2 and Trial both stuck in the same workspace somehow), Delta table support that just fine, but maybe the log was locked by one writer,  or maybe the issue is not related and it was a bad luck, I never had any issue with that code as I use it with a cloud function and there is a guaranty of only 1 writer at the same time.

But that’s not the issue, after digging further in the recent runs, I learnt that the UI is just telling you that the notebook runs, not that it runs successfully

I started looking at runs with longer duration, and here is an example, it does not look like a successful run to me.

Errors happen,  and in any production pipeline you add another step  to get rid of duplicate records, but I think it will be nice to show errors in the user interface.

Parting Thoughts

I wrote a big rant then I realized it doesn’t matter, I am accepting the whole experience as it, I deleted F2 though and synapse as a retaliation, at the end of the day it is my own unrealistic expectation, I was hoping for Fabric to be something that it is not, it is not a serverless product (which is fine for an Enterprise setup with a constant workload), but for a scale down scenario, or personal usage, The Tech is not there, even for a Reserved instance pricing, the math doesn’t add up.

My first impression did not change though, Fabric’s sweet spot is F64+  and that’s fine, but a man can dream.