Make sure to have the file”.env” when running the notebook from your computer, here is an example how it looks like
As a PowerBI user, I see a potential for a lightweight ETL process using just python that do complex transformation and output the results as a parquet files which PowerBI can consume.
As I deal with small data( less than 30 GB), Apache Spark does not make much sense to me, Hopefully Synapse will provide us with a cheap single node Notebook experience. I suspect it may be useful for a lot of customers.
Power BI product team kindly provided us a private Preview of Datamart a couple of Months ago, the product has many components and may mean different thing to different people( with different technical background ), but at the core, it is a SQL Server DB with columnar compression, and a SQL Endpoint, the Data is loaded using Power BI dataflow, and a brand new Web Experience for Data Modelling and measures authoring, no desktop is required.
Before we start talking about details, let’s get some key feature first.
It is premium only and to my surprise, it is available to PPU too ( 20 $/Month/User is an insane value proposition).
Security is managed by Power BI using Azure active directory, you don’t need any obscure SQL Security knowledge.
Sharing a database is a two click Operation. ( even with external users )
Datamart is very fast, sometimes I had to double check I am not in import mode.
It is a single node server, Data currently is limited to 100 GB per Database.
Currently Only Power BI dataflow can load data ( incremental is supported too), and no support for creating views yet.
Multiple users can use Datamart UI at the same time.
You don’t need to know anything about compression, partition etc, table distribution, data skew, shuffle etc, Datamart handles it automatically, it does create Clustered Columnstore index, statistics, partitions etc.
The product team promises that full DDL/DML is coming soon.
The First success we had was to make some data from Power BI dataflow available using the SQL Endpoint, it was just trivial, Just one Click on the server setting, and voila, a connection string that any tool with SQL Server support can use , Obviously Power BI, SSMS, Excel, Tabular reports, but Tableau, R, Python you name it.
I am a “Traditional” Power BI developer, what does it mean to me ?
Nothing has changed, it is not Power BI version 2,it is not replacing the Vertipaq, you can just keep using Power BI desktop as usual, in my opinion it is just Power BI product team trying to reimagine a better data experience and solving some pain points.
Make Power BI Dataflow a more “Enterprise” solution, there is a SQL Endpoint point, Data engineer will take PowerQuery more seriously now 🙂
An online Modelling Experience, no need to install a VM for Mac users.
A new experience for users with SQL skills, yes you can just write SQL and get results without out Modelling, you don’t need to Know anything about DAX.
Addressing a market that was poorly served before, an easy to use SQL Experience for Business users that don’t necessarily know SQL, and don’t want to model anything.
Bonus point, no Azure access requirement, it is just Power BI experience as we know it and love it.
So how it Works ?
You load the data using Power BI dataflow, it is instantly available in a SQL database, if you know SQL, you can just write Queries, and download an Excel spreadsheet that contains a connection to Datamart SQL Server with a SQL Query , the data will be always up to date, it is not a simply a copy of the results.
One nice thing about using SQL Server as an Execution Engine, the documentation available on the internet is amazing, let’s say I want to know how much memory the server is using, I just copy it and it works.
Let’s check which SQL server edition, Datamart uses ( 150 means, SQL Server 2019)
here you can see the table structure created by Datamart, 1 partition and multiple segments ( the table has 60 Million rows, each segments is around 700K rows), I think with incremental refresh it create multiple partitions ( I have not test it yet)
If you don’t want to write SQL, fine just use a Visual Experience to get the results, here I am getting the top customers by number of orders without writing any SQL code, bonus point the SQL is written behind the scene, it is like a Visual dbt 🙂
And the SQL generated
Ok, you don’t want ad hoc stuff fine, you can Model your data by defining relationship and row level security if necessary
To be clear, Tables, relationship and row level security are a pure “SQL artefact” and can be used by any front-end tool as long as it can connect to SQL Server and has proper access using Azure Active directory, and no Measures can not be defined as a SQL 🙂
Obviously, you can connect using SSMS. (notice the relationship columns)
It is Still Power BI
I presume, if you paid for Power BI service, probably you want still to use Power BI as a Visualisation tool 🙂
you can define Measures in DAX ( calculated column and tables are not supported, everything must be loaded using PowerQuery)
And here is the Trick Datamart Auto generate a read only Dataset, it is only a logical semantic Model, no data is duplicated, it uses a Direct Query Connection to the SQL Server Datamart.
Auto Generated dataset
I will be honest I did not like it initially, I wanted more control, I want to configure which table to include into the dataset and which one to exclude, Direct Query performance is amazing, but still I want to define some tables as dual Mode, or simply import, and I am not going to be apologetic about it, I like calculated column and calculated table, I used them a lot, and I don’t care if they need some extra size, some pattern are trivial using DAX and very hard ( at least for me) using SQL and PowerQuery.
One day, I changed my Mind , Charles Webb (the PM for Datamart) was very clear in his explanation about the different personas, it is free, and you don’t have to use it in all cases, people with SQL Background and less experience with Power BI will find it amazing, they don’t have to deal with Power BI desktop or extra fancy modelling, people with Power BI experience may prefer to have more control and will extend the Dataset, other advance Developers may want to use Tabular Editor and it is fine.
Flexible Open Architecture
You can have a Datamart that manage everything, or you can keep Dataflow, Datamart, Dataset separated, it is really a nice architecture and provide a lot of flexibility, I think it depends, if you have a Datamart that contains a lot of tables that update based on different schedule it make sense to have dataflow do the ETL first then load the results later using Datamart. ( I think later Dataflow will provide even a more elegant solution)
Anyway, it is flexible and can be tightly coupled if you want to, or you can pick and choose which component to use, and I am sure users will use it in some unexpected way ( I may have loaded some DAX calculated table and expose it to the SQL Endpoint)
One aspect I really like for philosophical reason, even if all the downstream users will end up using Only Power BI reports or Excel, having a SQL Endpoints is extremely important, it is an Enterprise grade solution that can talk to other tools when needed and not a self-service silo.
PBIX Download is annoying
The auto generated dataset cannot be downloaded for a good technical reason, but if you create a report based on it, the report will be download with a pbix that contains a DQ mode to SQL Server, IMHO a better approach is, if you build a thin report from the auto generated dataset, and you download the pbix, it should use only a live connection ( which is the same behaviour if you build a report using the Desktop), but if you download the pbix from Datamart then it should show a Direct Query Mode to SQL Server, it is not a problem per se, but rather unexpected behaviour.
I think it is an amazing solution that will keep evolving, Microsoft Data team manage to create an exceptional user experience to build a full data stack that anyone can use regardless of his technical background, I remember when I first used PowerPivot in 2015, I knew that it will change the industry (and it did), and I am sure Datamart will do the same.
Initially I wrote a long commentary on how Datamart fit in the “Modern Data Stack”, and how it compares to Other vendors offering, then something occurred to me, Microsoft is just building a useful product that solve hard problems for a very large segment of the market and anything else does not matter, call it “Post Modern Data Stack” if you want.
When I start that AEMO Dashboard , I had a hard time dealing with PowerBI gateway, it is just setting there, my laptop has to be online whenever I need to schedule a refresh, it just annoyed me, and I could not understand how cloud based data needs on-premises gateway anyway, obviously later I learned that strictly speaking it was not required, there was just undocumented feature to get away of it ( the trick is in the first blog, thanks @Rad_Reza).
but before I was aware of that, I went to some rabbit holes dealing with new tools that are out of my comfort zone, and I think they are worth sharing.
my first thought was instead of accessing the data directly from the website, let’s instead copy the data to a cloud storage then read it from there, I have already a google storage account, it is very generous with a free 5GB storage, my data is not big around 2 GB of zipped csv..
first setback, there is no native connector to Google storage and even if there was ,we have something called egress fees, in a nutshell, cloud storage is really low cost, loading data is free, but getting your data out is not free, unless it is for the same provider and the same region, most of the cloud vendors use the same model, as my data will be processed in PowerBI, the clear choice is azure blob storage
Azure Blob Storage
the setup is very simple I used the following options :
the same region as my PowerBI region ( otherwise your pay the egress fees)
for replication I used LRS
as PowerBI don’t support data lake V2, I used the classical Blob Storage.
When you want to copy data, the official tool in azure is data factory, I tried to play around with copy activities, it is straightforward, my first attempt did work and it was fast , actually too fast 😊, no zip was transferred but rather an HTML
probably copy data just handle this case just fine, but when you use your own credit card on a cloud tool and you don’t know what you are doing, better stay back and take the time to understand how it works, I deleted the new created resources and went to the second option, Python !!!
Normally I go with R but blob storage has no API for R, I have very limited experience with Python , just using it for the excellent package altair , let’s try something new.
I was very pleasantly surprised, the amount of documentation for Python is just amazing, actually once I asked a question on stackoverflow and got a very succinct answer in less than a minute, no one was judgemental or downvoted my question ( the question was very basic). the only drawback is that sometimes the code works well for python 2, but I am using Python 3 anyway enough talking let’s show some pseudo code.
step 1 : get a list of files name from the web site
url = “web address where the files are saved”
result = urlopen(url).read().decode(‘utf-8’)
pattern = re.compile(r'[\w.]*.zip’)
here is a snapshot of the results, the full list is 60 items.
step 2 : get a list of files name from the blob storage
in the first run, the list is empty as we did not load anything yet, I load a couple of files manually just to test if it is working, the API for blob storage are very simple, you only need to provide your storage account name and key and I love that.
step 4 : Upload the new files to Azure Blob Storage
the same here, the Azure API are very simple and clear, I had only when issues, when the script upload in a loop, it does not wait until the transfer is completed before jumping to the next file, my workaround was just to use sleep ( sync is supported but not in this scenario where the input is from an url), anyone i got the answer in stackoverflow
ok, so we do have a script that works, now we need to run it on a schedule, once per day at 5 AM, keep in mind the whole purpose of this workflow is not to use on-premise software, I just need to find the service that runs a script on the cloud on a schedule, as I am already on azure, let’s stick in that ecosystem.
and it is a personal project, I prefer a free solution, my script runs only every 24 hours, for a couple of minutes, a quick google search and i find this little treasure, I will not repeat here the steps, WebJobs is a service that just do that.
note that the package azure-blob-storage is not a base package in Python you need to install first in WebJobs, the schedule functionality is very flexible as it is using CRON, I wish we had something like that in PowerBI Dataflows.
Every day at exactly 5 AM, a new file show up in the azure storage, although I don’t need those files, I am using now another approach to load the files directly in PowerBI, it is important to build a data lake ( yes, I just said that, I am just joke, data lake is folder in the cloud where you save the raw files, nothing more), storage is cheap but most importantly the requirement may change, I may need to report on another dimension and it is crucial to keep the raw unprocessed data.
Python is awesome
Azure API for python are straightforward
Azure is awesome.
Be careful of Egress fees
CRON is awesome wish it was supported in PowerBI dataflows.
Wish PowerBI dataflows could save a raw file, Powerquery is amazing but it does not copy raw files.