Easymorph is a very Powerful Data preparation tools for Business users, you can export your results either to csv or a growing list of Database and PowerBI Push dataset
I think with the new composite Models, the Push Dataset became rather interesting as it behave more or less like a regular Dataset (see limitation here), you can add relationship to other Tables etc
No Code Data Pipeline
Just to test it, I loaded 7 csv files, then the usual Transformation, filters, select columns, Unpivot then I generated two PowerBI Dataset, Fact and Dimension.
As Push Dataset are append by default, I first delete all the rows in PowerBI dataset before loading new data, just to avoid duplicates, the third export is a csv file ( I wish one day, PowerQuery will have that option without using hacks)
and here is the Model in the service
PowerBI Support
there two type of Support
Power BI Command
You can list workspace, refresh Dataset and Dataflow etc, see Documentation
Export Data to PowerBI
Microsoft provide only API for Push Dataset, see Documentation
Easymorph has a very generous free tier, and The license is reasonable, I think it is worth having a look.
PowerBI has to be more Open
Personally I think PowerBI will gain more by being more open to third party tools, I hope one day , The Vertipaq engine will be as open as SQL Server in the sense that any tools can write and read data, it is a database after all. I am not suggesting to make it open source or free, you obviously still need to pay for a license, for example Adding a PowerBI Rest API for regular dataset will be a good start.
Now maybe dreaming, but when I see file format like Parquet, I wonder why we don’t have an Open API to load and read from Vertipaq engine Storage format, it has an amazing compression, it is columnar and support multiple schema in the same dataset.
Marco Russo has expressed this idea more gracefully here
TL;DR, we use filepath() function to get the partition Data which can be passed from PowerBI when used with Synapse Serverless, Potentially Making substantial cost saving when used in Incremental refresh as less files will be scanned.
Edit : there is another Technique to incrementally load parquet files without a Database
This blog is just me experimenting with the possibility of passing the filters from PowerBI to a Parquet file using Synapse Serverless.
when you deal with On demand Query engine Like Athena, BigQuery, Synapse Serverless etc, the less files read, the less cost incurred, so if somehow you manage to read only the files that contain the data without scanning all your folders, you can make substantial saving.
To Be clear, the first time I used Parquet was yesterday, so I have very basic knowledge at this stage.
1- Get the Source Data
I have a csv file with 3 Millions rows, here is the schema
2- Partition The Data
as far as my understanding of Synapse Serverless engine, when you filter by date for example, The engine will scan the whole file, maybe if the column is sorted, it will scan less data, I don’t know and have not try it yet, instead we are going to partition the Table by date, which is simply generating lot of files split by the date values, I imagine generating a lot of small files is not optimal too, it will reduce cost but potentially making the Query slower ( at the last this how other Engine Works).
I am using this python script to generate the partitions
and here is the results, a folder of parquet files grouped by Date
which I can read using the PowerBI Desktop
I just casually read a parquet file, without any Programing Language !!! ( Kudos for the Product team), this is only to test the parquet is properly generated.
3- Load The files to Azure Storage
This is only a show case, but in a production workflow, you maybe using something like ADF or Python Cloud Functions, anyway to upload a lot of files, I am using Azure storage explorer
now we can start using Synapse Analytics
4- Create a View In Synapse Serverless
we are going to leverage filepath() function to get the partition date, please see the documentation , here is the Query to create a view
USE [test];
GO
DROP VIEW IF EXISTS parquet;
GO
CREATE VIEW parquet AS
SELECT
*,convert(Datetime,result.filepath(1),120) as date
FROM
OPENROWSET(
BULK 'https://xxxxxxxxxx.dfs.core.windows.net/parquet/nem.parquet/Date=*/*.parquet',
FORMAT='PARQUET'
) result
5- Test if the filter Partition Works
Let’s try some testing to see how much is scanned
no Filter, scan all the files, data processed 48 MB
now with filter , only Date 11/04/2018, Data processed 1 MB
6- Test in PowerBI
I build this small model, Date Table is import and Parquet is DirectQuery ( be cautious when using DirectQuery, PBI can be very chatty and generate a lot of SQL Queries, Currently, it is better to use only import mode, until cache support is added)
Case 1: View all Dates
Data Processed : 43 MB
Case 2: Filter Some Dates
let’s filter only 2 days
Note : please use Query reduction option in PowerBI desktop, otherwise, every time you move the slicer, a query will be generated
Here is the result
I was really excited when I saw the results: 1 MB Synapse Serverless was smart enough to scan only 2 files
7- Incremental Refresh
Instead of Direct Query, let’s try a more practical use case, I configured Incremental refresh to change dates only for the Last 3 days
and here is the List of Queries generated by PowerBI
I have only Data for 2018 and 2019, the second refresh (that Started at 3:45 PM) just refreshed the data for the Last three days and because there is no files for those dates the Query returned 0 MB, which is great.
Another Nice functionality is select top 100 which is generated by PowerQuery to check field type scanned only 1 MB !!!!
Just to be sure, I have done another refresh at 4 :24 PM and checked the Partitions using SSMS
Only the last three partions were refreshed and PQ sent only 6 Queries ( 1 select data for 1 day and the other check field type)
I think Azure storage + Synapse analytics Serverless + PowerBI Incremental Refresh may end up as a very powerful Pattern.
I was always intrigued by the Push dataset in PowerBI but it did had a big showstopper, you can’t mix it with other tables, which made is of little use for me, until Dec 2020, where it become possible to use it in a composite Model, see this excellent introduction
To see how it works, I tried it with this dataset, the Power Generation in Australia every 5 minutes.
1- Create Dataset in PowerBI service
2- Get the URL
3- Use some tool to start streaming data
You need some tool to start pushing data into that Table, for example you can use Easymorph, Personally I have an existing python script that download the data and push it to a DB, I had to add only a couple of line of codes to make it push to PowerBI dataset directly, I copied the code from this function
Please make sure the datetime format is correct, otherwise it will not work.
###### post to PowerBI
df['SETTLEMENTDATE'] = [datetime.strftime(item, "%Y-%m-%dT%H:%M:%SZ") for item in df['SETTLEMENTDATE']]
REST_API_URL = "https://api.powerbi.com/beta/XXXXXXXXXXXXXXXXXXXXXX"
body = bytes(df.to_json(orient='records'), encoding='utf-8')
req = urllib2.Request(REST_API_URL, body)
response = urllib2.urlopen(req)
print("PowerBI: HTTP {0} {1}\n".format(response.getcode(), response.read()))
################################################
4- Build a Composite Model in PowerBI Desktop
DirectQuery to the Push dataset and a dimension Table as Import, we need the dimension table to know the fuel source of every generator ( wind, Coal etc), The Dimension Table change very rarely, less than 20 new rows per year !!!!
And here is the report
Maybe not a big deal for a lot of users, but Export to web does not work currently with composite Model as of 5-Jan-2021
5- Refresh the Page
Please notice, even if the data source refresh, you still needs to click refresh to get the Latest data on the visuals, when I try to add Automatic page refresh I get this error, I am using a pro license not PPU !!!!
Edit : Fred Kaffenberger suggested a very clever workaround by using a hidden Play Axis custom viz to force the refresh, and indeed it works very well !!!
Take away
To be very clear, Pushing a 5 minutes dataset is not something new , but it was a a relatively complex task, with the new composite model it become nearly trivial, and that’s the magic of PowerBI.
in Dec 2020, PowerBI introduced a fundamental change to the architecture of the Product, now when you connect to an existing model you can enhanced it by adding your own data.
Personally I find this functionality extremely useful, for example, I had access to an Enterprise Data Model that contains Oracle Primavera Data, but it was not very useful alone, that data make sense for my use case only when I combined it with other sources, now it is possible.
To make this functionality possible the Product team add the option to connect to an Existing PowerBI Dataset using DirectQuery, it is not SQL but DAX instead, I think you should read this first, if you haven’t already.
In this blog, I am experimenting with a new scenario just by curiosity, not sure if useful at all, but fun !!!! In PowerBI by default when you refresh a dataset, all the tables will refresh, the only way to control that is using XMLA endpoint, which involve some coding and require a premium license (PPU works too).
The idea is simple, let’s say you have a model with 4 tables, and only 1 Table needs to be refreshed frequently.
– Create a new Model that contain only 1 Table, setup schedule refresh to how often you want to see the data updated
– Delete that Table from the existing model, and connect to it from the new Model using DirectQuery
The Table that refresh frequently can be even a Realtime dataset
Testing
Again, don’t read too much into it, it just to give you an indication, The Data is Power Generation every 5 minutes, it make sense only to update the data for the current Day, all previous data does not change, The visual will show the data for today and yesterday.
1- All Tables are imported
notice Settlementdate is a datetime field, the data is imported using incremental refresh.
and here is the Model
here is the result, 378 ms
2-History Imported, Today Data DirectQuery
when you do DirectQuery mode, the performance will depend on the modelling used, here the measure Mw will sum the values from the History Table and Today Table.
if we use settlementdate as an X axis, the results will return in 80 Seconds
Now using two Dimension Date and time instead of Settlementdate, the Performance is nearly the same as import 492 ms
I can’t find a way to make date and time as a continuous axis in the Visual
I noticed if you use the DirectQuery Table without using dimensions from other models, the performance is extremely fast.
Take Away
I am not going to pretend I am an expert in DAX optimization, and Probably I am doing one or two things wrong, and as always it depends on a lot of factors 🙂 but as a rule of thumb:
DirectQuery on Dataset does not like Dimension with High cardinality
Import Model will be always more performant and tolerant to bad modelling
Data Modelling is very important, now it become even more critical
PowerBI Import mode is so fast and Powerful that even bad written DAX and poor Data Modelling will Just works, DirectQuery Mode on PowerBI Dataset will open all kind of new scenarios that was not possible before, but you have to be more careful about your modelling.