Refresh individual Tables using The Composite Model in PowerBI

in Dec 2020, PowerBI introduced a fundamental change to the architecture of the Product, now when you connect to an existing model you can enhanced it by adding your own data.

Personally I find this functionality extremely useful, for example, I had access to an Enterprise Data Model that contains Oracle Primavera Data, but it was not very useful alone, that data make sense for my use case only when I combined it with other sources, now it is possible.

To make this functionality possible the Product team add the option to connect to an Existing PowerBI Dataset using DirectQuery, it is not SQL but DAX instead, I think you should read this first, if you haven’t already.

In this blog, I am experimenting with a new scenario just by curiosity, not sure if useful at all, but fun !!!! In PowerBI by default when you refresh a dataset, all the tables will refresh, the only way to control that is using XMLA endpoint, which involve some coding and require a premium license (PPU works too).

The idea is simple, let’s say you have a model with 4 tables, and only 1 Table needs to be refreshed frequently.

– Create a new Model that contain only 1 Table, setup schedule refresh to how often you want to see the data updated

– Delete that Table from the existing model, and connect to it from the new Model using DirectQuery

The Table that refresh frequently can be even a Realtime dataset

Testing

Again, don’t read too much into it, it just to give you an indication, The Data is Power Generation every 5 minutes, it make sense only to update the data for the current Day, all previous data does not change, The visual will show the data for today and yesterday.

1- All Tables are imported

notice Settlementdate is a datetime field, the data is imported using incremental refresh.

and here is the Model

here is the result, 378 ms

2-History Imported, Today Data DirectQuery

when you do DirectQuery mode, the performance will depend on the modelling used, here the measure Mw will sum the values from the History Table and Today Table.

if we use settlementdate as an X axis, the results will return in 80 Seconds

Now using two Dimension Date and time instead of Settlementdate, the Performance is nearly the same as import 492 ms

I can’t find a way to make date and time as a continuous axis in the Visual

I noticed if you use the DirectQuery Table without using dimensions from other models, the performance is extremely fast.

Take Away

I am not going to pretend I am an expert in DAX optimization, and Probably I am doing one or two things wrong, and as always it depends on a lot of factors 🙂 but as a rule of thumb:

DirectQuery on Dataset does not like Dimension with High cardinality

Import Model will be always more performant and tolerant to bad modelling

Data Modelling is very important, now it become even more critical

PowerBI Import mode is so fast and Powerful that even bad written DAX and poor Data Modelling will Just works, DirectQuery Mode on PowerBI Dataset will open all kind of new scenarios that was not possible before, but you have to be more careful about your modelling.

Building a Modern Data Stack using BigQuery, Dataform and PowerBI

Google cloud has bought recently Dataform and made it available for free, although I  play it with it before, Now I thought it is a good time to use it more seriously, this is not a review but my own experience as a data analyst who is more comfortable with Microsoft self-service data tools and  does not use SQL in day to day work.

I have an existing data pipeline in BigQuery, the data is loaded using python and there are schedule Queries using python and BigQuery native scheduler, although the whole thing worked very well for the last 15 months, I would not say working with multiple views and table was a pleasant experience, to be honest,  Because I was afraid to break something, I have not touch it much,  everything change since I start using Dataform to manage it

What Dataform did ( and I imagine dbt too)  is to implement some very simple functionalities that make the whole work flow extremely easy to manage, so you write your SQL code in Dataform, Dependencies between Tables are auto generated , and when you click run, it will build those Tables and views in BigQuery

I think showing a general overview of what I did, hopefully give you a sense of the Big picture

1- Define your Source Tables

Here is the representation in Dataform

for Example The Table “DREGION”, you write this code

config {
  type: "declaration",
  schema: "aemodataset",
  name: "DREGION",
  description: "Price very 5 minute, history"
}

you repeated the same for all the source Tables

and here is the View in the dependency trees

2- Remove hard coded refrence to Tables in SQL Queries

let’s say you have an existing view

SELECT
   *
 FROM
   xxxxxx.aemodataset.rooftoptoday

instead of hard coding the table, you change it to this

config {
   type: "view",
   schema: "PowerBI",
   tags: ["PowerBI"]
 }
 SELECT
   *
 FROM
   ${ref("rooftoptoday")}

This format is called SQLX, as you can see it is still SQL but they added some new functionalities, in the config, you define if it is a table or a view, in which dataset it will be located and tag ( will be useful later for schedule refresh)

Now, repeated this for all your tables and you get this beautiful dependency tree

3- Schedule Queries

And that’s that where the magic is, when you schedule a Query, you have an option to schedule all dependant tables, for example, I setup a daily refresh for the Table “UNITARCHIVE” , the two Tables “archive_view” and “revenue” will be run in Sequence without me writing any extra code

The Dataform project is published here github, it is really nice to see the history of all the changes made so easy with the integration of version controls

4- Here is the final Views in BigQuery

I think it is a good practise to always expose only Views to PowerBI, as you can change the logic later without breaking the connection to PowerBI

5-Connect PowerBI to BigQuery

PowerBI Connect to BigQuery using incremental refresh to reduce the time required to update, it is pretty trivial to setup.

Although the data changes every 5 minutes, I am using PowerBI PRO license which is limited to 8 refresh/Day, if Premium per user turn out to have a reasonable price, I will upgrade 🙂

hopefully in 2021 we will have the option to serve PowerBI using BI Engine, as Currently using DirectQuery from BigQuery can be expensive very Quickly if you have a lot of usage, Obviously if you are on a flat rate, it is not a problem.

to clarify, BigQuery is very cost effective, actually the current pipeline cost me less than 2 $/Month, you have just to be careful with PowerBI and use only import mode, PowerBI is very chatty when used in live mode, it simply generate two much SQL Queries.

6-Semantic Model in PowerBI

Dataform Data model are not meant to replace a semantic model, all Dataform do is taking raw tables and generating reporting tables that can be consumed by a BI tool (to clarify, BigQuery is generating the views and tables, Dataform just manage the SQL code, and schedule refresh, but the compute is done by the DB).

For a simple scenario, some flat tables is all you need (in Fact I am using Google Data Studio too for this example), but anything slightly complex , you need a semantic model on top of it, here I am using PowerBI to host the semantic model.

I would have loved to test Looker semantic model, But currently you need to call a sales department to schedule a demo which I am not really interested in doing.

7- Final Reports

Here is the final reports, as the data is public I am using publish to web

what I really like about the dependency tree, it gives visual clues to redundant logic, it gave me the confidence to simplify my workflow and when I delete or change a table names, automatically it raise an errors that a dependency will be out of sync

I keep reading about how ELT will be the next big thing, and to be honest I never bought the concept, but with Dataform, I can see myself writing very complex workflow using SQL.

Change Dimension Dynamically using Parameter in PowerBI

At Last, PowerBI added support for parameters that can be changed by the end user, I guess from a Business perspective, it is mostly useful when you deal with Big Data load, and you want to control exactly the Query generated at the data source level, but in this short Blog, I will show how some use cases where hard or clunky using DAX became extremely easy to do using Parameters.

pbix file here : notice it is connecting to my DB instance, so it will not work but you can see the Data Model.

I think it is wise to read the documentation here first

Chris Webb has a great use case using Azure Data explorer here

Update : I added a new use case here, changing weekend Date Dynamically

We want to change a dimension based on a user selection from a slicer, currently Only DirectQuery is supported and to be honest, the documentation does not tell which data source works, we know SQL server is not one of them, Thanks to Alex for his clarification, Luckily BigQuery Works ( that was a very nice surprise to be honest)

I am using the Covid19 data set as an example (as it is free and don’t incur any charge till sept 2021), we want to switch dynamically between countries and continent

1- Load the main Table as import mode

2- Create a parameter ” Level_Details”

3- Import dimension Table with the values countries and Continent in Direct Mode:

I created a view in BigQuery , PowerQuery stopped folding when I tried to remove duplicated, although it is free data source, it is important to use directQuery only with dimension Tables to reduce cost and Data volume

4- Include the parameter logic in Dimension Table

I created a new Column “Grouping_Details” based on the Parameter Value, it will Take either Countries or Continent

5- create a new Table that contains all the possible values for the Parameter

by the way, you can use any table, either imported, or generated using DAX, this is a very clever implementation by the PowerBI team compared to Other BI Tool.

6- Bind the value of the column “Selection” to the Parameter

here is a View of the Data Model

it is very Important that “Selection_Details” stay as a disconnected Table, otherwise it will create new filter selection in the Queries which we don’t want, it will work but we want to control exactly the Query generated by PowerBI

And the Report

The feature is in Preview and I am sure, they will introduce more Data Sources and functionalities, by adding support to BigQuery, Microsoft sent a clear message, PowerBI is the best Data Analytics tool and they will support any third Party Data Warehouse, even if it is a direct Competitor.

Personally,I am very excited by the thought that we are very close to Finally have Parameter Action In PowerBI , and that will introduce a new class of Visual Analytics Interaction that was not even Possible, Please need some Votes here

Btw, if you use BigQuery with PowerBI, I appreciate some votes here, we need the support of Custom SQL Query with Parameter

Using PowerBI with Azure Synapse Serverless, First Look

Recently I come across a new use case, where I thought Azure Synapse serverless may make sense, if you never heard about it before, here is a very good introduction

TLDR; Interesting new Tool !!!!, will definitely have another serious look when they support cache for the same Queries

Basically a new file arrive daily in an azure storage and needs to be processed and later consumed in PowerBI

The setup is rather easy, here is an example of the user interface, this is not a step by step tutorial, but just my first impression.

I will use AEMO (Australian electricity market Operator) data as an example, the raw data is located here

Load Raw Data

First I load the csv file as it is, I define the columns to be loaded from 1 to 44 , make sure you load only 1 file to experiment then when you are ready you change this line

'https://xxxxxxxx.dfs.core.windows.net/tempdata/PUBLIC_DAILY_201804010000_20180402040501.CSV',
'https://xxxxxxxx.dfs.core.windows.net/tempdata/PUBLIC_DAILY_*.CSV',

Then it will load all files, notice when you use filename(), it will add a column with the files name, very handy

USE [test];
GO

DROP VIEW IF EXISTS aemo;
GO

CREATE VIEW aemo AS
SELECT
result.filename() AS [filename],
     *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxxx.dfs.core.windows.net/tempdata/PUBLIC_DAILY_201804010000_20180402040501.CSV',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
    )
    with (
c1   varchar(255),
c2   varchar(255),
c3   varchar(255),
c4   varchar(255),
c5   varchar(255),
c6   varchar(255),
c7   varchar(255),
c8   varchar(255),
c9   varchar(255),
c10   varchar(255),
c11   varchar(255),
c13   varchar(255),
c14   varchar(255),
c15   varchar(255),
c16   varchar(255),
c17   varchar(255),
c18   varchar(255),
c19   varchar(255),
c20   varchar(255),
c21   varchar(255),
c22   varchar(255),
c23   varchar(255),
c24   varchar(255),
c25   varchar(255),
c26   varchar(255),
c27   varchar(255),
c29   varchar(255),
c30   varchar(255),
c31   varchar(255),
c32   varchar(255),
c33   varchar(255),
c34   varchar(255),
c35   varchar(255),
c36   varchar(255),
c37   varchar(255),
c38   varchar(255),
c39   varchar(255),
c40   varchar(255),
c41   varchar(255),
c42   varchar(255),
c43   varchar(255),
c44   varchar(255)
     )
 AS result

The previous Query create a view that read the raw data

Create a View for a Clean Data

As you can imagine , Raw data by itself is not very useful, we will create another view that reference the raw data view and extract a nice table ( in this case the Power generation every 30 minutes)

USE [test];
GO

DROP VIEW IF EXISTS TUNIT;
GO

CREATE VIEW TUNIT AS
select [_].[filename] as [filename],
   convert(Datetime,[_].[c5],120) as [SETTLEMENTDATE],
    [_].[c7] as [DUID],
   cast( [_].[c8] as DECIMAL(18, 4)) as [INITIALMW]
from [dbo].[aemo] as [_]
where (([_].[c2] = 'TUNIT' and [_].[c2] is not null) and ([_].[c4] = '1' and [_].[c4] is not null)) and ([_].[c1] = 'D' and [_].[c1] is not null)

Connecting PowerBI

Connecting to azure synapse is extremely easy, PowerBI just see it as a normal SQL server.

here is the M script

let
Source = Sql.Databases("xxxxxxxxxxx-ondemand.sql.azuresynapse.net"),
test = Source{[Name="test"]}[Data],
dbo_GL_Clean = test{[Schema="dbo",Item="TUNIT"]}[Data]
in
dbo_GL_Clean

And the SQL Query generated by PowerQuery ( which Fold)

select [$Table].[filename] as [filename],
[$Table].[SETTLEMENTDATE] as [SETTLEMENTDATE],
[$Table].[DUID] as [DUID],
[$Table].[INITIALMW] as [INITIALMW]
from [dbo].[TUNIT] as [$Table]

Click refresh and perfect, here is 31 files loaded

Everything went rather smooth, nothing to set up and I have now an Enterprise Grade Data warehouse in Azure, how cool is that !!!

How Much it cost ?

Azure Synapse serverless pricing model is based on how much data is processed

First let’s try with only 1 file ,running Query from the Synapse Workspace, the file is 85 MB, good so far, data processed is 90 MB, file size + some meta Data

now let’s see using the Queries generated by PowerBI, in theory my files size are 300 MB, I will be paying only for 300 MB, let’s have a look at the Metrics

My first reaction was, there must be a bug , 2.4 GB !!!, I refreshed again and it is the same number !!!

A look at the PowerQuery diagnostic and a clear picture emerges, PowerBI SQL Connectors is famous for being “Chatty”, in this case you would expect PowerQuery to send only 1 Query but in reality it will send multiple Queries , at least 1 of them to check the top 1000 rows to define the fields type.

Keep in mind Azure Synapse Serverless has no cache ( they are working on it), so if you run the same query multiple times even with the same data, it will “scan” the files multiple times, and as there is no data statistic a select 1000 rows will read all files even without order by.

Obviously, I was using import mode, as you can imagine using it with directQuery will generate substantially more queries.

Just to be sure I tried to do refresh on the service.

The same, it is still 2.4 GB, I think it is fair to say, there is no way to control how many time PowerQuery send a SQL Query to Synapse.

Edit 17 October 2020 :

I got a feedback that probably my PowerBI desktop was open when I run the test in the service, turn out it is true, I tried again with The desktop closed and it worked as expected, one refresh generate 1 query

Notice even if the CSV file was compressed, it will not make a difference, Azure synapse bill uncompressed data.

Parquet file would made a difference as only columns used would be charged, but I did not want to used another tool in this example.

Take Away

It is an interesting Technology, the integration with Azure cloud storage is straightforward, the setup is easy,you can do transformation using only SQL, Pay only what you use and Microsoft is investing a lot of resources on it.

But the lack of cache is a show stopper !!

I will definitely check it again when they add the cache and cost control, after all it is still in Preview 🙂