Using PowerBI with Azure Synapse Serverless, First Look

Recently I come across a new use case, where I thought Azure Synapse serverless may make sense, if you never heard about it before, here is a very good introduction

TLDR; Interesting new Tool !!!!, will definitely have another serious look when they support cache for the same Queries

Basically a new file arrive daily in an azure storage and needs to be processed and later consumed in PowerBI

The setup is rather easy, here is an example of the user interface, this is not a step by step tutorial, but just my first impression.

I will use AEMO (Australian electricity market Operator) data as an example, the raw data is located here

Load Raw Data

First I load the csv file as it is, I define the columns to be loaded from 1 to 44 , make sure you load only 1 file to experiment then when you are ready you change this line

'https://xxxxxxxx.dfs.core.windows.net/tempdata/PUBLIC_DAILY_201804010000_20180402040501.CSV',
'https://xxxxxxxx.dfs.core.windows.net/tempdata/PUBLIC_DAILY_*.CSV',

Then it will load all files, notice when you use filename(), it will add a column with the files name, very handy

USE [test];
GO

DROP VIEW IF EXISTS aemo;
GO

CREATE VIEW aemo AS
SELECT
result.filename() AS [filename],
     *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxxx.dfs.core.windows.net/tempdata/PUBLIC_DAILY_201804010000_20180402040501.CSV',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
    )
    with (
c1   varchar(255),
c2   varchar(255),
c3   varchar(255),
c4   varchar(255),
c5   varchar(255),
c6   varchar(255),
c7   varchar(255),
c8   varchar(255),
c9   varchar(255),
c10   varchar(255),
c11   varchar(255),
c13   varchar(255),
c14   varchar(255),
c15   varchar(255),
c16   varchar(255),
c17   varchar(255),
c18   varchar(255),
c19   varchar(255),
c20   varchar(255),
c21   varchar(255),
c22   varchar(255),
c23   varchar(255),
c24   varchar(255),
c25   varchar(255),
c26   varchar(255),
c27   varchar(255),
c29   varchar(255),
c30   varchar(255),
c31   varchar(255),
c32   varchar(255),
c33   varchar(255),
c34   varchar(255),
c35   varchar(255),
c36   varchar(255),
c37   varchar(255),
c38   varchar(255),
c39   varchar(255),
c40   varchar(255),
c41   varchar(255),
c42   varchar(255),
c43   varchar(255),
c44   varchar(255)
     )
 AS result

The previous Query create a view that read the raw data

Create a View for a Clean Data

As you can imagine , Raw data by itself is not very useful, we will create another view that reference the raw data view and extract a nice table ( in this case the Power generation every 30 minutes)

USE [test];
GO

DROP VIEW IF EXISTS TUNIT;
GO

CREATE VIEW TUNIT AS
select [_].[filename] as [filename],
   convert(Datetime,[_].[c5],120) as [SETTLEMENTDATE],
    [_].[c7] as [DUID],
   cast( [_].[c8] as DECIMAL(18, 4)) as [INITIALMW]
from [dbo].[aemo] as [_]
where (([_].[c2] = 'TUNIT' and [_].[c2] is not null) and ([_].[c4] = '1' and [_].[c4] is not null)) and ([_].[c1] = 'D' and [_].[c1] is not null)

Connecting PowerBI

Connecting to azure synapse is extremely easy, PowerBI just see it as a normal SQL server.

here is the M script

let
Source = Sql.Databases("xxxxxxxxxxx-ondemand.sql.azuresynapse.net"),
test = Source{[Name="test"]}[Data],
dbo_GL_Clean = test{[Schema="dbo",Item="TUNIT"]}[Data]
in
dbo_GL_Clean

And the SQL Query generated by PowerQuery ( which Fold)

select [$Table].[filename] as [filename],
[$Table].[SETTLEMENTDATE] as [SETTLEMENTDATE],
[$Table].[DUID] as [DUID],
[$Table].[INITIALMW] as [INITIALMW]
from [dbo].[TUNIT] as [$Table]

Click refresh and perfect, here is 31 files loaded

Everything went rather smooth, nothing to set up and I have now an Enterprise Grade Data warehouse in Azure, how cool is that !!!

How Much it cost ?

Azure Synapse serverless pricing model is based on how much data is processed

First let’s try with only 1 file ,running Query from the Synapse Workspace, the file is 85 MB, good so far, data processed is 90 MB, file size + some meta Data

now let’s see using the Queries generated by PowerBI, in theory my files size are 300 MB, I will be paying only for 300 MB, let’s have a look at the Metrics

My first reaction was, there must be a bug , 2.4 GB !!!, I refreshed again and it is the same number !!!

A look at the PowerQuery diagnostic and a clear picture emerges, PowerBI SQL Connectors is famous for being “Chatty”, in this case you would expect PowerQuery to send only 1 Query but in reality it will send multiple Queries , at least 1 of them to check the top 1000 rows to define the fields type.

Keep in mind Azure Synapse Serverless has no cache ( they are working on it), so if you run the same query multiple times even with the same data, it will “scan” the files multiple times, and as there is no data statistic a select 1000 rows will read all files even without order by.

Obviously, I was using import mode, as you can imagine using it with directQuery will generate substantially more queries.

Just to be sure I tried to do refresh on the service.

The same, it is still 2.4 GB, I think it is fair to say, there is no way to control how many time PowerQuery send a SQL Query to Synapse.

Edit 17 October 2020 :

I got a feedback that probably my PowerBI desktop was open when I run the test in the service, turn out it is true, I tried again with The desktop closed and it worked as expected, one refresh generate 1 query

Notice even if the CSV file was compressed, it will not make a difference, Azure synapse bill uncompressed data.

Parquet file would made a difference as only columns used would be charged, but I did not want to used another tool in this example.

Take Away

It is an interesting Technology, the integration with Azure cloud storage is straightforward, the setup is easy,you can do transformation using only SQL, Pay only what you use and Microsoft is investing a lot of resources on it.

But the lack of cache is a show stopper !!

I will definitely check it again when they add the cache and cost control, after all it is still in Preview 🙂

How to build Compact layout Pivot Table in Google Data Studio

TL,DR : the report is here. and I appreciate a vote on this bug report

First, don’t be excited, it is a silly workaround, and introduce it is own problem, but if you are like me and need to deliver a nice-looking pivot table in Google Data Studio, it may be worth the hassle.

The Problem.

Show the spent and budget by Campaign and country, the spent is at the country level, the budget at the country level, here is a sample data set.

The Solution, First try

Probably you are saying, it is too easy, why you are writing a blog about it, GDS support pivot Table !!, let’s see the result

We have three Problems already (1 bug, 1 limitation and 1 by design)

Bug:            you can return not return a null in the metric spent

By design:  GDS does not understand hierarchy, country null is all good.

Limitation: The Famous Excel compact View is not supported

Here is the deal, contrary to what you may read in the internet, Pivot table is the most used viz in reporting ( ok, maybe second after table) and users will want their pivot table to look exactly like their beloved Excel, my own experience, if you show a user a map for example and he ask for a feature which is not possible, you can say, I can’t do it and people will tolerate that, but their Excel looking Pivot table, zero tolerance, if you can’t reproduce it, they will either think :

  1. Your BI is not good
  2. You don’t know the tool

The Solution, SQL!!.

Write a SQL that return a column that show the campaign and country in the same field, using union

Assuming your data is on google sheet.

  1. Link Google Sheet to an External table in BigQuery

2-Write the Query

Connect to that table using a custom Query

SELECT project,sum(budget) as budget,sum(spent) as spent FROM `test-187010.work.factraw`

group by 1

union all

SELECT Concat(“\U0001f680”,country), budget, spent FROM `test-187010.work.factraw` where country IS NOT NULL

3-BI engine does not support External table.

Every time you open the report, GDS studio will issue a new query which cost 10 mb minimum !!!, even if the data is 1 kb ( it is a big data thing after all),  to avoid that we extract the data

4-Profit 🙂

 We use conditional formatting to highlight the row campaign.

needless to say !!!! you should not use it unless you have to, cross filtering will be a mess , Hopefully GDS will improve pivot table formatting in the near future.

How to Model Primavera Activity ID and Quantity Measurement System using Multiple to Multiple in PowerBI

whenever I need to join Primavera Activity id to the quantity measurement system, I use this pattern, it did serve me well all those years, recently I started a new project where for the first time, I don’t get an extract using Excel but a proper live connection to SQL server 🙂

To get something quickly running, I started using the same approach, load Primavera export, unpivot the date and normalize it, every activity has a spread from 0 to 100 % then merge it to a Table from SQL server, all working as expected.

Although it works well, it is a bit clunky , specially that the export from Primavera does not change frequently, for the baseline maybe once a year and the forecast once a month,  so instead of merging the data using Powerquery, I loaded the Primavera data as a separate table, here what the model looks like

As you have guessed the Activity id is duplicated in both tables

Now the Metric I am looking for is how to spread the budget hours from the table BOQ using the spread ( 0-100 %) from Primavera, let’s say I filter 1 row from the BOQ the result should be something like this

As it is multiple to multiple if you simply multiply the hours X spread you get duplicate values

Planned_Hours_no_filter = sumx(Primavera, [remaining_hrs]*Primavera[Spread]) = 950K hours

Obviously, it is the wrong, the total remaining hours is 49K only, the maximum spread should be 49K (or less if some activities ID are not mapped.)

The solution is to create an explicit filter and get the hours only for the specific activiy ID

Planned_Hours = sumx(Primavera, CALCULATE([remaining_hrs],filter(BOQ,BOQ[Activity ID]=Primavera[Activity ID]))*Primavera[Spread])

And here is the result

I checked with the old model and all the results match, to be honest I am not a huge fan of multiple to multiple but in this case, it is worth it, less refresh time and got rid of two big tables.

you can download the pbix here

Custom SQL in Google Data Studio

Update August 2020 : SQL Parameter are better supported now, please go tho this updated blog

in the last 12 months, Google Data Studio has added many new interesting new features, specially the integration with BigQuery BI engine, and custom SQL Queries.

Obviously, I am a huge PowerBI fan, and I think it is the best thing that happen to analytics since Excel, but if you want to share a secure report without requiring a license for every user, Data Studio is becoming a valid option.

I have already blogged about building a near real time dashboard using Bigquery and Data Studio , but in this quick blog, I will try to show case that using SQL one can create a more complex business logic reports.

I am using a typical dataset we have in our industry, a lot of facts tables, with different granularity, the facts tables don’t all update at the same time, planned values changes only when there is a program revision, actual changes every day.

Instead of writing the steps here, please view the report that include the how to and the results.

The approach is pretty simple, all modern BI software works more or less the same way( at least PowerBI & Qlik, Tableau is coming soon), you load data to different tables then you model the data by creating relationships between the tables, then you create measures, when you click on a filter for example, or when you add dimension to a chart, the software generate a SQL query to the data source based on the existing relationship defined in the data model, it is really amazing , even without knowing any SQL coding you can do very complicated analysis.

Data Studio is no different to other tools, the Data Modeling is called Blending, it link all the tables together using left join, which is a big limitation as if some values exist in one table and not in others, you will miss data.

The idea is let’s bypass the modeling layer and write some SQL code, and to make it dynamic let’s use parameters, it is not an ideal solution for an average Business users ( we don’t particularly like code) but it is a workaround, till Data Studio improve it’s offering.