First Look at Datamart

Power BI product team kindly provided us a private Preview of Datamart a couple of Months ago, the product has many components and may mean different thing to different people( with different technical background ), but at the core, it is a SQL Server DB with columnar compression, and a SQL Endpoint, the Data is loaded using Power BI dataflow, and a brand new Web Experience for Data Modelling and measures authoring, no desktop is required.

Before we start talking about details, let’s get some key feature first.

  • It is premium only and to my surprise, it is available to PPU too ( 20 $/Month/User is an insane value proposition).
  • Security is managed by Power BI using Azure active directory, you don’t need any obscure SQL Security knowledge.
  • Sharing a database is a two click Operation. ( even with external users )
  • Datamart is very fast, sometimes I had to double check I am not in import mode.
  • It is a single node server, Data currently is limited to 100 GB per Database.
  • Currently Only Power BI dataflow can load data ( incremental is supported too), and no support for creating views yet.
  • Multiple users can use Datamart UI at the same time.
  • You don’t need to know anything about compression, partition etc, table distribution, data skew, shuffle etc, Datamart handles it automatically, it does create Clustered Columnstore index, statistics, partitions etc.
  • The product team promises that full DDL/DML is coming soon.
  • I find this blog very useful on explaining how SQL Server Clustered Columnstore Index Works

The First success we had was to make some data from Power BI dataflow available using the SQL Endpoint, it was just trivial, Just one Click on the server setting, and voila, a connection string that any tool with SQL Server support can use , Obviously Power BI, SSMS, Excel, Tabular reports,  but Tableau, R, Python you name it.

I am a “Traditional” Power BI developer, what does it mean to me ?

Nothing has changed, it is not Power BI version 2,it is not replacing the Vertipaq, you can just keep using Power BI desktop as usual, in my opinion it is just Power BI product team trying to reimagine a better data experience  and solving some pain points.

  • Make Power BI Dataflow a more “Enterprise” solution, there is a SQL Endpoint point, Data engineer will take PowerQuery more seriously now 🙂
  • An online Modelling Experience, no need to install a VM for Mac users.
  • A new experience for users with SQL skills, yes you can just write SQL and get results without out Modelling, you don’t need to Know anything about DAX.
  • Addressing a market that was poorly served before, an easy to use SQL Experience for Business users that don’t necessarily know SQL, and don’t want to model anything.
  • Bonus point, no Azure access requirement, it is just Power BI experience as we know it and love it.

   So how it Works ?

You load the data using Power BI dataflow, it is instantly available in a SQL database, if you know SQL, you can just write Queries, and download an Excel spreadsheet that contains a connection to Datamart SQL Server with a SQL Query , the data will be always up to date, it is not a simply a copy of the results.

One nice thing about using  SQL Server as an Execution Engine, the documentation available on the internet is amazing, let’s say I want to know how much memory the server is using, I just copy it and it works.

Let’s check which SQL server edition, Datamart uses ( 150 means, SQL Server 2019)

here you can see the table structure created by Datamart, 1 partition and multiple segments ( the table has 60 Million rows, each segments is around 700K rows), I think with incremental refresh it create multiple partitions ( I have not test it yet)

Visual SQL

If you don’t want to write SQL, fine just use a Visual Experience to get the results, here I am getting the top customers by number of orders without writing any SQL code, bonus point the SQL is written behind the scene, it is like a Visual dbt 🙂

And the SQL generated

Ok, you don’t want ad hoc stuff fine, you can Model your data by defining relationship and row level security if necessary

To be clear, Tables, relationship and row level security are a pure “SQL artefact” and can be used by any front-end tool as long as it can connect to SQL Server and has proper access using Azure Active directory, and no Measures can not be defined as a SQL 🙂

Obviously, you can connect using SSMS. (notice the relationship columns)

It is Still Power BI

I presume, if you paid for Power BI service, probably you want still to use Power BI as a Visualisation tool 🙂

you can define Measures in DAX ( calculated column and tables are not supported, everything must be loaded using PowerQuery)

And here is the Trick Datamart  Auto generate a read only Dataset, it is only a logical semantic Model, no data is duplicated, it uses a Direct Query Connection to the SQL Server Datamart.

Auto Generated dataset

I will be honest I did not like it initially, I wanted more control, I want to configure which table to include into the dataset and which one to exclude, Direct Query performance is amazing, but still I want to define some tables as dual Mode, or simply import, and I am not going to be apologetic about it, I like calculated column and calculated table, I used them a lot, and I don’t care if they need some extra size, some pattern are trivial using DAX and very hard ( at least for me) using SQL and PowerQuery.

One day, I changed my Mind , Charles Webb (the PM for Datamart) was very clear in his explanation about the different personas, it is free, and you don’t have to use it in all cases, people with SQL Background and less experience with Power BI will find it amazing, they don’t have to deal with Power BI desktop or extra fancy modelling, people with Power BI experience may prefer to have more control and will extend the Dataset, other advance Developers may want to use Tabular Editor and it is fine.

Flexible Open Architecture

You can have a Datamart that manage everything, or you can keep Dataflow, Datamart, Dataset separated, it is really a nice architecture and provide a lot of flexibility, I think it depends, if you have a Datamart that contains a lot of tables that update based on different schedule it make sense to have dataflow do the ETL first then load the results later using Datamart. ( I think later Dataflow will provide even a more elegant solution)

Anyway, it is flexible and can be tightly coupled if you want to, or you can pick and choose which component to use, and I am sure users will use it in some unexpected way ( I may have loaded some DAX calculated table and expose it to the SQL Endpoint)

One aspect I really like for philosophical reason, even if all the downstream users will end up using Only Power BI reports or Excel, having a SQL Endpoints is extremely important, it is an Enterprise grade solution that can talk to other tools when needed and not a self-service silo.

PBIX Download is annoying 

The auto generated dataset cannot be downloaded for a good technical reason, but if you create a report based on it, the report will be download with a pbix that contains a DQ mode to SQL Server, IMHO a better approach is, if you build a thin report from the auto generated dataset, and you download the pbix, it should use only a live connection ( which is the same behaviour if you build a report using the Desktop), but if you download the pbix from Datamart then it should show a Direct Query Mode to SQL Server, it is not a problem per se, but rather unexpected behaviour.

Final thoughts

I think it is an amazing solution that will keep evolving, Microsoft Data team manage to create an exceptional user experience to build a full data stack that anyone can use regardless of his technical background,  I remember when I first used PowerPivot in 2015, I knew that it will change the industry (and it did), and I am sure Datamart will do the same.

Initially I wrote a long commentary on how Datamart fit in the “Modern Data Stack”, and how it compares to Other vendors offering, then something occurred to me, Microsoft is just building a useful product that solve hard problems for a very large segment of the market and anything else does not matter, call it “Post Modern Data Stack” if you want.

PowerBI vs Thoughtspot Semantic Model and Why Pricing is an Engineering Problem.

A couple of days ago, Thoughtspot released a new tier tailored for Small and medium enterprise, see the pricing here, The big surprise, unlimited users is a standard offering even for the entry level tier which cost 95 $/Month, yes there are some limitation, but you can add any number of users !!!!

One may ask, how they can afford it? Thoughtspot is not a charity and they have a margin to maintain !!! and why a bigger player Like Microsoft can’t do that, the hypothesis of this blog is, it is mainly an engineering problem , which bring us to the semantic Model Architecture for Both product.

Let’s build a very simple Semantic Model in PowerBI and Thoughtspot, two fact tables with one common dimension, Both Data is hosted in BigQuery

now let’s have a simple report, in both PowerBI and Thoughtspot

Now let’s check the Queries sent to BigQuery.

PowerBI send three Queries

Thoughtspot Send one Query that return the results exactly as the visual

Why PowerBI needs three SQL Queries ?

That’s the key Question in this Blog, PowerBI was designed to have two tiers, a first tier to scan data either from the internal Database Vertipaq ( or external DB in live more) and the second tier to do ” result merge” and more complex calculation, it is a very clever design decision as it works with less sophisticated DB, after all , all External DB needs to provide is a simple scan and inner & left join, all the complexity is handled by the Formula Engine. ( Tableau has more or less the same architecture)

Thoughtspot made different design decisions, all the calculations are done by the external Database, The “Front End” job is to generate SQL based on the relationship between tables and the definition of measures, this was made possible because Cloud Data warehouse are fast and can do complex calculation. ( I know it is more complex than that like indexing and AI stuff but that the general idea )

My speculation is Thoughtspot Semantic Model require less compute resources, which made unlimited users possible even for the small tier.

I would argue even if thoughtspot decide to add an internal Database later, it will be still cheaper as the Storage/ compute can be shared and it is separated anyway from the Logical Data Model.

To be very Clear, I have no Idea how vendors decide their pricing, but my speculation is architectural design have consequences, and paying per user pricing model is archaic, and hopefully will be history.

The simplest way to merge Multiple Tables in Google Data Studio

Assume you have multiple Data Source and you can’t build a conformed dimension ( IT people like to used it to sound smart, it simply means common field 🙂 and you want to merge the results using a common field, notice every table contains different values of the same field, and before you get excited, filtering does not work except for Dates

we have three Tables, Budget, Actual and Baseline, the common field is Category and commodity

And you want this Final Results

Notice that the category Civil exist only in forecast, and instrumentation, mechanical, electrical exist only in the Table Budget, how to show the results in the same Visual

Full Outer Join for the win

the trick is to join the tables using full joins so not values get dropped

which give you this results

Obviously, you want only 1 column Category not three, too easy just add a column in the table calculation using the function coalesce

unfortunately as of this writing you can’t add a dimension in the results of a blending, it has to be in the Visual, which means, you can’t use that common dimension in filters, which make this approach not very useful to be honest.

I added a link to the report, which contains other approach for the same problem, if you are still reading, it is really unfortunate, Blending is very powerful but it needs constant improvements from the product team

Tableau vs PowerBI behavior in Direct Query Mode , Result Cache

I was intrigued why Tableau and PowerBI have a different behavior when operating in a Direct Query Mode ( Tableau call it Live Mode), I always assumed it is just a driver difference, but it seems it is a little bit more complicated than That.

It is a long weekend, and Tableau released the next version as a beta (which is free), so it is a perfect opportunity to do some testing, Hopefully it will be a series, but let’s start with the fundamental difference, Query Results Cache

Again, this is not about import mode, also known as extract in Tableau, which generally speaking works the same way (PowerBI can have mixed mode, which is not supported in Tableau as far as I know)

The Data Model

The Model is very simple Star Schema, 1 Fact ( 5 years of electricity Data) and two dimensions, Calendar Table and Power Generation Plan attribute

I built this simple report, Total Mwh by substation

Tableau generate an Inner Join, same behavior as PowerBI

SELECT `DUID_DIM`.`StationName` AS `StationName__DUID_DIM_`,
  SUM(`UNITARCHIVE`.`Mwh`) AS `sum_Mwh_ok`
FROM `test-187010.ReportingDataset`.`UNITARCHIVE` `UNITARCHIVE`
  INNER JOIN `test-187010.ReportingDataset`.`DUID_DIM` `DUID_DIM` ON (`UNITARCHIVE`.`DUID` = `DUID_DIM`.`DUID`)

Filtering Data

  • Tableau

I noticed filtering data is basically instantaneous, it is hard to believe it is using Direct Query Mode, you can see it here.

Tableau cached the results of the first Query, when you filter a substation, the data is already there, it does not need to send another SQL Query

  • PowerBI

Built the same report in PowerBI, every selection will fire a new SQL Query, yes it is still fast ( under a 2 second), but it is not instantaneous.

Here is an example in BigQuery Console

Take away

That was a simple example, but imagine 100 of users with a lot of Visuals, I suspect it will create a massive Workload on the source system, I think Tableau behavior, as a lot of other BI tools (Superset, Looker etc) make a lot of sense, and maybe it will be useful too for PowerBI.

%d bloggers like this: