PowerBI vs Thoughtspot Semantic Model and Why Pricing is an Engineering Problem.

A couple of days ago, Thoughtspot released a new tier tailored for Small and medium enterprise, see the pricing here, The big surprise, unlimited users is a standard offering even for the entry level tier which cost 95 $/Month, yes there are some limitation, but you can add any number of users !!!!

One may ask, how they can afford it? Thoughtspot is not a charity and they have a margin to maintain !!! and why a bigger player Like Microsoft can’t do that, the hypothesis of this blog is, it is mainly an engineering problem , which bring us to the semantic Model Architecture for Both product.

Let’s build a very simple Semantic Model in PowerBI and Thoughtspot, two fact tables with one common dimension, Both Data is hosted in BigQuery

now let’s have a simple report, in both PowerBI and Thoughtspot

Now let’s check the Queries sent to BigQuery.

PowerBI send three Queries

Thoughtspot Send one Query that return the results exactly as the visual

Why PowerBI needs three SQL Queries ?

That’s the key Question in this Blog, PowerBI was designed to have two tiers, a first tier to scan data either from the internal Database Vertipaq ( or external DB in live more) and the second tier to do ” result merge” and more complex calculation, it is a very clever design decision as it works with less sophisticated DB, after all , all External DB needs to provide is a simple scan and inner & left join, all the complexity is handled by the Formula Engine. ( Tableau has more or less the same architecture)

Thoughtspot made different design decisions, all the calculations are done by the external Database, The “Front End” job is to generate SQL based on the relationship between tables and the definition of measures, this was made possible because Cloud Data warehouse are fast and can do complex calculation. ( I know it is more complex than that like indexing and AI stuff but that the general idea )

My speculation is Thoughtspot Semantic Model require less compute resources, which made unlimited users possible even for the small tier.

I would argue even if thoughtspot decide to add an internal Database later, it will be still cheaper as the Storage/ compute can be shared and it is separated anyway from the Logical Data Model.

To be very Clear, I have no Idea how vendors decide their pricing, but my speculation is architectural design have consequences, and paying per user pricing model is archaic, and hopefully will be history.

Tableau vs PowerBI behavior in Direct Query Mode , Result Cache

I was intrigued why Tableau and PowerBI have a different behavior when operating in a Direct Query Mode ( Tableau call it Live Mode), I always assumed it is just a driver difference, but it seems it is a little bit more complicated than That.

It is a long weekend, and Tableau released the next version as a beta (which is free), so it is a perfect opportunity to do some testing, Hopefully it will be a series, but let’s start with the fundamental difference, Query Results Cache

Again, this is not about import mode, also known as extract in Tableau, which generally speaking works the same way (PowerBI can have mixed mode, which is not supported in Tableau as far as I know)

The Data Model

The Model is very simple Star Schema, 1 Fact ( 5 years of electricity Data) and two dimensions, Calendar Table and Power Generation Plan attribute

I built this simple report, Total Mwh by substation

Tableau generate an Inner Join, same behavior as PowerBI

SELECT `DUID_DIM`.`StationName` AS `StationName__DUID_DIM_`,
  SUM(`UNITARCHIVE`.`Mwh`) AS `sum_Mwh_ok`
FROM `test-187010.ReportingDataset`.`UNITARCHIVE` `UNITARCHIVE`
  INNER JOIN `test-187010.ReportingDataset`.`DUID_DIM` `DUID_DIM` ON (`UNITARCHIVE`.`DUID` = `DUID_DIM`.`DUID`)
GROUP BY 1

Filtering Data

  • Tableau

I noticed filtering data is basically instantaneous, it is hard to believe it is using Direct Query Mode, you can see it here.

Tableau cached the results of the first Query, when you filter a substation, the data is already there, it does not need to send another SQL Query

  • PowerBI

Built the same report in PowerBI, every selection will fire a new SQL Query, yes it is still fast ( under a 2 second), but it is not instantaneous.

Here is an example in BigQuery Console

Take away

That was a simple example, but imagine 100 of users with a lot of Visuals, I suspect it will create a massive Workload on the source system, I think Tableau behavior, as a lot of other BI tools (Superset, Looker etc) make a lot of sense, and maybe it will be useful too for PowerBI.

Using DuckDB with PowerBI

DuckDB is one of the most promising OLAP Engine in the market, it is open Source, very lightweight, and has virtually no dependencies and work in-Process (think the good Old MS Access ) and it is extremely fast, specially in reading and querying parquet files. and has an Amazing SQL Support

The ODBC driver is getting more stable, I thought it is an opportunity to test it with PowerBI, notice JDBC was always supported and can be used with SQL frontend like DBeaver and obviously Python and R has a native integration

I download the ODBC driver using the latest version 0.3.3, you need to check always the latest release and make sure it is the right file.

Installing the binary is straightforward, but unfortunately you need to be an administrator

Configuring PowerBI

Select ODBC, if the driver was installed correctly, you should see an entry for DuckDB

As of this writing there is a bug in the driver, if you add a path to the DuckDB database file, the driver will not recognise the tables and views inside it, Instead I selected

database=:memory:

And defining the base Table as a CTE, reading Directly from a folder of parquet files

Just for fun, I duplicated the parquet file just to reach the 1 Billion Rows mark

The total size is 30 GB compressed.

1 Billion rows in a Laptop

And here is the results in PowerBI

The size of the PowerBI report is only 48 KB, as I import only the results of the query not the whole 30 GB of data, yes separation of Storage and Compute make a lot of sense in this case.

Although the POC in this blog was just for fun, the query take 70 seconds using the ODBC driver in PowerBI ( which is still in an Alpha stage), The same query using dbeaver take 19 second using the more mature JDBC driver, and it works only with import, for Direct Query you need a custom connector and the use of the Gateway, But I see a lot of potential.

There are a lot of people doing very interesting scenarios, like Building extremely fast and cheap ETL pipeline just using Parquet, DuckDB running on a cloud Functions. I think we will hear more about DuckDB in the coming years.

PowerBI Direct Quey modeling using flat Table

Flat table modeling in PowerBI can generated some very heated arguments, every time someone suggest that that it may be useful for a particular use case, the reaction is nearly universal, flat table are bad, I think it may be useful in some very specific scenarios.

let’s say you have a nice wide fact table generated by dbt and hosted in a fast Cloud DWH, all dimensions are pre joined, , to be very clear you will not need to join it with another fact, it is a simple analysis of 1 fact table at a very specific grain

I will use Power generation in the Australian market for the last 5 years as an example.

Import Mode

When using Import Mode, PowerBI import the data to the internal Database Vertipaq, it is just a columnar database, with some very specific behavior, because the whole table is loaded into memory, less columns, means less memory requirement, which is faster, and because it does uses index joins between Fact and dimensions when you define relationships, counterintuitively, the cost of doing join is less expensive than loading a whole column in the base table.

In Import Mode, it is a no-brainer, Star Schema is nearly always the right choice.

Direct Query Mode

In Direct Query Mode, the whole way of thinking change, PowerBI is just sending SQL Queries to the source system and get back results, you try to optimize to the source system, and contrary to popular beliefs Star Schema is not always the most performant ( it is rather the exception), see this excellent blog for more details , basically pre join will often give you better performance.

let’s test it with with one fact table ( The Table is 80 millions with a materialized view to aggregate data)

And the glorious Model in PowerBI, yes, just 1 Table

and let’s build some visuals

Now let’s check the Data Source performance

Slow Slicer

The Slicer is rather slow, probably you will say, of course scanning a whole 80 million columns is not very smart, actually that’s not the Problem.

for example when I extend the State NSW, PowerBI will generate two SQL Queries

the first one to get the station Name and took 481 ms

And the second Query to get the regions, 361 ms

PowerBI Formula Language will take some time to stitch the results together ( 1023 ms, that’s seems very slow to me ?)

in this case it is only 5 states, not a big deal, the Query results will be cached eventually after the report users expand all the options.

Is 3 second ? a good enough experience for an end user, I don’t think so, slicers have to be instantaneous, Visual can take up to 5 second, I don’t mind personally , but as a user I have a higher expectation for the slicers responsiveness, I will still use Dual Mode with a star schema

Take Away

If your Database can give you a sub second response time for the slicer selection and you have a very limited and clear analysis to do and you have to do it in Direct Query Mode, then flat wide table is just fine as long as you are happy with the SQL Generated.

%d bloggers like this: