Last year, I wrote a blog on how to build a nested complex Data Model using BigQuery , please read it first, unfortunately using standard SQL to nest and unnest Data was a painful exercise, maybe it was due to my lack of experience, or simply my brain is wired to think only about Tabular Data. happy to say, using the new Data Modeling language Malloy the experience became way simpler.
for a start Malloy will show a nice expended view of the nest table
The Model is self explanatory.
Now you need just to write Queries, Malloy will generate a SQL behind the scene
For example Total values of all measures :
No Idea what the SQL Means to be honest
Now Grouped by Category
Now More tricky, let’s group by year, notice, the Budget has no date values, the results is not correct !!!!
instead, you first group by year then nest the second level of grouping something like this
Malloy has a native export to nested Data Structure, but for this exercise I want a tabular results, you can easily flatten the previous Query using Project, notice how you can simply reference a previous Query, the measure Progress is simply Actual/Budget
I think it is simple enough that a “Normal” BI person can work it, you can find the Model here, The Model currently works only with BigQuery, found some bugs with DuckDB connector, hopefully it will be fixed in the next update of Malloy.
if you works with nested data, Malloy make it nearly easy 🙂
Malloy just added a semantic “All”, to ignore a group by dimension, see the same Query using the new syntax , this is Amazing !!!!
This blog is a POC of something that I always wanted to have in a BI tool, and I tried Tableau, PowerBI and Data Studio, without success ( not interested in adding an invisible grid as a hack), The idea is extremely simple yet very powerful, retrieve data when you click on a map, you may think it should simple, it seems BI tool are good at retrieving data based on filter, but it is very hard to push a parameter from a map back to a source data.
Streamlit is a code first, web app platform using only Python, web page are generated behind the scene, and there are a lot of component where you need to write a minimum of code, and deployment is absolutely trivial using Streamlit Cloud, and because it is open source, you can deploy using alternative approach like Cloud Run, or Azure
I came across this component Streamlit-Folium recently, and it is magnificent work, when you click on a map, it does provide variable back on the last location clicked zoom, bounds etc, all for free, no code required !!!!
All I have done is copied the code from the source and built a SQL Query that take the last clicked item filter all the “cafe” in a radius of 500 m, the SQL Code is copied from this previous Blog
The Source Data is nearly half a Million, as you can imagine plotting a massive dataset just to see a small portion is a waste of computer resources.
here is the final results
Here an example of a SQL Query generated.
I added the code here, again it was too easy to write as I nearly copied everything from the component sample code, the tricky part was how to update the value of a variable which was already declared, Streamlit has a brilliant solution using State Management, the solution is very simple
Assign a default value when the Streamlit run for the first time
if 'key' not in st.session_state:
st.session_state.key = '( 153.024198,-27.467276)'
st.session_state.key1 = [-27.467276, 153.024198]
st.session_state.key2 = 16
point_clicked = st.session_state.key
location_ini = st.session_state.key1
zoom_Start = st.session_state.key2
Update the values when a user click on the map, the next run in the same session will use those new values
Currently I don’t know how to stop Streamlit from redrawing the map, as I am only interested in updating the markers.
it works with Any Database as long as it has a minimum support for GIS functions, Currently I used bigQuery BI Engine as I am familiar with it and to speak freely :), it is very cheap for this kind of workload, small Data and potentially a lot of concurrency 🙂
I tried PowerBI Datamart but it seems Python access is blocked , DuckDB don’t support GIS functions yet, I am sure you can reproduce the results using only SQL, but I did not bother.
I think there is a third way between no code and only Code, Streamlit managed to create a new category, maybe simple code 🙂 having said that BI Vendor should up their games, Location Parameter should not be that hard to implement.
I was intrigued why Tableau and PowerBI have a different behavior when operating in a Direct Query Mode ( Tableau call it Live Mode), I always assumed it is just a driver difference, but it seems it is a little bit more complicated than That.
It is a long weekend, and Tableau released the next version as a beta (which is free), so it is a perfect opportunity to do some testing, Hopefully it will be a series, but let’s start with the fundamental difference, Query Results Cache
Again, this is not about import mode, also known as extract in Tableau, which generally speaking works the same way (PowerBI can have mixed mode, which is not supported in Tableau as far as I know)
The Data Model
The Model is very simple Star Schema, 1 Fact ( 5 years of electricity Data) and two dimensions, Calendar Table and Power Generation Plan attribute
I built this simple report, Total Mwh by substation
Tableau generate an Inner Join, same behavior as PowerBI
SELECT `DUID_DIM`.`StationName` AS `StationName__DUID_DIM_`,
SUM(`UNITARCHIVE`.`Mwh`) AS `sum_Mwh_ok`
FROM `test-187010.ReportingDataset`.`UNITARCHIVE` `UNITARCHIVE`
INNER JOIN `test-187010.ReportingDataset`.`DUID_DIM` `DUID_DIM` ON (`UNITARCHIVE`.`DUID` = `DUID_DIM`.`DUID`)
GROUP BY 1
I noticed filtering data is basically instantaneous, it is hard to believe it is using Direct Query Mode, you can see it here.
Tableau cached the results of the first Query, when you filter a substation, the data is already there, it does not need to send another SQL Query
Built the same report in PowerBI, every selection will fire a new SQL Query, yes it is still fast ( under a 2 second), but it is not instantaneous.
Here is an example in BigQuery Console
That was a simple example, but imagine 100 of users with a lot of Visuals, I suspect it will create a massive Workload on the source system, I think Tableau behavior, as a lot of other BI tools (Superset, Looker etc) make a lot of sense, and maybe it will be useful too for PowerBI.
Flat table modeling in PowerBI can generated some very heated arguments, every time someone suggest that that it may be useful for a particular use case, the reaction is nearly universal, flat table are bad, I think it may be useful in some very specific scenarios.
let’s say you have a nice wide fact table generated by dbt and hosted in a fast Cloud DWH, all dimensions are pre joined, , to be very clear you will not need to join it with another fact, it is a simple analysis of 1 fact table at a very specific grain
I will use Power generation in the Australian market for the last 5 years as an example.
When using Import Mode, PowerBI import the data to the internal Database Vertipaq, it is just a columnar database, with some very specific behavior, because the whole table is loaded into memory, less columns, means less memory requirement, which is faster, and because it does uses index joins between Fact and dimensions when you define relationships, counterintuitively, the cost of doing join is less expensive than loading a whole column in the base table.
In Import Mode, it is a no-brainer, Star Schema is nearly always the right choice.
Direct Query Mode
In Direct Query Mode, the whole way of thinking change, PowerBI is just sending SQL Queries to the source system and get back results, you try to optimize to the source system, and contrary to popular beliefs Star Schema is not always the most performant ( it is rather the exception), see this excellent blog for more details , basically pre join will often give you better performance.
let’s test it with with one fact table ( The Table is 80 millions with a materialized view to aggregate data)
And the glorious Model in PowerBI, yes, just 1 Table
and let’s build some visuals
Now let’s check the Data Source performance
The Slicer is rather slow, probably you will say, of course scanning a whole 80 million columns is not very smart, actually that’s not the Problem.
for example when I extend the State NSW, PowerBI will generate two SQL Queries
the first one to get the station Name and took 481 ms
And the second Query to get the regions, 361 ms
PowerBI Formula Language will take some time to stitch the results together ( 1023 ms, that’s seems very slow to me ?)
in this case it is only 5 states, not a big deal, the Query results will be cached eventually after the report users expand all the options.
Is 3 second ? a good enough experience for an end user, I don’t think so, slicers have to be instantaneous, Visual can take up to 5 second, I don’t mind personally , but as a user I have a higher expectation for the slicers responsiveness, I will still use Dual Mode with a star schema
If your Database can give you a sub second response time for the slicer selection and you have a very limited and clear analysis to do and you have to do it in Direct Query Mode, then flat wide table is just fine as long as you are happy with the SQL Generated.