BigQuery GEOGRAPHY Support in Data Studio

Google Data Studio added recently support for BigQuery Geography field, which is a fantastic development and open all kind of new scenarios for creating free to share Maps.

It is straightforward, you just add a geography field and it just render, for example I have this dataset that contains polygons, lines and points

and here is the result

Some Observations

  • It seems the initial focus of the dev team was on polygons which are fully supported
  • You can’t fill color for lines yet
  • Although points support color coding, I could not make them smaller ( The Piles in a Solar Farm are obviously much smaller)

BI Engine to the rescue

Now the confusing part, BI Engine for Data Studio does not accelerate GEOGRAPHY yet , so you will incur BigQuery Cost, but if you connect the new BI Engine SQL interface, the Query will be accelerated, according to the devs, the BI Engine used in Data Studio is to be considered Version 1 , SQL Interface as the next version and they will be merged together eventually. ( This should be in the Documentation)

here is an example of a Query generated by Data Studio, I would say it is very complex SQL Query with analytical functions, UNNEST, you name it, and the New BI Engine support it just fine ( I really like Bytes billed 0 B)

A more interesting use case

I came across this excellent dataset https://openflights.org/data.html, and thought let’s try it with Data Studio, first I imported the two tables airports.dat and routes.dat  then using this SQL Query to generate the routes, which is a line between Source coordinates and Destination Coordinates

WITH
  xxxx AS (
  SELECT
    yy.Name AS source_name,
    yy.City AS source_city,
    yy.Country AS source_country,
    zz.Name AS destination_name,
    zz.City AS destination_city,
    zz.Country AS destination_country,
    DestinationairportID,
    SourceairportID,
    ST_MAKELINE(st_geogpoint(yy.Longitude,
        yy.Latitude),
      st_geogpoint(zz.Longitude,
        zz.Latitude)) AS route
  FROM
    `testing-bi-engine.test.airportroute` xx
  LEFT JOIN
    `testing-bi-engine.test.airport` yy
  ON
    SourceairportID= AirportID
  LEFT JOIN
    `testing-bi-engine.test.airport` zz
  ON
    DestinationairportID = zz.AirportID),
  ttt AS (
  SELECT
    source_name,
    source_city,
    source_country,
    destination_name,
    destination_city,
    destination_country,
    DestinationairportID,
    SourceairportID,
    ST_ASTEXT(route) AS route_wkt
  FROM
    xxxx
  GROUP BY
    1,
    2,
    3,
    4,
    5,
    6,
    6,
    7,
    8,
    9)
SELECT
  *,
  ST_GEOGFROMTEXT(route_wkt) AS route,
  ROW_NUMBER() OVER() AS id
FROM
  ttt

I save the Query in a table, then plotted using Data Studio

and here is the Result, which i share it in Reddit đŸ™‚

I think it is fair to say, people love maps, and a lot of users appreciate that you can download the data straight from Data Studio , you can play with the report here

Really Excellent Works by Data Studio Team.

First Look at BigQuery BI Engine with PowerBI

Google made BigQuery BI engine available in a public preview , you need to enroll first here, for the last two years it was available only for Google Data Studio, and I had use it extensively for this Project, so I was really curious how it will work with PowerBI.

I don’t think I know enough to even try to reproduce a benchmark, Instead I am interested in only one Question, how much value I can get using the lowest tier of BI Engine and can PowerBI works smooth enough t make Direct Query a realistic option.

BigQuery team was nice enough for the preview period to have 100 GB reservation free of charge, just to keep it realistic, I kept reservation to 1 GB with a cost of $30 per month, I built a couple of reports in PowerBI and tried to observe how BI engine behave and observe the Query statistic , The report is located here, The Data is using Direct Query Mode, the Query statistics update every 1 hour.

How BI Engine Works

it is extremely easy to setup just select how much memory you want to reserve by Project, and that’s all, you pay by GB reserved per hour.

Keep in mind the Project used for reservation can Query any other projects as long as it is in the same region, in PowerBI, you can define which project you use for the Query

After you wait a couple of minutes for BI engine to start, this is more or less how it works

1-Query received by BigQuery, based on columns used in the Query, BI engine will load only those columns and partition into the Memory, the First Query will be always slower as it has to scan the data from BigQuery Storage and compress it in memory in my case usually between 1-2 second

2-Second Query arrive, the data is already in Memory, very Fast 100 ms to 500 ms

3- Same Query arrive as 2, BigQuery will just hit the cache, that’s the sweet spot, less than 100 ms

4- A new Query arrive that target different table, that’s the interesting part, BI Engine based on the size of the scanned column, and the available reservation, either evict the old table from memory or decide that there is not enough Reservation then it will fall back to the default BI Engine, where you pay how much data is scanned

5- A Query arrive that contains feature not supported by BI Engine , it will fall back to the default engine

6- Data appended to the Base table or Table changed, BI Engine will invalidate the cache , it will load the delta to memory or load everything again if the table was truncated

Obviously it is much more complex behind the scene, But I find it fascinating that BI engine in a fraction of a second decide what’s the best way to serve the Query ( cache, Memory or Storage)

Personally I am very interested in Case 4, obviously if I reserve a Memory I want to minimize scanned storage to the lowest possible, here is the result for the last 10 days, I think that’s a great result, my ” Big Table is 6 GB, 50 Million rows” other tables are smaller , the dev team said they are working on improving even more how BI Engine algorithm deals with smaller tables, so far happy with that. ( it is fixed now, the memory consumption is extremely low now )

I appreciate other users with flat rate pricing would not care that much about file scanned , for user with usage based pricing, it is a very important factor

Query Performance

Again the results is based on my usage, the only way to know is to try it yourself, nearly 72 % of Queries render in less than 100 ms, I think it is fair to say, we are into a different kind of data warehouse architecture

PowerBI Performance

The Query Performance is only a portion of the whole story, you need to add network latency and PowerBI overhead ( DAX calculation, Viz rendering etc), my data is in Tokyo Region and PowerBI Service is located in Melbourne, a bit of distance I would say and using Publish to web add an extra latency.

The good thing, using Direct Query on a 51 Million Fact table with 5 dimensions is an achievement in itself, I feel I can use that in a Production, at the same time, using other report, it seems I am hitting a bug in the ODBC driver, and the performance is not good.

but to be totally Honest, it seems PowerBI driver for BigQuery is far from being optimized, it seems they are using SIMBA ODBC , other BI tools are using the native API and it is substantially faster, but I have reason to believe the PowerBI team will invest more in better Integration ( PowerBI parameter in SQL Query is coming for example)

I Think it is extremely interesting new development, specially if you have Big Fact tables or data that change very frequently, Direct Query mode have a lot of advantages, it is very simple to setup, the data is always fresh and there is no data movement. and BI Engine is fast, extremely Fast, and Cheap !!!, I am using a state of the art data warehouse for $30 per month !!!!, now it is up to The PowerBI Team to take advantage of that.

PowerBI Incremental refresh Parquet files, without a Database.

TL;DR, you can incremental refresh PowerBI using Parquet files stored in an Azure Storage without using any Database in the middle, you can download sample pbix here

I am using this blog post by Gilbert Quevauvilliers which is based on a technique from Rafael Mendonça, Please read it first

Maybe read this, it is using Synapse Serverless , but has a section where you can Partition your data using Python to Parquet

1-Add a new Table, Parquet

make sure it is not loaded, here is the M code

let
     Source = AzureStorage.DataLake("https://xxxxxx.core.windows.net/parquet"),
     #"Removed Other Columns" = Table.SelectColumns(Source,{"Content", "Folder Path"}),
     #"Inserted Text Between Delimiters" = Table.AddColumn(#"Removed Other Columns", "Text Between Delimiters", each Text.BetweenDelimiters([Folder Path], "D", "/", 1, 0), type text),
     #"Renamed Columns" = Table.RenameColumns(#"Inserted Text Between Delimiters",{{"Text Between Delimiters", "Date"}}),
     #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"Date", type datetime}}),
     #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Folder Path"})
 in
     #"Removed Columns"

here is the result

3-Merge using Inner Join

to read the parquet file content we use this function , notice we used inner join in the previous step to avoid reading null Content, which generate errors when you refresh in the service

Parquet.Document([Content])

and here is the final table

we configure incremental refresh to refresh the Last 2 days

4- Testing in PowerBI Service

as you can see the second refresh is way faster then the First one

here is the partition Table

now let’s check the transaction history from Azure storage, I refreshed again just to be sure

The second refresh read substantially less data as only two files are read

I Think with PowerBI desktop supporting Parquet, we will see more exciting scenarios, I can’t wait for Dataflow to support export to Parquet !!!!

if you are still reading, I appreciate a vote on this idea, Having an option in Dataflow to export to a dynamic file name

Load Data to PowerBI Push Dataset using Easymorph

Easymorph is a very Powerful Data preparation tools for Business users, you can export your results either to csv or a growing list of Database and PowerBI Push dataset

I think with the new composite Models, the Push Dataset became rather interesting as it behave more or less like a regular  Dataset (see limitation here), you can add relationship to other Tables etc

No Code Data Pipeline

Just to test it, I loaded 7 csv files, then the usual Transformation, filters, select columns, Unpivot then I generated two PowerBI Dataset, Fact and Dimension.

As Push Dataset are append by default, I first delete all the rows in PowerBI dataset before loading new data, just to avoid duplicates, the third export is a csv file ( I wish one day, PowerQuery will have that option without using hacks)

and here is the Model in the service

PowerBI Support

there two type of Support

Power BI Command

You can list workspace, refresh Dataset and Dataflow etc, see Documentation

Export Data to PowerBI

Microsoft provide only API for Push Dataset, see Documentation

Easymorph has a very generous free tier, and The license is reasonable, I think it is worth having a look.

PowerBI has to be more Open

Personally I think PowerBI will gain more by being more open to third party tools, I hope one day , The Vertipaq engine will be as open as SQL Server in the sense that any tools can write and read data, it is a database after all. I am not suggesting to make it open source or free, you obviously still need to pay for a license, for example Adding a PowerBI Rest API for regular dataset will be a good start.

Now maybe dreaming, but when I see file format like Parquet, I wonder why we don’t have an Open API to load and read from Vertipaq engine Storage format, it has an amazing compression, it is columnar and support multiple schema in the same dataset.

Marco Russo has expressed this idea more gracefully here