parquet – Small Data And self service

TL;DR : it easy to setup and works relatively well, but there is a catch, watch out for Cloud Storage throughput. I shared a notebook here

in previous blog, I showed a POC how to run Queries from a Colab notebook against a delta lake table, but what if you want to run the same Query from other tools, or if you want to run a Query in a different region and avoid egress fees, turn out it is extremely easy to setup.

And here is an Overall View of the architecture. the most important decision is to make sure Cloud Storage and Cloud function are in the same region, in my case “us-central1”, you can call the function from anywhere.

Google Cloud Functions

As I said before the code is very simple, I spent some time googling to convert the results from a byte to json to dataframe, I think beside BigQuery, Google Cloud function is the easiest service to setup, just write your code and Google Cloud handle the rest , just for fun, I used a machine with 8 CPU and 32 GB of RAM.

import pyarrow.dataset as ds
import duckdb
import json
lineitem = ds.dataset("gs://xxxxx/lineitem",format="parquet", partitioning="hive")
con = duckdb.connect()
def Query(request):
    SQL = request.get_json().get('name')
    df = con.execute(SQL).df()
    return json.dumps(df.to_json(orient="records")), 200, {'Content-Type': 'application/json'}

How to Call web API

a lot of tool can send a web api call as long as it has the correct authentication, I started with a python script for a simple reason as it was the easiest to get the code from the internet.

Here is the interesting part, you write any arbitrary SQL code, Then you send an API call, in return you get a json with the results, a user don’t need to know anything about the cloud function, all he needs is the web address and write a correct SQL Query.

Performance considerations

This is where it get very interesting, I have no clue where is the bottleneck, but we can ask some Questions

Cold Start

The First example was 334 ms, that’s impressive, but I was cheating, I showed the best case scenario, Google Cloud function or more precisely Cloud Run was already running so no cold start and DuckDB was running a local Query which did not require a call to Cloud storage
Currently Cold Start for Cloud run Gen2 is around 10 second, notice it is still in preview.

Transfer from Google Cloud Storage

let’s try this simple Query, we get the result in 29 second

The Same Query using my Laptop, 600 ms, btw table lineitem contains 60 Million rows !!!

I don’t know why the massive difference, I presume network speed is limited, but when I look at the the bucket stats, it is actually very good, nearly 500 MB/S

I am no expert in network, but that number don’t seems right !!, when I check how much Data the cloud function is receiving then the whole discrepancy start making more sense, in average I am getting around 30 MB/s, the maximum was 50 MB/s , I have to say this is really slow !!!

File Pruning

Arrow dataset is smart enough to prune columns that are not used for partitions, in this Query, I made a filter on L_shipdate , notice the parquet file was sorted on that field, and as expected the performance is very good 1.7 second, DuckDB scan only the row groups that contains the date ‘1998-09-02’

Dataset Catalog

I am defining a very rudimentary catalog, the user can just call ” Show Tables”

You can even check the table schema

Cloud Storage throughput is the bottleneck.

having a speed of 30 MB/S make the whole setup just good for POC or doing a Query on a small dataset , I don’t know the reason why such a poor performance from Cloud Run, I suspect Apache arrow implementation is not optimized for GCP although it works very well in a local file system.

Another missing piece is the lack of cache, it would have being good if somehow DuckDB cache the data already Queried, but cache is very hard to implement specially if you want cache invalidation, and you risk reinventing a full Data warehouse. I genuinely hope it is a bug and cloud run can provide a better network performance.

TL;DR, we use filepath() function to get the partition Data which can be passed from PowerBI when used with Synapse Serverless, Potentially Making substantial cost saving when used in Incremental refresh as less files will be scanned.

Edit : there is another Technique to incrementally load parquet files without a Database

This blog is just me experimenting with the possibility of passing the filters from PowerBI to a Parquet file using Synapse Serverless.

when you deal with On demand Query engine Like Athena, BigQuery, Synapse Serverless etc, the less files read, the less cost incurred, so if somehow you manage to read only the files that contain the data without scanning all your folders, you can make substantial saving.

To Be clear, the first time I used Parquet was yesterday, so I have very basic knowledge at this stage.

1- Get the Source Data

I have a csv file with 3 Millions rows, here is the schema

2- Partition The Data

as far as my understanding of Synapse Serverless engine, when you filter by date for example, The engine will scan the whole file, maybe if the column is sorted, it will scan less data, I don’t know and have not try it yet, instead we are going to partition the Table by date, which is simply generating lot of files split by the date values, I imagine generating a lot of small files is not optimal too, it will reduce cost but potentially making the Query slower ( at the last this how other Engine Works).

I am using this python script to generate the partitions

import pandas as pd
import datetime
df = pd.read_csv('box.csv',parse_dates=True)
df['SETTLEMENTDATE'] = pd.to_datetime(df['SETTLEMENTDATE'])
df['Date'] = df['SETTLEMENTDATE'].dt.date
df.to_parquet('nem.parquet',partition_cols=['Date'],allow_truncated_timestamps=True)

and here is the results, a folder of parquet files grouped by Date

which I can read using the PowerBI Desktop

I just casually read a parquet file, without any Programing Language !!! ( Kudos for the Product team), this is only to test the parquet is properly generated.

3- Load The files to Azure Storage

This is only a show case, but in a production workflow, you maybe using something like ADF or Python Cloud Functions, anyway to upload a lot of files, I am using Azure storage explorer

now we can start using Synapse Analytics

4- Create a View In Synapse Serverless

we are going to leverage filepath() function to get the partition date, please see the documentation , here is the Query to create a view

USE [test];
GO
DROP VIEW IF EXISTS parquet;
GO

CREATE VIEW parquet AS
SELECT
 *,convert(Datetime,result.filepath(1),120) as date
FROM
    OPENROWSET(
        BULK 'https://xxxxxxxxxx.dfs.core.windows.net/parquet/nem.parquet/Date=*/*.parquet',
        FORMAT='PARQUET'
    )  result

5- Test if the filter Partition Works

Let’s try some testing to see how much is scanned

no Filter, scan all the files, data processed 48 MB

now with filter , only Date 11/04/2018, Data processed 1 MB

6- Test in PowerBI

I build this small model, Date Table is import and Parquet is DirectQuery ( be cautious when using DirectQuery, PBI can be very chatty and generate a lot of SQL Queries, Currently, it is better to use only import mode, until cache support is added)

Case 1: View all Dates

Data Processed : 43 MB

Case 2: Filter Some Dates

let’s filter only 2 days

Note : please use Query reduction option in PowerBI desktop, otherwise, every time you move the slicer, a query will be generated

Here is the result

I was really excited when I saw the results: 1 MB Synapse Serverless was smart enough to scan only 2 files

7- Incremental Refresh

Instead of Direct Query, let’s try a more practical use case, I configured Incremental refresh to change dates only for the Last 3 days

and here is the List of Queries generated by PowerBI

I have only Data for 2018 and 2019, the second refresh (that Started at 3:45 PM) just refreshed the data for the Last three days and because there is no files for those dates the Query returned 0 MB, which is great.

Another Nice functionality is select top 100 which is generated by PowerQuery to check field type scanned only 1 MB !!!!

Just to be sure, I have done another refresh at 4 :24 PM and checked the Partitions using SSMS

Only the last three partions were refreshed and PQ sent only 6 Queries ( 1 select data for 1 day and the other check field type)

I think Azure storage + Synapse analytics Serverless + PowerBI Incremental Refresh may end up as a very powerful Pattern.

	Data 6903 on Sharing Public Data using Onel…
	Ed on Completion Skyline template
	Tomasz on First look at Apache Superset…
	Manish Saparia on Running a Serverless DuckDB on…
	Darren Gosbell on Fabric : Engines Resource…

Category: parquet

Running a Serverless DuckDB on Google Cloud

Google Cloud Functions

How to Call web API

Performance considerations

Cold Start

Transfer from Google Cloud Storage

File Pruning

Dataset Catalog

Cloud Storage throughput is the bottleneck.

Pushdown Filters to Parquet From PowerBI Using Synapse Serverless

1- Get the Source Data

2- Partition The Data

3- Load The files to Azure Storage

4- Create a View In Synapse Serverless

5- Test if the filter Partition Works

6- Test in PowerBI

Case 1: View all Dates

Case 2: Filter Some Dates

I was really excited when I saw the results: 1 MB Synapse Serverless was smart enough to scan only 2 files

7- Incremental Refresh

Google Cloud Functions

How to Call web API

Performance considerations

Cold Start

Transfer from Google Cloud Storage

File Pruning

Dataset Catalog

Cloud Storage throughput is the bottleneck.

Share this:

1- Get the Source Data

2- Partition The Data

3- Load The files to Azure Storage

4- Create a View In Synapse Serverless

5- Test if the filter Partition Works

6- Test in PowerBI

Case 1: View all Dates

Case 2: Filter Some Dates

I was really excited when I saw the results: 1 MB Synapse Serverless was smart enough to scan only 2 files

7- Incremental Refresh

Share this: