Delta lake with Python, Local Storage and DuckDB

TL;DR : I added a streamlit app here

a new experimental support for Writing Delta storage format using only Python was added recently and I thought it is a nice opportunity to play with it.

Apache Spark had a native support since day one, but personally the volume of data I deal with does not justify running Spark, hence the excitement when I learned we can finally just use Python.

instead of another hot take on how Delta Works, I just built a Python Notebook, that download files from a web site ( Australian Energy Market), create a delta table, then I use DuckDB and vega lite to show a chart, all you need to do is to define the Location of the Delta Table, I thought it maybe a useful example, all the code are located here

And I added a PowerBI report using the delta Connector

Some Observations

Currently DuckDB don’t support Delta natively, instead we first read the Delta table using pyarrow which DuckDB can read automatically, at this stage, I am not sure if DuckDB can push down filter selection or read the stats saved in the Log file, and currently, it seems only AWS S3 is supported.

Advertisement

One thought on “Delta lake with Python, Local Storage and DuckDB”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: