CloudflareR2 – Small Data And self service

How to

1- The Data is prepared and cleaned using Fabric and saved in Onelake

2- Copy the data to cloudflare R2 using code, as of today Shortcuts to S3 does not support write operation, although I did not test it, ~~Dataflow Gen2~~ (data pipeline) support S3 as a destination, I used code as I had it already from a previous project, you pay egress fees for this operation and storage in R2 with transaction cost

3- Provide access token to users or make it public, you don’t pay for egress fees from Cloudflare to end users, but the throughput is not guaranteed.

For example, I am writing public data in Fabric US-West and consuming it in Fabric Melbourne

make sure you turn on cache for Onelake, it helps a lot of performance.

You can try it

You can try it yourself using this credential, they are temporary, and I may delete them anytime.

Access Key ID

3a3d5b5ce8c296e41a6de910d30e7fb6

Secret

9a080220941f3ff0f22ac93c7d2f5ec1d73a77cd3a141416b30c1239efc50777

Endpoint

It is trick I learn today and thought it maybe useful to share, I have a folder of parquet files, partitioned by day using Hive style, the data is ingested every 5 minutes which end up generating 288 small parquet files per day, it is rather nice for a write scenario but reading that data will be slow as it generate a big overhead opening individual files, it is a well documented problem, and more sophisticated table format like Delta Table and Iceberg fix the problem by using compaction, but it does not work with Python. ( Edit by Python, I mean Engine Like Data Fusion, DuckDB , Pandas not Spark which does not make sense for a small Dataset)

In my example I use Only Python and pyarrow dataset which does not support compaction, but maybe there is a solution.

Just for illustration, here is a view of my Bucket in Cloudflare R2 (Pyarrow support S3, GCP and Azure)

Warning : the code will delete existing files, use at your own risk

Read the existing partitions except today data, as you may end up having concurrent Write , which will corrupt your table.
Filter only the partitions that contains more than 1 file , something like this Using DuckDB

create view base  as select * from  parquet_scan('s3://delta/aemo/scada/data/*/*.parquet' , HIVE_PARTITIONING = 1,filename=1) where Date < '{cut_off}';
create  view  filter as select Date, count(distinct filename) as cnt from  base  group by 1 having cnt>1

Read the data using the previous filter, again, we are not touching today Partition to avoid any conflicts

tb=con.execute('''select SETTLEMENTDATE,DUID,SCADAVALUE,file,cast(base.Date as date) as Date from base inner join filter on base.Date= filter.Date''').arrow()

Write Back the Data using Pyarrow Dataset with the option ‘delete_matching’

ds.write_dataset(xx,"delta/aemo/scada/data/", filesystem=s3,format="parquet" , partitioning=['Date'],partitioning_flavor="hive",
     min_rows_per_group=120000,existing_data_behavior="delete_matching")

Again there is no support for transaction, if your code for whatever reason, did not complete, you will end up with unstable table

And here is the results, all old partitions have only 1 file

You need to run the Job only once a day, hopefully next year sometimes, either Apache Iceberg or Delta Table will provide compaction for the Python client, in the meantime maybe this approach is good enough :), you can see the full code here

Another approach is copy on write, basically every time you ingest a new data, you need to copy the existing data append it to the new data and overwrite existing files, but it maybe an expensive operation, specially if your job runs more frequently.

	Data 6903 on Sharing Public Data using Onel…
	Ed on Completion Skyline template
	Tomasz on First look at Apache Superset…
	Manish Saparia on Running a Serverless DuckDB on…
	Darren Gosbell on Fabric : Engines Resource…

Category: CloudflareR2

Sharing Public Data using Onelake, Delta and Cloudflare R2

How to

You can try it

Using Apache Arrow Dataset to compact old partitions

How to

You can try it

Share this:

Share this: