In this blog, I will show how to leverage Python (or R) to implement an incremental refresh in PowerBI using PowerQuery and Python, nothing is really new ( I am sure Imke and Maxim has blogged about it before).
in a previous blog, I showed how to use R & Python integration to load data to a Database
This approach make sense only when you do a lot of heavy transformation and your data source change based on time.
As an example, in my previous job, we receive a new excel file every Monday (300K rows), this file gets approved and corrected every Thursday.
the workflow was:
save the files in a folder, do the transformation, which was fine , but after the first year, it was around 52 files, and although technically you need only to do transformation for the last file, and as PowerBI does not support incremental refresh, twice a week we redo everything, after two years, the refresh took nearly 30 Minutes and sometimes we get out of memory errors.
in the big picture,Half an hour was not that bad (we have a desktop just for refresh), the worst was, you refresh the model and once you finish, you get a new revision and you must refresh again.
Now using Python/R script, the idea is every file get transformed only 1 time, regardless of how many times you refresh, just by exporting the results of the transformation of every file as a csv in a staging folder.
- The first run is slow, as it will process all the existing files in Source Data, but the subsequent run, will transform only new files.
- Let’s say File 2 was revised, all you need to do,is to delete File2.csv and it will be transformed again, but only that file.
- Ok, if you see step 4, the files are reloaded each time, I am not too much worried about that, as the batch loading of csv files from a folder using PowerQuery is relatively fast (yes, a bit slow compared to R), the bottleneck is rather the transformation.
the code for python script is here, as you can see PowerQuery integration is amazing, just add a new step and you get a dataframe, that’s all,
# 'dataset' holds the input data for this script
df_by_filename = dataset.groupby("filename")
for (filename, filename_df) in df_by_filename:
filename = filename.replace("zip", "csv")
filename = filename.replace("PUBLIC_DAILY", "UNIT_PUBLIC_DAILY") filename_df.to_csv("C:/results/"+filename,index=False)
the script split the dataframe by the column filename, and then export each file separately, currently it is saving into a local folder, but you can easily save those files into a cloud storage
to test it, I built a quick workflow using public data, PBIX here, the source data is zip files in a public website, there is a new zip file daily, it is relatively complex transformation as you need to unzip the file split it, delete some columns etc, the first run is slow, as it is processing all the files (62 files), but the next run, will just process 1 file, you can simulate that just by deleting some csv files in the staging folder, when you refresh again, only the files deleted will be processed again.
I think the main take away is, Python and R integration are amazing tools to implement new possibilities that will not be necessary available in PowerBI, and you don’t need to be a programmer to use those integration, a serious search on stackoverflow will get you started quickly.