Small Data And self service

Advance Geospatial analysis using location Parameter with Streamlit

This blog is a POC of something that I always wanted to have in a BI tool, and I tried Tableau, PowerBI and Data Studio, without success ( not interested in adding an invisible grid as a hack), The idea is extremely simple yet very powerful, retrieve data when you click on a map, you may think it should simple, it seems BI tool are good at retrieving data based on filter, but it is very hard to push a parameter from a map back to a source data.

Traditionally, if you want to have this kind of interactivity, you need to write code, to be honest the idea of writing javascript and learning how to deploy a web server was not very interesting for me, but luckily we have a new Option in Streamlit

Streamlit is a code first, web app platform using only Python, web page are generated behind the scene, and there are a lot of component where you need to write a minimum of code, and deployment is absolutely trivial using Streamlit Cloud, and because it is open source, you can deploy using alternative approach like Cloud Run, or Azure

I came across this component Streamlit-Folium recently, and it is magnificent work, when you click on a map, it does provide variable back on the last location clicked zoom, bounds etc, all for free, no code required !!!!

All I have done is copied the code from the source and built a SQL Query that take the last clicked item filter all the “cafe” in a radius of 500 m, the SQL Code is copied from this previous Blog

The Source Data is nearly half a Million, as you can imagine plotting a massive dataset just to see a small portion is a waste of computer resources.

here is the final results

Here an example of a SQL Query generated.

State management

I added the code here, again it was too easy to write as I nearly copied everything from the component sample code, the tricky part was how to update the value of a variable which was already declared, Streamlit has a brilliant solution using State Management, the solution is very simple

Assign a default value when the Streamlit run for the first time

if 'key' not in st.session_state:
    st.session_state.key = '( 153.024198,-27.467276)'
    st.session_state.key1 = [-27.467276, 153.024198]
    st.session_state.key2 = 16
point_clicked = st.session_state.key
location_ini  = st.session_state.key1
zoom_Start    = st.session_state.key2

Update the values when a user click on the map, the next run in the same session will use those new values

 st.session_state.key = point_clicked
       st.session_state.key1 = location_ini
       st.session_state.key2 = map_data['zoom']

Currently I don’t know how to stop Streamlit from redrawing the map, as I am only interested in updating the markers.

Database

it works with Any Database as long as it has a minimum support for GIS functions, Currently I used bigQuery BI Engine as I am familiar with it and to speak freely :), it is very cheap for this kind of workload, small Data and potentially a lot of concurrency 🙂

I tried PowerBI Datamart but it seems Python access is blocked , DuckDB don’t support GIS functions yet, I am sure you can reproduce the results using only SQL, but I did not bother.

ST_DWithin(ST_GeogPoint(lng,lat),params.center,params.maxdist_m)

Take Away

I think there is a third way between no code and only Code, Streamlit managed to create a new category, maybe simple code 🙂 having said that BI Vendor should up their games, Location Parameter should not be that hard to implement.

Delta lake with Python, Local Storage and DuckDB

TL;DR : I added a streamlit app here

a new experimental support for Writing Delta storage format using only Python was added recently and I thought it is a nice opportunity to play with it.

Apache Spark had a native support since day one, but personally the volume of data I deal with does not justify running Spark, hence the excitement when I learned we can finally just use Python.

instead of another hot take on how Delta Works, I just built a Python Notebook, that download files from a web site ( Australian Energy Market), create a delta table, then I use DuckDB and vega lite to show a chart, all you need to do is to define the Location of the Delta Table, I thought it maybe a useful example, all the code are located here

And I added a PowerBI report using the delta Connector

Some Observations

Currently DuckDB don’t support Delta natively, instead we first read the Delta table using pyarrow which DuckDB can read automatically, at this stage, I am not sure if DuckDB can push down filter selection or read the stats saved in the Log file, and currently, it seems only AWS S3 is supported.

CPM is Broken – The WHY

I spend a lot of time thinking about project controls, it is my love and passion. Specifically in the planning world, P6 is weapon we are tasked with us. While I firmly believe the way in which P6 is used on projects needs to be expanded (extended) with the use of Agile (JIRA), nothing tops the user base and acceptance P6 has in our business.

P6 is fundamentally, just a CPM platform. We have activities and we have relationships (and constraints). It would seem simple to use at first; however, in practice, it is anything but simple and the way we build schedules is not helping matters.

What has specifically changed over the past 15-20 years is the DATA SOURCE from which we derive our data. 20years ago, we would run progress reports from P6 (we still do, but have complicated matters). Today there is much better visibility into deliverables on a project. What follows are some example from a construction project; however I feel as if the issue is more visible in the engineering and study side of things. Additionally, what has changed is the global adoption of EPR tools (Enterprise Progress Tools – aka Ecosys).

And that is the problem – we now have 2 or 3 sources of data that everyone looks at – and everyone thinks are integrated. Everyone has clear visibility into document progress by way of our document control tools, everyone has clear visibility into our EPR reports, and everyone struggles to understand what is in the schedule and why it is different. EPR is forcing document progress to drive overall reporting, but P6 does not track documents. We do not build schedules from a bottoms up document world. We (should) build schedules based on overall workflows. I will demonstrate this in many examples that follow and what we have been forced to do inside P6 to try to solve this conundrum, and exactly why our solutions all fail because we have fundamentally deviated from the CPM roots that used to make schedules valuable.

This is what a schedule is meant to look like

In reality, below is what the schedule would look like today. Extended durations on everything, broken relationships, out of sequence work and worse than anything – NO CRITICAL PATH! This is not a good example – however I am confident you have dealt with a schedule like this and know this isn’t just an isolated issue anymore.

What is driving this insanity is the method (the data source) for where our progress comes from. The concept of what “site prep” meant in the original schedule was not the full site prep scope, but only the key site prep work required to commence the excavation.

What changed is the method and data source for where the progress is coming for our activities. Because we are not progressing the schedule, and instead are progressing an excel file to load into our EPR systems – to most likely drive a progress payment – the way data then flows into the schedule becomes meaningless.

There are many that would counter that we are simply running our progress system wrong, and that indeed the site prep should have been 100% earlier. And, likely true. In the above I am just being silly, but we have all seen this and we all know the issue.

So what is the solution?

The solution is to understand that when progress is measured outside the schedule and at a level that is at a finer level of detail than the schedule – the schedule becomes meaningless at certain levels. You therefore need to build schedules AT A HIGHER LEVEL.

Instead of tracking the work at a lower level, for the schedule to bring back meaning, we need to only represent the high level workflows and leave the details in our progress system.

Keep your detail in Excel or you EPR System

The project should retain clear visibility to the detail and the detailed % complete; however, the detail items should not flow into the schedule because they can not be scheduled in a proper CPM way.

Again what is driving this is the method of progress no longer aligning with the methods of schedule. I understand this is severely controversial but when you start looking at your schedules these days and see countless in progress parallel activities that are simply being driven by interfaces with EPR systems to mirror a % – and not a proper schedule workflow – something is broken in our business. We would like to think our progress are all nice CPM logically driven schedules, and obviously I am not suggesting all our schedules are broken. However, the more I see issues such as this popping up in our As Builts Schedules, the more I want our industry to understand we are simply building crap schedules that are not agile enough to capture the intricate data now available in our other systems.

This article is only meant to better illustrate the problem – the why CPM is broken. Solutions (using CPM) are available. In the above I have actually suggested one solution, but several different models that can retain alignment with detailed progress items and retain a workflow CPM model are possible. Hopefully I will dive into those in subsequent discussions.

Loading 1 Billion New York Taxi Dataset into Datamart

Was chatting with David Eldersveld and he suggested that he wants to run a competition using the famous New York Taxi Dataset with Datamart, long story short, I did endup publishing my attempt before he had the time to start the competition, my sincere apology.

The report is using my personal PPU instance, the data is located here , personally I wanted to exclusively use tools available in PowerBI out of the box, no Synapse nor Azure stuff or BigQuery, Just pure self service tools.

Initially I didn’t really believed that PowerQuery can download such a big volume of data, my first set back , PowerQuery can not read parquet files available in public url, but as usual Chris Webb has an excellent blog explaining the reason, and gave a workaround.

I added the code here, it does work with any PowerQuery in Dataflow, Datamart and PowerBI desktop, unfortunately Excel is not supported yet, when prompted for authentication use anonymous

The only section of the code that you should pay attention to is selecting the number of files to download by default it is 2, but you can increase it

Only when I was writing this blog, I noticed the files for 2022 are using a slightly different URL, ~~will update the code later.~~ Code Updated.

To reduce the database size, I had to split Datetime to date and time, low cardinality is good for performance too.

Loading into Datamart

I don’t know how much it took datamart to load the data, currently Query refresh history is broken, but I think it is more than 6 hours, I maybe wrong, but Datamart take a bit of time to generate the tables with Clustered Columnstore Index

Initially I loaded only 2 then 30 files just to see how Datamart behave and finally I went for 100 files, and it did work again to my surprise.

and the Lineage View of the report

Performance

The Performance is not bad at all considering, the data was loaded as it is and it not sorted, although the parquet files are organized by month, unfortunately there are some outlier in every file see for example, so you get overlapping segments.

You can check the database size on disk by running this Query

EXEC sp_spaceused

Optimisation

Pretty much the only optimisation you can do in Datamart is to pre sort the data before loading it, but when you have 1 billion rows saved in parquet files, sorting is a very expensive operation, but there are options I think.

Create another Datamart and load it from the “raw” Datamart and define incremental refresh which will create partitions, yes partitions should improve the performance.

Hybrid table in PowerBI Dataset where only the recent data is cached in Vertipaq and the history kept in Datamart as a Direct Query Mode.

Final thoughts

The Publish to web report is here, a very big missing piece is the option to append data to an existing Datamart, this will make adding new data without a full refresh extremely trivial, I know about incremental refresh, and I am sure a hack like this may work, but we want the real deal, Dataflow people hurry up 🙂

I notice something interesting because the price of PPU is fixed, I felt I can experiment without the fear of getting a massive bill, maybe reserved pricing is not a bad thing after all.

My first reaction when I saw datamart was, that it will be a big validation for PowerQuery, and it is, as Alex said, PowerQuery everything !!!

	Querying a Fabric La… on Writing to SQL Server using…
	Benjamin on Running DuckDB at 10 TB s…
	mim on Running DuckDB at 10 TB s…
	Benjamin on Running DuckDB at 10 TB s…
	Running DuckDB at 10… on Running DuckDB at 10 TB s…

State management

Database

Take Away

Share this:

Some Observations

Share this:

Share this:

Loading into Datamart

Performance

Optimisation

Final thoughts

Share this: