Does a Single-Node Python Notebook Scale?

I was giving a presentation about Microsoft Fabric Python notebooks and someone asked if they scale. The short answer is yes. You can download the notebook and try it for yourself. For the long answer, keep reading.

The dataset I used contains the last seven years of Australian electricity market data. Although it’s public, the government agency only keeps archives for two months. I had saved the data during a previous job and kept it around as a hobby. It’s a great real-world workload with realistic data distribution. The CSV files are messy. Technically, they’re more like reports, with different sections stacked on top of each other and varying numbers of columns. That’s often what you encounter in real projects, not the neat, well-structured datasets you see in demos.

For example, being able to read a CSV file with a variable number of columns is a critical feature. Yet this rarely gets mentioned in synthetic benchmarks.

To create a clean environment for testing, I copied the data from one Lakehouse in onelake to a brand-new workspace. I could have used a shortcut, but I wanted to start from scratch. The binary copy took just 2 minutes, with no transformations, which gives a throughput of 1.4 GB per second. That’s pretty good for a 150 GB uncompressed dataset.

The default configuration for Fabric Python notebooks includes 2 cores and 16 GB of RAM. That’s roughly the same size as Google Colab. But you can easily increase the number of cores to 4, 8, 16, 32, or even 64. At 64 cores, you get nearly half a terabyte of RAM. That’s a serious machine.

The job itself is simple. Ingest and process the data using several Python engines, then save the result as a Delta table. The raw data has around one billion records, and you end up extracting 311 million. If your engine cannot push down filters to the CSV level, you’re going to have a hard time. The trick here is not to be fast, but to avoid doing unnecessary work.

I used the following engines: DuckDB, Daft, Polars, CHDB (basically ClickHouse for Python), DataFusion, PyArrow, and Pandas. Technically, Pandas is not ideal here because you can’t pass a list of files without using a loop. But I had used it for nearly seven years, so I kept it for sentimental reasons.

I’m fairly confident using all of these engines except PyArrow and DataFusion. Their syntax is very intimidating, and I probably missed some configuration settings. I couldn’t get them to use more than a single thread, so CPU utilization stayed very low.

Results

Polars support streaming writes, but doesn’t allow exporting a record batch. This means the Delta writer has to load all data into memory. It works fine with 32 cores and 256 GB of RAM, but you’ll run into out-of-memory issues with 16 cores and below.
Chdb 3.5 added a user friendly way to export arrow record batch, it is the first release so still some bugs, for example got an error with 2 cores, I am sure it will get fixed soon

Daft is the only engine that supports native writing to Delta. It uses the Deltalake package only to commit the transaction log. The actual Parquet write is handled by the engine itself.
DuckDB preserves the sort order of the input files. it is trick to appeal to Pandas users who care about index ordering. For best performance though, you should turn this off. (Honestly, I think it should be off by default)
DuckDB exports Arrow tables by default. You need to explicitly use record_batch(). I’ve lost count of how many out-of-memory issues I’ve solved just by changing the export format.
Overall, DuckDB delivered the best performance, especially considering it’s not even writing Parquet files directly. It simply streams Arrow data to the writer.

When I first ran the test with DuckDB and saw it finish in under 4 minutes, I thought I made a mistake. It wasn’t until CHDB finished in under 5 minutes that I realized these engines are seriously impressive.

We’re talking about 625 MB per second for processing and ingestion on a single node.

Another key observation: using DuckDB and Daft, even with just 16 GB of RAM, the data was processed correctly. It took about an hour, but it worked without errors, that’s 10 X the size of the RAM

To verify correctness, I simply checked the total sum of a column and the number of records. Everything checked out.

Choosing the Right Size

Now that I know these notebooks work, choosing the right size becomes more nuanced. Surprisingly, the cheapest configuration in term of capacity usage was the 2 cores 🙂

In practice though, using more compute makes sense. A single node has no concept of fault tolerance. If something goes wrong, you need to restart the entire job. Personally, I’m not a fan of long-running jobs. Too many things can go wrong. I used 2 cores just to make a point. That said, using 64 cores doesn’t make much sense either. You’re doubling your compute cost to save 30 seconds.

One more thing: while Daft scales down very well, it doesn’t seem to scale up as efficiently as I had hoped. Ideally, you want a flat performance curve. The total amount of work is fixed, so adding more cores should just reduce execution time. I know the reality is more complex. It’s not easy to keep all processors busy at higher scales.

What This Means

As you may have guessed, I’m a big fan of single-node setups and DuckDB. But I don’t want just one engine to dominate every benchmark or deliver results that no other engine in its class can match. That’s why I was genuinely excited by Daft’s performance. I’m also looking forward to seeing Polars and CHDB add Arrow streaming support.

To be honest, I look at the world from a storage perspective. More competition between engines is a good thing. All of these tools are open source under the MIT license. Most of them can write to Delta in one form or another. and as a user you can choose any engine you want, I think that’s a fantastic thing to have.

So yes, Python notebooks do scale. The experience is far from being perfect, and there’s still room for improvement. But scalability is not something you should worry about, unless of course you are really doing real big data, then you go distributed 🙂 DWH and Spark are robust options in Fabric.

Edit : tested with chdb 3.5 which has support for arrow streaming

Stress Testing Iceberg shortcut in Onelake

TL;DR: Shared a notebook showing the results of Iceberg metadata conversion to Delta in Onelake.

I’ve been following the evolution of Iceberg shortcuts to OneLake and I’m genuinely impressed with how the engineering team has invested so much energy into making it more robust, it is a good idea to read the documentation.

Essentially, XTable is used behind the scenes. Think of it as a translator for your open table format. Instead of requiring you to convert data from one format (like Iceberg) to another (like Delta) just to query them together, XTable allows you to access and interact with tables in different formats as if they were a single, unified table within OneLake—all without user intervention.

To truly put this to the test, I recently ran an experiment in a real production environment using my paid tenant—no sandboxes here! Here’s the logic from the Python notebook:

Accessing data from an Iceberg table using a shortcut (sourced from Snowflake; the data can be stored anywhere—Azure, S3, GCP, or OneLake, You can use BigQuery too or any Iceberg writer).
Inserting arbitrary data and performing delete operations.
Counting the total rows using Snowflake.
Counting the total rows using Fabric notebook as a Delta Table.
Recording the record counts in a results table to track and visualize the comparison over time.

The results were quite awesome. I plotted the total record counts from both the Iceberg and Delta perspectives using two distinct colors and observed a perfect match. This confirms the seamless interoperability provided by XTable.

Lesson learned:

See the code snippet below for inserting data in Snowflake:

snow.execute(f'insert into ONELAKE.ICEBERG.scada select * from ONELAKE.AEMO.SCADARAW limit {limit};')
snow.execute('delete from ONELAKE.ICEBERG.scada where INITIALMW = 0')
snow.execute("SELECT SYSTEM$GET_ICEBERG_TABLE_INFORMATION('ONELAKE.iceberg.scada');")

In rare cases—especially when running multiple transactions at the same time—Snowflake may not instantly generate the metadata. To be 100% sure, run this SQL statement

SELECT SYSTEM$GET_ICEBERG_TABLE_INFORMATION('Table_name')

to force the engine to write new Iceberg metadata. It’s an annoying aspect of Iceberg: every commit generates three files. That’s a bit excessive. Some engines prefer to group multiple commits to reduce the size of the metadata. Again, it’s rare—but it does happen.

A Non-scientific Benchmark of Text-to-SQL using Small Language Models

TL;DR ; This post shares a quick experiment I ran to test how effective (or ineffective) small language models are at generating SQL from natural language questions when provided with a well-defined semantic model. It is purely an intellectual curiosity; I don’t think we are there yet. Cloud Hosted LLMs are simply too good, efficient, and cost-effective.

You can download the notebook and the semantic model here.

⚠️ This is not a scientific benchmark.
I’m not claiming expertise here—just exploring what small-scale models can do to gain an intuition for how they work. Large language models use so much computational power that it’s unclear whether their performance reflects true intelligence or simply brute force. Small-scale models, however, don’t face this issue, making their capabilities easier to interpret.

Introduction

I used Ollama to serve models locally on my laptop and DuckDB for running the SQL queries. DuckDB is just for convenience—you could use any SQL-compatible database

For a start I used Qwen3, 4B, 8B and 14B, it is open weight and I heard good reviews considering it’s size, but the same approach will works with any models, notice I turn off thinking mode in Qwen.

To be honest, I tested other small models too, and they didn’t work as well. For example, they couldn’t detect my graphics card. I probably missed some configuration, but since I don’t know enough, I prefer to leave it at that.

0. Semantic Model Prompt

A semantic_model.txt file acts as the system prompt. This guides the model to produce more accurate and structured SQL outputs , the semantic model itself was generated by another LLM, it does include non trivial verified SQL queries ,sample values, relationships , measures etc, custom instructions

“no_think” is to turn off the thinking mode in Qwen3

1. Setup and Environment

The notebook expects an Ollama instance running locally, with the desired models (like qwen3:8b, qwen3:4b) already pulled using ollama run <model_name>.

2. How It Works

Two main functions handle the process:

get_ollama_response:
This function takes your natural language question, combines it with the semantic prompt, sends it to the Local Ollama server, and returns the generated SQL.
execute_sql_with_retry:
It tries to run the SQL in DuckDB. If the query fails (due to syntax or binding errors), it asks the model to fix it and retries—until it either works or hits a retry limit.

In short, you type a question, the model responds with SQL, and if it fails, the notebook tries to self-correct and rerun.

3. Data Preparation

The data was generated using a Python script with a scale factor (e.g., 0.1). If the corresponding DuckDB file didn’t exist, the script created one and populated it with the needed tables. Again, the idea was to keep things lightweight and portable.

Figure: Example semantic model

4. Testing Questions

Here are some of the questions I tested, some are simple questions others a bit harder and require more efforts from the LLM

“total sales”
“return rate”
“Identify the top 10 item categories with the highest total return amount from customers born in ‘USA’ who made returns in 2001.”
“customer age group with the worst return rate?”
“return rate per year”
“any days with unusual return rate?, use fancy statistics”

Each question was sent to different models (qwen3:14b, qwen3:8b, qwen3:4b) to compare their performance. I also used %%time to measure how long each model took to respond, some questions were already in the semantic model, verified query answers, so in a sense it is a test too to see how the model stick with the instruction

5. What Came Out

For every model and question, I recorded:

The original question
Any error messages and retries
The final result (or failure)
The final SQL used
Time taken per question and total time per model

6. Observations

Question 6 : about detecting unusual return rates with “fancy statistics “stood out:

8B model:
Generated clean SQL using CTEs and followed a star-schema-friendly join strategy. No retries needed.
14B model:
Tried using Z-scores, but incorrectly joined two fact tables directly. This goes against explicit instruction in the semantic model.
4B model:
Couldn’t handle the query at all. It hit the retry limit without producing usable SQL.

By the way, the scariest part isn’t when the SQL query fails to run, it’s when it runs, appears correct, but silently returns incorrect results

Another behavior which I like very much, I asked a question about customers born in the ‘USA’, the model was clever enough to leverage the sample values and use ‘UNITED STATES’ instead in the filter.

Execution Times

14B: 11 minutes 35 seconds
8B: 7 minutes 31 seconds
4B: 4 minutes 34 seconds

Tested on a laptop with 8 cores, 32 GB RAM, and 4 GB VRAM (Nvidia RTX A2000), the data is very small all the time is spent on getting the SQL , so although the accuracy is not too bad, we are far away from interactive use case using just laptop hardware.

7- Testing with simpler questions Only

I redone the test with 4B but using only simpler questions :

questions = [
          'total sales',
          'return rate',
          "Identify the top 10 item categories with the highest total return amount from customers born in 'USA' who made returns in 2001.",
          'return rate per year',
          'most sold items',
          ]
ask_question(questions,'qwen3:4b')

the 5 questions took less than a minutes, that promising !!!

Closing Thought

instead of a general purpose SLM, maybe a coding and sql fine tuned model with 4B size will be an interesting proposition, we live in an interesting time

Vibe Coding a Simple SQL orchestrator in a Fabric Python Notebook

Note: The blog and especially the code were written with the assistance of an LLM.

TL;DR

I built a simple Fabric Python notebook to orchestrate sequential SQL transformation tasks in OneLake using DuckDB and delta-rs. It handles task order, stops on failure, fetches SQL from external sources (like GitHub or a Onelake folder), manages Delta Lake writes, and uses Arrow recordbacth for efficient data transfer, even for large datasets. This approach helps separate SQL logic from Python code and simulates external table behavior in DuckDB. Check out the code on GitHub: https://github.com/djouallah/duckrun

pip install duckrun

Introduction

Inspired by tools like dbt and sqlmesh, I started thinking about building a simple SQL orchestrator directly within a Python notebook. I was showing a colleague a Fabric notebook doing a non-trivial transformation, and although it worked perfectly, I noticed that the SQL logic and Python code were mixed together – clear to me, but spaghetti code to anyone else. With Fabric’s release of the user data function, I saw the perfect opportunity to restructure my workflow:

Data ingestion using a User-Defined Function (UDF), which runs in a separate workspace.
Data transformation in another workspace, reading data from the ingestion workspace as read-only.
All transformations are done in pure SQL, there 8 tables, every table has a sql file, I used DuckDB, but feel free to use anything else that understands SQL and output arrow (datafusion, chdb, etc).
Built Python code to orchestrate the transformation steps.
PowerBI reports are in another workspace

I think this is much easier to present 🙂

I did try yato, which is a very interesting orchestrator, but it does not support parquet materialization

How It Works

The logic is pretty simple, inspired by the need for reliable steps:

Your Task List: You provide the function with a list (tasks_list). Each item has table_name (same SQL filename, table_name.sql) and how to materilize the data in OneLake (‘append’ , ‘overwrite’,ignore and None)
Going Down the List: The function loops through your tasks_list, taking one task at a time.
Checking Progress: It keeps track of whether the last task worked out using a flag (like previous_task_successful). This flag starts optimistically as True.
Do or Don’t: Before tackling the current task, it checks that flag.

If the flag is True, it retrieves the table_name and mode from the current task entry and passes them to another function, likely called run_sql. This function performs the actual work of running your transformation SQL and writing to OneLake.
If the flag is False, it knows something went wrong earlier, prints a quick “skipping” message, and importantly, uses a break statement to exit the loop immediately. No more tasks are run after a failure.

Updating the Status: After run_sql finishes, run_sql_sequence checks if run_sql returned 1 (our signal for success). If it returns 1, the previous_task_successful flag stays True. If not, the flag flips to False.
Wrap Up: When the loop is done (either having completed all tasks or broken early), it prints a final message letting you know if everything went smoothly or if there was a hiccup.

The run_sql function is the workhorse called by run_sql_sequence. It’s responsible for fetching your actual transformation SQL (that SELECT … FROM raw_table). A neat part here is that your SQL files don’t have to live right next to your notebook; they can be stored anywhere accessible, like a GitHub repository, and the run_sql function can fetch them. It then sends the SQL to your DuckDB connection and handles the writing part to your target OneLake table using write_deltalake for those specific modes. It also includes basic error checks built in for file reading, network stuff, and database errors, returning 1 if it succeeds and something else if it doesn’t.

You’ll notice the line con.sql(f””” CREATE or replace SECRET onelake … “””) inside run_sql; this is intentionally placed there to ensure a fresh access token for OneLake is obtained with every call, as these tokens typically have a limited validity period (around 1 hour), keeping your connection authorized throughout the sequence.

When using the overwrite mode, you might notice a line that drops DuckDB view (con.sql(f’drop VIEW if exists {table_name}’)). This is done because while DuckDB can query the latest state of the Delta Lake files, the view definition in the current session needs to be refreshed after the underlying data is completely replaced by write_deltalake in overwrite mode. Dropping and recreating the view ensures that subsequent queries against this view name correctly point to the newly overwritten data.

The reason we do this kind of hacks is, duckdb does not support external table yet, so we are just simulating the same behavior by combining duckdb and delta rs, spark obviousely has native support

Handling Materialization in Python

One design choice here is handling the materialization strategy (whether to overwrite or append data) within the Python code (run_sql function) rather than embedding that logic directly into the SQL scripts.

Why do it this way?

Consider a table like summary. You might have a nightly job that completely recalculates and overwrites the summary table, but an intraday job that just appends the latest data. If the overwrite or append command was inside the SQL script itself, you’d need two separate SQL files for the exact same transformation logic – one with CREATE OR REPLACE TABLE … AS SELECT … and another with INSERT INTO … SELECT ….

By keeping the materialization mode in the Python run_sql function and passing it to write_deltalake, you can use the same core SQL transformation script for the summary table in both your nightly and intraday pipelines. The Python code dictates how the results of that SQL query are written to the Delta Lake table in OneLake. This keeps your SQL scripts cleaner, more focused on the transformation logic itself, and allows for greater flexibility in how you materialize the results depending on the context of your pipeline run.

Efficient Data Transfer with Arrow Record batch

A key efficiency point is how data moves from DuckDB to Delta Lake. When DuckDB executes the transformation SQL, it returns the results as an Apache Arrow RecordBatch. Arrow’s columnar format is highly efficient for analytical processing. Since both DuckDB and the deltalake library understand Arrow, data transfers with minimal overhead. This “zero-copy” capability is especially powerful for handling datasets larger than your notebook’s available RAM, allowing write_deltalake to process and write data efficiently without loading everything into memory at once.

Example:

you pass Onelake location, schema and the number of files before doing any compaction

first it will load all the existing Delta table

Here’s an example showing how you might define and run different task lists for different scenarios:

sql_tasks_to_run_nightly = [
    ['price', 'append'],
    ['scada', 'append'],
    ['duid', 'ignore'],
    ['summary', 'overwrite'], # Overwrite summary nightly
    ['calendar', 'ignore'],
    ['mstdatetime', 'ignore'],
]

sql_tasks_to_intraday = [
    ['price_today', 'append'],
    ['scada_today', 'append'],
    ['duid', 'ignore'],
    ['summary', 'append'] # Append to summary intraday using the *same* SQL script
]

You can then use Python logic to decide which pipeline to run based on conditions, like the time of day:

start = time(4, 0)
end = time(5, 30)

if start <= now_brisbane <= end:
    run_sql_sequence(sql_tasks_to_run_nightly)

Here’s an example of an error I encountered during a run, it will automatically stop the remaining tasks:

Attempting to run SQL for table: price_today with mode: append
Running in mode: append for table: price_today
Error writing to delta table price_today in mode append: Parser Error: read_csv cannot take NULL list as parameter
Error updating data or creating view in append mode for price_today: Parser Error: read_csv cannot take NULL list as parameter
Failed to run SQL for table: price_today. Stopping sequence.
One or more SQL tasks failed.

here is some screenshots from actual runs

as it is a delta table, I can use SQL endpoints to get some stats

For example the table scada has nearly 300 Million rows, the raw data is around 1 billion of gz.csv

It took nearly 50 minutes to process using 2 cpu and 16 GB of RAM, notice although arrow is supposed to be zero copy, writing parquet directly from Duckdb is substantially faster !!! but anyway, the fact it works at all is a miracle 🙂

in the summary table we remove empty rows and other business logic, which reduce the total size to 119 Million rows.

here is an example report using PowerBI direct lake mode, basically reading delta directly from storage

In this run, it did detect that the the night batch table has changed

Conclusion

To be clear, I am not suggesting that I did anything novel, it is a very naive orchestrator, but the point is I could not have done it before, somehow the combination of open table table format, robust query engines and an easy to use platform to run it make it possible and for that’s progress !!!

I am very bad at remembering python libraries syntax but with those coding assistants, I can just focus on the business logic and let the machine do the coding. I think that’s good news for business users.

	Querying a Fabric La… on Writing to SQL Server using…
	Benjamin on Running DuckDB at 10 TB s…
	mim on Running DuckDB at 10 TB s…
	Benjamin on Running DuckDB at 10 TB s…
	Running DuckDB at 10… on Running DuckDB at 10 TB s…

Results

Choosing the Right Size

What This Means

Share this:

Share this:

Introduction

0. Semantic Model Prompt

1. Setup and Environment

2. How It Works

3. Data Preparation

4. Testing Questions

5. What Came Out

6. Observations

Execution Times

7- Testing with simpler questions Only

Closing Thought

Share this:

TL;DR

Introduction

How It Works

Handling Materialization in Python

Efficient Data Transfer with Arrow Record batch

here is some screenshots from actual runs

Conclusion

Share this: