Writing to SQL Server using DuckDB

I don’t know much about SQL Server. The closest I ever got to it was having read only access to a database. I remember 10 years ago we had a use case for a database, and IT decided for some reason that we were not allowed to install SQL Server Express. Even though it was free and a Microsoft product. To this day, it is still a mystery to me, anyway, at that time I was introduced to PowerPivot and PowerQuery, and the rest was history.

Although I knew very little about SQL Server, I knew that SQL Server users are in love with the product. I worked with a smart data engineer who had a very clear world view:

I used SQL Server for years. It is rock solid. I am not interested in any new tech.

At the time, I thought he lacked imagination. Now I think I see his point.

When SQL Server was added to Fabric, I was like, oh, that’s interesting. But I don’t really do operational workloads anyway, so I kind of ignored it.

Initially I tried to make it fit my workflow, which is basically developing Python notebooks using DuckDB or Polars (depending on my mood) inside VSCode with GitHub Copilot. and deploy it later into Fabric, of course you can insert a dataframe into SQL Server, but it did not really click for me at first. To be clear, I am not saying it is not possible. It just did not feel natural in my workflow( messing with pyodbc is not fun).

btw the SQL extension inside VSCode is awesome

A week ago I was browsing the DuckDB community extensions and I came across the mssql extension. And boy !!! that was an emotional rollercoaster (The last time I had this experience was when I first used tabular editor a very long time ago).

You just attach a SQL Server database using either username and password or just a token. That’s it. The rest is managed by the extension, suddenly everything make sense to me!!!

conn = duckdb.connect()

if PLATFORM == 'fabric':
    token = DefaultAzureCredential().get_token("https://database.windows.net/.default").token

# notebookutils.credentials.getToken("sql") inside Fabric notebook
    for attempt in range(3):
        try:
            conn.sql(f"""
                ATTACH IF NOT EXISTS
                'Server={host};Database={db}'
                AS db (TYPE mssql, ACCESS_TOKEN '{token}')
            """)
            break
        except Exception as e:
            if attempt < 2:
                print(f"Attempt {attempt+1} failed, waiting 60s for serverless wake-up...")
                time.sleep(60)
            else:
                raise e
else:
    conn.sql(f"""
        ATTACH OR REPLACE
        'Server={host},{pr};Database={db};User Id={user};Password={pw};Encrypt=yes'
        AS db (TYPE mssql)
    """)

conn.sql("SET mssql_query_timeout = 6000; SET mssql_ctas_drop_on_failure = true;")
print(f"Connected to SQL Server via {PLATFORM}")

again, I know there other ways to load data which are more efficiently, but if I have a small csv that I processed using python, nothing compare to the simplicity of a dataframe, in that week; here are some things I learned, I know it is obvious for someone who used it !!! but for me, it is like I was living under a rock all these years 🙂

if you run show all tables in duckdb, you get something like this

TDS and bulk insertion

You don’t need ODBC. You can talk to SQL Server directly using TDS, which is the native protocol it understands. There is also something called BCP, which basically lets you batch load data efficiently instead of pushing rows one by one. Under the hood it streams the data in chunks, and the performance is actually quite decent. It is not some hacky workaround. It feels like you are speaking SQL Server’s own language, and that changes the whole experience.

SQL Server is not only for OLTP

Turns out people use SQL Server for analytics too, with columnar table format.

CREATE CLUSTERED COLUMNSTORE INDEX cci_{table}
ON {schema}.{table}
ORDER ({order_col});

I tested a typical analytical benchmark and more or less it performs like a modern single node data warehouse.

Accelerating Analytics for row store

Basically, there is a batch mode where the engine processes row-based tables in batches instead of strictly row by row. The engine can apply vectorized operations, better CPU cache usage, and smarter memory management even on traditional rowstore tables. It is something DuckDB added with great fanfare to accelerate PostgreSQL heap tables. I was a bit surprised that SQL Server already had it for years.

RLS/CLS for untrusted Engine

If you have a CLS or RLS Lakehouse table and you want to query it from an untrusted engine, let’s say DuckDB running on your laptop, today, you can’t for a good reason as the direct storage access is blocked, this extension solves it, as you query the SQL Endpoint itself.

Most of fancy things were already invented

Basically, many of the things’ people think are next generation technologies were already implemented decades ago. SQL control flow, temp tables, complex transactions, fine grained security, workload isolation, it was all already there.

I think the real takeaway for me; user experience is as important – if not more- than the SQL Engine itself, and when a group of very smart people like something then there is probably a very good reason for it.

Vibe Coding a Simple SQL orchestrator in a Fabric Python Notebook

Note: The blog and especially the code were written with the assistance of an LLM.

TL;DR

I built a simple Fabric Python notebook to orchestrate sequential SQL transformation tasks in OneLake using DuckDB and delta-rs. It handles task order, stops on failure, fetches SQL from external sources (like GitHub or a Onelake folder), manages Delta Lake writes, and uses Arrow recordbacth for efficient data transfer, even for large datasets. This approach helps separate SQL logic from Python code and simulates external table behavior in DuckDB. Check out the code on GitHub: https://github.com/djouallah/duckrun

pip install duckrun

Introduction

Inspired by tools like dbt and sqlmesh, I started thinking about building a simple SQL orchestrator directly within a Python notebook. I was showing a colleague a Fabric notebook doing a non-trivial transformation, and although it worked perfectly, I noticed that the SQL logic and Python code were mixed together – clear to me, but spaghetti code to anyone else. With Fabric’s release of the user data function, I saw the perfect opportunity to restructure my workflow:

  • Data ingestion using a User-Defined Function (UDF), which runs in a separate workspace.
  • Data transformation in another workspace, reading data from the ingestion workspace as read-only.
  • All transformations are done in pure SQL, there 8 tables, every table has a sql file, I used DuckDB, but feel free to use anything else that understands SQL and output arrow (datafusion, chdb, etc).
  • Built Python code to orchestrate the transformation steps.
  • PowerBI reports are in another workspace

I think this is much easier to present 🙂

I did try yato, which is a very interesting orchestrator, but it does not support parquet materialization

How It Works

The logic is pretty simple, inspired by the need for reliable steps:

  1. Your Task List: You provide the function with a list (tasks_list). Each item has table_name (same SQL filename, table_name.sql) and how to materilize the data in OneLake (‘append’ , ‘overwrite’,ignore and None)
  2. Going Down the List: The function loops through your tasks_list, taking one task at a time.
  3. Checking Progress: It keeps track of whether the last task worked out using a flag (like previous_task_successful). This flag starts optimistically as True.
  4. Do or Don’t: Before tackling the current task, it checks that flag.
  • If the flag is True, it retrieves the table_name and mode from the current task entry and passes them to another function, likely called run_sql. This function performs the actual work of running your transformation SQL and writing to OneLake.
  • If the flag is False, it knows something went wrong earlier, prints a quick “skipping” message, and importantly, uses a break statement to exit the loop immediately. No more tasks are run after a failure.
  1. Updating the Status: After run_sql finishes, run_sql_sequence checks if run_sql returned 1 (our signal for success). If it returns 1, the previous_task_successful flag stays True. If not, the flag flips to False.
  2. Wrap Up: When the loop is done (either having completed all tasks or broken early), it prints a final message letting you know if everything went smoothly or if there was a hiccup.

The run_sql function is the workhorse called by run_sql_sequence. It’s responsible for fetching your actual transformation SQL (that SELECT … FROM raw_table). A neat part here is that your SQL files don’t have to live right next to your notebook; they can be stored anywhere accessible, like a GitHub repository, and the run_sql function can fetch them. It then sends the SQL to your DuckDB connection and handles the writing part to your target OneLake table using write_deltalake for those specific modes. It also includes basic error checks built in for file reading, network stuff, and database errors, returning 1 if it succeeds and something else if it doesn’t.

You’ll notice the line con.sql(f””” CREATE or replace SECRET onelake … “””) inside run_sql; this is intentionally placed there to ensure a fresh access token for OneLake is obtained with every call, as these tokens typically have a limited validity period (around 1 hour), keeping your connection authorized throughout the sequence.

When using the overwrite mode, you might notice a line that drops DuckDB view (con.sql(f’drop VIEW if exists {table_name}’)). This is done because while DuckDB can query the latest state of the Delta Lake files, the view definition in the current session needs to be refreshed after the underlying data is completely replaced by write_deltalake in overwrite mode. Dropping and recreating the view ensures that subsequent queries against this view name correctly point to the newly overwritten data.

The reason we do this kind of hacks is, duckdb does not support external table yet, so we are just simulating the same behavior by combining duckdb and delta rs, spark obviousely has native support

Handling Materialization in Python

One design choice here is handling the materialization strategy (whether to overwrite or append data) within the Python code (run_sql function) rather than embedding that logic directly into the SQL scripts.

Why do it this way?

Consider a table like summary. You might have a nightly job that completely recalculates and overwrites the summary table, but an intraday job that just appends the latest data. If the overwrite or append command was inside the SQL script itself, you’d need two separate SQL files for the exact same transformation logic – one with CREATE OR REPLACE TABLE … AS SELECT … and another with INSERT INTO … SELECT ….

By keeping the materialization mode in the Python run_sql function and passing it to write_deltalake, you can use the same core SQL transformation script for the summary table in both your nightly and intraday pipelines. The Python code dictates how the results of that SQL query are written to the Delta Lake table in OneLake. This keeps your SQL scripts cleaner, more focused on the transformation logic itself, and allows for greater flexibility in how you materialize the results depending on the context of your pipeline run.

Efficient Data Transfer with Arrow Record batch

A key efficiency point is how data moves from DuckDB to Delta Lake. When DuckDB executes the transformation SQL, it returns the results as an Apache Arrow RecordBatch. Arrow’s columnar format is highly efficient for analytical processing. Since both DuckDB and the deltalake library understand Arrow, data transfers with minimal overhead. This “zero-copy” capability is especially powerful for handling datasets larger than your notebook’s available RAM, allowing write_deltalake to process and write data efficiently without loading everything into memory at once.

Example:

you pass Onelake location, schema and the number of files before doing any compaction

first it will load all the existing Delta table

Here’s an example showing how you might define and run different task lists for different scenarios:

sql_tasks_to_run_nightly = [
    ['price', 'append'],
    ['scada', 'append'],
    ['duid', 'ignore'],
    ['summary', 'overwrite'], # Overwrite summary nightly
    ['calendar', 'ignore'],
    ['mstdatetime', 'ignore'],
]

sql_tasks_to_intraday = [
    ['price_today', 'append'],
    ['scada_today', 'append'],
    ['duid', 'ignore'],
    ['summary', 'append'] # Append to summary intraday using the *same* SQL script
]



You can then use Python logic to decide which pipeline to run based on conditions, like the time of day:

start = time(4, 0)
end = time(5, 30)

if start <= now_brisbane <= end:
    run_sql_sequence(sql_tasks_to_run_nightly)

Here’s an example of an error I encountered during a run, it will automatically stop the remaining tasks:

Attempting to run SQL for table: price_today with mode: append
Running in mode: append for table: price_today
Error writing to delta table price_today in mode append: Parser Error: read_csv cannot take NULL list as parameter
Error updating data or creating view in append mode for price_today: Parser Error: read_csv cannot take NULL list as parameter
Failed to run SQL for table: price_today. Stopping sequence.
One or more SQL tasks failed.

here is some screenshots from actual runs

as it is a delta table, I can use SQL endpoints to get some stats

For example the table scada has nearly 300 Million rows, the raw data is around 1 billion of gz.csv

It took nearly 50 minutes to process using 2 cpu and 16 GB of RAM, notice although arrow is supposed to be zero copy, writing parquet directly from Duckdb is substantially faster !!! but anyway, the fact it works at all is a miracle 🙂

in the summary table we remove empty rows and other business logic, which reduce the total size to 119 Million rows.

here is an example report using PowerBI direct lake mode, basically reading delta directly from storage

In this run, it did detect that the the night batch table has changed

Conclusion

To be clear, I am not suggesting that I did anything novel, it is a very naive orchestrator, but the point is I could not have done it before, somehow the combination of open table table format, robust query engines and an easy to use platform to run it make it possible and for that’s progress !!!

I am very bad at remembering python libraries syntax but with those coding assistants, I can just focus on the business logic and let the machine do the coding. I think that’s good news for business users.

Experimenting with Text-to-SQL: Lessons from Optimizing Product Return Analysis

🌟 Introduction

While testing the DuckDB ODBC driver, which is getting better and better (not production ready but less broken compared to two years ago), I noticed something unexpected. Running queries through Power BI in DirectQuery mode was actually faster than executing them directly in the DuckDB native UI.

Naturally, that does not make sense !!!

What followed was an investigation that turned into a fun and insightful deep dive into text-to-SQL generation, Power BI’s query behavior, and the enduring relevance of manual SQL tuning

🧩 The Goal: Find the Worst Product by Return Rate

The task was straightforward:

Calculate total sales, total returns, and return rate by product. Rank the products and find the top 5 with the highest return rates.

To make it interesting, I decided to try:

  1. Letting an LLM generate the SQL by loading the semantic model.
  2. Using PowerBI in Direct Query Mode.
  3. Finally, manually tuning the query.

📝 Step 1: LLM-generated SQL — Clean and Understandable

chatgpt generated a good starting point:

WITH sales_by_product AS   (
SELECT
i.i_product_name AS product_name,
SUM(ss.ss_sales_price * ss.ss_quantity) AS total_sales
FROM store_sales ss
JOIN item i ON ss.ss_item_sk = i.i_item_sk
WHERE i.i_product_name IS NOT NULL
GROUP BY i.i_product_name
),

returns_by_product AS (
SELECT
i.i_product_name AS product_name,
SUM(sr.sr_return_amt) AS total_returns
FROM store_returns sr
JOIN item i ON sr.sr_item_sk = i.i_item_sk
WHERE i.i_product_name IS NOT NULL
GROUP BY i.i_product_name
),

combined AS (
SELECT
COALESCE(s.product_name, r.product_name) AS product_name,
COALESCE(s.total_sales, 0) AS total_sales,
COALESCE(r.total_returns, 0) AS total_returns
FROM sales_by_product s
FULL OUTER JOIN returns_by_product r
ON s.product_name = r.product_name
)

SELECT
product_name,
ROUND((total_returns / NULLIF(total_sales, 0)) * 100, 2) AS return_rate
FROM combined
WHERE total_sales > 0 -- Avoid divide by zero
ORDER BY return_rate DESC
limit 5 ;

Pros:

  • Clean and easy to read.
  • Logically sound.
  • Good for quick prototyping.

🔍 Observation: However, it used product_name (a text field) as the join key in the combined table, initially I was testing using TPC-DS10, the performance was good, but when I changed it to DS100, performance degraded very quickly!!! I should know better but did not notice that product_name has a lot of distinct values.

the sales table is nearly 300 M rows using my laptop, so it is not too bad

and it is nearly 26 GB of highly compressed data ( just to keep it in perspective)

📊 Step 2: Power BI DirectQuery Surprises

Power BI automatically generate SQL Queries based on the Data Model, Basically you defined measures using DAX, you add a visual which generate a DAX query that got translated to SQL, based on some complex logic, it may or may not push just 1 query to the source system, anyway in this case, it did generated multiple SQL queries and stitched the result together.

🔍 Insight: Power BI worked exactly as designed:

  • It split measures into independent queries.
  • It grouped by product_name, because that was the visible field in my model.
  • And surprisingly, it was faster than running the same query directly in DuckDB CLI!

Here’s my screenshot showing Power BI results and DAX Studio:


🧩 Step 3: DuckDB CLI — Slow with Text Joins

Running the same query directly in DuckDB CLI was noticeably slower, 290 seconds !!!


⚙️ Step 4: Manual SQL Tuning — Surrogate Keys Win

To fix this, I rewrote the SQL manually:

  • Switched to item_sk, a surrogate integer key.
  • Delayed lookup of human-readable fields.

Here’s the optimized query:

WITH sales_by_product AS (
SELECT
ss.ss_item_sk AS item_sk,
SUM(ss.ss_sales_price * ss.ss_quantity) AS total_sales
FROM store_sales ss
GROUP BY ss.ss_item_sk
),

returns_by_product AS (
SELECT
sr.sr_item_sk AS item_sk,
SUM(sr.sr_return_amt) AS total_returns
FROM store_returns sr
GROUP BY sr.sr_item_sk
),

combined AS (
SELECT
COALESCE(s.item_sk, r.item_sk) AS item_sk,
COALESCE(s.total_sales, 0) AS total_sales,
COALESCE(r.total_returns, 0) AS total_returns
FROM sales_by_product s
FULL OUTER JOIN returns_by_product r ON s.item_sk = r.item_sk
)

SELECT
i.i_product_name AS product_name,
ROUND((combined.total_returns / NULLIF(combined.total_sales, 0)) * 100, 2) AS return_rate
FROM combined
LEFT JOIN item i ON combined.item_sk = i.i_item_sk
WHERE i.i_product_name IS NOT NULL
ORDER BY return_rate DESC
LIMIT 5;

🚀 Result: Huge performance gain! from 290 seconds to 41 seconds

Check out the improved runtime in DuckDB CLI:


🌍 In real-world models, surrogate keys aren’t typically used

unfortunately in real life, people still use text as a join key, luckily PowerBI seems to do better there !!!


🚀 Final Thoughts

LLMs are funny, when I asked chatgpt why it did not suggest a better SQL Query, I got this answer 🙂

I guess the takeaway is this:


If you’re writing SQL queries, always prefer integer types for your keys!

And maybe, just maybe, DuckDB (and databases in general) could get even better at optimizing joins on text columns. 😉

But perhaps the most interesting question is:
What if, one day, LLMs not only generate correct SQL queries but also fully performance-optimized ones?

Now that would be exciting.

you can download the data here, it is using very small factor : https://github.com/djouallah/Fabric_Notebooks_Demo/tree/main/SemanticModel

Edit : run explain analyze show that it is group by is taking most of the time and not the joins

the optimized query assumed already that i.i_item_sk is unique, it is not very obvious for duckdb to rewrite the query without knowing the type of joins !!! I guess LLMs still have a lot to learn

Using OBSTORE to Load and Download Arbitrary Files to OneLake

Prerequisites

Before proceeding, ensure you have the necessary tools installed and configured. For more details, refer to the official installation guide: Install Azure CLI on Windows.

  1. Install Azure CLI (one-time setup) winget install Microsoft.AzureCLI Ensure you have the latest version installed.
  2. Login to Azure az login Follow the browser-based authentication flow.

Installing Required Python Package

The obstore package is a Python API for the Rust-based Object Store crate, which simplifies interaction with cloud storage systems.

pip install obstore --upgrade

Connecting to OneLake Storage

Once installed, you can connect to OneLake using the obstore package:

import obstore
from obstore.store import from_url

# Define storage path
store = from_url('abfss://sqlengines@onelake.dfs.fabric.microsoft.com/power.Lakehouse/Files', azure_use_azure_cli=True)

there is a PR by someone from the community where azure_use_azure_cli=True will not be needed the system will automatically pick the available authentification

Listing Files in OneLake

To list the files and folders inside OneLake, always specify a prefix to avoid long processing times:

obstore.list(store, 'tmp').collect()

Uploading Local Files to OneLake

To upload files from a local directory to OneLake, use the following script:

import os

folder_path = '/test'  # Change this to the directory containing your files

for root, dirs, files in os.walk(folder_path):
    for file in files:
        local_path = os.path.join(root, file).replace("\\", "/")
        print(f"Uploading: {local_path}")
        obstore.put(store, local_path, local_path)

Downloading Files

for downloading files, you use get

xx = obstore.get(store,'plan/plan.png').bytes()
with open('output_file.png', 'wb') as file:
    file.write(xx)

Compatibility with Other Storage Solutions

The beauty of this approach is that the code remains largely the same whether you’re using OneLake or an S3-compatible storage service. The main differences lie in updating:

  • The storage path
  • Authentication credentials

Note: OpenDale provides a similar solution, but it does not currently support Entra OAuth 2

Summary

This short blog outlines a straightforward way to load files into OneLake using Python. With Azure CLI authentication and obstore, managing files in OneLake becomes both simple and specially standardized.

Obviously, it was always possible to do the same using Azure storage SDK but, the API is far from being user friendly (Personal opinion), it is designed for developers, but as a business user I like this package 🙂

Thanks Kyle Barron for creating this package  

You can download a sample notebook here :