Writing to SQL Server using DuckDB

I don’t know much about SQL Server. The closest I ever got to it was having read only access to a database. I remember 10 years ago we had a use case for a database, and IT decided for some reason that we were not allowed to install SQL Server Express. Even though it was free and a Microsoft product. To this day, it is still a mystery to me, anyway, at that time I was introduced to PowerPivot and PowerQuery, and the rest was history.

Although I knew very little about SQL Server, I knew that SQL Server users are in love with the product. I worked with a smart data engineer who had a very clear world view:

I used SQL Server for years. It is rock solid. I am not interested in any new tech.

At the time, I thought he lacked imagination. Now I think I see his point.

When SQL Server was added to Fabric, I was like, oh, that’s interesting. But I don’t really do operational workloads anyway, so I kind of ignored it.

Initially I tried to make it fit my workflow, which is basically developing Python notebooks using DuckDB or Polars (depending on my mood) inside VSCode with GitHub Copilot. and deploy it later into Fabric, of course you can insert a dataframe into SQL Server, but it did not really click for me at first. To be clear, I am not saying it is not possible. It just did not feel natural in my workflow( messing with pyodbc is not fun).

btw the SQL extension inside VSCode is awesome

A week ago I was browsing the DuckDB community extensions and I came across the mssql extension. And boy !!! that was an emotional rollercoaster (The last time I had this experience was when I first used tabular editor a very long time ago).

You just attach a SQL Server database using either username and password or just a token. That’s it. The rest is managed by the extension, suddenly everything make sense to me!!!

conn = duckdb.connect()

if PLATFORM == 'fabric':
    token = DefaultAzureCredential().get_token("https://database.windows.net/.default").token

# notebookutils.credentials.getToken("sql") inside Fabric notebook
    for attempt in range(3):
        try:
            conn.sql(f"""
                ATTACH IF NOT EXISTS
                'Server={host};Database={db}'
                AS db (TYPE mssql, ACCESS_TOKEN '{token}')
            """)
            break
        except Exception as e:
            if attempt < 2:
                print(f"Attempt {attempt+1} failed, waiting 60s for serverless wake-up...")
                time.sleep(60)
            else:
                raise e
else:
    conn.sql(f"""
        ATTACH OR REPLACE
        'Server={host},{pr};Database={db};User Id={user};Password={pw};Encrypt=yes'
        AS db (TYPE mssql)
    """)

conn.sql("SET mssql_query_timeout = 6000; SET mssql_ctas_drop_on_failure = true;")
print(f"Connected to SQL Server via {PLATFORM}")

again, I know there other ways to load data which are more efficiently, but if I have a small csv that I processed using python, nothing compare to the simplicity of a dataframe, in that week; here are some things I learned, I know it is obvious for someone who used it !!! but for me, it is like I was living under a rock all these years 🙂

if you run show all tables in duckdb, you get something like this

TDS and bulk insertion

You don’t need ODBC. You can talk to SQL Server directly using TDS, which is the native protocol it understands. There is also something called BCP, which basically lets you batch load data efficiently instead of pushing rows one by one. Under the hood it streams the data in chunks, and the performance is actually quite decent. It is not some hacky workaround. It feels like you are speaking SQL Server’s own language, and that changes the whole experience.

SQL Server is not only for OLTP

Turns out people use SQL Server for analytics too, with columnar table format.

CREATE CLUSTERED COLUMNSTORE INDEX cci_{table}
ON {schema}.{table}
ORDER ({order_col});

I tested a typical analytical benchmark and more or less it performs like a modern single node data warehouse.

Accelerating Analytics for row store

Basically, there is a batch mode where the engine processes row-based tables in batches instead of strictly row by row. The engine can apply vectorized operations, better CPU cache usage, and smarter memory management even on traditional rowstore tables. It is something DuckDB added with great fanfare to accelerate PostgreSQL heap tables. I was a bit surprised that SQL Server already had it for years.

RLS/CLS for untrusted Engine

If you have a CLS or RLS Lakehouse table and you want to query it from an untrusted engine, let’s say DuckDB running on your laptop, today, you can’t for a good reason as the direct storage access is blocked, this extension solves it, as you query the SQL Endpoint itself.

Most of fancy things were already invented

Basically, many of the things’ people think are next generation technologies were already implemented decades ago. SQL control flow, temp tables, complex transactions, fine grained security, workload isolation, it was all already there.

I think the real takeaway for me; user experience is as important – if not more- than the SQL Engine itself, and when a group of very smart people like something then there is probably a very good reason for it.

Query Onelake Iceberg REST catalog using Duckdb SQL

it is a quick post on how to query Onelake Iceberg REST Catalog using pure SQL with DuckDB, and yes you need a service principal that has access to the lakehouse

CREATE or replace PERSISTENT secret onelake_identity_iceberg (
    TYPE ICEBERG,
    CLIENT_ID 'xxxxxxxxxxxxxx',
    CLIENT_SECRET 'yyyyyyyyyyyyyyyyyyy' ,
    OAUTH2_SCOPE   'https://storage.azure.com/.default' ,
    OAUTH2_SERVER_URI 'https://login.microsoftonline.com/TENANT_ID /oauth2/v2.0/token' ,
    ENDPOINT 'https://onelake.table.fabric.microsoft.com/iceberg'
);
CREATE or replace PERSISTENT secret azure_spn (
    TYPE azure,
    PROVIDER service_principal,
    TENANT_ID 'ccccccc',
    CLIENT_ID 'iiiiiiiiiiiiii',
    CLIENT_SECRET 'xbndlfrewi' ,
    ACCOUNT_NAME 'onelake'
);

it works reasonably well assuming your region is not far from your laptop, or even better , if you run it inside Fabric then there is no network shenanigans, I recorded a video showing my experience

Why read operations do not always need full consistency checks

I hope DuckDB eventually adds an option that allows turning off table state checks for purely read scenarios. The current behaviour is correct because you always need the latest state when writing in order to guarantee consistency. However, for read queries it feels unnecessary and hurts the overall user experience. PowerBI solved this problem very well with its concept of framing, and something similar in DuckDB would make a big difference, notice duckdb delta reader already support pin version.

just vorder don’t try to understand how vertipaq works

I know it is a niche topic, and in theory, we should not think about it. in import mode, Power BI ingests data and automatically optimizes the format without users knowing anything about it. With Direct Lake, it has become a little more nuanced. The ingestion part is done downstream by other processes that Power BI cannot control. The idea is that if your Parquet files are optimally written using Vorder, you will get the best performance. The reality is a bit more complicated. External writers know nothing about Vorder, and even some internal Fabric Engines do not write it by default.

To show the impact of read performance for Vordered Parquet vs sorted Parquet, I ran some tests. I spent a lot of time ensuring I was not hitting the hot cache, which would defeat the purpose. One thing I learned is to test only one aspect of the system carefully.

Cold Run: Data loaded from OneLake with on-the-fly building of dictionaries, relationships, etc

Warm Run: Data already in memory format.

Hot Run: The system has already scanned the same data before. You can disable this by calling the ClearCache command using XMLA (you cannot use a REST API call).


Prepare the Data

Same model, three dimensions, and one fact table, you can generate the data using this notebook

  • Deltars: Data prepared using Delta Rust sorted by date, id, then time.
  • Spark_default: Same sorted data but written by Spark.
  • Spark_vorder: Manually enabling Vorder (Vorder is off by default for new workspaces).

Note: you can not sort and vorder at the same time, at least in Spark, DWH seems to support it just fine

The data layout is as follows:

I could not make Delta Rust produce larger row groups. Notice that ZSTD gives better compression, although it seems Power BI has to uncompress the data in memory, so it seems it is not a big deal

I added a bar chart to show how the data is sorted per date. As expected, Vorder is all over the place, the algorithm determines that this is the best row reordering to achieve the best RLE encoding.


The Queries

The queries are very simple. Initially, I tried more complex queries, but they involved other parts of the system and introduced more variables. I was mainly interested in testing the scan performance of VertiPaq.

Basically, it is a simple filter and aggregation.


The Test

You can download the notebook here :

def run_test(workspace, model_to_test):
    for i, dataset in enumerate(model_to_test):
        try:
            print(dataset)
            duckrun.connect(f"{ws}/{lh}.Lakehouse/{dataset}").deploy(bim_url)
            run_dax(workspace, dataset, 0, 1)
            time.sleep(300)
            run_dax(workspace, dataset, 1, 5)
            time.sleep(300)
        except Exception as e:
            print(f"Error: {e}")
    return 'done'

Deploy automatically generates a new semantic model. If it already exists, it will call clearvalue and perform a full refresh, ensuring no data is kept in memory. When running DAX, I make sure ClearCache is called.

The first run includes only one query; the second run includes five different queries. I added a 5-minute wait time to create a clearer chart of capacity usage.


Capacity Usage

To be clear, this is not a general statement, but at least in this particular case with this specific dataset, the biggest impact of Vorder seems to appear in the capacity consumption during the cold run. In other words:

Transcoding a vanilla Parquet file consumes more compute than a Vordered Parquet file.

Warm runs appear to be roughly similar.


Impact on Performance

Again, this is based on only a couple of queries, but overall, the sorted data seems slightly faster in warm runs (although more expensive). Still, 120 ms vs 340 ms will not make a big difference. I suspect the queries I ran aligned more closely with the column sorting—that was not intentional.


Takeaway

Just Vorder if you can. Make sure you enable it when starting a new project. ETL, data engineering have only one purpose, make the experience of the end users the best possible way, ETL job that take 30 seconds more is nothing compared to a slower PowerBI reports.

now if you can’t, maybe you are using a shortcut from an external engine, check your powerbi performance if it is not as good as you expect then make a copy, the only thing that matter is the end user experience.

Another lesson is that VertiPaq is not your typical OLAP engine; common database tricks do not apply here. It is a unique engine that operates entirely on compressed data. Better-RLE encoded Parquet will give you better results, yes you may have cases where the sorting align better with your queries pattern, but in the general case, Vorder is always the simplest option.

First Look at OneLake Diagnostics

While preparing for a presentation about the FabCon announcement, one item was about OneLake Diagnostics. all ll I knew was that it had something to do with security and logs. As a Power BI user, that’s not exactly the kind of topic that gets me excited, but I needed to know at least the basic, so I can answer questions if someone ask 🙂

Luckily, we have a tradition at work , whenever something security-related comes up, we just ping Amnjeet 🙂

He showed me how it works , and I have to say, I loved it. It’s refreshingly simple.

You can download the notebook here:


You just select a folder in your Lakehouse and turn it on.

That’s it , the system automatically starts generating JSON files, neatly organized using Hive-style partitions, By default, user identity and IP tracking are turned off unless an admin explicitly enables them. You can find more details about the schema and setup here.


What the Logs Look Like

Currently, the logs are aggregated at the hourly level, but the folder structure also includes a partition for minutes (even though they’re all grouped at 00 right now).

Parsing the JSON Logs

Once the logs were available, I wanted to do some quick analysis , not necessarily about security, just exploring what’s inside.

There are probably half a dozen ways to do this in Fabric ; Shortcut Transform, RTI, Dataflow Gen2, DWH, Spark, and probably some AI tools too, Honestly, that’s a good problem to have.

But since I like Python notebooks and the data is relatively small, I went with DuckDB (as usual), but Instead of using plain DuckDB and delta_rs to store the results, I used my little helper library, duckrun, to make things simpler ( Self Promotion alert).

Then I asked Copilot to generate a bit of code for registering existing functions to look up the workspace name and lakehouse name from their GUIDs in DuckDB, using SQL to call python is cool 🙂


The data is stored incrementally, using the file path as a key , so you end up with something like this:

import duckrun

con = duckrun.connect('bigdata/tpch.lakehouse/dbo')

onelake_logs_path = (
    'abfss://bigdata@onelake.dfs.fabric.microsoft.com/'
    'tpch.Lakehouse/Files/DiagnosticLogs/OneLake/Workspaces/*/'
    'y=*/m=*/d=*/h=*/m=*/*.json'
)

Then I added only the new logs with this SQL script:

try:
    con.sql(f"""  
        CREATE VIEW IF NOT EXISTS logs(file) AS SELECT 'dummy';
        SET VARIABLE list_of_files =
        (
            WITH new_files AS (
                SELECT file
                FROM glob('{onelake_logs_path}')
                WHERE file NOT IN (SELECT DISTINCT file FROM logs)
                ORDER BY file
            )
            SELECT list(file) FROM new_files
        );
        SELECT * EXCLUDE(data), data.*, filename AS file  
        FROM read_json_auto(
            GETVARIABLE('list_of_files'),
            hive_partitioning = true,
            union_by_name = 1,
            FILENAME = 1
        )
    """).write.mode("append").option("mergeSchema", "true").saveAsTable('logs')
except Exception as e:
    print(f"An error occurred: {e}")

1- Using glob() to collect file names means you don’t open any files unnecessarily , a small but nice performance win.

2- DuckDB expand the struct using this expression data.*

3- union_by_name = 1 in case the json has different schemas

4- option(“mergeSchema”, “true”) for schema evolution in Delta table


Exploring the Data

Once the logs are in a Delta table, you can query them like any denormalize table.

For example, here’s a simple query showing API calls per engine:

Note : using AI to get working regex is maybe the best thing ever 🙂

SELECT
    regexp_extract(resource, '([^&/]+)/([^&/]+)/(Tables|Files)(?:/([^&/]+))?(?:/([^&/]+))?', 4) AS schema_name,
    get_workspace_name(workspaceid) AS workspace_name,
    get_lakehouse_name(workspaceid, itemId) AS lakehouse_name,
    originatingApp,
    COUNT(*) AS API_calls
FROM logs
GROUP BY ALL
ORDER BY API_calls DESC
LIMIT 5;

Fun fact: OneLake tags Python notebook as Spark.
Also, I didn’t realize Lineage calls OneLake too!

as I have already register Python functions as UDFs, which is how I pulled in the workspace and lakehouse names in the query above.


Takeaway

This was just a bit of tinkering, but I’m really impressed with how easy OneLake Diagnostics is to set up and use.

I still remember the horrors of trying to connect Dataflow Gen1 to Azure Storage ,that was genuinely painful (and I never even got access from IT anyway).

It’s great to see how Microsoft Fabric is simplifying these scenarios. Not everything can always be easy, but making the first steps easy really gives the feature a very good impression.