First Look at OneLake Diagnostics

While preparing for a presentation about the FabCon announcement, one item was about OneLake Diagnostics. all ll I knew was that it had something to do with security and logs. As a Power BI user, that’s not exactly the kind of topic that gets me excited, but I needed to know at least the basic, so I can answer questions if someone ask 🙂

Luckily, we have a tradition at work , whenever something security-related comes up, we just ping Amnjeet 🙂

He showed me how it works , and I have to say, I loved it. It’s refreshingly simple.

You can download the notebook here:


You just select a folder in your Lakehouse and turn it on.

That’s it , the system automatically starts generating JSON files, neatly organized using Hive-style partitions, By default, user identity and IP tracking are turned off unless an admin explicitly enables them. You can find more details about the schema and setup here.


What the Logs Look Like

Currently, the logs are aggregated at the hourly level, but the folder structure also includes a partition for minutes (even though they’re all grouped at 00 right now).

Parsing the JSON Logs

Once the logs were available, I wanted to do some quick analysis , not necessarily about security, just exploring what’s inside.

There are probably half a dozen ways to do this in Fabric ; Shortcut Transform, RTI, Dataflow Gen2, DWH, Spark, and probably some AI tools too, Honestly, that’s a good problem to have.

But since I like Python notebooks and the data is relatively small, I went with DuckDB (as usual), but Instead of using plain DuckDB and delta_rs to store the results, I used my little helper library, duckrun, to make things simpler ( Self Promotion alert).

Then I asked Copilot to generate a bit of code for registering existing functions to look up the workspace name and lakehouse name from their GUIDs in DuckDB, using SQL to call python is cool 🙂


The data is stored incrementally, using the file path as a key , so you end up with something like this:

import duckrun

con = duckrun.connect('bigdata/tpch.lakehouse/dbo')

onelake_logs_path = (
    'abfss://bigdata@onelake.dfs.fabric.microsoft.com/'
    'tpch.Lakehouse/Files/DiagnosticLogs/OneLake/Workspaces/*/'
    'y=*/m=*/d=*/h=*/m=*/*.json'
)

Then I added only the new logs with this SQL script:

try:
    con.sql(f"""  
        CREATE VIEW IF NOT EXISTS logs(file) AS SELECT 'dummy';
        SET VARIABLE list_of_files =
        (
            WITH new_files AS (
                SELECT file
                FROM glob('{onelake_logs_path}')
                WHERE file NOT IN (SELECT DISTINCT file FROM logs)
                ORDER BY file
            )
            SELECT list(file) FROM new_files
        );
        SELECT * EXCLUDE(data), data.*, filename AS file  
        FROM read_json_auto(
            GETVARIABLE('list_of_files'),
            hive_partitioning = true,
            union_by_name = 1,
            FILENAME = 1
        )
    """).write.mode("append").option("mergeSchema", "true").saveAsTable('logs')
except Exception as e:
    print(f"An error occurred: {e}")

1- Using glob() to collect file names means you don’t open any files unnecessarily , a small but nice performance win.

2- DuckDB expand the struct using this expression data.*

3- union_by_name = 1 in case the json has different schemas

4- option(“mergeSchema”, “true”) for schema evolution in Delta table


Exploring the Data

Once the logs are in a Delta table, you can query them like any denormalize table.

For example, here’s a simple query showing API calls per engine:

Note : using AI to get working regex is maybe the best thing ever 🙂

SELECT
    regexp_extract(resource, '([^&/]+)/([^&/]+)/(Tables|Files)(?:/([^&/]+))?(?:/([^&/]+))?', 4) AS schema_name,
    get_workspace_name(workspaceid) AS workspace_name,
    get_lakehouse_name(workspaceid, itemId) AS lakehouse_name,
    originatingApp,
    COUNT(*) AS API_calls
FROM logs
GROUP BY ALL
ORDER BY API_calls DESC
LIMIT 5;

Fun fact: OneLake tags Python notebook as Spark.
Also, I didn’t realize Lineage calls OneLake too!

as I have already register Python functions as UDFs, which is how I pulled in the workspace and lakehouse names in the query above.


Takeaway

This was just a bit of tinkering, but I’m really impressed with how easy OneLake Diagnostics is to set up and use.

I still remember the horrors of trying to connect Dataflow Gen1 to Azure Storage ,that was genuinely painful (and I never even got access from IT anyway).

It’s great to see how Microsoft Fabric is simplifying these scenarios. Not everything can always be easy, but making the first steps easy really gives the feature a very good impression.

Some Observations on Running TPCH 1 TB on Microsoft Fabric

This is not an official Microsoft benchmark, just my personal experience.

Last week, I came across a new TPCH generator written in Rust. Luckily, someone ported it to Python, which makes generating large datasets possible even with a small amount of RAM. For example, it took 2 hours and 30 minutes to generate a 1 TB scale dataset using the smallest Fabric Python notebook (2 cores and 16 GB of RAM).

Having the data handy, I tested Fabric DWH and SQL Endpoint. I also tested DuckDB as a sanity check. To be honest, I wasn’t sure what to expect.

I shared all the notebooks used and results here

I ran the test 30 times over three days, I think I have enough data to say something useful,In this blog, I will focus only on the results for the cold and warm runs, along with some observations.

For readers unfamiliar with Fabric, DWH and SQL Endpoint refer to the same distributed SQL engine. With DWH, you ingest data that is stored as a Delta table (which can be read by any Delta reader). With SQL Endpoint, you query external Delta tables written by Spark and other writers (this is called a Lakehouse table). Both use Delta tables.

Notes:

  • All the runs are using a Python notebook
  • to send queries to DWH/SQL Endpoint, all you need is
    conn = notebookutils.data.connect_to_artifact("data")
    conn.query("select 42")
  • I did not include the cost of ingestion for the DWH
  • The cost include compute and storage transaction and assume pay as you go rate of 0.18 $/Cu(hour)
  • For extracting Capacity usage, I used this excellent blog

Cold Run

  • The first-ever run on SQL Endpoint incurs an overhead, apparently the system build statistics. This overhead happened only once across all tests.
  • Point 2 is an outlier but an interesting one 🙂
  • The number of dots displayed is less than the number of tests runs as some tests perfectly match, which is a good sign that the system is predictable !!!
  • vorder improves performance for both SQL Endpoint and DuckDB. The data was generated by Rust and rewritten using Spark; it seems to be worth the effort.
  • Costs are roughly the same for DWH and SQL Endpoint when the Delta is optimized by vorder, but DWH is still faster.
  • DuckDB, running in a Python notebook with 64 cores, is the cheapest (but the slowest). Query 17 did not run , so that result is moot. ,Still, it’s a testament to the OneLake architecture: third-party engines can perform well without any additional Microsoft integration. Lakehouse for the win.

Warm Run

  • vorder is better than vanilla Parquet.
  • DWH is faster and a bit cheaper than SQL Endpoint.
  • DuckDB behavior is a bit surprising, was expecting better performance , considering the data is already loaded into RAM.

Impact on the Parquet Writer

I added a chat showing the impact of using different writers on the read performance, I use only warm run to remove the impact of the first run ever as it does not happen in the DWH ( as the data was ingested)

  • given the same table layout, DWH and SQL Endpoint perform the same, it is expected as it is the same engine
  • surprisingly using the initial raw delta table vs spark optimize write gave more or less the same performance at least for this particular workload.

Final Thoughts

Running the test was a very enjoyable experience, and for me, that’s the most important thing. I particularly enjoyed using Python notebooks to interact with Fabric DWH. It makes a lot of sense to combine a client-server distributed system with a lightweight client that costs very little.

There are new features coming that will make the experience working with DWH even more streamlined.

Edit :

  • update the figures for Dcukdb as Query 17 runs but you need to limit the memory manually set memory_limit='500GB'
  • added a graph on the impact of the parquet layout.

Using OBSTORE to Load and Download Arbitrary Files to OneLake

Prerequisites

Before proceeding, ensure you have the necessary tools installed and configured. For more details, refer to the official installation guide: Install Azure CLI on Windows.

  1. Install Azure CLI (one-time setup) winget install Microsoft.AzureCLI Ensure you have the latest version installed.
  2. Login to Azure az login Follow the browser-based authentication flow.

Installing Required Python Package

The obstore package is a Python API for the Rust-based Object Store crate, which simplifies interaction with cloud storage systems.

pip install obstore --upgrade

Connecting to OneLake Storage

Once installed, you can connect to OneLake using the obstore package:

import obstore
from obstore.store import from_url

# Define storage path
store = from_url('abfss://sqlengines@onelake.dfs.fabric.microsoft.com/power.Lakehouse/Files', azure_use_azure_cli=True)

there is a PR by someone from the community where azure_use_azure_cli=True will not be needed the system will automatically pick the available authentification

Listing Files in OneLake

To list the files and folders inside OneLake, always specify a prefix to avoid long processing times:

obstore.list(store, 'tmp').collect()

Uploading Local Files to OneLake

To upload files from a local directory to OneLake, use the following script:

import os

folder_path = '/test'  # Change this to the directory containing your files

for root, dirs, files in os.walk(folder_path):
    for file in files:
        local_path = os.path.join(root, file).replace("\\", "/")
        print(f"Uploading: {local_path}")
        obstore.put(store, local_path, local_path)

Downloading Files

for downloading files, you use get

xx = obstore.get(store,'plan/plan.png').bytes()
with open('output_file.png', 'wb') as file:
    file.write(xx)

Compatibility with Other Storage Solutions

The beauty of this approach is that the code remains largely the same whether you’re using OneLake or an S3-compatible storage service. The main differences lie in updating:

  • The storage path
  • Authentication credentials

Note: OpenDale provides a similar solution, but it does not currently support Entra OAuth 2

Summary

This short blog outlines a straightforward way to load files into OneLake using Python. With Azure CLI authentication and obstore, managing files in OneLake becomes both simple and specially standardized.

Obviously, it was always possible to do the same using Azure storage SDK but, the API is far from being user friendly (Personal opinion), it is designed for developers, but as a business user I like this package 🙂

Thanks Kyle Barron for creating this package  

You can download a sample notebook here :