OneLake – Page 4 – Small Data And self service

Reading a Delta Table Hosted in OneLake from Snowflake

I recently had a conversation about this topic and realized that it’s not widely known that Snowflake can read Delta tables hosted in OneLake. So, I thought I’d share this in a blog post.

Fundamentally, this process is similar to how XTable in Fabric works, but in reverse—it converts a Delta table to Iceberg by translating the table metadata ( AFAIK, Snowflake don’t use Xtable but an internal tool)

How It Works

External Volume and File Section

When creating an external volume in Snowflake that points to OneLake, only the Files section is supported. This isn’t an issue because you can simply add a shortcut that points to a schema.

SQL Code to Set Up External Volume and Map an Existing Table

CREATE OR REPLACE EXTERNAL VOLUME onelake_personal_tenant_delta
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'onelake_personal_tenant_delta',
            STORAGE_PROVIDER = 'AZURE',
            STORAGE_BASE_URL = 'azure://onelake.dfs.fabric.microsoft.com/python/data.Lakehouse/Files',
            AZURE_TENANT_ID = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
         )
      );

DESC EXTERNAL VOLUME onelake_personal_tenant_delta;

One-Time Setup: Authentication

You’ll need to complete a one-time authentication step:

Copy the AZURE_CONSENT_URL from the output.
Open it in your browser: "AZURE_CONSENT_URL":"https://login.microsoftonline.com/xx/oauth2/authorize?client_id=yy&response_type=code"
Ensure you select the correct email address that has access to your tenant.
This will create an Azure multi-tenant app with a service principal.
Add this service principal to your workspace so that Snowflake can access the data.

Setting Up the Iceberg Table

CREATE OR REPLACE CATALOG INTEGRATION delta_catalog_integration
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = DELTA
  ENABLED = TRUE;

CREATE DATABASE IF NOT EXISTS ONELAKE;

CREATE SCHEMA IF NOT EXISTS delta;

CREATE ICEBERG TABLE ONELAKE.delta.test
  CATALOG='delta_catalog_integration'
  EXTERNAL_VOLUME = 'onelake_personal_tenant_delta'
  BASE_LOCATION = 'aemo/test/';

SELECT COUNT(*) FROM ONELAKE.delta.test;

and here is the result

Limitations

Deletion Vectors Are Not Supported:
If your table contains deletion vectors, make sure to run OPTIMIZE to handle them properly.

Reading BigQuery Iceberg Tables in Fabric

This is a quick guide on correctly reading Iceberg tables from BigQuery. Currently, there are two types of Iceberg tables in BigQuery, based on the writer:

BigQuery Iceberg Table

This is the table written using BigQuery engine

Iceberg Tables Written Using the BigQuery Metastore

Currently, only Spark is supported. I assume that, at some point, other engines will be added. The implementation is entirely open source but currently supports only Java (having a REST API would have been a nice addition).

How OneLake Iceberg Shortcuts Work

OneLake reads both the data and metadata of an Iceberg table from its storage location and dynamically generates a Delta Lake log. This is a quick and cost-effective operation, as it involves only generating JSON files. See an example here

The Delta log is added to OneLake, while the source data remains read-only. Whenever you make changes to the Iceberg table, new metadata is generated and translated accordingly. The process is straightforward.

BigQuery Iceberg Doesn’t Publish Metadata Automatically

BigQuery uses an internal system to manage transactions. When querying data from the BigQuery SQL endpoint, the results are always consistent. However, reading directly from storage may return an outdated state of the table.

For BigQuery Iceberg tables, you need to manually run the following command to update the metadata:

EXPORT TABLE METADATA FROM dataset.iceberg_table;

you can run it on a schedule, or make it the last step in an ETL pipeline.

Iceberg Tables Using the BigQuery Metastore (Written by Spark)

If the Iceberg table is written using the BigQuery metastore (e.g., by Spark), no additional steps are required. The metadata is automatically updated.

The interesting part about Iceberg’s translation to a Delta table in OneLake is that it is completely transparent to Fabric workloads. For example, Power BI simply recognizes it as a regular Delta table. 😊

Fabric for Small Enterprises

While experimenting with different access modes in Power BI, I thought it is maybe worth sharing as a short blog to show why the Lakehouse architecture offers versatile options for Power BI developers. Even when they use Only Import Mode.

And Instead of sharing a conceptual piece, perhaps focus on presenting some dollar figures 🙂

Scenario: A Small Consultancy

According to local regulations, a small enterprise is defined as having fewer than 15 employees. Let’s consider this setup:

Data Storage: The data resides in Microsoft OneLake, utilizing an F2 SKU.
Number of Users: 15 employees.
Data Size: Approximately 94 million rows.
Pricing Model: For simplicity, assume the F2 SKU uses a reserved pricing model.

Monthly Costs:

Power BI Licensing: 15 users × 15 AUD = 225 AUD.
F2 SKU Reserved Pricing: 293 AUD.
Total Cost: 518 AUD per month.

ETL Workload

Currently, the ETL workload consumes approximately 50% of the available capacity.

For comparison, I ran the same workload on another Lakehouse vendor. To minimize costs, the schedule was adjusted to operate only from 8 AM to 6 PM. Despite this adjustment, the cost amounted to:

Daily Cost: 40 AUD.
Monthly Cost: 1,200 AUD.

In contrast, the F2 SKU’s reserved price of 293 AUD per month is significantly more economical. Even the pay-as-you-go model, which costs 500 AUD per month, remains competitive.

Key Insight:

While serverless billing is attractive, what matter is how much you end up paying per month.

For smaller workloads (less than 100 GB of data), data transformation becomes commoditized, and charging a premium for it is increasingly challenging.

Analytics in Power BI

I prefer to separate Power BI reports from the workspace used for data transformation. End users care primarily about clean, well-structured tables—not the underlying complexities.

With OneLake, there are multiple ways to access the stored data:

Import Mode: Directly import data from OneLake.
DirectQuery: Use the Fabric SQL Endpoint for querying.
Direct Lake Model: Access data with minimal latency.
Composite Models: All the above ( this is me trying to be funny)

All the Semantic Models and reports are hosted in the Pro license workspace, Notice that an import model works even when the capacity is suspended ( if you are using pay as you go pricing)

The Trade-Off Triangle

In analytical databases, including Power BI, there is always a trade-off between cost, freshness, and query latency. Here’s a breakdown:

Import Mode: Ideal if eight refreshes per day suffice and the model size is small. Reports won’t consume Fabric capacity (Onelake Transactions cost are insignificant for small data import)
Direct Lake Model: Provides excellent freshness and latency but will probably impacts F2 capacity, in other words, it will cost more.
DirectQuery: Balances freshness and latency (seconds rather than milliseconds) while consuming less capacity. This approach is particularly efficient as Fabric treats those Queries as background operations, with low consumption rates in many cases. Looking forward to the release of Fabric DWH result cache.

Key Takeaways

Cost-Effectiveness: Reserved pricing for smaller Fabric F SKUs combined with Power BI Pro license offers a compelling value proposition for small enterprises.
Versatility: OneLake provides flexible options for ETL workflows, even when using import mode exclusively.

The Lakehouse architecture and Power BI’s diverse access modes make it possible to efficiently handle analytics, even for smaller enterprises with limited budgets.

TPC-DS 100GB with Only 2 Cores and 16 GB of RAM

As the year comes to a close, I decided to explore a fun yet somewhat impractical challenge: Can DuckDB run the TPC-DS benchmark using just 2 cores and 16 GB of RAM? The answer is yes, but with a caveat—it’s slow. Despite the limitations, it works!

Notice; I am using lakehouse mounted storage, for a background on the different access mode, you can read the previous blog

Data Generation Challenges

Initially, I encountered an out-of-memory error while generating the dataset. Upgrading to the development release of DuckDB resolved this issue. However, the development release currently lacks support for reading Delta tables, as Delta functionality is provided as an extension available only in the stable release.

Here are some workarounds:

Increase the available RAM.
Use the development release to generate the data, then switch back to version 1.1.3 for querying.
Wait for the upcoming version 1.2, which should resolve this limitation.

The data is stored as Delta tables in OneLake, it was exported as a parquet files by duckdb and converted to delta table using delta_rs (the conversion was very quick as it is a metadata only operation)

Query Performance

Running all 99 TPC-DS queries worked without errors, albeit very slowly( again using only 2 cores ).

I also experimented with different configurations:

4, 8, and 16 cores: Predictably, performance improved as more cores were utilized.

For comparison, I ran the same test on my laptop, which has 8 cores and reads my from local SSD storage, The Data was generated using the same notebook.

Results

Python notebook compute consumption is straightforward, 2 cores = 1 CUs, the cheapest option is the one that consume less capacity units, assuming speed of execution is not a priority.

Cheapest configuration: 8 cores offered a good balance between cost and performance.
Fastest configuration: 16 cores delivered the best performance.

Interestingly, the performance of a Fabric notebook with 8 cores reading from OneLake was comparable to my laptop with 8 cores and an SSD. This suggests that OneLake’s throughput is competitive with local SSDs.

Honestly, It’s About the Experience

At the end of the day, it’s not just about the numbers. There’s a certain joy in using a Python notebook—it just feels right. DuckDB paired with Python creates an intuitive, seamless experience that makes analytical work enjoyable. It’s simply a very good product.

Conclusion

While this experiment may not have practical applications, it highlights DuckDB’s robustness and adaptability. Running TPC-DS with such limited resources showcases its potential for lightweight analytical workloads.

You can download the notebook for this experiment here:

	Querying a Fabric La… on Writing to SQL Server using…
	Benjamin on Running DuckDB at 10 TB s…
	mim on Running DuckDB at 10 TB s…
	Benjamin on Running DuckDB at 10 TB s…
	Running DuckDB at 10… on Running DuckDB at 10 TB s…

Category: OneLake

Reading a Delta Table Hosted in OneLake from Snowflake

Recommended Documentation

How It Works

External Volume and File Section

SQL Code to Set Up External Volume and Map an Existing Table

One-Time Setup: Authentication

Setting Up the Iceberg Table

Limitations

Fabric for Small Enterprises

Monthly Costs:

ETL Workload

Key Insight:

Analytics in Power BI

The Trade-Off Triangle

Key Takeaways

TPC-DS 100GB with Only 2 Cores and 16 GB of RAM

Data Generation Challenges

Query Performance

Results

Honestly, It’s About the Experience

Conclusion

Recommended Documentation

How It Works

External Volume and File Section

SQL Code to Set Up External Volume and Map an Existing Table

One-Time Setup: Authentication

Setting Up the Iceberg Table

Limitations

Share this:

BigQuery Iceberg Table

Iceberg Tables Written Using the BigQuery Metastore

How OneLake Iceberg Shortcuts Work

BigQuery Iceberg Doesn’t Publish Metadata Automatically

Iceberg Tables Using the BigQuery Metastore (Written by Spark)

Share this:

Monthly Costs:

ETL Workload

Key Insight:

Analytics in Power BI

The Trade-Off Triangle

Key Takeaways

Share this:

Data Generation Challenges

Query Performance

Results

Honestly, It’s About the Experience

Conclusion

Share this: