it is a quick post on how to query Onelake Iceberg REST Catalog using pure SQL with DuckDB, and yes you need a service principal that has access to the lakehouse
it works reasonably well assuming your region is not far from your laptop, or even better , if you run it inside Fabric then there is no network shenanigans, I recorded a video showing my experience
Why read operations do not always need full consistency checks
I hope DuckDB eventually adds an option that allows turning off table state checks for purely read scenarios. The current behaviour is correct because you always need the latest state when writing in order to guarantee consistency. However, for read queries it feels unnecessary and hurts the overall user experience. PowerBI solved this problem very well with its concept of framing, and something similar in DuckDB would make a big difference, notice duckdb delta reader already support pin version.
I’ve been following the evolution of Iceberg shortcuts to OneLake and I’m genuinely impressed with how the engineering team has invested so much energy into making it more robust, it is a good idea to read the documentation.
Essentially, XTable is used behind the scenes. Think of it as a translator for your open table format. Instead of requiring you to convert data from one format (like Iceberg) to another (like Delta) just to query them together, XTable allows you to access and interact with tables in different formats as if they were a single, unified table within OneLake—all without user intervention.
To truly put this to the test, I recently ran an experiment in a real production environment using my paid tenant—no sandboxes here! Here’s the logic from the Python notebook:
Accessing data from an Iceberg table using a shortcut (sourced from Snowflake; the data can be stored anywhere—Azure, S3, GCP, or OneLake, You can use BigQuery too or any Iceberg writer).
Inserting arbitrary data and performing delete operations.
Counting the total rows using Snowflake.
Counting the total rows using Fabric notebook as a Delta Table.
Recording the record counts in a results table to track and visualize the comparison over time.
The results were quite awesome. I plotted the total record counts from both the Iceberg and Delta perspectives using two distinct colors and observed a perfect match. This confirms the seamless interoperability provided by XTable.
Lesson learned:
See the code snippet below for inserting data in Snowflake:
snow.execute(f'insert into ONELAKE.ICEBERG.scada select * from ONELAKE.AEMO.SCADARAW limit {limit};')
snow.execute('delete from ONELAKE.ICEBERG.scada where INITIALMW = 0')
snow.execute("SELECT SYSTEM$GET_ICEBERG_TABLE_INFORMATION('ONELAKE.iceberg.scada');")
In rare cases—especially when running multiple transactions at the same time—Snowflake may not instantly generate the metadata. To be 100% sure, run this SQL statement
to force the engine to write new Iceberg metadata. It’s an annoying aspect of Iceberg: every commit generates three files. That’s a bit excessive. Some engines prefer to group multiple commits to reduce the size of the metadata. Again, it’s rare—but it does happen.
This is more or less the industry consensus on how a Lakehouse architecture should look in 2025.
By now, it’s become clear that Parquet is the de facto standard for storing data, and using an object store to separate storage from compute makes a lot of sense.
Another interesting development is how vendors want to package this offering. Storage vendors saw an opportunity to do more—after all, there’s no law that says the metastore belongs to the data warehouse! So you get things like S3 Table and Cloudflare R2, which I think is a good thing, especially if you’re a smaller analytics vendor. Life becomes much easier when table maintenance is done upstream, allowing you to focus solely on making the query engine faster.
Encouraging things are also happening in the table format space. I know a bit about Iceberg and Delta, but not much about the others. One very interesting development is Iceberg adopting deletion vectors from Delta in the V3 spec, while Delta will requires a catalog for read and write (at least for catalog managed table). I like to call it the “Icebergification” of Delta.
Another trend is the Delta Java writer making it easier to auto-generate Iceberg metadata. and Xtable is doing the same regardless of the delta writer, At this stage, one could argue: why do we need two table formats that are becoming virtually identical?
Data Analyst—How About Me?
These improvements mostly impact the write path, which is primarily managed by data engineers. But what about data analysts and end users?
if you have Fabric OneLake, you can use Direct Lake in OneLake mode. Marco has a great article about it. It’s a fantastic improvement compared to the initial version of Direct Lake. However, it doesn’t solve the problem if your data is hosted in an S3 table or BigQuery Iceberg table. Yes, you can create a shortcut to OneLake and read it from there, but that still depends on a data engineer setting it up.
Now imagine a world where an Excel, Tableau, or Power BI Desktop user (or any arbitrary client tool) can just point to a Lakehouse using a standard API, discover tables, read data, and build reports. Honestly, this isn’t a big ask , we already have this when connecting to databases using ODBC, and I don’t see any technical reason why we can’t have the same experience with Lakehouses.
We Already Have This API
For me, the most promising development in the Lakehouse ecosystem is the Iceberg Catalog REST API, and I genuinely hope it becomes a standard—just like ODBC is today (and hopefully ADBC in the future, but that’s another topic).
Again, speaking as a data analyst, I want my tools to support the read part of the API—just the ability to list tables and scan a table. That’s all. I have zero interest in how the data is stored or which table format is used. The catalog should be smart enough to generate metadata on the fly.
The Good News
We’re getting there—at least if you’re using a Python notebook. Here’s an example where I use the same Iceberg REST API to query a table from four different Lakehouse implementations using Daft.
def connect_catalog(cat):
match cat:
case 'polaris':
catalog = load_catalog(
'default',
uri= polaris_endpoint,
warehouse='dwh',
scope = 'PRINCIPAL_ROLE:data_engineer' ,
credential= polaris_key
)
case 's3':
catalog = load_catalog(
'default',
**{
"type": "rest",
"warehouse": s3_warehouse ,
"uri": "https://s3tables.us-east-2.amazonaws.com/iceberg",
"rest.sigv4-enabled": "true",
"rest.signing-name": "s3tables",
"rest.signing-region": "us-east-2"
}
)
case 'uc':
catalog = load_catalog(
'default',
token = token ,
uri = endpoint,
warehouse = 'ne'
)
case 'r2':
catalog = RestCatalog(
name = 'default',
token = token_r2 ,
uri = endpoint_r2,
warehouse = r2_warehouse
)
return catalog
Then, I run a standard SQL query using Daft SQL.
Final Thoughts
It took Parquet a decade to become a standard. We may or may not have a single standard table format—and maybe we don’t need one. But if we want this Lakehouse vision to become mainstream, then everyone should support the Iceberg Catalog REST API, at least for read operations.
I recently had a conversation about this topic and realized that it’s not widely known that Snowflake can read Delta tables hosted in OneLake. So, I thought I’d share this in a blog post.
Fundamentally, this process is similar to how XTable in Fabric works, but in reverse—it converts a Delta table to Iceberg by translating the table metadata ( AFAIK, Snowflake don’t use Xtable but an internal tool)
Recommended Documentation
For detailed information, I strongly recommend reading the official Snowflake documentation: 🔗 Create Iceberg Table from Delta
How It Works
External Volume and File Section
When creating an external volume in Snowflake that points to OneLake, only the Files section is supported. This isn’t an issue because you can simply add a shortcut that points to a schema.
SQL Code to Set Up External Volume and Map an Existing Table