Optimizing Parquet Layout for Power BI Direct Lake Mode

TL:DR; Delta_rs does not support Vorder so we need workaround, we notice by changing row groups size and sorting data we improved a direct Lake model from not working at all to returning queries in 100 ms

I was having another look at Fabric F2 (hint: I like it very much; you can watch the video here). I tried to use Power BI in Direct Lake mode, but it did not work well, and I encountered memory errors. My first instinct was to switch it to Fabric DWH in Direct Query mode, and everything started working again.

Obviously, I did complain to the Power BI team that I was not happy with the results, and their answer was to just turn on V-order, which I was not using. I had used Delta_rs to write the Delta table, and the reason I never thought about Parquet optimization was that when I used F64, everything worked just fine since that SKU has more hardware. However, F2 is limited to 3 GB of RAM.

There are many scenarios where Fabric is primarily used for reading data that is produced externally. In such cases, it is important to understand how to optimize those Parquet files for better performance.

Phil Seamark (yes the same guy who built a 3D games using DAX) gave me some very good advice: you can still achieve very good performance even without using V-order. Just sort by date and partition as a first step, and you can go further by splitting columns.

As I got the report working perfectly, even inside F2, I thought it was worth sharing what I learned.

Note: I used the term ‘parquet’ as it is more relevant than specifying the table format, whether it is a Delta table or Iceberg, after all this is where the data is stored, there is no standard for Parquet layout, different Engines will produce different files with massive difference in row group, file size, encoding etc.

Memory Errors

This is not really something you would like to see, but that’s life. F2 is the lowest tier of Fabric, and you will encounter hardware limitations.

A screenshot of a computer

Description automatically generated

When trying to run a query from DAX Studio, you will get the same error:


Rule of Thumb

Split datetime into date and time

It may be a good idea to split the datetime column into two separate columns, date and time. Using a datetime column in a relationship will not give you good performance as it usually has a very high number of distinct values.

Reduce Precision if You Don’t Need It

If you have a number with a lot of decimals, changing the type to decimal(18,4) may give you better compression.

Sorting Data

Find Column Cardinality

First, let’s check the cardinality of all columns in the Delta table. I used DuckDB for this, but you can use any tool—it’s just a personal choice.

First, upgrade DuckDB and configure authentication:

!pip install -q duckdb --upgrade

token = notebookutils.credentials.getToken('storage')

duckdb.sql(f"""  
  CREATE OR REPLACE SECRET onelake (TYPE AZURE, PROVIDER ACCESS_TOKEN, ACCESS_TOKEN '{notebookutils.credentials.getToken('storage')}')  
""")

Then, run SUMMARIZE:

display(duckdb.sql(f"""  
  SUMMARIZE (from delta_scan('abfss://python@onelake.dfs.fabric.microsoft.com/data.Lakehouse/Tables/aemo/summary'))  
""").df())

And here is the result:

Automatic Sorting from Low to High Cardinality

A very simple rule is to sort the columns from low to high cardinality , in this example : time, duid, date, price, mw.

Columns like cutoff don’t matter as they have only one value.

The result isn’t too bad—I went from 753 MB to 643 MB.

However, this assumes that the column has a uniform data distribution, which is rarely the case in real life. In more serious implementations, you need to account for data skewness.


Sort Based on Query Patterns

I built the report, so I know exactly the types of queries. The main columns used for aggregation and filters are date and duid, so that’s exactly what I’m going to use: sort by date, then duid, and then from low to high cardinality.

I think I just got lucky—the size is now 444 MB, which is even better than V-order. Again, this is not a general rule; it just happened that for my particular fact table, with that particular data distribution, this ordering gave me better results.

But more importantly, it’s not just about the Parquet size. Power BI in Direct Lake mode (and import mode) can keep only the columns used by the query into memory at the row group level. If I query only the year 2024, there is a good chance that only the row groups containing that data will be kept into memory. However, for this to work, the data must be sorted. If 2024 data is spread all over the place, there is no way to page out the less used row groups.

edit : to be very clear, Vertipaq needs to see all the data to build a global dictionary, so initially all columns needed for a query has to be fully loaded into Memory.

More Advanced Sorting Heuristics

These are more complex row-reordering algorithms. Instead of simply sorting by columns, they analyze the entire dataset and reorder rows to achieve the best Run-Length Encoding (RLE) compression across all columns simultaneously. I suspect that V-order uses something similar, but this is a more complex topic that I don’t have much to say about.

To make matters more complex, it’s not just about finding a near-optimal solution; the algorithm must also be efficient. Consuming excessive computational resources to determine the best reordering might not be a practical approach.

If you are into this kind of thing, there is a nice phd thesis here to read

Split Decimal Numbers

Reducing column cardinality by splitting decimals into separate columns can also help. For example, instead of storing price = 73.3968, store it in two columns: 73 and 3968.

Indeed, this gave even better results—a size reduction to 365 MB.

To be totally honest, though, while it gave the best compression result, I don’t feel comfortable using it. Let’s just say it’s for aesthetic reasons, and because the data is used not only for Power BI but also for other Fabric engines. Additionally, you pay a cost later when combining those two columns.

Partitioning

Once the sorting is optimized, note that compression occurs at the row group level. Small row groups won’t yield better results.

For this particular example, Delta_rs generates row groups with 2 million rows, even when I changed the options. I might have been doing something wrong. Using Rust as an engine reduced it to 1 million rows. If you’re using Delta_rs, consider using pyarrow instead:

RG = 8_000_000

write_deltalake(
    path, df, mode="overwrite", 
    max_rows_per_file=RG, max_rows_per_group=RG, min_rows_per_group=RG, 
    engine="pyarrow"
)

Notice here, I’m not partitioning by column but by file. This ensures uniform row groups, even when data distribution is skewed. For example, if the year 2020 has 30M rows and 2021 has 50M rows, partitioning by year would create two substantially different Parquet files.

Testing Again in Fabric F2

Using F2 capacity, let’s see how the new semantic models behave. Notice that the queries are generated from the Power BI report. I manually used the reports to observe capacity behavior.

A screenshot of a computer

Description automatically generated

Testing Using DAX Studio

To understand how each optimization works, I ran a sample query using DAX Studio. Please note, this is just one query and doesn’t represent the complexity of existing reports, as a single report generates multiple queries using different group-by columns. So, this is not a scientific benchmark—just a rough indicator.

Make sure that PowerBI has no data in memory, you can use the excellent semantic lab for that

!pip install -q semantic-link-labs
import sempy_labs as labs
import sempy.fabric as fabric
def clear(sm):
    labs.refresh_semantic_model(sm, refresh_type='clearValues')
    labs.refresh_semantic_model(sm)
    return "done"
for x in ['sort_partition_split_columns','vorder','sort_partition','no_sort']:
    clear(x)

Cold Run

The cold run is the most expensive query, as VertiPaq loads data from OneLake and builds dictionaries in memory. Only the columns needed are loaded. Pay attention to CPU duration, as it’s a good indicator of capacity unit usage.

Warm Run

All queries read from memory, so they’re very fast, often completing in under 100 milliseconds, SQL endpoint return the data around 2 seconds, yes, it is way slower than vertipaq but virtually it does not consume any interactive compute.

using DAX Studio you can view the column loaded into Memory, any column not needed will not be loaded

and the total data size into Memory

Hot Run

This isn’t a result cache. VertiPaq keeps the result of data scans in a buffer (possibly called datacache). If you run another query that can reuse the same data cache, it will skip an unnecessary scan. Fabric DWH would greatly benefit from having some sort of result cache (and yes, it’s coming).

Takeaways

  • VertiPaq works well with sorted Parquet files. V-order is one way to achieve that goal, but it is not a strict requirement.
  • RLE Encoding is more effective with larger row groups, and when the data is sorted
  • Writing data in Fabric is inexpensive; optimizing for user-facing queries is more important.
  • Don’t dismiss Direct Query mode in Fabric DWH—it’s becoming good enough for interactive, user-facing queries.
  • Fabric DWH background consumption of compute unit is an attractive proposition
  • Power BI will often read Parquet files written by Engines other than Fabric. A simple UI to display whether a Delta Table is optimized would be beneficial.
  • Fabric DWH appears to handle less optimized Parquet files with greater tolerance

Building an Ad Hoc Disk Cache with DuckDB and Fabric Notebook

This weekend, I came up with an idea to speed up query execution when running DuckDB inside a Fabric Notebook—and it actually works! 🎉

You can download the notebook here


Approach

  1. Parse the Query
    • Use SQGLot to parse the SQL query and extract the list of Delta tables that need to be scanned from OneLake.
  2. Track Table Metadata
    • Capture the Delta table version and ID to ensure consistency.
  3. Selective Copy
    • Copy only the necessary tables locally to satisfy the query.
  4. Reuse Cached Data
    • For subsequent queries, check if the Delta table has changed:
      • If unchanged, read data from the local SSD.
      • If new tables are required, repeat the caching process for those tables.

Why It Works

This approach effectively creates a temporary, ad hoc disk cache in the notebook. The cache:

  • Persists only for the session’s duration.
  • Evicts automatically when the session ends.
  • Ensures consistency by validating whether the Delta table in OneLake has changed before reusing cached data.
    • Thanks to the Delta format, this validation is a relatively cheap operation.
  • Leverages the user-level isolation in Fabric notebooks to eliminate risks of data inconsistency.

Despite its simplicity, this method has proven to be highly effective for query acceleration! 🚀


Limitations

Yes, I know—the cache is rather naïve since it loads the entire table. Other systems go further by:

  • Copying only the columns needed for the query.
  • Fetching just the row groups relevant to the query.

However, these optimizations would need to be implemented natively by the engine itself.


Industry Gap

Although virtually all Python engines (e.g., Polars, DataFusion, etc.) support reading formats like Delta and Iceberg, almost none offer built-in disk or RAM caching. This lack of caching support limits performance optimization opportunities.

Hopefully, this will change in the future, enabling more efficient workflows out of the box.

Btw, this is really fast !!! just a hint, this is faster than the results obtained by a state of the art DWH in 2022 !!!

Reading Delta table in Fabric Python notebook without attaching a lakehouse. 

This is just a code snippet on how to read a delta table using pure Python (no Spark) without attaching a lakehouse.

First in spark notebook you need to install those two package ( this step will be unnecessary with pure Python notebook)

%pip install deltalake
!pip install duckdb

Then get the abfss path of the table, it can be anywhere, even in different workspace.

from deltalake import DeltaTable
import duckdb
access_token = notebookutils.credentials.getToken('storage')
storage_options=     {"bearer_token": access_token, "use_fabric_endpoint": "true"}
TAXI = DeltaTable('abfss://xxx@onelake.dfs.fabric.microsoft.com/NY.Lakehouse/Tables/ny/taxi',storage_options = storage_options).to_pyarrow_dataset()

For this example the table is not small ( 1.3 billion rows) , having filter pushdown is a must for a good user experience, for smaller data it does not matter.

let’s show 5 rows, the limit is pushed to the source, we don’t need to scan 20 GB just to see some rows.

display(duckdb.sql(f''' SELECT * from TAXI limit 5 ''').df())

Note : you can use any Engine that understand arrow dataset , personally I prefer duckdb but it is a personal taste

Now let’s filter the data only for this year, again, filter pruning works, what I really like ; although the table is not partitioned somehow the scan is leveraging the stats in the delta table log

data = duckdb.sql(f''' SELECT date ,  ROUND (SUM (fare_amount),0) as TotalFares , ROUND (AVG (fare_amount),0) as AVGFares
             FROM TAXI where year = 2024 GROUP BY ALL ''').df()
display(data)

Not everything is perfect yet 😦

max (column) , count(*) unfortunately does not use the delta log and trigger a whole table scan.

let’s show some interactive chart

I have the aggregated data already, using the excellent library Altair, I can easily plot an interactive chart

import altair as alt
brush = alt.selection_interval()
details = alt.Chart(data).mark_bar().encode(alt.X('date:T'),  alt.Y('TotalFares:Q'), tooltip=[alt.Tooltip('date:T',format='%Y-%m-%d %H'),'TotalFares:Q']
).properties( width=1400,  height=400 ).add_params( brush)

summary = alt.Chart(data).mark_square().encode( alt.X('date:T'), alt.Y('AVGFares:Q'), tooltip=['AVGFares:Q'] ).properties(  width=1400,  height=400).transform_filter( brush)
details & summary

honestly, I did not know how much covid had impacted the Taxi industry 😦

You can use Other Engines too

as I said it is not specific to duckdb, for example using Polars

import polars as pl
access_token = notebookutils.credentials.getToken('storage')
storage_options=     {"bearer_token": access_token, "use_fabric_endpoint": "true"}
scada = pl.scan_delta('abfss://xxx@onelake.dfs.fabric.microsoft.com/NY.Lakehouse/Tables/ny/taxi', storage_options=storage_options)
x = scada.limit(10).collect()

and Daft

import daft
from daft.io import IOConfig, AzureConfig
io_config = IOConfig(azure=AzureConfig(storage_account="onelake",endpoint_url="https://onelake.blob.fabric.microsoft.com",bearer_token=access_token))
df = daft.read_deltalake('abfss://xxx@onelake.dfs.fabric.microsoft.com/NY.Lakehouse/Tables/ny/taxi', io_config=io_config)
df

Process 1 Billion rows of raw csv in Fabric Notebook for less than 20 Cents 

The Use case

Data source is around 2200 files with a total of 897 Million rows of weird csv files (the file has more columns than the header) , This is a real world data not some synthetic dataset, it is relatively small around 100 GB uncompressed.

The Pipeline will read those files and extract clean data from it using non trivial transformation and save it as a Delta Table.

we used the smallest Compute available in Fabric Notebook which is 4 cores with 32 GB. to be clear this is a real single node (not 1 driver and 1 executor), Although the Runtime is using Spark, All the Engines interact Directly with the Operating system, as far as I can tell, Spark has a very minimum overhead when not used Directly by the Python code.

You need to pick the Engine

Nowadays we have plenty of high quality Calculation Engines,  but two seems to gain traction (Polars and DuckDB) , at least by the number of package downloaded and the religious wars that seems to erupt occasionally in twitter 🙂

For a change I tried to use Polars, as I was accused of having a bias toward DuckDB, long story short, hit a bug with Polars , I tried Datafusion too but did managed to get a working code, there is not enough documentation on the web, after that I did test Clickhouse chdb, but find a bug, anyway the code is public, feel free to test your own Engine.

So I ended up using DuckDB, the code is published here , it is using only 60 files as it is available publicly, the whole archive is saved in my tenant (happy to share it if interested) 

The results is rather surprising (God bless Onelake throughput), I am using the excellent Python Package Delta Lake to write to Onelake

26 minutes, that’s freaking fast, using Fabric F2, the total cost will be

0.36 $/Hour X(26/60) =  15 Cents

you need to add a couple of cents for Onelake Storage Transactions.

As far as I can tell, this is maybe one of the cheapest option in the Market.

0.36 $/Hour is the rate for pay as you go, if you have a reservation then it is substantially cheaper.

because it is Delta Table Then Any Fabric Engine ( SQL, PowerBI, Spark) can read it.

What’s the catch ?

Today DuckDB can not write directly to Delta Table ( it is coming though eventually) instead it will export data to Delta Lake writer using Arrow Table, it is supposed to be zero copy but as far as I can tell, it is the biggest bottleneck and will generate out of memory errors , the solution is easy ; process the files in chunks , not all at once

#############################################
list_files=[os.path.basename(x) for x in glob.glob(Source+'*.CSV')]
files_to_upload_full_Path = [Source + i for i in list_files]
if len(files_to_upload_full_Path) >0 :
  for i in range(0, len(files_to_upload_full_Path), chunk_len):
    chunk = files_to_upload_full_Path[i:i + chunk_len]
    df=get_scada(chunk)
    write_deltalake("/lakehouse/default/Tables/scada_duckdb",df,mode="append",engine='rust',partition_by=['YEAR'],storage_options={"allow_unsafe_rename":"true"})
    del df

By experimentation, I notice 100 files works fine with 16 GB, 200 files with 32 GB etc

When exporting to Parquet, DuckDB managed the memory natively and it is faster too.

Native Lakehouse Is the future of Data Engineering

The combination of Open table format like Delta and Iceberg with ultra efficient Open Source Engine like DuckDB, Polars, Velox, datafusion all written in C++/Rust will give data engineers an extremely powerful tools to build more flexible and way cheaper data solutions.

if I have to give an advice for young Data engineers/Analysts, Learn Python/SQL.

Would like to thanks Pedro Holanda for fixing some very hard to reproduce bugs in the DuckDB csv reader.

And Ion Koutsouris for answering my silly questions about Delta lake writer.