LLMs Are Getting Better at SQL

For no particular reason, I recently opened a Power BI report with two fact tables, exported the model definition to TMDL, saved it as a file, and loaded it into several state-of-the-art LLMs, including ChatGPT, Grok, Anthropic, and Qwen QWQ. I wasn’t particularly interested in comparing specific models—I just wanted to get a general sense of how they performed. To keep things simple, I used only the free tiers.

To my surprise, these models are getting really smart. They all recognized that the file represented a Power BI semantic model. They understood the concept of tables, columns, relationships, and measures, which was quite impressive.

You can download the Model here , i added a semantic Model as a text file in case you don’t have PowerBI

The Power of Semantic Models

Power BI, and before that PowerPivot, has supported multiple fact tables from day one. I remember the first time I used PowerPivot— a colleague showed me how to handle multiple fact tables: just don’t join fact to fact. Instead, use a common dimension, and everything will work fine. Back then, I had no idea about Kimball, multi-stage SQL queries, chasm trap etc. As a business user, I only cared that I got the correct results. It took me years to appreciate the ingenuity and engineering that went into making this work seamlessly. To me, this is an example of great product design: solve a very hard problem, and people will use it.

But this blog isn’t about Power BI and DAX—that’s a solved problem. Instead, I wanted to explore whether LLMs, by reading only metadata, could generate correct SQL queries for relatively complex questions.

Note: for people not familiar with TMDL, it is a human readable format of PowerBI semantic Model ( if human can read it, then LLMs will understand it too)

Testing LLMs with SQL Generation

I started with simple queries involving only one fact table, and the results were straightforward. The models correctly referenced measure definitions, avoiding column name hallucinations. Good news so far.

Things got more interesting when I asked for a query using a more complex measure, such as Net Sales:

Net Sales = [Total Sales] - [Total Returns]

This measure involves two tables: store_sales and store_returns.

Expected SQL Approach: Drilling Across

My expectation was that the models would use the Drilling Across technique. For example, when asked for net sales and cumulative sales by year and brand, a well-structured query would look like this:

  1. Compute total sales by year and brand.
  2. Compute total returns by year and brand.
  3. Join these results on year and brand.
  4. Calculate net sales as [Total Sales] - [Total Returns].
  5. Calculate cumulative Sales

Most LLMs generated a similar approach. Some required a hint (“don’t ever join fact to fact”), but the biggest issue was the final join. Many models defaulted to a LEFT JOIN, which is problematic—some returns for a brand might not occur in the same period as the sales, leading to incomplete results.

That said, some models got it right on the first attempt, which was very impressive. something like this:

WITH sales AS (
    SELECT
        d.d_year,
        i.i_brand,
        SUM(s.ss_sales_price * s.ss_quantity) AS total_sales
    FROM store_sales s
    JOIN item i ON s.ss_item_sk = i.i_item_sk
    JOIN date_dim d ON s.ss_sold_date_sk = d.d_date_sk
    GROUP BY d.d_year, i.i_brand
),
returns AS (
    SELECT
        d.d_year,
        i.i_brand,
        SUM(r.sr_return_amt) AS total_returns
    FROM store_returns r
    JOIN item i ON r.sr_item_sk = i.i_item_sk
    JOIN date_dim d ON r.sr_returned_date_sk = d.d_date_sk
    GROUP BY d.d_year, i.i_brand
)
SELECT
    COALESCE(s.d_year, r.d_year) AS year,
    COALESCE(s.i_brand, r.i_brand) AS brand,
    COALESCE(s.total_sales, 0) AS total_sales,
    COALESCE(r.total_returns, 0) AS total_returns,
    (COALESCE(s.total_sales, 0) - COALESCE(r.total_returns, 0)) AS net_sales,
    SUM(COALESCE(s.total_sales, 0) - COALESCE(r.total_returns, 0))
        OVER (PARTITION BY COALESCE(s.i_brand, r.i_brand) ORDER BY COALESCE(s.d_year, r.d_year)
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_net_sales
FROM sales s
FULL OUTER JOIN returns r
ON s.d_year = r.d_year AND s.i_brand = r.i_brand
ORDER BY COALESCE(s.i_brand, r.i_brand), COALESCE(s.d_year, r.d_year);

An Alternative Approach: UNION ALL

One LLM got creative and proposed a different solution using UNION ALL:

WITH combined AS (
    SELECT d.d_year, i.i_brand, 'sales' AS type, s.ss_sales_price * s.ss_quantity AS amount
    FROM store_sales s
    JOIN date_dim d ON s.ss_sold_date_sk = d.d_date_sk
    JOIN item i ON s.ss_item_sk = i.i_item_sk
    UNION ALL
    SELECT d.d_year, i.i_brand, 'returns' AS type, r.sr_return_amt AS amount
    FROM store_returns r
    JOIN date_dim d ON r.sr_returned_date_sk = d.d_date_sk
    JOIN item i ON r.sr_item_sk = i.i_item_sk
)
SELECT
    d_year AS year,
    i_brand AS brand,
    SUM(CASE WHEN type = 'sales' THEN amount ELSE 0 END) AS total_sales,
    SUM(CASE WHEN type = 'returns' THEN amount ELSE 0 END) AS total_returns,
    (SUM(CASE WHEN type = 'sales' THEN amount ELSE 0 END) - SUM(CASE WHEN type = 'returns' THEN amount ELSE 0 END)) AS net_sales,
    SUM(SUM(CASE WHEN type = 'sales' THEN amount ELSE 0 END) - SUM(CASE WHEN type = 'returns' THEN amount ELSE 0 END))
        OVER (PARTITION BY i_brand ORDER BY d_year ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_net_sales
FROM combined
GROUP BY d_year, i_brand
ORDER BY i_brand, d_year;

Edit : people who know what they are doing , did not like the union all solution for performance reason

So, Is SQL a Solved Problem?

No, of course not. There’s still no guarantee of getting the same answer when asking the same question multiple times. Additionally, AI has no real understanding of the data. The moment you introduce filters, you can get all sorts of unexpected results,Although Providing sample values from columns into a semantic model helps a lot, still real-world models are far more complex and often contain ambiguities.

While LLMs generate correct SQL, they don’t always generate the most efficient queries, But you can argue, it is the Database problem to solve 🙂

Compute Considerations

I don’t know enough to make a strong statement about the current cost of running LLMs, but I ran a simple experiment: I tested QWQ-32 on my laptop (which has an NVIDIA RTX A2000 GPU with 4GB of dedicated memory). The good news? It worked. Running an open-source “thinking model” on a personal laptop is already impressive. The bad news? It was painfully slow—so much so that I’m not eager to repeat the test.

From a practical perspective, generating SQL queries this way seems extremely expensive. A decent BI tool can generate similar queries in milliseconds, using orders of magnitude less compute power, but as LLMs continue to improve in efficiency, maybe in a couple of years, the compute requirement to generate SQL Queries will become trivial ?

Final Thoughts

Having experimented with LLMs since 2023, I can say with confidence that they have improved significantly ( LLMs in 2023 were not very good to be honest). They are getting better and, more importantly, cheaper, opening the door for new applications.

The initial promise was that you could just throw data at an AI system, and it would figure out everything. I suspect that’s not true. In reality, to get useful results, you need more structure, and well-defined semantic models will play a crucial role in making asking your Data a reality.

Optimizing Fabric Capacity Consumption: A Practical Approach

If you’re utilizing Microsoft Fabric and notice consistently high usage—perhaps averaging 90%—or experience occasional throttling, it’s a clear indication that you’re nearing your capacity limit. This situation could pose challenges as you scale your workload. To address this, you have three primary options to consider:

  1. Upgrade to a Higher SKU
    Increasing your capacity by upgrading to a higher SKU is a straightforward solution, though it comes with additional costs. This option ensures your infrastructure can handle increased demand without requiring immediate changes to your workflows.
  2. Optimize Your Workflows
    Workflow optimization can reduce capacity consumption, though it’s not always a simple task. Achieving meaningful improvements often demands a higher level of expertise than merely maintaining functional operations. This approach requires a detailed analysis of your processes to identify inefficiencies and implement refinements.
  3. Reduce Data Refresh Frequency
    A practical and often overlooked option is to adjust the freshness of your data. Review the workflows consuming the most capacity and assess whether the current refresh rate is truly necessary. For instance, if a process runs continuously or frequently throughout the day, consider scheduling it to operate within a specific window—say, from 9 AM to 5 PM—or at reduced intervals. As an example, I adjusted a Python notebook that previously refreshed every 5 minutes to run hourly, limiting it to 10 executions per day. The results demonstrated a significant reduction in capacity usage without compromising core business needs.

8000 CU(s) total usage for a whole solution is just a bargain !!! Python Notebook and Onelake are just too good !!!  the red bar is the limit of F2

Choosing the Right Path

If your organization has a genuine requirement for frequent data refreshes, reducing freshness may not suffice. In such cases, you’ll need to weigh the benefits of optimization against the simplicity of upgrading your SKU. 

Reading a Delta Table Hosted in OneLake from Snowflake

I recently had a conversation about this topic and realized that it’s not widely known that Snowflake can read Delta tables hosted in OneLake. So, I thought I’d share this in a blog post.

Fundamentally, this process is similar to how XTable in Fabric works, but in reverse—it converts a Delta table to Iceberg by translating the table metadata ( AFAIK, Snowflake don’t use Xtable but an internal tool)

Recommended Documentation

For detailed information, I strongly recommend reading the official Snowflake documentation:
🔗 Create Iceberg Table from Delta


How It Works

External Volume and File Section

When creating an external volume in Snowflake that points to OneLake, only the Files section is supported. This isn’t an issue because you can simply add a shortcut that points to a schema.

SQL Code to Set Up External Volume and Map an Existing Table

CREATE OR REPLACE EXTERNAL VOLUME onelake_personal_tenant_delta
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'onelake_personal_tenant_delta',
            STORAGE_PROVIDER = 'AZURE',
            STORAGE_BASE_URL = 'azure://onelake.dfs.fabric.microsoft.com/python/data.Lakehouse/Files',
            AZURE_TENANT_ID = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
         )
      );

DESC EXTERNAL VOLUME onelake_personal_tenant_delta;

One-Time Setup: Authentication

You’ll need to complete a one-time authentication step:

  1. Copy the AZURE_CONSENT_URL from the output.
  2. Open it in your browser: "AZURE_CONSENT_URL":"https://login.microsoftonline.com/xx/oauth2/authorize?client_id=yy&response_type=code"
  3. Ensure you select the correct email address that has access to your tenant.
  4. This will create an Azure multi-tenant app with a service principal.
  5. Add this service principal to your workspace so that Snowflake can access the data.

Setting Up the Iceberg Table

CREATE OR REPLACE CATALOG INTEGRATION delta_catalog_integration
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = DELTA
  ENABLED = TRUE;

CREATE DATABASE IF NOT EXISTS ONELAKE;

CREATE SCHEMA IF NOT EXISTS delta;

CREATE ICEBERG TABLE ONELAKE.delta.test
  CATALOG='delta_catalog_integration'
  EXTERNAL_VOLUME = 'onelake_personal_tenant_delta'
  BASE_LOCATION = 'aemo/test/';

SELECT COUNT(*) FROM ONELAKE.delta.test;


and here is the result

Limitations

  • Deletion Vectors Are Not Supported:
    If your table contains deletion vectors, make sure to run OPTIMIZE to handle them properly.

Reading BigQuery Iceberg Tables in Fabric

This is a quick guide on correctly reading Iceberg tables from BigQuery. Currently, there are two types of Iceberg tables in BigQuery, based on the writer:

BigQuery Iceberg Table

This is the table written using BigQuery engine

Iceberg Tables Written Using the BigQuery Metastore

Currently, only Spark is supported. I assume that, at some point, other engines will be added. The implementation is entirely open source but currently supports only Java (having a REST API would have been a nice addition).

How OneLake Iceberg Shortcuts Work

OneLake reads both the data and metadata of an Iceberg table from its storage location and dynamically generates a Delta Lake log. This is a quick and cost-effective operation, as it involves only generating JSON files. See an example here

The Delta log is added to OneLake, while the source data remains read-only. Whenever you make changes to the Iceberg table, new metadata is generated and translated accordingly. The process is straightforward.

BigQuery Iceberg Doesn’t Publish Metadata Automatically

BigQuery uses an internal system to manage transactions. When querying data from the BigQuery SQL endpoint, the results are always consistent. However, reading directly from storage may return an outdated state of the table.

For BigQuery Iceberg tables, you need to manually run the following command to update the metadata:

EXPORT TABLE METADATA FROM dataset.iceberg_table;

you can run it on a schedule, or make it the last step in an ETL pipeline.

Iceberg Tables Using the BigQuery Metastore (Written by Spark)

If the Iceberg table is written using the BigQuery metastore (e.g., by Spark), no additional steps are required. The metadata is automatically updated.

The interesting part about Iceberg’s translation to a Delta table in OneLake is that it is completely transparent to Fabric workloads. For example, Power BI simply recognizes it as a regular Delta table. 😊