Using gpt-oss 20B for Text to SQL 

TL;DR : 

As a quick first impression, I tested  Generating SQL Queries based on a YAML Based Semantic model, all the files are stored here , considering i have only 4 GB of VRAM, it is not bad at all !!!

to be clear, this is not a very rigorous benchmark, I just used the result of the last runs, differents runs will give you slightly different results, it is just to get a feeling about the Model, but it is a workload I do care about, which is the only thing that matter really.

The Experimental Setup

The experiment uses a SQL generation test based on the TPC-DS dataset (scale factor 0.1), featuring a star schema with two fact tables (store_sales, store_returns) and multiple dimension tables (date_dim, store, customer, item). The models were challenged with 20 questions ranging from simple aggregations to complex analytical queries requiring proper dimensional modeling techniques.

The Models Under Test

  1. O3-Mini (Reference Model): Cloud-based Azure model serving as the “ground truth”
  2. Qwen3-30B-A3B-2507: Local model via LM Studio
  3. GPT-OSS-20B : Local model via LM Studio

Usually I use Ollama, but moved to LM studio because of MCP tools support, for Qwen 3, strangely code did not perform well at all, and the thinking mode is simply too slow

Key Technical Constraints

  • Memory Limit: Both local models run on 4GB VRAM, my laptop has 32 GB of RAM
  • Timeout: 180 seconds per query
  • Retry Logic: Up to 1 attempt for syntax error correction
  • Validation: Results compared using value-based matching (exact, superset, subset)

The Testing Framework

The experiment employs a robust testing framework with several features:

Semantic Model Enforcement

The models were provided with a detailed semantic model that explicitly defines:

  • Proper dimensional modeling principles
  • Forbidden patterns (direct fact-to-fact joins)
  • Required CTE patterns for combining fact tables
  • Specific measure definitions and business rules

Multi-Level Result Validation

Results are categorized into five match types:

  • Exact Match: Identical results including order
  • Superset: Model returns additional valid data
  • Subset: Model returns partial but correct data
  • Mismatch: Different results
  • Error: Execution or generation failures

The Results: Not bad at all

Overall Performance Summary

Both local models achieved 75-85% accuracy, which is remarkable considering they’re running on consumer-grade hardware with just 4GB VRAM. The GPT-OSS-20B model slightly outperformed Qwen3  with 85% accuracy versus 75%. Although it is way slower

I guess we are not there yet for interactive use case, it is simply too slow for a local setup, specially for complex queries.

Tool calling

a more practical use case is tools calling, you can basically use it to interact with a DB or PowerBI using an mcp server and because it is totally local, you can go forward and read the data and do whatever you want as it is total isolated to your own computer.

The Future is bright

I don’t want to sounds negative, just 6 months ago, i could not make it to works at all, and now I have the choice between multiple vendors and it is all open source, I am very confident that those Models will get even more efficient with time.

AI is Coming for Us

There are moments in life when you know things will never be the same. I remember distinctly when Gary showed me PowerPivot 10 years ago, and I knew that working with data would become as easy as playing with Excel. Another such moment was two days ago when I connected Claude Desktop to a database and asked, “What do you think?”

It was a strange experience. It wasn’t your typical “chat with your data and give me a nice chart” interaction. It was more like talking to a human and asking them to create a report. The LLM started by listing all the tables, examining the data, and making sense of what the dataset was about. Somehow, it figured out that the power generation figures were in MW and that to convert them to MWh, you need to divide by 12.

There’s a simple reason why this approach is so powerful compared to a typical chat with your data workflow: the LLM has read access to the data. It’s still secure and can only read what you’re authorized to access. As far as I know, these LLMs don’t auto-learn and don’t use the data for training, at least when you use an enterprise API.

Another interesting observation: as a non-programmer, I watched AI’s progress in coding with great excitement and never felt much sympathy for human coders. I thought they were exaggerating the threat. Somehow, my reaction changed when I noticed that AI will get very good at analytics too.

Note: I’ll refer to LLMs as AI for simplicity. Kurt has an excellent blog post worth reading, and thanks to Pawel for telling me about this whole MCP thing.

Typical “Chat with Your Data” Workflow

The important thing here is that AI doesn’t have access to your data at all. You collect the maximum knowledge about your data and send your questions with that knowledge. You get back SQL or DAX statements that you send to your server to get answers. if the question is not clear enough then they will ask for clarification, for example, what is the biggest country in the world, AI will reply, is it per size, by GDP etc, It’s much more complex in real life, but that’s the core idea.

Basically, we spend a lot of effort making sure AI can’t see your data. Sometimes, as a user, you wonder why this AI can’t answer some very obvious questions. Just imagine: as a data analyst, if someone asked you to give them a report without even seeing any numbers!

Using MCP

In this setup, the AI is unleashed. It can read the data directly (again, using only what you’re allowed to access and ideally read only), basically AI acts like an agent and has more autonomy, it is not limited only to your metadata.

Example Using Data from OneLake

I have this data in OneLake, and it’s cleansed data:

Because we don’t have an MCP server yet for Fabric DWH, I used the DuckDB MCP server to read the data from OneLake. For convenience, instead of using direct query, I imported the data into a local DuckDB file:

import duckdb

con = duckdb.connect()
con.sql("ATTACH 'aemo_delta.duckdb' AS db; USE db")

for tbl in ['duid', 'summary', 'calendar', 'mstdatetime']:
    con.sql(f"""
        CREATE OR REPLACE TABLE {tbl} AS 
        FROM delta_scan('abfss://serving@onelake.dfs.fabric.microsoft.com/datamart.Lakehouse/Tables/aemo/{tbl}')
    """)

con.close()

You need to install MCP and configure the connection with Claude Desktop. To be clear, it should work with any MCP client, but so far, that’s the best I could find. Who knows, maybe one day Power BI Desktop will act as an MCP client (I literally made up this idea; this is not a hint or anything).

Then you add this config to Claude Desktop:

{
  "mcpServers": {
    "mcp-server-motherduck": {
      "command": "uvx",
      "args": [
        "mcp-server-motherduck",
        "--db-path",
        "/tmp/llm/aemo_delta.duckdb"
      ]
    }
  }
}

For me, it feels like ODBC for AI. The protocol is getting adopted by everyone.

The Experience

Since the data is public, I shared the whole chat. What I really like is how AI approaches the problem, first by looking at the tables. This is very human-like behavior.

If you read the chat, you’ll see it’s not perfect. It casually skipped hydro from the renewable conversation and didn’t calculate MWh correctly, although it did yesterday.

Some Observations

  • Even for a simple use case, you still need a semantic model. If I had a measure MWh = MW/12, the AI would always use it, at least in theory. For a complex model, it’s even more critical, Having said that, AI can do modeling just fine 🙂 do we need human for that ?
  • surprisingly in that simple workflow, i can replace every compute , what’s really critical is storage !!!
  • All my data is publicly available, so I wasn’t worried about security. For any enterprise work, you can’t really use something like Claude Desktop, but rather solutions like Azure AI Foundry.
  • For now, most models don’t acquire new knowledge during serving, but who knows what will happen in the next 10 years? You can imagine an AI that learns just from interaction with users and data, which opens all kinds of new questions. Do you need specific models for every tenant, for every user ? We’re not there yet, it is something we will have to deal with it.
  • Never give MCP write access to anything

A Non-scientific Benchmark of Text-to-SQL using Small Language Models

TL;DR ; This post shares a quick experiment I ran to test how effective (or ineffective) small language models are at generating SQL from natural language questions when provided with a well-defined semantic model. It is purely an intellectual curiosity; I don’t think we are there yet. Cloud Hosted LLMs are simply too good, efficient, and cost-effective.

You can download the notebook and the semantic model here.

⚠️ This is not a scientific benchmark.
I’m not claiming expertise here—just exploring what small-scale models can do to gain an intuition for how they work. Large language models use so much computational power that it’s unclear whether their performance reflects true intelligence or simply brute force. Small-scale models, however, don’t face this issue, making their capabilities easier to interpret.

Introduction

I used Ollama to serve models locally on my laptop and DuckDB for running the SQL queries. DuckDB is just for convenience—you could use any SQL-compatible database

For a start I used Qwen3, 4B, 8B and 14B, it is open weight and I heard good reviews considering it’s size, but the same approach will works with any models, notice I turn off thinking mode in Qwen.

To be honest, I tested other small models too, and they didn’t work as well. For example, they couldn’t detect my graphics card. I probably missed some configuration, but since I don’t know enough, I prefer to leave it at that.

0. Semantic Model Prompt


A semantic_model.txt file acts as the system prompt. This guides the model to produce more accurate and structured SQL outputs , the semantic model itself was generated by another LLM, it does include non trivial verified SQL queries ,sample values, relationships , measures etc, custom instructions

“no_think” is to turn off the thinking mode in Qwen3

1. Setup and Environment

  • The notebook expects an Ollama instance running locally, with the desired models (like qwen3:8b, qwen3:4b) already pulled using ollama run <model_name>.

2. How It Works

Two main functions handle the process:

  • get_ollama_response:
    This function takes your natural language question, combines it with the semantic prompt, sends it to the Local Ollama server, and returns the generated SQL.
  • execute_sql_with_retry:
    It tries to run the SQL in DuckDB. If the query fails (due to syntax or binding errors), it asks the model to fix it and retries—until it either works or hits a retry limit.

In short, you type a question, the model responds with SQL, and if it fails, the notebook tries to self-correct and rerun.

3. Data Preparation

The data was generated using a Python script with a scale factor (e.g., 0.1). If the corresponding DuckDB file didn’t exist, the script created one and populated it with the needed tables. Again, the idea was to keep things lightweight and portable.

Figure: Example semantic model

4. Testing Questions

Here are some of the questions I tested, some are simple questions others a bit harder and require more efforts from the LLM

  • “total sales”
  • “return rate”
  • “Identify the top 10 item categories with the highest total return amount from customers born in ‘USA’ who made returns in 2001.”
  • “customer age group with the worst return rate?”
  • “return rate per year”
  • “any days with unusual return rate?, use fancy statistics”

Each question was sent to different models (qwen3:14b, qwen3:8b, qwen3:4b) to compare their performance. I also used %%time to measure how long each model took to respond, some questions were already in the semantic model, verified query answers, so in a sense it is a test too to see how the model stick with the instruction


5. What Came Out

For every model and question, I recorded:

  • The original question
  • Any error messages and retries
  • The final result (or failure)
  • The final SQL used
  • Time taken per question and total time per model

6. Observations

Question 6 : about detecting unusual return rates with “fancy statistics “stood out:

  • 8B model:
    Generated clean SQL using CTEs and followed a star-schema-friendly join strategy. No retries needed.
  • 14B model:
    Tried using Z-scores, but incorrectly joined two fact tables directly. This goes against explicit instruction in the semantic model.
  • 4B model:
    Couldn’t handle the query at all. It hit the retry limit without producing usable SQL.

By the way, the scariest part isn’t when the SQL query fails to run, it’s when it runs, appears correct, but silently returns incorrect results

Another behavior which I like very much, I asked a question about customers born in the ‘USA’, the model was clever enough to leverage the sample values and use ‘UNITED STATES’ instead in the filter.

Execution Times

  • 14B: 11 minutes 35 seconds
  • 8B: 7 minutes 31 seconds
  • 4B: 4 minutes 34 seconds

Tested on a laptop with 8 cores, 32 GB RAM, and 4 GB VRAM (Nvidia RTX A2000), the data is very small all the time is spent on getting the SQL , so although the accuracy is not too bad, we are far away from interactive use case using just laptop hardware.

7- Testing with simpler questions Only

I redone the test with 4B but using only simpler questions :

questions = [
          'total sales',
          'return rate',
          "Identify the top 10 item categories with the highest total return amount from customers born in 'USA' who made returns in 2001.",
          'return rate per year',
          'most sold items',
          ]
ask_question(questions,'qwen3:4b')

the 5 questions took less than a minutes, that promising !!!

Closing Thought

instead of a general purpose SLM, maybe a coding and sql fine tuned model with 4B size will be an interesting proposition, we live in an interesting time

Experimenting with Text-to-SQL: Lessons from Optimizing Product Return Analysis

🌟 Introduction

While testing the DuckDB ODBC driver, which is getting better and better (not production ready but less broken compared to two years ago), I noticed something unexpected. Running queries through Power BI in DirectQuery mode was actually faster than executing them directly in the DuckDB native UI.

Naturally, that does not make sense !!!

What followed was an investigation that turned into a fun and insightful deep dive into text-to-SQL generation, Power BI’s query behavior, and the enduring relevance of manual SQL tuning

🧩 The Goal: Find the Worst Product by Return Rate

The task was straightforward:

Calculate total sales, total returns, and return rate by product. Rank the products and find the top 5 with the highest return rates.

To make it interesting, I decided to try:

  1. Letting an LLM generate the SQL by loading the semantic model.
  2. Using PowerBI in Direct Query Mode.
  3. Finally, manually tuning the query.

📝 Step 1: LLM-generated SQL — Clean and Understandable

chatgpt generated a good starting point:

WITH sales_by_product AS   (
SELECT
i.i_product_name AS product_name,
SUM(ss.ss_sales_price * ss.ss_quantity) AS total_sales
FROM store_sales ss
JOIN item i ON ss.ss_item_sk = i.i_item_sk
WHERE i.i_product_name IS NOT NULL
GROUP BY i.i_product_name
),

returns_by_product AS (
SELECT
i.i_product_name AS product_name,
SUM(sr.sr_return_amt) AS total_returns
FROM store_returns sr
JOIN item i ON sr.sr_item_sk = i.i_item_sk
WHERE i.i_product_name IS NOT NULL
GROUP BY i.i_product_name
),

combined AS (
SELECT
COALESCE(s.product_name, r.product_name) AS product_name,
COALESCE(s.total_sales, 0) AS total_sales,
COALESCE(r.total_returns, 0) AS total_returns
FROM sales_by_product s
FULL OUTER JOIN returns_by_product r
ON s.product_name = r.product_name
)

SELECT
product_name,
ROUND((total_returns / NULLIF(total_sales, 0)) * 100, 2) AS return_rate
FROM combined
WHERE total_sales > 0 -- Avoid divide by zero
ORDER BY return_rate DESC
limit 5 ;

Pros:

  • Clean and easy to read.
  • Logically sound.
  • Good for quick prototyping.

🔍 Observation: However, it used product_name (a text field) as the join key in the combined table, initially I was testing using TPC-DS10, the performance was good, but when I changed it to DS100, performance degraded very quickly!!! I should know better but did not notice that product_name has a lot of distinct values.

the sales table is nearly 300 M rows using my laptop, so it is not too bad

and it is nearly 26 GB of highly compressed data ( just to keep it in perspective)

📊 Step 2: Power BI DirectQuery Surprises

Power BI automatically generate SQL Queries based on the Data Model, Basically you defined measures using DAX, you add a visual which generate a DAX query that got translated to SQL, based on some complex logic, it may or may not push just 1 query to the source system, anyway in this case, it did generated multiple SQL queries and stitched the result together.

🔍 Insight: Power BI worked exactly as designed:

  • It split measures into independent queries.
  • It grouped by product_name, because that was the visible field in my model.
  • And surprisingly, it was faster than running the same query directly in DuckDB CLI!

Here’s my screenshot showing Power BI results and DAX Studio:


🧩 Step 3: DuckDB CLI — Slow with Text Joins

Running the same query directly in DuckDB CLI was noticeably slower, 290 seconds !!!


⚙️ Step 4: Manual SQL Tuning — Surrogate Keys Win

To fix this, I rewrote the SQL manually:

  • Switched to item_sk, a surrogate integer key.
  • Delayed lookup of human-readable fields.

Here’s the optimized query:

WITH sales_by_product AS (
SELECT
ss.ss_item_sk AS item_sk,
SUM(ss.ss_sales_price * ss.ss_quantity) AS total_sales
FROM store_sales ss
GROUP BY ss.ss_item_sk
),

returns_by_product AS (
SELECT
sr.sr_item_sk AS item_sk,
SUM(sr.sr_return_amt) AS total_returns
FROM store_returns sr
GROUP BY sr.sr_item_sk
),

combined AS (
SELECT
COALESCE(s.item_sk, r.item_sk) AS item_sk,
COALESCE(s.total_sales, 0) AS total_sales,
COALESCE(r.total_returns, 0) AS total_returns
FROM sales_by_product s
FULL OUTER JOIN returns_by_product r ON s.item_sk = r.item_sk
)

SELECT
i.i_product_name AS product_name,
ROUND((combined.total_returns / NULLIF(combined.total_sales, 0)) * 100, 2) AS return_rate
FROM combined
LEFT JOIN item i ON combined.item_sk = i.i_item_sk
WHERE i.i_product_name IS NOT NULL
ORDER BY return_rate DESC
LIMIT 5;

🚀 Result: Huge performance gain! from 290 seconds to 41 seconds

Check out the improved runtime in DuckDB CLI:


🌍 In real-world models, surrogate keys aren’t typically used

unfortunately in real life, people still use text as a join key, luckily PowerBI seems to do better there !!!


🚀 Final Thoughts

LLMs are funny, when I asked chatgpt why it did not suggest a better SQL Query, I got this answer 🙂

I guess the takeaway is this:


If you’re writing SQL queries, always prefer integer types for your keys!

And maybe, just maybe, DuckDB (and databases in general) could get even better at optimizing joins on text columns. 😉

But perhaps the most interesting question is:
What if, one day, LLMs not only generate correct SQL queries but also fully performance-optimized ones?

Now that would be exciting.

you can download the data here, it is using very small factor : https://github.com/djouallah/Fabric_Notebooks_Demo/tree/main/SemanticModel

Edit : run explain analyze show that it is group by is taking most of the time and not the joins

the optimized query assumed already that i.i_item_sk is unique, it is not very obvious for duckdb to rewrite the query without knowing the type of joins !!! I guess LLMs still have a lot to learn