Analytics Engineering

Text-to-SQL for Operational Analytics — Beyond the Toy Examples

Making natural language querying work against real freight and procurement data requires more than a capable model.

Pankaj Kumar•June 2026•7 min read

The demo for text-to-SQL always looks the same. A clean schema. A natural language question. A correct SQL query returned instantly. Applause.

The production system looks different. The schema has 40 tables, columns with ambiguous names, date fields in three different formats, and business logic embedded in views that no single person fully understands.

Building text-to-SQL that works against real operational data — freight datasets, procurement records, carrier pricing tables — is an engineering problem much more than a modeling problem. Here is what that actually involves.

Why schema complexity is the real bottleneck

Most toy text-to-SQL benchmarks use schemas with five to ten tables and obvious column names. Operational databases look nothing like this.

Freight and procurement datasets typically involve dozens of tables with abbreviated or ambiguous column names, denormalized structures from legacy systems, and implicit business rules that live in analyst knowledge rather than the schema itself. A column called org might mean origin port, originating office, or organization depending on which table it appears in.

The model cannot reason correctly over ambiguous schema if the schema is all it receives. Schema understanding requires:

Column aliasing and disambiguation in the prompt context
Business terminology mapping (what "lane" means in this context, what counts as "active" cargo)
Relationship documentation beyond foreign keys
Example queries that demonstrate non-obvious joins

Without this layer, even capable models generate plausible-looking SQL that returns the wrong rows.

The self-correction loop

A query generated by a language model cannot be trusted without validation. The architecture pattern that consistently improves reliability is a self-correction loop: generate, execute, inspect, regenerate if needed.

The loop works roughly as follows. The model generates a SQL query. The query engine executes it and returns either a result or an error. If the query fails, the error message is injected back into the prompt alongside the original question, and the model attempts to correct. If the query succeeds but returns zero rows — or an implausibly large result set — that signal is also fed back.

This loop catches the majority of syntax errors and obvious logical mistakes before they reach the user. It does not catch semantic errors — queries that execute correctly but answer the wrong question — which is why validation against known ground truth remains important.

Date and entity normalization

Operational queries are full of implicit references that require normalization before SQL generation can succeed.

"Last quarter's volume" requires resolving the current date, computing the quarter boundary, and mapping that to the correct date field in the schema. "Hamburg freight" requires resolving "Hamburg" to the port code the database actually uses. "Active shipments" requires knowing how the system defines active — which is usually not a single column flag.

These normalization steps belong in the retrieval and preprocessing layer, not inside the model prompt. Asking the model to do implicit arithmetic over dates and terminology at the same time as generating SQL increases hallucination risk significantly.

A pre-processing layer that resolves entities, normalizes date ranges, and surfaces relevant schema context before the model sees the query produces substantially more reliable results.

Token budget and context management

A fully documented operational schema is too large to inject into a model context window in its entirety. Token budgeting requires a retrieval step just for schema context: given the user's question, retrieve the relevant tables and columns rather than the entire schema.

In practice, this means maintaining a lightweight schema index — table names, column descriptions, sample values — and using semantic similarity to surface the two to five most relevant tables for each query. The selected tables and their documented columns then enter the prompt; the rest do not.

Getting this step right reduces context noise and substantially improves SQL accuracy on multi-table queries.

What "good enough" looks like

Text-to-SQL for operational analytics does not need to be perfect. It needs to be reliable enough that users trust it and specific enough that wrong answers are obvious rather than plausible.

The failure mode to avoid is confident wrongness: a query that executes cleanly, returns a number, and is silently incorrect. A system that errors visibly — or surfaces its uncertainty — is operationally safer than one that always returns an answer regardless of confidence.

The teams that build useful text-to-SQL systems treat it as a retrieval pipeline problem first, a prompt engineering problem second, and a model selection problem last. That ordering is usually the inverse of how organizations approach it.