Field notes from a sovereign-AI build

Notes from a recent engagement. Names and details fictionalised.

The brief was modest in scope and unusually clear in spirit: build a small, realistic AI assistant that lets a business ask natural-language questions of its own data — without that data ever leaving its own infrastructure. No third-party AI service, no hosted vector database, no telemetry leaving the building.

What started as a quick spike turned into a few intense weeks of iteration and three meaningful rewrites. The arc of that work is familiar enough — and increasingly common — that it’s worth writing down.

The shape of the problem

Picture a mid-sized business sitting on a structured database of operational data: customers, contracts, invoices, jobs, schedules. They’d like to ask questions of it the way they’d ask a thoughtful colleague: which contracts are renewing next month?, which jobs are overdue and unassigned?, who hasn’t paid a recent invoice?

They don’t want to learn a query language. They don’t want a new dashboard. And, increasingly, they don’t want their information flowing through a service they don’t control.

That last constraint is the interesting one. Take it seriously and a lot of the modern AI stack stops being available to you, which forces a useful conversation about what was actually load-bearing and what was just convention.

Phase one — the simple thing that mostly works

The first version was deliberately unglamorous: describe the structure of the data to the model, let it propose a query in response to a plain-English question, run that query against a read-only copy of the data, and let the model summarise what came back.

For a focused demo over well-organised data, this approach is astonishingly capable. Most questions a business would actually ask have a precise answer, and a precise answer is what you get — not a hallucinated paraphrase of one.

Phase two — the unglamorous middle

Then reality showed up. A friendly “hi” shouldn’t trigger a database query. A two-part question shouldn’t be answered as if it were one. A multi-turn conversation needs to actually remember what was said earlier. The model occasionally says something confident and slightly wrong, and the system needs to notice before the user does.

None of this is novel. None of it is research. All of it is real work. Anyone who has shipped an AI feature recognises this stretch — the part between “working demo” and “something a customer can actually use” — and almost everyone underestimates it.

Phase three — letting the customer choose the engine

A reasonable customer question, asked early: which AI model should we run? The honest answer is that it depends, and that the right answer will change. The practical answer is to keep that decision separate from the rest of the system, so that swapping the underlying model is a configuration change rather than a project.

That bit of restraint paid back more than any other piece of architecture. Demos became conversations: the same question, the same data, several different models. The customer could see for themselves what worked on hardware they already owned.

Phase four — when the data stops being a toy

As the underlying data grew toward something realistic, two things started to wobble. Describing the whole structure to the model in plain text became expensive and a little leaky — the more it had to keep in mind, the more subtle mistakes it made. And ordinary questions started to be ambiguous: which record did they mean? for what date range? with which assumptions?

For a while, the answer was a longer and more careful set of instructions. That works — until it doesn’t.

Phase five — the rewrite worth doing

The most useful change came late, and in hindsight should have come earlier. Instead of asking the model to author a query from scratch every time, the system was reshaped around a small, well-defined set of operations the business actually cares about: look up a record by name, search a category by a handful of useful filters, fetch the things due this week.

The model’s job became simpler, almost clerical: pick the right operation and fill in its parameters. The complicated details — how the data is shaped, which fields matter, how edge cases should be handled — moved out of the prompt and into ordinary, testable code.

Before	After
The model writes a fresh query every turn	The model picks an operation and fills in the blanks
The data’s shape lives inside the prompt	The data’s shape lives in code, where it can be reviewed
Edge cases are managed by careful wording	Edge cases are handled where edge cases belong
Each run is a little different and hard to test	The behaviour is predictable and easy to test
Smaller models struggle	Smaller models work fine — they only choose, not author
The prompt grows over time	The prompt is stable; the system grows around it

The benefits compound. New capabilities became small, ordinary additions to existing code rather than ever-longer paragraphs in an ever-more-fragile prompt. The prose around the system calmed down. Smaller, faster models became viable. The whole thing got easier to explain.

What mattered less than expected

Exotic retrieval techniques. For well-structured data, they were rarely the right starting point. They have their place, but as a second layer rather than a foundation.
Bigger models. Beyond a certain point, the limiting factor stopped being the model. Capability is real, but it is rarely the bottleneck once the system around the model is well shaped.
Heavyweight frameworks. A clear, simple loop with sensible limits did the work. The temptation to reach for elaborate orchestration faded once the underlying operations were clean.

What mattered more than expected

Showing the working. Letting the user see, in some form, what the system was actually doing on their behalf was the single biggest trust-builder. Transparency is a feature, not a debugging tool.
The user experience around “thinking” models. Models that show their reasoning, and the choices about how much of that to surface, have bigger UX consequences than engineering ones.
Sovereignty as a story. “Your data never leaves your premises” did more in customer conversations than any technical comparison. For regulated-adjacent industries, this is the headline, not the footnote.

If you’re heading the same way

Start with the simplest thing that could work. A working answer to a real question, on real data, in front of a real customer is worth more than any architecture diagram.
Keep the model swappable. The right choice today is not the right choice in six months. Decouple it early.
Make the system’s working visible. Trust scales with transparency, not with cleverness.
Notice when the prompt is doing too much. When “just one more rule” becomes a daily ritual, that’s the signal to move complexity into code.
Budget for the boring middle. The bit between a demo and something usable is where the real work lives. It’s also where the difference between systems you can rely on and systems you can’t becomes clear.

The end state is unspectacular by design. A small system, on ordinary infrastructure, that quietly answers questions about a customer’s own data without ever sending it anywhere it shouldn’t go. For the right kind of customer, that is exactly the point.

If you’re thinking about something similar, the most interesting questions in this space are rarely about the model itself. They’re about where “private”, “useful”, and “maintainable” have to all be true at once — and how to keep them that way.