Giving the search box real query power

A plain-language look at the gap a structured text query would close — letting searches combine words with AND / OR / NOT, match exact phrases, require words near each other, and grow word-stems with * — with three real examples of what it unlocks.
2026-05-30 · background note for issue #363 (split from #214 item 4) · supersedes offline/explorations/fts-inert-text-vs-query-dsl.html
The idea in one line
Our text search treats whatever you type as one literal phrase to find, never as a query with logic in it. A structured textQuery would let you (or an LLM searching for you) compose real boolean, phrase, proximity and prefix searches — safely — over the catalogue's free-text fields.

The gap today

Our collection search runs on SQLite's full-text engine, FTS5, which has its own little query language: AND, OR, NOT, proximity (NEAR), quoted phrases, and prefixes like paint*. But the server doesn't let any of it through. Before your words reach the database we strip out the operators and wrap everything in quotes, so the whole thing is matched as one exact phrase. Your text becomes pure data to look up, never instructions for how to search.

An analogy
Imagine a very literal-minded librarian. You ask for "paintings of cats and dogs" hoping for pictures with either animal. They don't read "and" as logic — they go hunting for a book titled, word for word, "paintings of cats and dogs." That literal-mindedness is exactly what the search enforces on purpose.

This is a deliberate safety choice: raw FTS5 input can crash on a stray bracket, silently change the meaning of ordinary words like "and", or let callers poke at the internal shape of our index. Force-quoting kills all three risks at once. The cost is expressiveness — and there are four common things a researcher simply cannot ask for today:

The current workaround is our separate flat filters (title, description, inscription, curatorialNarrative), which are simply combined with AND, plus semantic_search for fuzzy meaning. That covers a great deal — but not precise, composed text logic.

What a structured query would add

The fix (detailed in #363) keeps the user's words inert — always quoted, always stripped of operators, exactly as today — but lets the structure around them be built by our own code from a small, validated request. The user's text only ever lands inside a safe quoted slot; the logic is ours, so it can never be malformed. Here are the building blocks, each with a tiny example:

should — either/or
Match any of several clauses. { should: [ A, B ] } → "A or B". This is the only way to search two different fields with an either/or, which the flat filters can't do.
must / mustNot — and / not
must requires every clause; mustNot excludes. { must: [A], mustNot: [B] } → "A but not B".
phrase vs. any — exact words vs. loose words
{ phrase: "cum privilegio" } matches those two words in that order; { any: ["x", "y"] } matches either word anywhere.
near — proximity
{ near: { terms: ["gesigneerd", "gedateerd"], distance: 4 } } → the two words within four words of each other.
prefix — grow a word-stem
{ prefix: "excud" } matches excudit, excud., exc. — useful when the same idea is spelled many ways.

The three fields it would search

The catalogue's long-form text lives in three quite different corpora, which is exactly why composed queries help — a concept may be phrased one way in one field and another way (or another language) in the next.

FieldRoughlyCharacter
description~512,000 worksDutch, denotative — what is shown, the materials, the signature facts
curatorialNarrative~14,000 worksMostly English, interpretive — meaning, theme, the story behind the work
inscription~502,000 worksLiteral transcription — the words actually written on the object

Three things it would unlock

Each example below leans on a different building block, and each reaches works the flat filters and semantic_search genuinely cannot. (These appear as scenarios 26–28 in docs/research-scenarios.md.)

Example 1 · either/or across fields

A theme written up two different ways

"Find works about the Beeldenstorm — the 1566 wave of iconoclasm — whether a Dutch cataloguer called it beeldenstorm in the description, or an English curator called it iconoclasm in the wall text."

textQuery: {
  should: [ { field: "description",        phrase: "beeldenstorm" },
            { field: "curatorialNarrative", any: ["iconoclasm","iconoclastic"] } ],
  mustNot: [ { field: "title", phrase: "geschiedenis" } ]
}

The two corpora describe the same event with different words, in different languages, and the matches barely overlap. Today's flat filters can only AND the two fields — giving the near-empty overlap — never the union. The should (either/or across fields) is the missing piece.

Example 2 · proximity

Signed and dated, together

"Find paintings the museum describes as both signed and dated in the same breath — the strongest statement that a work is genuinely by the artist's own hand."

textQuery: { field: "description",
             near: { terms: ["gesigneerd","gedateerd"], distance: 4 } }

A description that mentions a signature in one place and a date somewhere else is not the same as one asserting both together ("gesigneerd … en gedateerd 1749"). Requiring the two Dutch words to sit close drops the loose coincidences. Neither vocabulary filters nor semantic_search can demand that two words be near each other.

Example 3 · prefix + and/or

Who made the print — designer, engraver, publisher

"Old prints name their makers on the plate in Latin: who designed it (invenit), who engraved it (sculpsit/fecit), who published it (excudit). Find prints that record all three roles."

textQuery: { field: "inscription",
  must: [ { anyPrefix: ["inven","delineav"] },             // designer
          { anyPrefix: ["sculp","incid"], any: ["fecit"] }, // engraver
          { prefix: "excud" } ] }                           // publisher

The same role is written many ways — sculpsit, sculptor, Sculpt., sculp. — because engravers abbreviated to fit the margin. A prefix like sculp* finds many times more works than the exact word sculpsit alone. The structured "production role" filter can't help here: it normalises everything into a few tidy English labels, so it loses the literal Latin and can't spot one person doing two jobs ("sculptor et excudit"). The inscription is the original source; this reads it directly.

How it stays safe

The core trick
We already do a tiny version of this. A helper called expandFtsQuery takes one word and builds a safe ("paint" OR "paints" OR "painted") group — each variant individually quoted, the OR supplied by us. A structured textQuery just generalises that pattern: walk the small request, quote every leaf word, and emit the operators from our own code. The user's words can never break out of their quotes.
Option A — raw passthrough

Just stop stripping operators; let clients send FTS5 query text directly.

~10 lines. But it re-opens every risk — crashes, confusion, index probing. Not recommended.

Option B — structured compiler (recommended)

Accept the small { should, must, mustNot, near, prefix, field } request and compile it into a guaranteed-safe query.

~80–150 lines. User words stay inert; we generate only well-formed queries. As a bonus, combining everything into one match means relevance ranking finally covers the whole text query, not just the first field.

It would run entirely on the full-text indexes we already ship — no Elasticsearch, no new database, no re-harvest. It would be added as an opt-in field, so the plain string search stays exactly as it is.

Open questions before we'd commit

Bottom line
The "inert text" behaviour is a sound safety choice, not a bug — but it's also why precise searches quietly fail. We could lift the ceiling without giving up the safety, by having the server build the query logic around the user's still-inert words. The three examples above — an either/or across the bilingual catalogue, a proximity test for signed-and-dated works, and a prefix search over Latin maker's inscriptions — each reach works that nothing else in the toolkit can. The capability is already in the database; what's undecided is whether the demand justifies the added complexity. Tracked as #363, implementation deliberately left open.
References. Issue #363 (this proposal, split from #214 item 4) · Scenarios 26–28 in docs/research-scenarios.md · Code: src/utils/db.tsescapeFts5, escapeFts5Token, expandFtsQuery; src/api/VocabularyDb.ts — text-FTS filter block (~lines 4423–4459). The four full-text indexes already shipped: vocabulary_fts, artwork_texts_fts, person_names_fts, entity_alt_names_fts.