Structured subject vocabulary for 830,000 artworks
Iconclass is a hierarchical classification system for art subject matter — the standard vocabulary used by museums, libraries, and art historians worldwide to describe what an artwork depicts. rijksmuseum-mcp+ indexes the full system (39,802 notations across 13 languages) and links it to the Rijksmuseum collection, enabling subject-based discovery from broad categories down to individual scenes.
An art collection’s most basic search need — “show me paintings of the Crucifixion” — is surprisingly hard to serve with title search alone. Titles are brief, inconsistent, and often in Dutch. Subject metadata is richer, but it uses free-text labels that vary across cataloguers and languages.
The Rijksmuseum’s BM25 text index only covers titles. A painting titled “De Kruisiging” (The Crucifixion) won’t appear in a search for “crucifixion” because the title is in Dutch. Many works have purely descriptive titles like “Biblical scene” that reveal nothing about the specific subject.
Free-text dc:subject tags vary by cataloguer and era.
The same biblical scene might be tagged as “Crucifixion”,
“Christ on the Cross”, or “Golgotha”. Searching for one misses
the others. Without a controlled vocabulary, recall is inherently incomplete.
Each notation is a language-independent code (73D6 = Crucifixion) with labels in
13 languages. The hierarchy groups related concepts: searching for “73D” (Passion of Christ)
automatically includes the Crucifixion, Last Supper, and Gethsemane. And because the Rijksmuseum
catalogues its artworks with Iconclass notations, the mapping between notation and artwork is
already in the data — no NLP or inference required.
Iconclass notations are alphanumeric codes that encode a path through a hierarchy.
Each character extends the specificity: 7 → Bible,
73 → New Testament, 73D → Passion of Christ,
73D6 → Crucifixion. The system has 10 top-level divisions covering
all representational subject matter.
| Code | Division | Examples | Artworks |
|---|---|---|---|
| 0 | Abstract, Non-representational Art | geometric patterns, colour fields | 1,515 |
| 1 | Religion and Magic | Christian iconography, saints, angels | 36,000+ |
| 2 | Nature | landscapes, animals, plants, weather | 78,000+ |
| 3 | Human Being, Man in General | anatomy, senses, death, clothing | 65,000+ |
| 4 | Society, Civilization, Culture | trade, war, law, education, heraldry | 91,000+ |
| 5 | Abstract Ideas and Concepts | virtues, vices, time, fortune | 5,400+ |
| 6 | History | historical events, persons, places | 96,000+ |
| 7 | Bible | Old and New Testament narratives | 12,000+ |
| 8 | Literature | classical literature, mythology | 2,100+ |
| 9 | Classical Mythology and Ancient History | Greek/Roman gods, myths, legends | 9,500+ |
Maximum hierarchy depth: 11 levels. Most notations cluster at depth 5–7. Each level adds specificity while inheriting its parent’s meaning.
The Iconclass DB is a self-contained SQLite database built from the Iconclass CC0 data dump and cross-referenced against the vocabulary database for artwork counts.
Read notations.txt → extract hierarchy (children, refs) → compute
ancestor paths via parent→child reverse map
Read txt/{lang}/*.txt for 13 languages → 279,231 label entries →
build FTS5 full-text index
Read kw/{lang}/*.txt → 780,049 keyword entries →
separate FTS5 keyword index
Query vocabulary.db for per-notation artwork counts →
update rijks_count column
Generate 384d embeddings for all ~39,800 notations on Modal cloud GPU (e5-small, int8)
Database size: 130 MB (including embeddings). Stored as a single SQLite file alongside the vocabulary and embeddings databases.
The Iconclass server supports multiple search modes, each suited to a different
stage of research. Results include artwork counts so the user can gauge collection
coverage before querying search_artwork.
FTS5 full-text search across labels and keywords in all 13 languages. Exact word matching (no stemming) — “crucifixion” won’t match “crucified”. Results ranked by artwork count.
Navigate the hierarchy by notation code. Returns the entry with its full ancestry path and direct children — ideal for exploring what a notation contains or narrowing a broad category.
Conceptual search using 384-dimensional embeddings (intfloat/multilingual-e5-small).
Finds notations by meaning rather than exact keywords — “domestic animals in
everyday life” finds 34B1 (domestic animals kept in the house),
34B2 (kept outside the house), and even 25FF21
(fabulous animals ~ domestic animals). Optional onlyWithArtworks
filter restricts results to notations that appear in the collection.
Each Iconclass notation is embedded as a composite text that combines its label, Dutch translation, keywords, and full category path. This gives the embedding model richer context than a bare label — critical for short or ambiguous entries.
A notation like 31A33 has the bare label “smell, smelling
(one of the five senses)”. Without the category path
Human Being > Senses > Taste & Smell,
the embedding can’t distinguish it from “smell” in a
chemical or environmental sense. The hierarchy path acts as
a disambiguation signal.
Dutch labels add multilingual coverage — a query in Dutch matches via embedding similarity even though FTS5 only does exact word matching.
Embeddings are generated on Modal cloud GPU (NVIDIA T4) using
intfloat/multilingual-e5-small (384 dimensions, int8 quantized).
The same model encodes user queries at runtime on the host machine (CPU, ~130 MB).
Iconclass labels are available in 13 languages, though coverage varies widely. The top five languages cover >99% of notations; the long tail drops sharply. A language fallback chain (requested → English → Dutch → any) ensures every notation returns a label regardless of the query language.
20,220 of the 39,802 notations (51%) are linked to at least one artwork. The distribution is heavily skewed — a few broad notations cover tens of thousands of works, while the long tail describes rare or highly specific subjects.
| Notation | Subject | Division | Artworks |
|---|---|---|---|
| 61B2 | historical persons (portraits, depictions) | 6 | 90,471 |
| 61BB2 | historical persons — women | 6 | 14,026 |
| 31D14 | adult man | 3 | 13,156 |
| 11Q712 | church (exterior) | 1 | 11,465 |
| 46A122 | armorial bearing, heraldry | 4 | 10,859 |
| 31D15 | adult woman | 3 | 10,154 |
| 25I1 | city view in general (‘veduta’) | 2 | 9,518 |
| 46C24 | sailing ship, sailing boat | 4 | 7,979 |
| 25H213 | river | 2 | 7,898 |
| 34B11 | dog | 3 | 6,291 |
Artwork counts are pre-computed at database build time and approximate.
Use search_artwork(iconclass: "73D6") for current, precise results.
Iconclass integrates into the broader search pipeline through a two-step pattern:
first discover the right notation code via the Iconclass server, then pass it
to search_artwork as a structured filter. This separates the vocabulary
navigation problem from the collection query problem.
Iconclass server search / browse
keyword / browse / semantic → notation code + artwork count
search_artwork
iconclass: "73D6"
Combines with all other filters (creator, date, material, type …)
The iconclass parameter also works in collection_stats,
enabling aggregate analysis: “which creators produced the most works tagged with this
subject?” or “what materials are most common for works depicting dogs?”
Key implementation choices that affect performance and capability.
Labels and keywords live in separate FTS5 virtual tables with a UNION query and deduplication by notation. This avoids contaminating label matches with keyword noise and vice versa, while still returning a single ranked result list.
Pure semantic search uses the vec0 virtual table (fast brute-force scan).
When onlyWithArtworks is set, a slower path joins against
rijks_count > 0 using vec_distance_cosine()
on the regular embeddings table. Both use int8 quantized vectors.
Every label lookup tries: requested language → English → Dutch → any. Graceful degradation ensures queries always return labels, even for languages with <1% coverage. Prepared statements are cached for the lifetime of the DB connection.
Iconclass embeddings are always 384d (e5-small native). Artwork embeddings may be
MRL-truncated to a shorter dimension. A runtime check in registration.ts
prevents mismatched KNN queries — if the artwork embedding model outputs a
different dimension, the Iconclass semantic search path is silently disabled.