Iconclass Subject Search

Structured subject vocabulary for 830,000 artworks

Iconclass is a hierarchical classification system for art subject matter — the standard vocabulary used by museums, libraries, and art historians worldwide to describe what an artwork depicts. rijksmuseum-mcp+ indexes the full system (39,802 notations across 13 languages) and links it to the Rijksmuseum collection, enabling subject-based discovery from broad categories down to individual scenes.

39,802
Iconclass notations
13
languages
20,220
notations linked to artworks
3
search modes

The problem: finding artworks by subject

An art collection’s most basic search need — “show me paintings of the Crucifixion” — is surprisingly hard to serve with title search alone. Titles are brief, inconsistent, and often in Dutch. Subject metadata is richer, but it uses free-text labels that vary across cataloguers and languages.

Title search is too narrow

The Rijksmuseum’s BM25 text index only covers titles. A painting titled “De Kruisiging” (The Crucifixion) won’t appear in a search for “crucifixion” because the title is in Dutch. Many works have purely descriptive titles like “Biblical scene” that reveal nothing about the specific subject.

Subject labels are inconsistent

Free-text dc:subject tags vary by cataloguer and era. The same biblical scene might be tagged as “Crucifixion”, “Christ on the Cross”, or “Golgotha”. Searching for one misses the others. Without a controlled vocabulary, recall is inherently incomplete.

Iconclass solves both problems

Each notation is a language-independent code (73D6 = Crucifixion) with labels in 13 languages. The hierarchy groups related concepts: searching for “73D” (Passion of Christ) automatically includes the Crucifixion, Last Supper, and Gethsemane. And because the Rijksmuseum catalogues its artworks with Iconclass notations, the mapping between notation and artwork is already in the data — no NLP or inference required.

How the notation system works

Iconclass notations are alphanumeric codes that encode a path through a hierarchy. Each character extends the specificity: 7 → Bible, 73 → New Testament, 73D → Passion of Christ, 73D6 → Crucifixion. The system has 10 top-level divisions covering all representational subject matter.

Code Division Examples Artworks
0 Abstract, Non-representational Art geometric patterns, colour fields 1,515
1 Religion and Magic Christian iconography, saints, angels 36,000+
2 Nature landscapes, animals, plants, weather 78,000+
3 Human Being, Man in General anatomy, senses, death, clothing 65,000+
4 Society, Civilization, Culture trade, war, law, education, heraldry 91,000+
5 Abstract Ideas and Concepts virtues, vices, time, fortune 5,400+
6 History historical events, persons, places 96,000+
7 Bible Old and New Testament narratives 12,000+
8 Literature classical literature, mythology 2,100+
9 Classical Mythology and Ancient History Greek/Roman gods, myths, legends 9,500+

Hierarchy depth: from “Bible” to a single scene

7 Bible — 139 artworks
73 New Testament — 220
73D Passion of Christ
73D6 Crucifixion: Christ’s death on the cross; Golgotha — 362
73D61 comprehensive representations on Golgotha — 56
73D64 crucified Christ, with particular persons under the cross — 85
73D66 Christ on the cross alone, without bystanders — 58

Maximum hierarchy depth: 11 levels. Most notations cluster at depth 5–7. Each level adds specificity while inheriting its parent’s meaning.

How the Iconclass database is built

The Iconclass DB is a self-contained SQLite database built from the Iconclass CC0 data dump and cross-referenced against the vocabulary database for artwork counts.

Phase 1

Parse notations

Read notations.txt → extract hierarchy (children, refs) → compute ancestor paths via parent→child reverse map

Phase 2

Parse texts

Read txt/{lang}/*.txt for 13 languages → 279,231 label entries → build FTS5 full-text index

Phase 3

Parse keywords

Read kw/{lang}/*.txt → 780,049 keyword entries → separate FTS5 keyword index

Phase 4

Cross-reference

Query vocabulary.db for per-notation artwork counts → update rijks_count column

Phase 5

Embed

Generate 384d embeddings for all ~39,800 notations on Modal cloud GPU (e5-small, int8)

Database contents
Labels  279K
Keywords  780K
Notations  40K
Embed

Database size: 130 MB (including embeddings). Stored as a single SQLite file alongside the vocabulary and embeddings databases.

Three ways to find a notation

The Iconclass server supports multiple search modes, each suited to a different stage of research. Results include artwork counts so the user can gauge collection coverage before querying search_artwork.

Keyword search

FTS5 full-text search across labels and keywords in all 13 languages. Exact word matching (no stemming) — “crucifixion” won’t match “crucified”. Results ranked by artwork count.

query: "crucifixion"
→ 52 notations matched
→ 73D6 (362 artworks), 73D5 (41),
   73F2165 "Peter crucified" (37) …

Browse

Navigate the hierarchy by notation code. Returns the entry with its full ancestry path and direct children — ideal for exploring what a notation contains or narrowing a broad category.

notation: "73D6"
→ path: 7 › 73 › 73D › 73D6
→ 9 children: 73D61…73D69
→ keywords: Calvary, Golgotha,
   crucifixion, death, last hours

Semantic search

Conceptual search using 384-dimensional embeddings (intfloat/multilingual-e5-small). Finds notations by meaning rather than exact keywords — “domestic animals in everyday life” finds 34B1 (domestic animals kept in the house), 34B2 (kept outside the house), and even 25FF21 (fabulous animals ~ domestic animals). Optional onlyWithArtworks filter restricts results to notations that appear in the collection.

semanticQuery: "domestic animals in everyday life"
→ 34B1 (11 artworks, dist: 0.148)
→ 34B "domestic animals, kept in and
   outside the house" (6, dist: 0.148)
→ 34B13 "domestic birds" (8, dist: 0.154)

Embedding strategy: composite texts

Each Iconclass notation is embedded as a composite text that combines its label, Dutch translation, keywords, and full category path. This gives the embedding model richer context than a bare label — critical for short or ambiguous entries.

What goes into a single embedding

Notation 73D6
[Description] the crucifixion of
Christ: Christ's death on the cross;
Golgotha (Matthew 27:45-58 …)
[Description NL] Christus' kruisdood
[Keywords] Calvary, Golgotha,
crucifixion, death, last hours
[Category] Bible > New Testament
> Passion of Christ > Crucifixion
Why composite?

A notation like 31A33 has the bare label “smell, smelling (one of the five senses)”. Without the category path Human Being > Senses > Taste & Smell, the embedding can’t distinguish it from “smell” in a chemical or environmental sense. The hierarchy path acts as a disambiguation signal.

Dutch labels add multilingual coverage — a query in Dutch matches via embedding similarity even though FTS5 only does exact word matching.

Embeddings are generated on Modal cloud GPU (NVIDIA T4) using intfloat/multilingual-e5-small (384 dimensions, int8 quantized). The same model encodes user queries at runtime on the host machine (CPU, ~130 MB).

Multilingual coverage

Iconclass labels are available in 13 languages, though coverage varies widely. The top five languages cover >99% of notations; the long tail drops sharply. A language fallback chain (requested → English → Dutch → any) ensures every notation returns a label regardless of the query language.

English
40,670  (99.9%)
German
40,645
French
40,642
Italian
40,639
Japanese
40,476
Portuguese
40,354
Finnish
13,615
33.5%
Spanish
11,800
29%
Chinese
6,709
16.5%
Dutch
1,900
4.7%
Polish
1,493
3.7%
Hungarian
288
0.7%
Czech
<0.1%

Most-used notations in the Rijksmuseum

20,220 of the 39,802 notations (51%) are linked to at least one artwork. The distribution is heavily skewed — a few broad notations cover tens of thousands of works, while the long tail describes rare or highly specific subjects.

Notation Subject Division Artworks
61B2 historical persons (portraits, depictions) 6 90,471
61BB2 historical persons — women 6 14,026
31D14 adult man 3 13,156
11Q712 church (exterior) 1 11,465
46A122 armorial bearing, heraldry 4 10,859
31D15 adult woman 3 10,154
25I1 city view in general (‘veduta’) 2 9,518
46C24 sailing ship, sailing boat 4 7,979
25H213 river 2 7,898
34B11 dog 3 6,291

Artwork counts are pre-computed at database build time and approximate. Use search_artwork(iconclass: "73D6") for current, precise results.

The two-step workflow: lookup, then search

Iconclass integrates into the broader search pipeline through a two-step pattern: first discover the right notation code via the Iconclass server, then pass it to search_artwork as a structured filter. This separates the vocabulary navigation problem from the collection query problem.

Step 1

Discover notation

Iconclass server search / browse
keyword / browse / semantic → notation code + artwork count

Step 2

Search collection

search_artwork
iconclass: "73D6"
Combines with all other filters (creator, date, material, type …)

Example: finding Crucifixion prints from the 17th century

1. Iconclass search(query: "crucifixion")
   → 73D6 (362 artworks)

2. search_artwork(
   iconclass: "73D6",
   type: "print",
   creationDateFrom: 1600,
   creationDateTo: 1699
  )

Example: heraldic objects in silver

1. Iconclass search(query: "heraldry")
   → 46A122 (10,859 artworks)

2. search_artwork(
   iconclass: "46A122",
   material: "silver"
  )

Integration with collection_stats

The iconclass parameter also works in collection_stats, enabling aggregate analysis: “which creators produced the most works tagged with this subject?” or “what materials are most common for works depicting dogs?”

collection_stats(
  dimension: "creator",
  iconclass: "34B11" // dog
)
→ top creators of works depicting dogs

Under the hood

Key implementation choices that affect performance and capability.

Dual FTS5 indexes

Labels and keywords live in separate FTS5 virtual tables with a UNION query and deduplication by notation. This avoids contaminating label matches with keyword noise and vice versa, while still returning a single ranked result list.

Dual KNN paths

Pure semantic search uses the vec0 virtual table (fast brute-force scan). When onlyWithArtworks is set, a slower path joins against rijks_count > 0 using vec_distance_cosine() on the regular embeddings table. Both use int8 quantized vectors.

Language fallback chain

Every label lookup tries: requested language → English → Dutch → any. Graceful degradation ensures queries always return labels, even for languages with <1% coverage. Prepared statements are cached for the lifetime of the DB connection.

Dimension guard

Iconclass embeddings are always 384d (e5-small native). Artwork embeddings may be MRL-truncated to a shorter dimension. A runtime check in registration.ts prevents mismatched KNN queries — if the artwork embedding model outputs a different dimension, the Iconclass semantic search path is silently disabled.