src — search

Module: src-search Cohesion: 0.80 Members: 0

src — search

The src/search module provides a robust and flexible search engine designed to combine the strengths of traditional keyword-based search with modern vector similarity search. This "hybrid search" approach aims to deliver superior recall and precision by understanding both the literal terms in a query and its semantic meaning.

It is built around three core components:

  1. BM25Index: An in-memory implementation of the BM25 (Best Match 25) ranking algorithm for keyword search.
  2. USearchVectorIndex: A high-performance vector index leveraging the usearch library for approximate nearest neighbor (ANN) search, enabling semantic similarity.
  3. HybridSearchEngine: An orchestrator that combines results from both BM25 and vector indexes, applies configurable weighting, and manages data sources and caching.

This module is designed to be developer-focused, offering clear APIs for indexing, searching, and managing search data across various application domains like user memories, code snippets, or messages.

Architecture Overview

The HybridSearchEngine acts as the central coordinator. Upon initialization, it sets up dedicated BM25Index and USearchVectorIndex instances for each configured SearchSource (e.g., 'memories', 'code', 'messages'). When a search query is received, it dispatches the query to both the relevant BM25 and vector indexes, retrieves embeddings from an EmbeddingProvider, and then intelligently merges and ranks the results.

graph TD
    subgraph Search Module
        HSE[HybridSearchEngine] -->|uses| BMI[BM25Index]
        HSE -->|uses| USVI[USearchVectorIndex]
        HSE -->|gets embeddings from| EP(EmbeddingProvider)
        HSE -->|loads data from| MR(MemoryRepository)
    end

    subgraph BM25Index
        BMI -->|tokenizes & stems| Tokenization(tokenize/stem)
    end

    subgraph USearchVectorIndex
        USVI -->|dynamic import| U(usearch library)
        USVI -->|fallback to| FVI(FallbackVectorIndex)
    end

    style HSE fill:#f9f,stroke:#333,stroke-width:2px
    style BMI fill:#bbf,stroke:#333,stroke-width:2px
    style USVI fill:#bbf,stroke:#333,stroke-width:2px
    style EP fill:#cfc,stroke:#333,stroke-width:2px
    style MR fill:#cfc,stroke:#333,stroke-width:2px

Key Components

1. BM25 Keyword Search (src/search/bm25.ts)

This component provides an in-memory implementation of the BM25 ranking function, a standard for keyword-based full-text search.

Tokenization and Stemming

Before indexing or searching, text content is processed to extract meaningful terms.

BM25Index Class

The BM25Index class manages the inverted index and calculates BM25 scores.

If a document with the same id already exists, it's removed first (an upsert operation).

  1. Tokenizes and stems the query.
  2. Iterates through all documents in the index.
  3. For each document and each query term, it calculates the BM25 score using the formula:

score = IDF * TF_normalized Where:

  1. Returns a sorted list of { id, score } pairs, limited by limit.

Singleton Management

The bm25.ts module also provides functions to manage named BM25Index instances as singletons:

2. USearch Vector Search (src/search/usearch-index.ts)

This component integrates the usearch library for high-performance vector similarity search, enabling semantic search capabilities. It supports approximate nearest neighbor (ANN) search using the HNSW algorithm, offering O(log n) search complexity.

USearchVectorIndex Class

The USearchVectorIndex class wraps the native usearch index and manages vector data, IDs, and metadata.

FallbackVectorIndex

This internal class provides a basic, brute-force vector search implementation. It is used automatically if the native usearch library cannot be loaded, ensuring the application can still function (albeit with reduced performance for large datasets). It implements add, remove, and search using standard distance calculations (cosine, L2 squared, inner product).

Singleton Management

Similar to BM25Index, USearchVectorIndex instances can be managed as singletons:

3. Hybrid Search Engine (src/search/hybrid-search.ts)

The HybridSearchEngine is the primary API for performing searches. It orchestrates the BM25Index and USearchVectorIndex to provide a unified, configurable search experience.

HybridSearchEngine Class

Singleton Management

4. Types (src/search/types.ts)

This file defines all the essential interfaces and types used across the search module, ensuring type safety and clarity.

Integration Points

The src/search module integrates with other parts of the codebase:

Usage Examples

The src/search/index.ts file serves as the main entry point for consuming the search module and provides convenient exports.

import { getHybridSearchEngine } from './search'; class="hl-cmt">// Or from '@your-package/search'

async function runSearchExamples() {
  const engine = getHybridSearchEngine();
  await engine.initialize();

  class="hl-cmt">// Add some example documents (these would typically come from your application's data layer)
  await engine.indexDocument('memories', {
    id: 'mem1',
    content: 'The user mentioned a new authentication flow using OAuth2.',
    metadata: { type: 'conversation', timestamp: Date.now() }
  });
  await engine.indexDocument('code', {
    id: 'code1',
    content: 'function handleOAuthCallback(code: string) { /* ... */ }',
    metadata: { language: 'typescript', file: 'auth.ts' }
  });
  await engine.indexDocument('memories', {
    id: 'mem2',
    content: 'Remember to implement rate limiting for API endpoints.',
    metadata: { type: 'task', timestamp: Date.now() }
  });

  class="hl-cmt">// For vector search, you'd also index embeddings.
  class="hl-cmt">// In a real app, these would be generated by the EmbeddingProvider.
  class="hl-cmt">// For this example, we'll simulate it.
  const mockEmbedding1 = Array.from({ length: 384 }, () => Math.random());
  const mockEmbedding2 = Array.from({ length: 384 }, () => Math.random());
  await engine.indexVector('memories', 'mem1', mockEmbedding1, { type: 'conversation' });
  await engine.indexVector('memories', 'mem2', mockEmbedding2, { type: 'task' });


  console.log('--- Hybrid Search (default weights) ---');
  const results = await engine.search({
    query: 'how to handle user login securely',
    limit: 5,
    sources: ['memories', 'code'],
  });
  console.log(results);

  console.log('\n--- Vector-only Search ---');
  const vectorResults = await engine.search({
    query: 'semantic similarity search for API security',
    vectorOnly: true,
    limit: 3,
    sources: ['memories'],
  });
  console.log(vectorResults);

  console.log('\n--- BM25-only Search ---');
  const keywordResults = await engine.search({
    query: 'authentication flow',
    bm25Only: true,
    limit: 3,
    sources: ['memories', 'code'],
  });
  console.log(keywordResults);

  console.log('\n--- Custom Weights Search ---');
  const customResults = await engine.search({
    query: 'rate limiting implementation',
    vectorWeight: 0.5,
    bm25Weight: 0.5,
    limit: 3,
    sources: ['memories'],
  });
  console.log(customResults);

  console.log('\n--- Search Stats ---');
  console.log(engine.getStats());

  class="hl-cmt">// Clean up
  engine.dispose();
}

runSearchExamples().catch(console.error);