tests — input

Module: tests-input Cohesion: 0.80 Members: 0

tests — input

This document provides developer-focused documentation for the src/input/multimodal-input.ts module, which is responsible for managing multimodal input, primarily images, within the application. The functionality described is inferred from the provided test file tests/input/multimodal-input.test.ts.

Multimodal Input Manager

The multimodal-input module provides a robust way to handle various forms of multimodal input, with a strong focus on image management. It allows the application to load, store, retrieve, and prepare images for use with AI models, while also detecting system capabilities related to multimodal interactions.

Purpose

The primary goals of this module are:

  1. Image Lifecycle Management: Provide a centralized mechanism to load images from files, validate them, store them in memory, and prepare them for API consumption.
  2. Capability Detection: Determine the system's ability to perform multimodal operations such as taking screenshots, accessing the clipboard, performing OCR, and general image processing.
  3. Configuration & Isolation: Allow configuration of image handling parameters (e.g., temporary directory, max size, supported formats) and ensure a clean state for testing or different contexts.
  4. Event-Driven Updates: Notify other parts of the application about significant events, such as initialization completion or image loading/removal.
  5. Singleton Access: Provide a consistent, globally accessible instance of the manager.

Key Components

The module exposes a class and two utility functions:

MultimodalInputManager Class

This is the core class responsible for all multimodal input operations.

    new MultimodalInputManager(options: {
      tempDir: string;
      maxImageSize: number;
      supportedFormats: string[];
    });

Initializes the manager with configuration options:

Detects and caches the system's multimodal capabilities. This method should be called once at application startup. Returns a Promise that resolves with a Capabilities object, indicating whether screenshotAvailable, clipboardAvailable, ocrAvailable, and imageProcessingAvailable are true or false. Subsequent calls return the cached capabilities.

Loads an image from the specified file path.

  1. Performs validation against maxImageSize and supportedFormats.
  2. Reads the file, converts it to base64, and stores it internally with a unique ID.
  3. Emits an image:loaded event.

Throws an error if the file is not found, unsupported, or too large.

Retrieves a previously loaded image by its unique ID.

Returns an array of all currently loaded images.

Removes a stored image by its ID. Emits an image:removed event if the image was successfully removed. Returns true if removed, false otherwise.

Removes all currently loaded images from the manager's internal storage.

Prepares a loaded image for submission to an external API. This typically involves retrieving its base64 encoded data and MIME type. Throws an error if the image ID is not found.

Generates a human-readable summary string of the manager's current state, including loaded images and detected capabilities. Useful for debugging or displaying status to the user.

The manager extends an event emitter, allowing other components to subscribe to important lifecycle events:

getMultimodalInputManager(): MultimodalInputManager

This function provides access to a singleton instance of the MultimodalInputManager. It ensures that only one instance of the manager exists throughout the application, promoting consistent state management.

resetMultimodalInputManager(): void

This utility function clears the singleton instance, forcing getMultimodalInputManager() to create a new instance on its next call. This is primarily useful for testing or scenarios where a fresh, unconfigured manager is required.

Image Lifecycle Flow

The following diagram illustrates the typical flow of an image through the MultimodalInputManager:

graph TD
    A[File Path] --> B{loadImageFile(filePath)}
    B -- Validation --> C{Image Data (base64, mimeType)}
    C --> D[Store Image (ID, metadata)]
    D --> E[Emit 'image:loaded' event]

    D -- Retrieve --> F{getImage(id)}
    D -- Prepare for API --> G{prepareForAPI(id)}
    G --> H[API Payload]

    D -- Remove --> I{removeImage(id)}
    I --> J[Emit 'image:removed' event]
    I --> K[Remove from storage]

Integration and Usage

Other modules should interact with the MultimodalInputManager primarily through the getMultimodalInputManager() singleton accessor.

import { getMultimodalInputManager } from "./multimodal-input";

async function setupMultimodalInput() {
  const manager = getMultimodalInputManager();

  class="hl-cmt">// Configure the manager (typically done once at app startup)
  class="hl-cmt">// Note: In a real app, configuration might come from a global config object
  class="hl-cmt">// or be passed to the initial call of getMultimodalInputManager if it supports it.
  class="hl-cmt">// For now, assume the singleton is initialized elsewhere or has default config.
  class="hl-cmt">// The tests show configuration via constructor, implying the singleton might be
  class="hl-cmt">// initialized with options or configured after creation.
  class="hl-cmt">// For this example, we'll assume it's configured or uses defaults.

  class="hl-cmt">// Initialize capabilities
  const capabilities = await manager.initialize();
  console.log("Multimodal capabilities:", capabilities);

  class="hl-cmt">// Listen for events
  manager.on("image:loaded", (image) => {
    console.log(`Image loaded: ${image.id} (${image.source})`);
    class="hl-cmt">// Update UI, log, etc.
  });

  manager.on("image:removed", (image) => {
    console.log(`Image removed: ${image.id}`);
    class="hl-cmt">// Update UI, log, etc.
  });

  class="hl-cmt">// Load an image from a file
  try {
    const image = await manager.loadImageFile("/path/to/my/image.png");
    console.log("Loaded image ID:", image.id);

    class="hl-cmt">// Get all loaded images
    const allImages = manager.getAllImages();
    console.log("Total images loaded:", allImages.length);

    class="hl-cmt">// Prepare an image for an API call
    const apiPayload = await manager.prepareForAPI(image.id);
    class="hl-cmt">// Send apiPayload.base64 and apiPayload.mimeType to an AI model API

    class="hl-cmt">// Get a summary of the current state
    console.log(manager.formatSummary());

    class="hl-cmt">// Remove an image
    manager.removeImage(image.id);
  } catch (error) {
    console.error("Failed to handle image:", error);
  }
}

setupMultimodalInput();

Configuration Considerations

When using getMultimodalInputManager(), it's important to understand how the singleton is initialized. The tests show the MultimodalInputManager constructor taking options (tempDir, maxImageSize, supportedFormats). In a production environment, these options would typically be provided once when the singleton is first created, or the singleton might have a configure() method. The current test structure implies that the singleton is either initialized with defaults or configured externally before getMultimodalInputManager() is first called. Developers should ensure the manager is configured appropriately for their application's needs.