tests — input
tests — input
This document provides developer-focused documentation for the src/input/multimodal-input.ts module, which is responsible for managing multimodal input, primarily images, within the application. The functionality described is inferred from the provided test file tests/input/multimodal-input.test.ts.
Multimodal Input Manager
The multimodal-input module provides a robust way to handle various forms of multimodal input, with a strong focus on image management. It allows the application to load, store, retrieve, and prepare images for use with AI models, while also detecting system capabilities related to multimodal interactions.
Purpose
The primary goals of this module are:
- Image Lifecycle Management: Provide a centralized mechanism to load images from files, validate them, store them in memory, and prepare them for API consumption.
- Capability Detection: Determine the system's ability to perform multimodal operations such as taking screenshots, accessing the clipboard, performing OCR, and general image processing.
- Configuration & Isolation: Allow configuration of image handling parameters (e.g., temporary directory, max size, supported formats) and ensure a clean state for testing or different contexts.
- Event-Driven Updates: Notify other parts of the application about significant events, such as initialization completion or image loading/removal.
- Singleton Access: Provide a consistent, globally accessible instance of the manager.
Key Components
The module exposes a class and two utility functions:
MultimodalInputManager Class
This is the core class responsible for all multimodal input operations.
- Constructor:
new MultimodalInputManager(options: {
tempDir: string;
maxImageSize: number;
supportedFormats: string[];
});
Initializes the manager with configuration options:
tempDir: A directory for temporary file operations (e.g., for image processing).maxImageSize: The maximum allowed size for images in bytes.supportedFormats: An array of file extensions (e.g.,".png", ".jpg") that the manager will accept.
initialize(): Promise
Detects and caches the system's multimodal capabilities. This method should be called once at application startup.
Returns a Promise that resolves with a Capabilities object, indicating whether screenshotAvailable, clipboardAvailable, ocrAvailable, and imageProcessingAvailable are true or false. Subsequent calls return the cached capabilities.
loadImageFile(filePath: string): Promise
Loads an image from the specified file path.
- Performs validation against
maxImageSizeandsupportedFormats. - Reads the file, converts it to base64, and stores it internally with a unique ID.
- Emits an
image:loadedevent.
Throws an error if the file is not found, unsupported, or too large.
getImage(id: string): Image | undefined
Retrieves a previously loaded image by its unique ID.
getAllImages(): Image[]
Returns an array of all currently loaded images.
removeImage(id: string): boolean
Removes a stored image by its ID.
Emits an image:removed event if the image was successfully removed.
Returns true if removed, false otherwise.
clearAll(): void
Removes all currently loaded images from the manager's internal storage.
prepareForAPI(id: string): Promise<{ base64: string; mimeType: string }>
Prepares a loaded image for submission to an external API. This typically involves retrieving its base64 encoded data and MIME type. Throws an error if the image ID is not found.
formatSummary(): string
Generates a human-readable summary string of the manager's current state, including loaded images and detected capabilities. Useful for debugging or displaying status to the user.
- Event Emitter (
onmethod)
The manager extends an event emitter, allowing other components to subscribe to important lifecycle events:
initialized: Emitted afterinitialize()completes.image:loaded: Emitted when an image is successfully loaded vialoadImageFile(). The event payload includes theImageobject.image:removed: Emitted when an image is removed viaremoveImage(). The event payload includes theImageobject that was removed.
getMultimodalInputManager(): MultimodalInputManager
This function provides access to a singleton instance of the MultimodalInputManager. It ensures that only one instance of the manager exists throughout the application, promoting consistent state management.
resetMultimodalInputManager(): void
This utility function clears the singleton instance, forcing getMultimodalInputManager() to create a new instance on its next call. This is primarily useful for testing or scenarios where a fresh, unconfigured manager is required.
Image Lifecycle Flow
The following diagram illustrates the typical flow of an image through the MultimodalInputManager:
graph TD
A[File Path] --> B{loadImageFile(filePath)}
B -- Validation --> C{Image Data (base64, mimeType)}
C --> D[Store Image (ID, metadata)]
D --> E[Emit 'image:loaded' event]
D -- Retrieve --> F{getImage(id)}
D -- Prepare for API --> G{prepareForAPI(id)}
G --> H[API Payload]
D -- Remove --> I{removeImage(id)}
I --> J[Emit 'image:removed' event]
I --> K[Remove from storage]
Integration and Usage
Other modules should interact with the MultimodalInputManager primarily through the getMultimodalInputManager() singleton accessor.
import { getMultimodalInputManager } from "./multimodal-input";
async function setupMultimodalInput() {
const manager = getMultimodalInputManager();
class="hl-cmt">// Configure the manager (typically done once at app startup)
class="hl-cmt">// Note: In a real app, configuration might come from a global config object
class="hl-cmt">// or be passed to the initial call of getMultimodalInputManager if it supports it.
class="hl-cmt">// For now, assume the singleton is initialized elsewhere or has default config.
class="hl-cmt">// The tests show configuration via constructor, implying the singleton might be
class="hl-cmt">// initialized with options or configured after creation.
class="hl-cmt">// For this example, we39;ll assume it39;s configured or uses defaults.
class="hl-cmt">// Initialize capabilities
const capabilities = await manager.initialize();
console.log("Multimodal capabilities:", capabilities);
class="hl-cmt">// Listen for events
manager.on("image:loaded", (image) => {
console.log(`Image loaded: ${image.id} (${image.source})`);
class="hl-cmt">// Update UI, log, etc.
});
manager.on("image:removed", (image) => {
console.log(`Image removed: ${image.id}`);
class="hl-cmt">// Update UI, log, etc.
});
class="hl-cmt">// Load an image from a file
try {
const image = await manager.loadImageFile("/path/to/my/image.png");
console.log("Loaded image ID:", image.id);
class="hl-cmt">// Get all loaded images
const allImages = manager.getAllImages();
console.log("Total images loaded:", allImages.length);
class="hl-cmt">// Prepare an image for an API call
const apiPayload = await manager.prepareForAPI(image.id);
class="hl-cmt">// Send apiPayload.base64 and apiPayload.mimeType to an AI model API
class="hl-cmt">// Get a summary of the current state
console.log(manager.formatSummary());
class="hl-cmt">// Remove an image
manager.removeImage(image.id);
} catch (error) {
console.error("Failed to handle image:", error);
}
}
setupMultimodalInput();
Configuration Considerations
When using getMultimodalInputManager(), it's important to understand how the singleton is initialized. The tests show the MultimodalInputManager constructor taking options (tempDir, maxImageSize, supportedFormats). In a production environment, these options would typically be provided once when the singleton is first created, or the singleton might have a configure() method. The current test structure implies that the singleton is either initialized with defaults or configured externally before getMultimodalInputManager() is first called. Developers should ensure the manager is configured appropriately for their application's needs.