tests — desktop-automation
tests — desktop-automation
The desktop-automation module provides a robust, cross-platform API for interacting with the desktop environment, including mouse, keyboard, window, application, and screen operations. It abstracts away platform-specific implementations, allowing developers to write automation scripts that can run on Linux, Windows, and macOS. Additionally, it includes a "Smart Snapshot" system for intelligent UI element detection, combining accessibility and OCR capabilities.
This documentation focuses on the core components, their interactions, and how to leverage them for desktop automation tasks.
Core Concepts
The module is built around three primary concepts:
DesktopAutomationManager: The central facade that provides a unified API for all desktop automation tasks. It manages the underlying automation providers and offers configuration, event handling, and safety features.IAutomationProvider: An interface (or abstract class in practice) that defines the contract for platform-specific or library-based automation implementations. TheDesktopAutomationManagerdelegates actual operations to an activeIAutomationProvider.SmartSnapshotManager: A system for taking "snapshots" of the UI, detecting elements using either accessibility APIs or Optical Character Recognition (OCR), and providing a structured view of the interactive elements on the screen.
Architecture Overview
The DesktopAutomationManager acts as a central orchestrator. It can be configured to use a specific IAutomationProvider or automatically select the most suitable one based on the operating system and available tools. All high-level automation commands (e.g., click, type, focusWindow) are routed through the manager to the currently active provider. The SmartSnapshotManager operates in conjunction, providing intelligent element detection capabilities that can be integrated into automation workflows.
graph TD
A[DesktopAutomationManager] --> B{IAutomationProvider};
B --> C[MockAutomationProvider];
B --> D[NutJsProvider];
B --> E[LinuxNativeProvider];
B --> F[WindowsNativeProvider];
B --> G[MacOSNativeProvider];
A -- Manages --> H[SmartSnapshotManager];
H -- Uses --> I[ScreenshotTool];
H -- Uses --> J[OCRTool];
DesktopAutomationManager
The DesktopAutomationManager is the primary entry point for developers. It provides a comprehensive set of methods for desktop interaction and manages the lifecycle and configuration of automation providers.
Getting an Instance
The manager is designed as a singleton, accessible via getDesktopAutomation(). This ensures that only one instance manages desktop automation resources at a time.
import { getDesktopAutomation } from 39;../../src/desktop-automation/index.js39;;
const manager = getDesktopAutomation();
await manager.initialize(); class="hl-cmt">// Initialize the underlying provider
class="hl-cmt">// ... perform automation ...
await manager.shutdown(); class="hl-cmt">// Clean up resources
You can reset the singleton instance using resetDesktopAutomation() for testing or specific scenarios. The first call to getDesktopAutomation() can also accept an initial configuration.
import { getDesktopAutomation, resetDesktopAutomation } from 39;../../src/desktop-automation/index.js39;;
resetDesktopAutomation(); class="hl-cmt">// Clear any existing instance
const manager = getDesktopAutomation({ debug: true, provider: 39;nutjs39; });
await manager.initialize();
Initialization and Provider Selection
Upon initialize(), the manager attempts to find and initialize an IAutomationProvider. By default, it prioritizes native providers (platform-specific tools), then nutjs, and finally mock (for testing). You can explicitly specify a provider in the configuration.
initialize(): Initializes the selected automation provider.shutdown(): Shuts down the active provider and releases resources.registerProvider(provider: IAutomationProvider): Allows registering custom or additional providers.getProviderStatus(): Returns the status of the currently active provider.getAllProviderStatuses(): Returns the status of all registered providers.
Configuration
The manager's behavior can be configured via updateConfig() and retrieved with getConfig().
const config = manager.getConfig();
console.log(config.provider); class="hl-cmt">// e.g., 39;native39;
manager.updateConfig({
defaultDelays: {
mouseMove: 50,
keyPress: 20,
},
safety: {
failSafe: false, class="hl-cmt">// Disable fail-safe for specific scenarios
},
});
Safety Features: The manager includes built-in safety mechanisms:
failSafe: Enabled by default, this feature allows stopping automation by moving the mouse to a corner of the screen.minActionDelay: A minimum delay between automation actions to prevent overwhelming the system or making actions too fast for human observation.resetFailSafe(): Resets the fail-safe state.
Event System
The DesktopAutomationManager emits events for various desktop interactions, allowing for monitoring or reactive automation.
manager.on(39;mouse-move39;, (pos) => {
console.log(`Mouse moved to: ${pos.x}, ${pos.y}`);
});
manager.on(39;key-press39;, (key, modifiers) => {
console.log(`Key pressed: ${key} with modifiers: ${modifiers.join(39;,39;)}`);
});
manager.on(39;window-focus39;, (windowInfo) => {
console.log(`Window focused: ${windowInfo.title} (${windowInfo.handle})`);
});
Key events include:
mouse-move,mouse-clickkey-press,key-typewindow-focus,window-changeapp-launch,app-close
Core Automation Methods
The manager exposes a comprehensive API for desktop interaction, mirroring the IAutomationProvider interface:
Mouse Operations:
getMousePosition(): Get current mouse coordinates.moveMouse(x, y): Move mouse to absolute coordinates.click(x?, y?, options?): Perform a click (left by default).doubleClick(x?, y?, options?): Perform a double click.rightClick(x?, y?, options?): Perform a right click.drag(startX, startY, endX, endY): Drag the mouse from one point to another.scroll(options): Scroll the mouse wheel.
Keyboard Operations:
keyPress(key, options?): Press and release a single key.keyDown(key): Press down a key.keyUp(key): Release a key.type(text): Type a string of text.hotkey(...keys): Execute a hotkey combination (e.g.,ctrl+c).
Window Operations:
getActiveWindow(): Get information about the currently focused window.getWindows(filter?): Get a list of all open windows, optionally filtered by title.getWindow(handle): Get information for a specific window by its handle.findWindow(query): Find a window by title (string or regex).focusWindow(handle): Bring a window to the foreground.minimizeWindow(handle): Minimize a window.maximizeWindow(handle): Maximize a window.restoreWindow(handle): Restore a minimized/maximized window.closeWindow(handle): Close a window.moveWindow(handle, x, y): Move a window to new coordinates.resizeWindow(handle, width, height): Resize a window.setWindow(handle, options): Set window position and/or size.
Application Operations:
getRunningApps(): Get a list of all running applications.findApp(query): Find an application by name (string or regex).launchApp(path): Launch an application from its executable path.closeApp(pid): Close an application by its process ID.
Screen Operations:
getScreens(): Get information about all connected displays.getPrimaryScreen(): Get information about the primary display.getPixelColor(x, y): Get the RGB color of a pixel at specified coordinates.
Clipboard Operations:
getClipboard(): Get the current clipboard content (text, image, etc.).getClipboardText(): Get only the text content from the clipboard.setClipboard(content): Set the clipboard content.copyText(text): Convenience method to set clipboard text.clearClipboard(): Clear the clipboard.
IAutomationProvider Interface
The IAutomationProvider interface defines the contract for any desktop automation implementation. Each provider must implement these methods and declare its capabilities.
interface IAutomationProvider {
name: string;
capabilities: {
mouse: boolean;
keyboard: boolean;
windows: boolean;
apps: boolean;
clipboard: boolean;
ocr: boolean;
screenshots?: boolean;
colorPicker?: boolean;
};
isAvailable(): Promise<boolean>;
initialize(): Promise<void>;
shutdown(): Promise<void>;
class="hl-cmt">// ... methods for mouse, keyboard, window, app, screen, clipboard operations ...
}
Concrete Automation Providers
The module includes several concrete implementations of IAutomationProvider:
1. MockAutomationProvider
- Purpose: Primarily used for testing the
DesktopAutomationManagerand other components without requiring actual desktop interaction. - Capabilities: Reports
truefor most capabilities (mouse, keyboard, windows, apps, clipboard) butfalseforocr. - Behavior: Simulates desktop actions with predictable, hardcoded responses (e.g., mouse position starts at 500,500; a fixed set of mock windows and apps).
2. NutJsProvider
- Purpose: Integrates with the
nut.jslibrary, providing cross-platform automation capabilities. - Capabilities: Supports mouse, keyboard, windows, clipboard, and screen operations.
- Limitations: Has limited support for application operations (
getRunningAppsreturns an empty array,launchAppandcloseAppthrow "not supported" errors). - Availability: Checks if
nut.jsis installed and functional.
3. Native Providers
These providers leverage platform-specific command-line tools or APIs for optimal performance and deeper integration. They are typically preferred when available.
##### LinuxNativeProvider
- Platform: Linux (specifically X11-based desktop environments).
- Dependencies: Relies on external tools like
xdotool,xclip,wmctrl, andxrandr. These must be installed on the system. - Limitations:
- Wayland: Does not function on Wayland-based desktop environments (e.g., modern GNOME, KDE Plasma on Wayland) due to its reliance on X11 tools.
isAvailable()will returnfalseifXDG_SESSION_TYPEiswayland. - Requires
xdotoolfor core functionality;initialize()will throw if it's not found. - Key Operations:
- Mouse: Uses
xdotool mousemove,xdotool click. - Keyboard: Uses
xdotool key,xdotool type. - Windows: Uses
xdotool getactivewindow,wmctrl,xdotool windowactivate,xdotool windowminimize, etc. - Clipboard: Uses
xclip. - Screens: Parses output from
xrandr --query.
##### WindowsNativeProvider
- Platform: Windows (including WSL environments).
- Dependencies: Primarily uses PowerShell scripts executed via
child_process. - WSL Support: Can operate within a Windows Subsystem for Linux (WSL) environment by targeting the host Windows PowerShell. The constructor accepts a
wsl: trueoption. - Capabilities: Supports mouse, keyboard, windows, clipboard, and includes
screenshots. - Key Operations: Executes PowerShell commands for all automation tasks.
##### MacOSNativeProvider
- Platform: macOS.
- Dependencies: Leverages macOS-specific commands and tools.
cliclick: Used for some mouse and keyboard actions. It's optional; the provider can still initialize without it, but some capabilities might be limited.pbpaste/pbcopy: For clipboard operations.system_profiler: For screen information.- Limitations:
getPixelColor(): Currently throws a "not supported" error, as it requires screenshot analysis which is not directly implemented in this provider.- Key Operations:
- Clipboard: Uses
pbpasteto get text. - Screens: Parses JSON output from
system_profiler SPDisplaysDataType.
SmartSnapshotManager
The SmartSnapshotManager provides capabilities for intelligent UI element detection, allowing automation scripts to interact with elements based on their visual or accessibility properties rather than just coordinates.
Getting an Instance
The SmartSnapshotManager is typically instantiated directly, often configured with a detection method and defaultTtl.
import { SmartSnapshotManager } from 39;../../src/desktop-automation/smart-snapshot.js39;;
const snapshotManager = new SmartSnapshotManager({
method: 39;ocr39;, class="hl-cmt">// or 39;accessibility39;
defaultTtl: 30_000, class="hl-cmt">// Snapshots are valid for 30 seconds
});
Snapshot Creation
takeSnapshot(): Captures the current screen state and processes it to identify UI elements.- If
method: 'ocr', it usesScreenshotToolto capture an image andOCRToolto extract text blocks and their bounding boxes. It attempts to infer roles (button, link, text-field) from the text content. - If
method: 'accessibility', it would use platform-specific accessibility APIs (not fully detailed in the provided tests, but implied). getCurrentSnapshot(): Retrieves the most recently taken snapshot.getElement(ref): Retrieves a specific element from the current snapshot using its unique reference number.
Element Referencing
getNextRef(): Provides a monotonically increasing integer to serve as a unique reference (ref) for UI elements. EachSmartSnapshotManagerinstance maintains its own counter.
Injecting Browser Elements
A powerful feature of the SmartSnapshotManager is its ability to combine desktop-level UI elements with elements sourced from a browser context. This is crucial for hybrid automation scenarios.
injectBrowserElements(elements: UIElement[], sourceName?: string): Adds a list ofUIElementobjects (e.g., from a browser's DOM inspection) to the current desktop snapshot.- Injected elements are tagged with an
attributes.sourceproperty, defaulting to'browser-accessibility'ifsourceNameis not provided. - This allows the
DesktopAutomationManagerto find and interact with elements that might not be visible or detectable via desktop-level accessibility/OCR, but are known from a browser's internal state.
class="hl-cmt">// Example: Injecting elements from a browser automation tool
const browserElements = [
{
ref: snapshotManager.getNextRef(),
role: 39;button39;,
name: 39;Submit Form39;,
bounds: { x: 100, y: 200, width: 150, height: 40 },
center: { x: 175, y: 220 },
interactive: true,
focused: false,
enabled: true,
visible: true,
},
];
snapshotManager.injectBrowserElements(browserElements, 39;my-browser-plugin39;);
class="hl-cmt">// Now, a subsequent call to manager.findUIElement() could find 39;Submit Form39;
class="hl-cmt">// even if it39;s within a browser window that desktop OCR/accessibility can39;t fully parse.
Developer Guide
Basic Usage Flow
- Get the Manager: Obtain the singleton instance.
- Initialize: Prepare the underlying automation provider.
- Perform Actions: Use the manager's methods for mouse, keyboard, window, etc.
- Shutdown: Release resources when done.
import { getDesktopAutomation } from 39;../../src/desktop-automation/index.js39;;
async function automateTask() {
const automation = getDesktopAutomation();
await automation.initialize();
try {
class="hl-cmt">// Move mouse and click
await automation.moveMouse(100, 100);
await automation.click();
class="hl-cmt">// Type text
await automation.type(39;Hello, Desktop!39;);
class="hl-cmt">// Find and focus a window
const terminalWindow = await automation.findWindow(39;Terminal39;);
if (terminalWindow) {
await automation.focusWindow(terminalWindow.handle);
console.log(`Focused: ${terminalWindow.title}`);
}
class="hl-cmt">// Get clipboard content
await automation.copyText(39;Copied from automation39;);
const clipboardText = await automation.getClipboardText();
console.log(`Clipboard: ${clipboardText}`);
} catch (error) {
console.error(39;Automation failed:39;, error);
} finally {
await automation.shutdown();
}
}
automateTask();
Using Smart Snapshots for Intelligent Interaction
import { getDesktopAutomation, SmartSnapshotManager } from 39;../../src/desktop-automation/index.js39;;
async function interactWithUI() {
const automation = getDesktopAutomation();
await automation.initialize();
const snapshotManager = new SmartSnapshotManager({ method: 39;ocr39; }); class="hl-cmt">// Use OCR for element detection
try {
class="hl-cmt">// Take a snapshot of the current screen
const snapshot = await snapshotManager.takeSnapshot();
console.log(`Snapshot taken with ${snapshot.elements.length} elements.`);
class="hl-cmt">// Find an element by name (e.g., a button detected by OCR)
const submitButton = snapshot.elements.find(e => e.name === 39;Submit39; && e.role === 39;button39;);
if (submitButton) {
console.log(`Found 39;Submit39; button at ${submitButton.center.x}, ${submitButton.center.y}`);
await automation.click(submitButton.center.x, submitButton.center.y);
} else {
console.log("Submit button not found.");
}
} catch (error) {
console.error(39;UI interaction failed:39;, error);
} finally {
await automation.shutdown();
}
}
interactWithUI();
Extending with a New Automation Provider
To add support for a new platform or library:
- Create a new class that implements the
IAutomationProviderinterface. - Implement all required methods (mouse, keyboard, window, etc.) using the new platform's APIs or tools.
- Define
nameandcapabilitiesfor your provider. - Implement
isAvailable()to check if the necessary tools/libraries are present on the system. - Register your provider with the
DesktopAutomationManagerusingmanager.registerProvider(yourNewProviderInstance). You can then configure the manager to use it.
This modular design ensures that the desktop-automation module can be easily extended to support new environments or integrate with different automation backends.