""" Installation and Usage Guide for the Smooth Operator Agent Tools Python Library """

Installation

Using pip

The Smooth Operator Agent Tools Python library can be installed using pip:

pip install smooth-operator-agent-tools

This will automatically install the library and all its dependencies, including the server executable.

From Source

To install from source:

Clone the repository:

git clone https://github.com/fstandhartinger/smooth-operator-client-python.git
cd smooth-operator-client-python

Install the package:
```
pip install -e .
```

Basic Usage

Initializing the Client

from smooth_operator_agent_tools import SmoothOperatorClient

# Initialize the client
client = SmoothOperatorClient(api_key="YOUR_API_KEY")  # Get API key for free at https://screengrasp.com/api.html

# Start the server
client.start_server()

# Stop the server when done
client.stop_server()

You can also use the client as a context manager:

from smooth_operator_agent_tools import SmoothOperatorClient

with SmoothOperatorClient() as client:
    client.start_server()
    # Use the client here
    # Server will be automatically stopped when exiting the context

Taking Screenshots

# Take a screenshot - returns image in a form that can easily be passed to LLMs
screenshot = client.screenshot.take()

# Access the screenshot data
image_bytes = screenshot.image_bytes
image_base64 = screenshot.image_base64

Mouse Operations

# Click at coordinates
client.mouse.click(500, 300)

# Right-click at coordinates
client.mouse.right_click(500, 300)

# Double-click at coordinates
client.mouse.double_click(500, 300)

# Drag from one position to another
client.mouse.drag(100, 100, 200, 200)

# Scroll at coordinates
client.mouse.scroll(500, 300, 5)  # Scroll down 5 clicks
client.mouse.scroll(500, 300, -5)  # Scroll up 5 clicks

AI-Powered UI Interaction

# Find and click a UI element by description
client.mouse.click_by_description("the Submit button")

# Find and right-click a UI element by description
client.mouse.right_click_by_description("the Context menu icon")

# Find and double-click a UI element by description
client.mouse.double_click_by_description("the File icon")

# Drag from one element to another by description
client.mouse.drag_by_description("the invoice pdf file", "the 'invoices' folder")

Keyboard Operations

# Type text
client.keyboard.type("Hello, world!")

# Press a key combination
client.keyboard.press("Ctrl+C")
client.keyboard.press("Alt+F4")

# Type text in a UI element
client.keyboard.type_at_element("the Username field", "user123")

Chrome Browser Control

# Open Chrome browser
client.chrome.open_chrome("https://www.example.com")

# Navigate to a URL
client.chrome.navigate("https://www.google.com")

# Get information about the current tab
# Can be used to find likely interactable elements in the page
# Marks all html elements with robust CSS selectors for use
# in functions like click_element() or simulate_input()
# Response can also be passed to LLM to pick the right selector
tab_details = client.chrome.explain_current_tab()

# Click an element using CSS selector
client.chrome.click_element("#search-button")

# Input text into a form field
client.chrome.simulate_input("#username", "user123")

# Execute JavaScript
result = client.chrome.execute_script("return document.title")

# Generate and execute JavaScript based on a description
result = client.chrome.generate_and_execute_script("Extract all links from the page")

System Operations

# Get system overview
# Contains list of windows, available apps on the system,
# detailed infos about the currently focused ui element and window.
# Can be used as a source of ui element ids for use in automation functions 
# like invoke() (=click) or set_value().
# Can be used as a source of window ids for get_window_details(window_id).
# Consider sending the json serialized form of this result to a LLM, together
# with a task description, the form is chosen to be LLM friendly, the LLM
# sould be able to find the relevand ui element ids and windows ids like that.
overview = client.system.get_overview()

# Open an application
client.system.open_application("notepad")

# Get window details - contains the ui automation tree of elements.
# Consider using the response in a LLM prompt.
window_id = overview.windows[0].id
window_details = client.system.get_window_details(window_id)

Windows Automation

# Click a UI element by description
# element ids can be acquired from get_overview() and get_window_details()
client.automation.invoke(element_id)

# Type text in a UI element
# element ids can be acquired from get_overview() and get_window_details()
client.automation.set_value(element_id, "john doe")

# Bring a window to the front
client.automation.bring_to_front(window_id)

Code Execution

# Execute C# code
result = client.code.execute_csharp("return 2 + 2;")

# Generate and execute C# code based on a description - example 1
result = client.code.generate_and_execute_csharp("Calculate the factorial of 5")

# Generate and execute C# code based on a description - example 2
result = client.code.generate_and_execute_csharp("Return content of the biggest file in folder c:\\temp")

# Generate and execute C# code based on a description - example 3
result = client.code.generate_and_execute_csharp("Connect to Outlook via Interop and return text and date of the latest email from pricelist@vendor.com")

Advanced Usage

Using Different AI Mechanisms

For AI-vision powered operations (provided by Screengrasp.com), you can specify different AI mechanisms:

from smooth_operator_agent_tools import MechanismType

# Use a different AI mechanism
client.mouse.click_by_description("the Submit button", mechanism=MechanismType.OPENAI_COMPUTER_USE)

Converting Responses to JSON - use LLMs to analyze

Most response objects have a to_json_string() method that converts the response to a JSON string:

# Get a response
screenshot = client.screenshot.take()

# Convert to JSON string
json_str = screenshot.to_json_string()

# Use the JSON string (e.g., pass it to a language model)
print(json_str)

It is a recommended pattern to use these JSON strings with LLMs to analyze the content.

For example you can prompt GPT-4o to extract the CSS selector of "the UI element that can be clicked to submit the form" by providing a textual instruction and the JSON string in a prompt.

Use GPT-4o's JSON mode (for some LLMs also called structured output) to ensure it answers in a form you can easily parse.

Platform Support

The Smooth Operator Agent Tools Python library is designed to work on Windows platforms, as the server executable is a Windows application. Support for other platforms may be added in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installation

Using pip

From Source

Basic Usage

Initializing the Client

Taking Screenshots

Mouse Operations

AI-Powered UI Interaction

Keyboard Operations

Chrome Browser Control

System Operations

Windows Automation

Code Execution

Advanced Usage

Using Different AI Mechanisms

Converting Responses to JSON - use LLMs to analyze

Platform Support

FilesExpand file tree

usage_guide.md

Latest commit

History

usage_guide.md

File metadata and controls

Installation

Using pip

From Source

Basic Usage

Initializing the Client

Taking Screenshots

Mouse Operations

AI-Powered UI Interaction

Keyboard Operations

Chrome Browser Control

System Operations

Windows Automation

Code Execution

Advanced Usage

Using Different AI Mechanisms

Converting Responses to JSON - use LLMs to analyze

Platform Support