Skip to content

Commit 45ba6e6

Browse files
authored
Merge branch 'OthersideAI:main' into installer-script
2 parents 9a8d364 + 285d62d commit 45ba6e6

File tree

3 files changed

+30
-16
lines changed

3 files changed

+30
-16
lines changed

README.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,12 @@
2020
> **Note:** GPT-4V's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.
2121
2222
### Ongoing Development
23-
At [HyperwriteAI](https://www.hyperwriteai.com/), we are developing a multimodal model with more accurate click location predictions.
23+
At [HyperwriteAI](https://www.hyperwriteai.com/), we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.
24+
25+
### Agent-1-Vision Model API Access
26+
We will soon be offering API access to our Agent-1-Vision model.
27+
28+
If you're interested in gaining access to this API, sign up [here](https://othersideai.typeform.com/to/FszaJ1k8?typeform-source=www.hyperwriteai.com).
2429

2530
### Additional Thoughts
2631
We recognize that some operating system functions may be more efficiently executed with hotkeys such as entering the Browser Address bar using `command + L` rather than by simulating a mouse click at the correct XY location. We plan to make these improvements over time. However, it's important to note that many actions require the accurate selection of visual elements on the screen, necessitating precise XY mouse click locations. A primary focus of this project is to refine the accuracy of determining these click locations. We believe this is essential for achieving a fully self-operating computer in the current technological landscape.
@@ -83,7 +88,7 @@ operate
8388
- **Adding New Multimodal Models**: Integration of new multimodal models is welcomed. If you have a specific model in mind that you believe would be a valuable addition, please feel free to integrate it and submit a PR.
8489
- **Framework Architecture Improvements**: Think you can enhance the framework architecture described in the intro? We welcome suggestions and PRs.
8590

86-
For any input on improving this project, feel free to reach out to me on [Twitter](https://twitter.com/josh_bickett).
91+
For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter.
8792

8893
### Follow HyperWriteAI for More Updates
8994

@@ -92,4 +97,4 @@ Stay updated with the latest developments:
9297
- Follow HyperWriteAI on [LinkedIn](https://www.linkedin.com/company/othersideai/).
9398

9499
### Compatibility
95-
- This project is only compatible with MacOS at this time.
100+
- This project is compatible with Mac OS, Windows, and Linux (with X server installed).

operate/main.py

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,13 @@
1111
import pyautogui
1212
import argparse
1313
import platform
14+
import Xlib.display
1415

1516
from prompt_toolkit import prompt
1617
from prompt_toolkit.shortcuts import message_dialog
1718
from prompt_toolkit.styles import Style as PromptStyle
1819
from dotenv import load_dotenv
19-
from PIL import Image, ImageDraw, ImageFont
20+
from PIL import Image, ImageDraw, ImageFont, ImageGrab
2021
import matplotlib.font_manager as fm
2122
from openai import OpenAI
2223

@@ -27,6 +28,7 @@
2728

2829
client = OpenAI()
2930
client.api_key = os.getenv("OPENAI_API_KEY")
31+
client.base_url = os.getenv("OPENAI_API_BASE_URL", client.base_url)
3032

3133
VISION_PROMPT = """
3234
You are a Self-Operating Computer. You use the same operating system as a human.
@@ -62,7 +64,7 @@
6264
Objective: Open Spotify and play the beatles
6365
SEARCH Spotify
6466
__
65-
Objective: Find a image of a banana
67+
Objective: Find an image of a banana
6668
CLICK {{ "x": "50%", "y": "60%", "description": "Click: Google Search field", "reason": "This will allow me to search for a banana" }}
6769
__
6870
Objective: Go buy a book about the history of the internet
@@ -178,10 +180,9 @@ def main(model):
178180
}
179181
messages = [assistant_message, user_message]
180182

181-
looping = True
182183
loop_count = 0
183184

184-
while looping:
185+
while True:
185186
if DEBUG:
186187
print("[loop] messages before next action:\n\n\n", messages[1:])
187188
try:
@@ -194,25 +195,21 @@ def main(model):
194195
print(
195196
f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}"
196197
)
197-
looping = False
198198
break
199199
except Exception as e:
200200
print(
201201
f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}"
202202
)
203-
looping = False
204203
break
205204

206205
if action_type == "DONE":
207206
print(
208207
f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BLUE} Objective complete {ANSI_RESET}"
209208
)
210-
looping = False
211209
summary = summarize(messages, objective)
212210
print(
213211
f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BLUE} Summary\n{ANSI_RESET}{summary}"
214212
)
215-
216213
break
217214

218215
if action_type != "UNKNOWN":
@@ -234,8 +231,8 @@ def main(model):
234231
print(
235232
f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] AI response\n{ANSI_RESET}{response}"
236233
)
237-
looping = False
238234
break
235+
239236
print(
240237
f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA} [Act] {action_type} COMPLETE {ANSI_RESET}{function_response}"
241238
)
@@ -248,7 +245,7 @@ def main(model):
248245

249246
loop_count += 1
250247
if loop_count > 10:
251-
looping = False
248+
break
252249

253250

254251
def format_summary_prompt(objective):
@@ -561,12 +558,23 @@ def search(text):
561558

562559

563560
def capture_screen_with_cursor(file_path=os.path.join("screenshots", "screenshot.png")):
564-
# Use the screencapture utility to capture the screen with the cursor
565-
if platform.system() == "Windows":
561+
user_platform = platform.system()
562+
563+
if user_platform == "Windows":
566564
screenshot = pyautogui.screenshot()
567565
screenshot.save(file_path)
568-
else:
566+
elif user_platform == "Linux":
567+
# Use xlib to prevent scrot dependency for Linux
568+
screen = Xlib.display.Display().screen()
569+
size = screen.width_in_pixels, screen.height_in_pixels
570+
screenshot = ImageGrab.grab(bbox=(0, 0, size[0], size[1]))
571+
screenshot.save(file_path)
572+
elif user_platform == "Darwin": # (Mac OS)
573+
# Use the screencapture utility to capture the screen with the cursor
569574
subprocess.run(["screencapture", "-C", file_path])
575+
else:
576+
print(f"The platform you're using ({user_platform}) is not currently supported")
577+
570578

571579
def extract_json_from_string(s):
572580
# print("extracting json from string", s)

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ pyperclip==1.8.2
3434
PyRect==0.2.0
3535
pyscreenshot==3.1
3636
PyScreeze==0.1.29
37+
python3-xlib==0.15
3738
python-dateutil==2.8.2
3839
python-dotenv==1.0.0
3940
pytweening==1.0.7

0 commit comments

Comments
 (0)