Overview
This guide shows how to get started with Google’s Gemini 2.5 Computer Use in minutes using Orgo to control a virtual desktop environment.Setup
Install the required packages:pip
.env
file:
.env
terminal
Complete Example
Here’s a full working example that handles the complete agent loop:example.py
Usage Examples
Basic Tasks
Complex Workflows
Key Concepts
System Prompt
The system prompt provides crucial context to Gemini about the Ubuntu environment:- Double-click desktop icons to open applications
- Single-click menu items and buttons
- Use appropriate keyboard shortcuts
Getting Your Computer ID
Get yourcomputer_id
from the Orgo dashboard:
- Go to https://orgo.ai/projects
- Click on your project
- Find your computer ID in the computer list
- Use it in:
Computer(computer_id="your-computer-id")
The Agent Loop
Gemini Computer Use works in a continuous loop:- Request → Send task with screenshot to the model
- Action → Model suggests actions (click, type, etc.)
- Execute → Your code executes the actions
- Screenshot → Capture the result
- Repeat → Continue until task is complete
Image Format Conversion
Important: Orgo returns screenshots in JPEG format, but Gemini requires PNG format:URL Tracking
Important: Gemini Computer Use requires the current URL in every function response:Coordinate System
Gemini uses normalized coordinates (0-999) that must be converted to actual pixels:Action Types
Action | Description | Example |
---|---|---|
open_web_browser | Opens the browser | Start Firefox |
click_at | Click at coordinates | Click button at (500, 300) |
type_text_at | Type text at location | Enter “hello” in search box |
scroll_document | Scroll page | Scroll down |
key_combination | Press key combos | Press ctrl+c |
navigate | Go to URL | Load https://example.com |
go_back | Browser back | Previous page |
wait_5_seconds | Pause execution | Wait for page load |
Tool Compatibility
Orgo provides methods corresponding to Gemini’s computer use tools:Gemini Tool Action | Orgo Method | Description |
---|---|---|
click_at | computer.left_click(x, y) | Click at coordinates |
type_text_at | computer.type(text) | Type text |
key_combination | computer.key(keys) | Press keys (e.g., “ctrl+c”) |
scroll_document | computer.scroll(direction, amount) | Scroll page |
navigate | computer.bash('firefox "url" &') | Open URL |
Screenshot | computer.screenshot_base64() | Capture screen (JPEG) |
wait_5_seconds | computer.wait(5) | Wait 5 seconds |
Best Practices
1. Clear Instructions
2. Use System Prompts
Always include a system prompt with Ubuntu-specific instructions:3. Convert Coordinates
Always denormalize Gemini’s normalized coordinates (0-999):4. Handle Image Format
Always convert JPEG screenshots to PNG:5. Include URL in Responses
Always include the current URL:6. Add Delays
Comparison with Claude and OpenAI
Feature | Gemini Computer Use | Claude Computer Use | OpenAI Computer Use |
---|---|---|---|
API | Generate Content API | Messages API | Responses API |
Model | gemini-2.5-computer-use-preview | claude-4-sonnet | computer-use-preview |
System Prompt | Supported | Supported | Supported |
Coordinates | Normalized (0-999) | Actual pixels | Actual pixels |
Image Format | PNG required | JPEG/PNG | PNG |
URL Requirement | Required in response | Optional | Optional |
Parallel Actions | Yes | No | No |
Limitations
- Preview Status: Computer Use is in preview and may have unexpected behaviors
- Browser Focus: Optimized for browser-based tasks
- Coordinate System: Requires conversion from normalized to actual pixels
- Image Format: Requires PNG format (Orgo returns JPEG, must convert)
- URL Requirement: Must include URL in every function response
- Rate Limits: Subject to Gemini API rate limits
Troubleshooting
Model doesn’t double-click desktop icons
Make sure you’re including the system prompt with Ubuntu-specific instructions:INVALID_ARGUMENT: Unable to process input image
This error occurs when Gemini receives a JPEG image instead of PNG. Make sure you’re using theget_screenshot_png()
function.
INVALID_ARGUMENT: Requires URL in function response
Always include theurl
field in your response data:
Missing API Key
Ensure both environment variables are set in your.env
file: