Skip to main content

Overview

This guide shows how to get started with Google’s Gemini 2.5 Computer Use in minutes using Orgo to control a virtual desktop environment.

Setup

Install the required packages:
pip
pip install orgo google-genai pillow python-dotenv
Set up your API keys in a .env file:
.env
ORGO_API_KEY=your_orgo_api_key
GEMINI_API_KEY=your_gemini_api_key
Or export them as environment variables:
terminal
export ORGO_API_KEY=your_orgo_api_key
export GEMINI_API_KEY=your_gemini_api_key

Complete Example

Here’s a full working example that handles the complete agent loop:
example.py
import os
import time
import base64
import io
from google import genai
from google.genai import types
from orgo import Computer
from PIL import Image
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize Gemini client
client = genai.Client(api_key=os.environ.get('GEMINI_API_KEY'))

# Connect to your Orgo computer
# Get your computer_id from https://orgo.ai/projects
computer = Computer(computer_id="your-computer-id")

# Screen resolution
SCREEN_WIDTH = 1024
SCREEN_HEIGHT = 768

# System prompt with Ubuntu-specific instructions
SYSTEM_PROMPT = f"""You are controlling an Ubuntu Linux virtual machine with a display resolution of {SCREEN_WIDTH}x{SCREEN_HEIGHT}.

<SYSTEM_CAPABILITY>
* You have access to a virtual Ubuntu desktop environment with standard applications
* You can see the current state through screenshots and control the computer through actions
* The environment has Firefox browser and standard Ubuntu applications pre-installed
</SYSTEM_CAPABILITY>

<UBUNTU_DESKTOP_GUIDELINES>
* CRITICAL: When opening applications or files on the Ubuntu desktop, you MUST USE DOUBLE-CLICK, not single-click
* Single-click only selects desktop icons but DOES NOT open them
* Desktop interactions:
  - Desktop icons (apps/folders): DOUBLE-CLICK to open
  - Menu items: SINGLE-CLICK to select
  - Taskbar/launcher icons: SINGLE-CLICK to open
  - Window buttons (close/minimize/maximize): SINGLE-CLICK
  - File browser items: DOUBLE-CLICK to open
* Always start by taking a screenshot to see the current state
* When you need to submit or confirm, use the 'Enter' key
</UBUNTU_DESKTOP_GUIDELINES>

<IMPORTANT_NOTES>
* Be efficient with screenshots - only take them when you need to see the current state
* Wait for pages/applications to load before taking another screenshot
* Batch multiple actions together when possible before checking the result
</IMPORTANT_NOTES>"""

def denormalize_x(x: int) -> int:
    """Convert normalized x coordinate (0-999) to actual pixel."""
    return int(x / 1000 * SCREEN_WIDTH)

def denormalize_y(y: int) -> int:
    """Convert normalized y coordinate (0-999) to actual pixel."""
    return int(y / 1000 * SCREEN_HEIGHT)

def get_screenshot_png() -> bytes:
    """Get screenshot as PNG bytes (Gemini requires PNG format)."""
    jpeg_data = base64.b64decode(computer.screenshot_base64())
    image = Image.open(io.BytesIO(jpeg_data))
    png_buffer = io.BytesIO()
    image.save(png_buffer, format='PNG')
    return png_buffer.getvalue()

def get_current_url() -> str:
    """Get the current URL from the browser."""
    try:
        result = computer.bash("xdotool getactivewindow getwindowname")
        return result if result else "about:blank"
    except:
        return "about:blank"

def execute_function_calls(candidate):
    """Execute function calls from Gemini's response."""
    results = []
    function_calls = [
        part.function_call 
        for part in candidate.content.parts 
        if part.function_call
    ]
    
    for function_call in function_calls:
        fname = function_call.name
        args = function_call.args
        action_result = {}
        
        print(f"  → {fname}")
        
        try:
            if fname == "open_web_browser":
                pass  # Browser already open
            elif fname == "click_at":
                computer.left_click(denormalize_x(args["x"]), denormalize_y(args["y"]))
            elif fname == "type_text_at":
                computer.left_click(denormalize_x(args["x"]), denormalize_y(args["y"]))
                computer.type(args["text"])
                if args.get("press_enter", False):
                    computer.key("Return")
            elif fname == "scroll_document":
                computer.scroll(args["direction"], 3)
            elif fname == "key_combination":
                computer.key(args["keys"])
            elif fname == "go_back":
                computer.key("alt+Left")
            elif fname == "navigate":
                url = args["url"]
                computer.bash(f'firefox "{url}" &')
                action_result["url"] = url
            elif fname == "wait_5_seconds":
                computer.wait(5)
            else:
                print(f"    Warning: Unimplemented function {fname}")
            
            time.sleep(1)  # Wait for UI to update
        except Exception as e:
            print(f"    Error: {e}")
            action_result = {"error": str(e)}
        
        results.append((fname, action_result))
    
    return results

def get_function_responses(results):
    """Create function responses with screenshot and URL."""
    screenshot_png = get_screenshot_png()
    current_url = get_current_url()
    function_responses = []
    
    for name, result in results:
        response_data = {
            "status": "completed",
            "url": result.get("url", current_url)
        }
        response_data.update(result)
        
        function_responses.append(
            types.FunctionResponse(
                name=name,
                response=response_data,
                parts=[
                    types.FunctionResponsePart(
                        inline_data=types.FunctionResponseBlob(
                            mime_type="image/png",
                            data=screenshot_png
                        )
                    )
                ]
            )
        )
    
    return function_responses

try:
    # Configure Computer Use tool with system instruction
    config = types.GenerateContentConfig(
        system_instruction=SYSTEM_PROMPT,
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment=types.Environment.ENVIRONMENT_BROWSER
                )
            )
        ]
    )
    
    # Define task
    task = "Open Firefox and search for 'gemini ai'"
    print(f"Task: {task}\n")
    
    # Get initial screenshot
    initial_screenshot = get_screenshot_png()
    
    # Create initial request
    contents = [
        types.Content(
            role="user",
            parts=[
                types.Part(text=task),
                types.Part.from_bytes(
                    data=initial_screenshot,
                    mime_type='image/png'
                )
            ]
        )
    ]
    
    # Agent loop
    for iteration in range(20):
        print(f"\n--- Turn {iteration + 1} ---")
        
        # Get response from Gemini
        response = client.models.generate_content(
            model='gemini-2.5-computer-use-preview-10-2025',
            contents=contents,
            config=config
        )
        
        candidate = response.candidates[0]
        contents.append(candidate.content)
        
        # Display progress
        for part in candidate.content.parts:
            if part.text:
                print(f"💬 {part.text}")
        
        # Check for function calls
        has_function_calls = any(
            part.function_call 
            for part in candidate.content.parts
        )
        
        if not has_function_calls:
            print("\n✓ Task completed")
            break
        
        # Execute actions
        print("→ Executing actions...")
        results = execute_function_calls(candidate)
        
        # Get responses with screenshot and URL
        function_responses = get_function_responses(results)
        
        # Continue conversation
        contents.append(
            types.Content(
                role="user",
                parts=[
                    types.Part(function_response=fr) 
                    for fr in function_responses
                ]
            )
        )

except Exception as e:
    print(f"\n❌ Error: {e}")

finally:
    print("\nDone!")
    # Note: computer.destroy() not called to keep computer running
    # Call computer.destroy() if you want to clean up

Usage Examples

Basic Tasks

# Change the task variable to control what Gemini does
task = "Open Firefox and search for 'gemini ai'"

# Navigate to a website
task = "Go to github.com and search for 'orgo'"

# Fill a form
task = "Fill out the contact form with test data"

Complex Workflows

# Multi-step task
task = """
1. Open a text editor
2. Write a Python hello world program
3. Save it as hello.py
4. Open a terminal
5. Run the program
"""

Key Concepts

System Prompt

The system prompt provides crucial context to Gemini about the Ubuntu environment:
SYSTEM_PROMPT = f"""You are controlling an Ubuntu Linux virtual machine...

<UBUNTU_DESKTOP_GUIDELINES>
* CRITICAL: When opening applications or files on the Ubuntu desktop, 
  you MUST USE DOUBLE-CLICK, not single-click
* Single-click only selects desktop icons but DOES NOT open them
* Desktop icons (apps/folders): DOUBLE-CLICK to open
* Menu items: SINGLE-CLICK to select
</UBUNTU_DESKTOP_GUIDELINES>
"""
This ensures Gemini knows to:
  • Double-click desktop icons to open applications
  • Single-click menu items and buttons
  • Use appropriate keyboard shortcuts

Getting Your Computer ID

Get your computer_id from the Orgo dashboard:
  1. Go to https://orgo.ai/projects
  2. Click on your project
  3. Find your computer ID in the computer list
  4. Use it in: Computer(computer_id="your-computer-id")

The Agent Loop

Gemini Computer Use works in a continuous loop:
  1. Request → Send task with screenshot to the model
  2. Action → Model suggests actions (click, type, etc.)
  3. Execute → Your code executes the actions
  4. Screenshot → Capture the result
  5. Repeat → Continue until task is complete

Image Format Conversion

Important: Orgo returns screenshots in JPEG format, but Gemini requires PNG format:
def get_screenshot_png() -> bytes:
    """Get screenshot as PNG bytes (Gemini requires PNG format)."""
    jpeg_data = base64.b64decode(computer.screenshot_base64())
    image = Image.open(io.BytesIO(jpeg_data))
    png_buffer = io.BytesIO()
    image.save(png_buffer, format='PNG')
    return png_buffer.getvalue()

URL Tracking

Important: Gemini Computer Use requires the current URL in every function response:
response_data = {
    "status": "completed",
    "url": result.get("url", current_url)  # Always include URL
}

Coordinate System

Gemini uses normalized coordinates (0-999) that must be converted to actual pixels:
def denormalize_x(x: int) -> int:
    return int(x / 1000 * SCREEN_WIDTH)

def denormalize_y(y: int) -> int:
    return int(y / 1000 * SCREEN_HEIGHT)
Orgo’s default screen resolution is 1024x768.

Action Types

ActionDescriptionExample
open_web_browserOpens the browserStart Firefox
click_atClick at coordinatesClick button at (500, 300)
type_text_atType text at locationEnter “hello” in search box
scroll_documentScroll pageScroll down
key_combinationPress key combosPress ctrl+c
navigateGo to URLLoad https://example.com
go_backBrowser backPrevious page
wait_5_secondsPause executionWait for page load

Tool Compatibility

Orgo provides methods corresponding to Gemini’s computer use tools:
Gemini Tool ActionOrgo MethodDescription
click_atcomputer.left_click(x, y)Click at coordinates
type_text_atcomputer.type(text)Type text
key_combinationcomputer.key(keys)Press keys (e.g., “ctrl+c”)
scroll_documentcomputer.scroll(direction, amount)Scroll page
navigatecomputer.bash('firefox "url" &')Open URL
Screenshotcomputer.screenshot_base64()Capture screen (JPEG)
wait_5_secondscomputer.wait(5)Wait 5 seconds

Best Practices

1. Clear Instructions

# ✅ Good - Specific and clear
task = "Go to amazon.com and find the top 3 rated laptops under $1000"

# ❌ Avoid - Too vague
task = "Find some laptops"

2. Use System Prompts

Always include a system prompt with Ubuntu-specific instructions:
config = types.GenerateContentConfig(
    system_instruction=SYSTEM_PROMPT,  # Include OS-specific guidance
    tools=[...]
)

3. Convert Coordinates

Always denormalize Gemini’s normalized coordinates (0-999):
actual_x = denormalize_x(args["x"])
actual_y = denormalize_y(args["y"])

4. Handle Image Format

Always convert JPEG screenshots to PNG:
screenshot_png = get_screenshot_png()

5. Include URL in Responses

Always include the current URL:
response_data = {
    "status": "completed",
    "url": current_url
}

6. Add Delays

time.sleep(1)  # Wait for UI to update after actions

Comparison with Claude and OpenAI

FeatureGemini Computer UseClaude Computer UseOpenAI Computer Use
APIGenerate Content APIMessages APIResponses API
Modelgemini-2.5-computer-use-previewclaude-4-sonnetcomputer-use-preview
System PromptSupportedSupportedSupported
CoordinatesNormalized (0-999)Actual pixelsActual pixels
Image FormatPNG requiredJPEG/PNGPNG
URL RequirementRequired in responseOptionalOptional
Parallel ActionsYesNoNo

Limitations

  • Preview Status: Computer Use is in preview and may have unexpected behaviors
  • Browser Focus: Optimized for browser-based tasks
  • Coordinate System: Requires conversion from normalized to actual pixels
  • Image Format: Requires PNG format (Orgo returns JPEG, must convert)
  • URL Requirement: Must include URL in every function response
  • Rate Limits: Subject to Gemini API rate limits

Troubleshooting

Model doesn’t double-click desktop icons

Make sure you’re including the system prompt with Ubuntu-specific instructions:
config = types.GenerateContentConfig(
    system_instruction=SYSTEM_PROMPT,  # This is critical!
    tools=[...]
)

INVALID_ARGUMENT: Unable to process input image

This error occurs when Gemini receives a JPEG image instead of PNG. Make sure you’re using the get_screenshot_png() function.

INVALID_ARGUMENT: Requires URL in function response

Always include the url field in your response data:
response_data = {
    "status": "completed",
    "url": result.get("url", current_url)
}

Missing API Key

Ensure both environment variables are set in your .env file:
ORGO_API_KEY=your_orgo_api_key
GEMINI_API_KEY=your_gemini_api_key

Next Steps

I