OmniParser Integration

Fixes: #732 /claim #732

Features Added

  • Added hybrid extraction method to BrowserContext
  • Introduced OmniParser configuration in BrowserContextConfig
  • Implemented fallback mechanism for DOM extraction
  • Added screenshot capture method for OmniParser analysis
  • Added comprehensive documentation for OmniParser feature
  • Updated README with hybrid extraction and CAPTCHA detection details

Demonstration Examples

The PR includes three example scripts demonstrating different capabilities:

1. CAPTCHA Detection (omniparser_captcha.py)

This example demonstrates how the OmniParser integration can detect CAPTCHA elements on webpages:

INFO     [browser_use] BrowserUse logging setup complete with level info
INFO     [root] Anonymized telemetry enabled. See https://docs.browser-use.com/development/telemetry for more information.
INFO     [__main__] Navigating to page with CAPTCHA...
INFO     [omniparser] Using hosted OmniParser API (local installation not available)
INFO     [omniparser] Using hosted OmniParser API (local installation not available)
INFO     [omniparser] Merging 18 OmniParser elements with DOM results
INFO     [__main__] Detected elements:
INFO     [__main__] Screenshot saved to captcha_detected.png

Screenshot: captcha_detected.png image

2. Complex UI Handling (omniparser_complex_ui.py)

This example shows how OmniParser helps navigate complex UI elements that traditional DOM-based methods struggle with:

Part 1: Complex Form Interaction

  • Demonstrates fallback mechanism when direct selectors fail
  • Successfully interacts with Airbnb search form using visual recognition
INFO     [browser_use] BrowserUse logging setup complete with level info
INFO     [root] Anonymized telemetry enabled. See https://docs.browser-use.com/development/telemetry for more information.
INFO     [__main__] PART 1: Handling complex form interaction...
INFO     [omniparser] Using hosted OmniParser API (local installation not available)
INFO     [omniparser] Using hosted OmniParser API (local installation not available)
INFO     [omniparser] Merging 87 OmniParser elements with DOM results
INFO     [__main__] Initial state has 224 elements
INFO     [__main__] Saved initial screenshot to airbnb_initial.png
INFO     [__main__] Looking for the search/location input field...
INFO     [__main__] Could not interact with search field using direct selectors: Page.click: Timeout 30000ms exceeded.
Call log:
  - waiting for locator("text=Where")
  -     - locator resolved to 2 elements. Proceeding with the first one: <div class="f16sug5q atm_c8_1cw7z3g atm_g3_qslrf5 atm_cs_10d11i2 atm_l8_1mni9fk atm_sq_1l2sidv atm_vv_1q9ccgz atm_ks_15vqwwr atm_am_ggq5uc atm_jb_1xtcb10 dir dir-ltr">Anywhere</div>
  -   - attempting click action
  -     2 × waiting for element to be visible, enabled and stable
  -       - element is not visible
  -     - retrying click action
  -     - waiting 20ms
  -     2 × waiting for element to be visible, enabled and stable
  -       - element is not visible
  -     - retrying click action
  -       - waiting 100ms
  -     56 × waiting for element to be visible, enabled and stable
  -        - element is not visible
  -      - retrying click action
  -        - waiting 500ms

INFO     [__main__] Trying alternative search approach...
INFO     [__main__] Found 1 potential search elements
INFO     [__main__] Clicked on search element with xpath: html/body/div[5]/div/div/div/div/div[3]/div[2]/div/div/div/header/div/div[2]/div[2]/div/div/div/form/div[2]/div/div/div/label/div/input
INFO     [__main__] Entered 'London' as search text
INFO     [__main__] Saved fallback search screenshot to airbnb_search_fallback.png
INFO     [__main__] Complex form interaction complete

Screenshots:

  • airbnb_initial.png
  • airbnb_search_fallback.png image image

Part 2: Dynamic UI Components

  • Successfully identifies and interacts with carousel controls
INFO     [__main__] PART 2: Handling dynamic UI components...
INFO     [__main__] Saved initial carousel screenshot to carousel_initial.png
INFO     [omniparser] Using hosted OmniParser API (local installation not available)
INFO     [omniparser] Using hosted OmniParser API (local installation not available)
INFO     [omniparser] Merging 39 OmniParser elements with DOM results
INFO     [__main__] Found 5 potential carousel controls
INFO     [__main__] Clicking carousel 'Next' button...
INFO     [__main__] Saved carousel 'next' screenshot to carousel_next.png
INFO     [__main__] Saved final carousel screenshot to carousel_final.png
INFO     [__main__] Dynamic component interaction complete

Screenshots:

  • carousel_initial.png
  • carousel_next.png
  • carousel_final.png image image image

Part 3: Visual Information Extraction

  • Extracts structured data from GitHub trending page
  • Demonstrates enhanced ability to interpret visual layout and content
INFO     [__main__] PART 3: Extracting visual information...
INFO     [__main__] Saved GitHub trending page screenshot to github_trending.png
INFO     [omniparser] Using hosted OmniParser API (local installation not available)
INFO     [omniparser] Using hosted OmniParser API (local installation not available)
INFO     [omniparser] Merging 168 OmniParser elements with DOM results
INFO     [__main__] Extracted information about 10 trending repositories:
INFO     [__main__] 1. sponsors/vllm-project
INFO     [__main__]    Description: Cost-efficient and pluggable Infrastructure components for GenAI inference
INFO     [__main__] 2. al1abb/invoify
INFO     [__main__]    Description: An invoice generator app built using Next.js, Typescript, and Shadcn
INFO     [__main__] 3. sponsors/NirDiamant
INFO     [__main__]    Description: This repository provides tutorials and implementations for various Generative AI Agent techniques...
INFO     [__main__] 4. deepseek-ai/awesome-deepseek-integration
INFO     [__main__] 5. mishushakov/llm-scraper
INFO     [__main__]    Description: Turn any webpage into structured data using LLMs
INFO     [__main__] Saved extracted repository data to trending_repos.json
INFO     [__main__] Visual information extraction complete

Screenshot: github_trending.png Screenshot 2025-02-27 at 03 25 44

Technical Implementation

  • Integration with hosted OmniParser API with local fallback option
  • Merging of OmniParser visual elements with traditional DOM results for robust extraction
  • Intelligent fallback mechanisms when traditional selectors fail
  • Automatic screenshot capture at critical interaction points for debugging and verification

Key Benefits

  • More reliable interaction with complex web UIs
  • Better handling of dynamic, JavaScript-heavy websites
  • Enhanced capabilities for detecting and handling CAPTCHAs
  • Fallback mechanisms for improved automation reliability
  • Improved extraction of structured data from visually complex pages

Example Usage

The integration simplifies working with complex web UIs by automatically:

  1. Attempting traditional DOM-based interaction first
  2. Falling back to visual recognition when traditional methods fail
  3. Merging results from both approaches for more comprehensive data extraction

Testing

The three example scripts demonstrate functionality across different use cases:

  • CAPTCHA detection and handling
  • Complex UI navigation (Airbnb example)
  • Dynamic component interaction (carousel example)
  • Structured data extraction (GitHub trending repositories)

Documentation

  • Added comprehensive documentation for all new OmniParser features
  • Updated README with integration details
  • Included example scripts demonstrating real-world usage

Claim

Total prize pool $100
Total paid $0
Status Pending
Submitted February 27, 2025
Last updated February 27, 2025

Contributors

DA

David Anyatonwu

@onyedikachi-david

100%

Sponsors

BR

Browser Use (YC W25)

@browser-use

$100