Description
This PR introduces the GitHub Pages connector as a new connector type in the Onyx platform. The GitHub Pages connector allows users to index and search content from GitHub Pages websites by connecting to GitHub repositories and processing their content. New Feature: GitHub Pages Connector
The GitHub Pages connector provides the following capabilities:
Core Functionality:
- Repository Integration: Connects to GitHub repositories via the GitHub API
- Multi-format Support: Indexes HTML, Markdown, reStructuredText, and text files
- Smart Filtering: Filters by file type, directory depth, and file size
- Incremental Updates: Supports polling based on file modification dates
- Rate Limiting: Handles GitHub API rate limits with exponential backoff
Configuration Options:
- Repository Owner: GitHub username or organization
- Repository Name: Name of the repository containing GitHub Pages
- Branch: Branch to scan (default: gh-pages)
- Root Directory: Optional subdirectory to index
- Max Files: Maximum number of files to index (default: 1000)
- Max Depth: Maximum directory depth for crawling
- Timeout: Request timeout in seconds
Supported File Types:
- .html, .htm - HTML files (processed with BeautifulSoup)
- .md, .markdown - Markdown files (converted to HTML then processed)
- .txt - Plain text files
- .rst - reStructuredText files
- .asciidoc, .adoc - AsciiDoc files
fixes https://github.com/onyx-dot-app/onyx/issues/2282
/claim https://github.com/onyx-dot-app/onyx/issues/2282
Summary by cubic
Added a new GitHub Pages connector that lets users index and search content from GitHub Pages sites by connecting to GitHub repositories and processing their files. This addresses the requirements in issue #2282.
- New Features
- Supports indexing HTML, Markdown, reStructuredText, and text files from a specified repository and branch.
- Allows filtering by file type, directory depth, and file size.
- Handles incremental updates using file modification dates and manages GitHub API rate limits.
- Includes configuration options for repository owner, name, branch, root directory, max files, max depth, and timeout.
- Added UI and type support for the new connector in the web app.