Summary
Implements a complete GitHub Pages connector that indexes GitHub Pages sites via the GitHub API. Supports public and private repositories, recursive file traversal, configurable limits and incremental updates.
Changes Made
- New GitHubPagesConnector implementing LoadConnector and PollConnector interfaces.
- GitHub API integration using PyGithub for authentication and file retrieval.
- Smart content processing for HTML, Markdown, TXT, RST and AsciiDoc files with BeautifulSoup and markdown libraries.
- Factory registration for automatic connector discovery.
- Comprehensive error handling including rate limiting, repository validation and size/depth limits.
Features
- Recursive file tree analysis of the gh-pages branch or specified branch/directory.
- Support for public and private repositories via personal access token.
- Intelligent filtering of files (HTML, MD, TXT, RST, AsciiDoc) and skipping non-content or binary files.
- Configurable limits on file size, maximum files and directory depth.
- Incremental polling based on file modification timestamps.
- GitHub Pages URL generation for proper document linking.
- Robust error handling and logging.
Testing
- Validated against the
microsoft/vscode-docs
repository using a personal access token.
- Confirmed correct retrieval and parsing of HTML, Markdown and text files.
- Confirmed correct generation of GitHub Pages URLs and extraction of metadata.
- Tested both public and private repository access and rate limiting behaviour.
Configuration
Required: repo_owner
, repo_name
Optional: branch
(default gh-pages
), root_directory
, max_files
, max_depth
, github_access_token
fixes #2282
/claim #2282
Summary by cubic
Added a GitHub Pages connector to index content from GitHub Pages sites using the GitHub API, supporting both public and private repositories with configurable limits and incremental updates. This enables automatic discovery and processing of HTML, Markdown, and text files for improved document coverage.
- New Features
- Supports recursive file traversal, file type filtering, and size/depth limits.
- Handles authentication, error cases, and incremental polling based on file modification times.
- Generates correct GitHub Pages URLs for indexed documents.