Summary

Implements a complete GitHub Pages connector that indexes GitHub Pages sites via the GitHub API. Supports public and private repositories, recursive file traversal, configurable limits and incremental updates.

Changes Made

  • New GitHubPagesConnector implementing LoadConnector and PollConnector interfaces.
  • GitHub API integration using PyGithub for authentication and file retrieval.
  • Smart content processing for HTML, Markdown, TXT, RST and AsciiDoc files with BeautifulSoup and markdown libraries.
  • Factory registration for automatic connector discovery.
  • Comprehensive error handling including rate limiting, repository validation and size/depth limits.

Features

  • Recursive file tree analysis of the gh-pages branch or specified branch/directory.
  • Support for public and private repositories via personal access token.
  • Intelligent filtering of files (HTML, MD, TXT, RST, AsciiDoc) and skipping non-content or binary files.
  • Configurable limits on file size, maximum files and directory depth.
  • Incremental polling based on file modification timestamps.
  • GitHub Pages URL generation for proper document linking.
  • Robust error handling and logging.

Testing

  • Validated against the microsoft/vscode-docs repository using a personal access token.
  • Confirmed correct retrieval and parsing of HTML, Markdown and text files.
  • Confirmed correct generation of GitHub Pages URLs and extraction of metadata.
  • Tested both public and private repository access and rate limiting behaviour.

Configuration

Required: repo_owner, repo_name Optional: branch (default gh-pages), root_directory, max_files, max_depth, github_access_token

fixes #2282 /claim #2282


Summary by cubic

Added a GitHub Pages connector to index content from GitHub Pages sites using the GitHub API, supporting both public and private repositories with configurable limits and incremental updates. This enables automatic discovery and processing of HTML, Markdown, and text files for improved document coverage.

  • New Features
    • Supports recursive file traversal, file type filtering, and size/depth limits.
    • Handles authentication, error cases, and incremental polling based on file modification times.
    • Generates correct GitHub Pages URLs for indexed documents.

Claim

Total prize pool $250
Total paid $0
Status Pending
Submitted August 04, 2025
Last updated August 04, 2025

Contributors

HO

hoklims

@hoklims

100%

Sponsors

ON

Onyx (YC W24)

@onyx-dot-app

$250