Overview

This pull request implements the enhancements outlined in issue #3: “Extend Support for Multiple File Types in OCR/Extraction Service”. It significantly expands the file type support beyond the current PDF, DOCX, and PPTX formats.

Changes

Added Support for New File Types:

  1. Microsoft Office Formats:

    • Added support for .DOC (older Word format) using docx2txt
    • Added support for .XLSX (Excel) using openpyxl
    • Added support for .XLS (older Excel format) using xlrd
  2. OpenDocument Formats:

    • Added support for .ODT (OpenDocument Text) using odfpy
    • Added support for .ODS (OpenDocument Spreadsheet) using odfpy
    • Added support for .ODP (OpenDocument Presentation) using odfpy
  3. Text Formats:

    • Added support for plain text .TXT files
    • Added support for rich text .RTF files using striprtf
  4. E-book Format:

    • Added support for .EPUB format using ebooklib and BeautifulSoup4

Implementation Details:

  1. New Extraction Functions:

    • Added extract_text_from_doc() for DOC files
    • Added extract_text_from_xlsx() and extract_text_from_xls() for Excel files
    • Added extract_text_from_odt(), extract_text_from_ods(), and extract_text_from_odp() for OpenDocument formats
    • Added extract_text_from_txt() and extract_text_from_rtf() for text formats
    • Added extract_text_from_epub() for EPUB files
  2. File Type Detection:

    • Updated file type detection in __init__.py for both URL and local file paths
    • Updated file type detection in chunking_routes.py for API uploads
  3. Processing Logic:

    • Updated process_document_chunking() in services/chunking.py to handle all new file types
  4. Dependencies:

    • Added necessary dependencies to requirements.txt:
      • docx2txt
      • openpyxl
      • xlrd
      • odfpy
      • ebooklib
      • striprtf
      • beautifulsoup4
    • Updated setup.py with the same dependencies
  5. Documentation:

    • Updated README.md to reflect the new supported file types
    • Added information about new dependencies

Testing

All new file type extraction functions have been implemented with proper error handling and logging. The implementation follows the same patterns as the existing extraction functions for PDF, DOCX, and PPTX.

Resolves

This PR resolves issue #3: “Extend Support for Multiple File Types in OCR/Extraction Service” /claim #3

Claim

Total prize pool $50
Total paid $0
Status Pending
Submitted May 14, 2025
Last updated May 14, 2025

Contributors

KU

Kunal Darekar

@Kunal-Darekar

100%

Sponsors

UN

Unsiloed AI

@unsiloed-ai

$50