Feature: Extend Support for Multiple File Types in OCR/Extraction Service

Overview

This pull request implements the enhancements outlined in issue #3: “Extend Support for Multiple File Types in OCR/Extraction Service”. It significantly expands the file type support beyond the current PDF, DOCX, and PPTX formats.

Changes

Added Support for New File Types:

Microsoft Office Formats:
- Added support for .DOC (older Word format) using docx2txt
- Added support for .XLSX (Excel) using openpyxl
- Added support for .XLS (older Excel format) using xlrd
OpenDocument Formats:
- Added support for .ODT (OpenDocument Text) using odfpy
- Added support for .ODS (OpenDocument Spreadsheet) using odfpy
- Added support for .ODP (OpenDocument Presentation) using odfpy
Text Formats:
- Added support for plain text .TXT files
- Added support for rich text .RTF files using striprtf
E-book Format:
- Added support for .EPUB format using ebooklib and BeautifulSoup4

Implementation Details:

New Extraction Functions:
- Added extract_text_from_doc() for DOC files
- Added extract_text_from_xlsx() and extract_text_from_xls() for Excel files
- Added extract_text_from_odt(), extract_text_from_ods(), and extract_text_from_odp() for OpenDocument formats
- Added extract_text_from_txt() and extract_text_from_rtf() for text formats
- Added extract_text_from_epub() for EPUB files
File Type Detection:
- Updated file type detection in __init__.py for both URL and local file paths
- Updated file type detection in chunking_routes.py for API uploads
Processing Logic:
- Updated process_document_chunking() in services/chunking.py to handle all new file types
Dependencies:
- Added necessary dependencies to requirements.txt:
  - docx2txt
  - openpyxl
  - xlrd
  - odfpy
  - ebooklib
  - striprtf
  - beautifulsoup4
- Updated setup.py with the same dependencies
Documentation:
- Updated README.md to reflect the new supported file types
- Added information about new dependencies

Testing

All new file type extraction functions have been implemented with proper error handling and logging. The implementation follows the same patterns as the existing extraction functions for PDF, DOCX, and PPTX.

Resolves

This PR resolves issue #3: “Extend Support for Multiple File Types in OCR/Extraction Service” /claim #3

Overview

Changes

Added Support for New File Types:

Implementation Details:

Testing

Resolves

Claim

Contributors

Sponsors