This pull request implements the enhancements outlined in issue #3: “Extend Support for Multiple File Types in OCR/Extraction Service”. It significantly expands the file type support beyond the current PDF, DOCX, and PPTX formats.
Microsoft Office Formats:
.DOC
(older Word format) using docx2txt.XLSX
(Excel) using openpyxl.XLS
(older Excel format) using xlrdOpenDocument Formats:
.ODT
(OpenDocument Text) using odfpy.ODS
(OpenDocument Spreadsheet) using odfpy.ODP
(OpenDocument Presentation) using odfpyText Formats:
.TXT
files.RTF
files using striprtfE-book Format:
.EPUB
format using ebooklib and BeautifulSoup4New Extraction Functions:
extract_text_from_doc()
for DOC filesextract_text_from_xlsx()
and extract_text_from_xls()
for Excel filesextract_text_from_odt()
, extract_text_from_ods()
, and extract_text_from_odp()
for OpenDocument formatsextract_text_from_txt()
and extract_text_from_rtf()
for text formatsextract_text_from_epub()
for EPUB filesFile Type Detection:
__init__.py
for both URL and local file pathschunking_routes.py
for API uploadsProcessing Logic:
process_document_chunking()
in services/chunking.py
to handle all new file typesDependencies:
requirements.txt
:
setup.py
with the same dependenciesDocumentation:
All new file type extraction functions have been implemented with proper error handling and logging. The implementation follows the same patterns as the existing extraction functions for PDF, DOCX, and PPTX.
This PR resolves issue #3: “Extend Support for Multiple File Types in OCR/Extraction Service” /claim #3
Kunal Darekar
@Kunal-Darekar
Unsiloed AI
@unsiloed-ai