Closes : #3
This PR extends the document processing capabilities of Unsiloed by adding support for additional file formats. The service can now handle a wider range of document types while maintaining the existing chunking strategies and processing pipeline.
.DOC (Word documents).XLSX and .XLS (Excel spreadsheets).ODT (OpenDocument Text).ODS (OpenDocument Spreadsheet).ODP (OpenDocument Presentation).TXT (Plain text).RTF (Rich text format).EPUB (Electronic publication)extract_text_from_doc() for .DOC filesextract_text_from_xlsx() and extract_text_from_xls() for Excel filesextract_text_from_odt(), extract_text_from_ods(), and extract_text_from_odp() for OpenDocument formatsextract_text_from_txt() for plain text filesextract_text_from_rtf() for rich text filesextract_text_from_epub() for ebook filestextract for .DOC and .RTF filespandas for .XLSX and .XLS filesodfpy for OpenDocument formatsebooklib for .EPUB filesPlease test the following scenarios:
New dependencies have been added to setup.py:
"textract", # For .DOC and .RTF
"pandas", # For .XLSX and .XLS
"odfpy", # For .ODT, .ODS, .ODP
"ebooklib", # For .EPUB
/claim #3
Harsh
@harsh-791
Unsiloed AI
@unsiloed-ai