Closes : #3
This PR extends the document processing capabilities of Unsiloed by adding support for additional file formats. The service can now handle a wider range of document types while maintaining the existing chunking strategies and processing pipeline.
.DOC
(Word documents).XLSX
and .XLS
(Excel spreadsheets).ODT
(OpenDocument Text).ODS
(OpenDocument Spreadsheet).ODP
(OpenDocument Presentation).TXT
(Plain text).RTF
(Rich text format).EPUB
(Electronic publication)extract_text_from_doc()
for .DOC filesextract_text_from_xlsx()
and extract_text_from_xls()
for Excel filesextract_text_from_odt()
, extract_text_from_ods()
, and extract_text_from_odp()
for OpenDocument formatsextract_text_from_txt()
for plain text filesextract_text_from_rtf()
for rich text filesextract_text_from_epub()
for ebook filestextract
for .DOC and .RTF filespandas
for .XLSX and .XLS filesodfpy
for OpenDocument formatsebooklib
for .EPUB filesPlease test the following scenarios:
New dependencies have been added to setup.py
:
"textract", # For .DOC and .RTF
"pandas", # For .XLSX and .XLS
"odfpy", # For .ODT, .ODS, .ODP
"ebooklib", # For .EPUB
/claim #3
Harsh
@harsh-791
Unsiloed AI
@unsiloed-ai