UN
extend document type support
Unsiloed-AI/Unsiloed-chunker#10

Extend Document Type Support

Closes : #3

Description

This PR extends the document processing capabilities of Unsiloed by adding support for additional file formats. The service can now handle a wider range of document types while maintaining the existing chunking strategies and processing pipeline.

New Supported File Types

  • Microsoft Office formats:
    • .DOC (Word documents)
    • .XLSX and .XLS (Excel spreadsheets)
  • OpenDocument formats:
    • .ODT (OpenDocument Text)
    • .ODS (OpenDocument Spreadsheet)
    • .ODP (OpenDocument Presentation)
  • Text formats:
    • .TXT (Plain text)
    • .RTF (Rich text format)
  • Ebook format:
    • .EPUB (Electronic publication)

Changes

  • Added new text extraction functions for each file type:
    • extract_text_from_doc() for .DOC files
    • extract_text_from_xlsx() and extract_text_from_xls() for Excel files
    • extract_text_from_odt(), extract_text_from_ods(), and extract_text_from_odp() for OpenDocument formats
    • extract_text_from_txt() for plain text files
    • extract_text_from_rtf() for rich text files
    • extract_text_from_epub() for ebook files
  • Updated file type validation in routes
  • Added new dependencies:
    • textract for .DOC and .RTF files
    • pandas for .XLSX and .XLS files
    • odfpy for OpenDocument formats
    • ebooklib for .EPUB files
  • Updated API documentation and README
  • Enhanced error messages for unsupported file types

Testing

Please test the following scenarios:

  1. Upload and process each new file type
  2. Verify that chunking strategies work correctly for each format
  3. Check error handling for unsupported file types
  4. Verify that the API documentation reflects the new supported formats

Dependencies

New dependencies have been added to setup.py:

"textract",  # For .DOC and .RTF
"pandas",    # For .XLSX and .XLS
"odfpy",     # For .ODT, .ODS, .ODP
"ebooklib",  # For .EPUB

Documentation

  • Updated README.md with new supported file types
  • Updated API documentation in root routes
  • Added detailed docstrings for new functions

/claim #3

Claim

Total prize pool $50
Total paid $0
Status Pending
Submitted May 12, 2025
Last updated May 12, 2025

Contributors

HA

Harsh

@harsh-791

100%

Sponsors

UN

Unsiloed AI

@unsiloed-ai

$50