You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

2.2 KiB

File Loader Configuration

DeepSearcher supports various file loaders to extract and process content from different file formats.

📝 Basic Configuration

config.set_provider_config("file_loader", "(FileLoaderName)", "(Arguments dict)")

📋 Available File Loaders

Loader Description Supported Formats
UnstructuredLoader General purpose document loader with broad format support PDF, DOCX, PPT, HTML, etc.
DoclingLoader Document processing library with extraction capabilities See documentation

🔍 File Loader Options

Unstructured

Unstructured is a powerful library for extracting content from various document formats.

config.set_provider_config("file_loader", "UnstructuredLoader", {})

??? tip "Setup Instructions"

You can use Unstructured in two ways:

1. **With API** (recommended for production)
   - Set environment variables:
     - `UNSTRUCTURED_API_KEY`
     - `UNSTRUCTURED_API_URL`

2. **Local Processing**
   - Simply don't set the API environment variables
   - Install required dependencies:
     ```bash
     # Install core dependencies
     pip install unstructured-ingest
     
     # For all document formats
     pip install "unstructured[all-docs]"
     
     # For specific formats (e.g., PDF only)
     pip install "unstructured[pdf]"
     ```

For more information:
- [Unstructured Documentation](https://docs.unstructured.io/ingestion/overview)
- [Installation Guide](https://docs.unstructured.io/open-source/installation/full-installation)

Docling

Docling provides document processing capabilities with support for multiple formats.

config.set_provider_config("file_loader", "DoclingLoader", {})

??? tip "Setup Instructions"

1. Install Docling:
   ```bash
   pip install docling
   ```

2. For information on supported formats, see the [Docling documentation](https://docling-project.github.io/docling/usage/supported_formats/#supported-output-formats).