You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
76 lines
2.8 KiB
76 lines
2.8 KiB
2 weeks ago
|
# Unstructured Integration Example
|
||
|
|
||
|
This example demonstrates how to use the Unstructured library with DeepSearcher for advanced document parsing.
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
Unstructured is a powerful document processing library that can extract content from various document formats. This example shows:
|
||
|
|
||
|
1. Setting up Unstructured with DeepSearcher
|
||
|
2. Configuring the Unstructured API keys (optional)
|
||
|
3. Loading documents with Unstructured's parser
|
||
|
4. Querying the extracted content
|
||
|
|
||
|
## Code Example
|
||
|
|
||
|
```python
|
||
|
import logging
|
||
|
import os
|
||
|
from deepsearcher.offline_loading import load_from_local_files
|
||
|
from deepsearcher.online_query import query
|
||
|
from deepsearcher.configuration import Configuration, init_config
|
||
|
|
||
|
# Suppress unnecessary logging from third-party libraries
|
||
|
logging.getLogger("httpx").setLevel(logging.WARNING)
|
||
|
|
||
|
# (Optional) Set API keys (ensure these are set securely in real applications)
|
||
|
os.environ['UNSTRUCTURED_API_KEY'] = '***************'
|
||
|
os.environ['UNSTRUCTURED_API_URL'] = '***************'
|
||
|
|
||
|
|
||
|
def main():
|
||
|
# Step 1: Initialize configuration
|
||
|
config = Configuration()
|
||
|
|
||
|
# Configure Vector Database (Milvus) and File Loader (UnstructuredLoader)
|
||
|
config.set_provider_config("vector_db", "Milvus", {})
|
||
|
config.set_provider_config("file_loader", "UnstructuredLoader", {})
|
||
|
|
||
|
# Apply the configuration
|
||
|
init_config(config)
|
||
|
|
||
|
# Step 2: Load data from a local file or directory into Milvus
|
||
|
input_file = "your_local_file_or_directory" # Replace with your actual file path
|
||
|
collection_name = "Unstructured"
|
||
|
collection_description = "All Milvus Documents"
|
||
|
|
||
|
load_from_local_files(paths_or_directory=input_file, collection_name=collection_name, collection_description=collection_description)
|
||
|
|
||
|
# Step 3: Query the loaded data
|
||
|
question = "What is Milvus?" # Replace with your actual question
|
||
|
result = query(question)
|
||
|
|
||
|
|
||
|
if __name__ == "__main__":
|
||
|
main()
|
||
|
```
|
||
|
|
||
|
## Running the Example
|
||
|
|
||
|
1. Install DeepSearcher with Unstructured support: `pip install deepsearcher "unstructured[all-docs]"`
|
||
|
2. (Optional) Sign up for the Unstructured API at [unstructured.io](https://unstructured.io) if you want to use their cloud service
|
||
|
3. Replace `your_local_file_or_directory` with your own document file path or directory
|
||
|
4. Run the script: `python load_local_file_using_unstructured.py`
|
||
|
|
||
|
## Unstructured Options
|
||
|
|
||
|
You can use Unstructured in two modes:
|
||
|
|
||
|
1. **API Mode**: Set the environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` to use their cloud service
|
||
|
2. **Local Mode**: Don't set the environment variables, and Unstructured will process documents locally on your machine
|
||
|
|
||
|
## Key Concepts
|
||
|
|
||
|
- **Document Processing**: Advanced document parsing for various formats
|
||
|
- **API/Local Options**: Flexibility in deployment based on your needs
|
||
|
- **Integration**: Seamless integration with DeepSearcher's vector database and query capabilities
|