You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
2.8 KiB
2.8 KiB
Web Crawler Configuration
DeepSearcher supports various web crawlers to collect data from websites for processing and indexing.
📝 Basic Configuration
config.set_provider_config("web_crawler", "(WebCrawlerName)", "(Arguments dict)")
📋 Available Web Crawlers
Crawler | Description | Key Feature |
---|---|---|
FireCrawlCrawler | Cloud-based web crawling service | Simple API, managed service |
Crawl4AICrawler | Browser automation crawler | Full JavaScript support |
JinaCrawler | Content extraction service | High accuracy parsing |
DoclingCrawler | Doc processing with crawling | Multiple format support |
🔍 Web Crawler Options
FireCrawl
FireCrawl is a cloud-based web crawling service designed for AI applications.
Key features:
- Simple API
- Managed Service
- Advanced Parsing
config.set_provider_config("web_crawler", "FireCrawlCrawler", {})
??? tip "Setup Instructions"
1. Sign up for FireCrawl and get an API key
2. Set the API key as an environment variable:
```bash
export FIRECRAWL_API_KEY="your_api_key"
```
3. For more information, see the [FireCrawl documentation](https://docs.firecrawl.dev/introduction)
Crawl4AI
Crawl4AI is a Python package for web crawling with browser automation capabilities.
config.set_provider_config("web_crawler", "Crawl4AICrawler", {"browser_config": {"headless": True, "verbose": True}})
??? tip "Setup Instructions"
1. Install Crawl4AI:
```bash
pip install crawl4ai
```
2. Run the setup command:
```bash
crawl4ai-setup
```
3. For more information, see the [Crawl4AI documentation](https://docs.crawl4ai.com/)
Jina Reader
Jina Reader is a service for extracting content from web pages with high accuracy.
config.set_provider_config("web_crawler", "JinaCrawler", {})
??? tip "Setup Instructions"
1. Get a Jina API key
2. Set the API key as an environment variable:
```bash
export JINA_API_TOKEN="your_api_key"
# or
export JINAAI_API_KEY="your_api_key"
```
3. For more information, see the [Jina Reader documentation](https://jina.ai/reader/)
Docling Crawler
Docling provides web crawling capabilities alongside its document processing features.
config.set_provider_config("web_crawler", "DoclingCrawler", {})
??? tip "Setup Instructions"
1. Install Docling:
```bash
pip install docling
```
2. For information on supported formats, see the [Docling documentation](https://docling-project.github.io/docling/usage/supported_formats/#supported-output-formats)