You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
97 lines
2.8 KiB
97 lines
2.8 KiB
2 weeks ago
|
# Web Crawler Configuration
|
||
|
|
||
|
DeepSearcher supports various web crawlers to collect data from websites for processing and indexing.
|
||
|
|
||
|
## 📝 Basic Configuration
|
||
|
|
||
|
```python
|
||
|
config.set_provider_config("web_crawler", "(WebCrawlerName)", "(Arguments dict)")
|
||
|
```
|
||
|
|
||
|
## 📋 Available Web Crawlers
|
||
|
|
||
|
| Crawler | Description | Key Feature |
|
||
|
|---------|-------------|-------------|
|
||
|
| **FireCrawlCrawler** | Cloud-based web crawling service | Simple API, managed service |
|
||
|
| **Crawl4AICrawler** | Browser automation crawler | Full JavaScript support |
|
||
|
| **JinaCrawler** | Content extraction service | High accuracy parsing |
|
||
|
| **DoclingCrawler** | Doc processing with crawling | Multiple format support |
|
||
|
|
||
|
## 🔍 Web Crawler Options
|
||
|
|
||
|
### FireCrawl
|
||
|
|
||
|
[FireCrawl](https://docs.firecrawl.dev/introduction) is a cloud-based web crawling service designed for AI applications.
|
||
|
|
||
|
**Key features:**
|
||
|
- Simple API
|
||
|
- Managed Service
|
||
|
- Advanced Parsing
|
||
|
|
||
|
```python
|
||
|
config.set_provider_config("web_crawler", "FireCrawlCrawler", {})
|
||
|
```
|
||
|
|
||
|
??? tip "Setup Instructions"
|
||
|
|
||
|
1. Sign up for FireCrawl and get an API key
|
||
|
2. Set the API key as an environment variable:
|
||
|
```bash
|
||
|
export FIRECRAWL_API_KEY="your_api_key"
|
||
|
```
|
||
|
3. For more information, see the [FireCrawl documentation](https://docs.firecrawl.dev/introduction)
|
||
|
|
||
|
### Crawl4AI
|
||
|
|
||
|
[Crawl4AI](https://docs.crawl4ai.com/) is a Python package for web crawling with browser automation capabilities.
|
||
|
|
||
|
```python
|
||
|
config.set_provider_config("web_crawler", "Crawl4AICrawler", {"browser_config": {"headless": True, "verbose": True}})
|
||
|
```
|
||
|
|
||
|
??? tip "Setup Instructions"
|
||
|
|
||
|
1. Install Crawl4AI:
|
||
|
```bash
|
||
|
pip install crawl4ai
|
||
|
```
|
||
|
2. Run the setup command:
|
||
|
```bash
|
||
|
crawl4ai-setup
|
||
|
```
|
||
|
3. For more information, see the [Crawl4AI documentation](https://docs.crawl4ai.com/)
|
||
|
|
||
|
### Jina Reader
|
||
|
|
||
|
[Jina Reader](https://jina.ai/reader/) is a service for extracting content from web pages with high accuracy.
|
||
|
|
||
|
```python
|
||
|
config.set_provider_config("web_crawler", "JinaCrawler", {})
|
||
|
```
|
||
|
|
||
|
??? tip "Setup Instructions"
|
||
|
|
||
|
1. Get a Jina API key
|
||
|
2. Set the API key as an environment variable:
|
||
|
```bash
|
||
|
export JINA_API_TOKEN="your_api_key"
|
||
|
# or
|
||
|
export JINAAI_API_KEY="your_api_key"
|
||
|
```
|
||
|
3. For more information, see the [Jina Reader documentation](https://jina.ai/reader/)
|
||
|
|
||
|
### Docling Crawler
|
||
|
|
||
|
[Docling](https://docling-project.github.io/docling/) provides web crawling capabilities alongside its document processing features.
|
||
|
|
||
|
```python
|
||
|
config.set_provider_config("web_crawler", "DoclingCrawler", {})
|
||
|
```
|
||
|
|
||
|
??? tip "Setup Instructions"
|
||
|
|
||
|
1. Install Docling:
|
||
|
```bash
|
||
|
pip install docling
|
||
|
```
|
||
|
2. For information on supported formats, see the [Docling documentation](https://docling-project.github.io/docling/usage/supported_formats/#supported-output-formats)
|