You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

2.8 KiB

Web Crawler Configuration

DeepSearcher supports various web crawlers to collect data from websites for processing and indexing.

📝 Basic Configuration

config.set_provider_config("web_crawler", "(WebCrawlerName)", "(Arguments dict)")

📋 Available Web Crawlers

Crawler Description Key Feature
FireCrawlCrawler Cloud-based web crawling service Simple API, managed service
Crawl4AICrawler Browser automation crawler Full JavaScript support
JinaCrawler Content extraction service High accuracy parsing
DoclingCrawler Doc processing with crawling Multiple format support

🔍 Web Crawler Options

FireCrawl

FireCrawl is a cloud-based web crawling service designed for AI applications.

Key features:

  • Simple API
  • Managed Service
  • Advanced Parsing
config.set_provider_config("web_crawler", "FireCrawlCrawler", {})

??? tip "Setup Instructions"

1. Sign up for FireCrawl and get an API key
2. Set the API key as an environment variable:
   ```bash
   export FIRECRAWL_API_KEY="your_api_key"
   ```
3. For more information, see the [FireCrawl documentation](https://docs.firecrawl.dev/introduction)

Crawl4AI

Crawl4AI is a Python package for web crawling with browser automation capabilities.

config.set_provider_config("web_crawler", "Crawl4AICrawler", {"browser_config": {"headless": True, "verbose": True}})

??? tip "Setup Instructions"

1. Install Crawl4AI:
   ```bash
   pip install crawl4ai
   ```
2. Run the setup command:
   ```bash
   crawl4ai-setup
   ```
3. For more information, see the [Crawl4AI documentation](https://docs.crawl4ai.com/)

Jina Reader

Jina Reader is a service for extracting content from web pages with high accuracy.

config.set_provider_config("web_crawler", "JinaCrawler", {})

??? tip "Setup Instructions"

1. Get a Jina API key
2. Set the API key as an environment variable:
   ```bash
   export JINA_API_TOKEN="your_api_key"
   # or
   export JINAAI_API_KEY="your_api_key"
   ```
3. For more information, see the [Jina Reader documentation](https://jina.ai/reader/)

Docling Crawler

Docling provides web crawling capabilities alongside its document processing features.

config.set_provider_config("web_crawler", "DoclingCrawler", {})

??? tip "Setup Instructions"

1. Install Docling:
   ```bash
   pip install docling
   ```
2. For information on supported formats, see the [Docling documentation](https://docling-project.github.io/docling/usage/supported_formats/#supported-output-formats)