You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

3.0 KiB

FireCrawl Integration Example

This example demonstrates how to use FireCrawl with DeepSearcher to crawl and extract content from websites.

Overview

FireCrawl is a specialized web crawling service designed for AI applications. This example shows:

  1. Setting up FireCrawl with DeepSearcher
  2. Configuring API keys for the service
  3. Crawling a website and extracting content
  4. Querying the extracted content

Code Example

import logging
import os
from deepsearcher.offline_loading import load_from_website
from deepsearcher.online_query import query
from deepsearcher.configuration import Configuration, init_config

# Suppress unnecessary logging from third-party libraries
logging.getLogger("httpx").setLevel(logging.WARNING)

# Set API keys (ensure these are set securely in real applications)
os.environ['OPENAI_API_KEY'] = 'sk-***************'
os.environ['FIRECRAWL_API_KEY'] = 'fc-***************'


def main():
    # Step 1: Initialize configuration
    config = Configuration()

    # Set up Vector Database (Milvus) and Web Crawler (FireCrawlCrawler)
    config.set_provider_config("vector_db", "Milvus", {})
    config.set_provider_config("web_crawler", "FireCrawlCrawler", {})

    # Apply the configuration
    init_config(config)

    # Step 2: Load data from a website into Milvus
    website_url = "https://example.com"  # Replace with your target website
    collection_name = "FireCrawl"
    collection_description = "All Milvus Documents"

    # crawl a single webpage
    load_from_website(urls=website_url, collection_name=collection_name, collection_description=collection_description)
    # only applicable if using Firecrawl: deepsearcher can crawl multiple webpages, by setting max_depth, limit, allow_backward_links
    # load_from_website(urls=website_url, max_depth=2, limit=20, allow_backward_links=True, collection_name=collection_name, collection_description=collection_description)

    # Step 3: Query the loaded data
    question = "What is Milvus?"  # Replace with your actual question
    result = query(question)


if __name__ == "__main__":
    main()

Running the Example

  1. Install DeepSearcher: pip install deepsearcher
  2. Sign up for a FireCrawl API key at firecrawl.dev
  3. Replace the placeholder API keys with your actual keys
  4. Change the website_url to the website you want to crawl
  5. Run the script: python load_website_using_firecrawl.py

Advanced Crawling Options

FireCrawl provides several advanced options for crawling:

  • max_depth: Control how many links deep the crawler should go
  • limit: Set a maximum number of pages to crawl
  • allow_backward_links: Allow the crawler to navigate to parent/sibling pages

Key Concepts

  • Web Crawling: Extracting content from websites
  • Depth Control: Managing how deep the crawler navigates
  • URL Processing: Handling multiple pages from a single starting point
  • Vector Storage: Storing the crawled content in a vector database for search