How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai

3 min read 1 year ago
Published on Aug 11, 2024 This response is partially generated with the help of AI. It may contain inaccuracies.

Table of Contents

Introduction

In this tutorial, we will explore how to scrape the web effectively for large language models (LLMs) in 2024 using three powerful tools: Jina AI, Mendable's Firecrawl, and Scrapegraph-ai. Web scraping is essential for gathering data, and this guide will provide you with practical steps to get started, whether you're a developer or an AI enthusiast.

Step 1: Set Up Your Environment

Before diving into scraping, ensure you have the necessary tools and environment set up.

  • Install Python: Make sure you have Python installed on your machine. You can download it from the official website.
  • Install Jupyter Notebook: This is crucial for running the code examples provided. You can install it via pip:
    pip install notebook
    
  • Clone the Code Repository: Access the repository where the tutorial code is hosted. Use the following command in your terminal:
    git clone https://github.com/trancethehuman/ai-workshop-code.git
    
  • Navigate to the Notebook: Change directories to access the Jupyter notebook:
    cd ai-workshop-code/Web_scraping_for_LLM_in_2024
    

Step 2: Use Jina AI Reader API

Jina AI offers a powerful tool for extracting information from web pages. Follow these steps to implement it.

  • Install Jina AI: Run the following command to install the Jina AI library:
    pip install jina
    
  • Set Up the Reader API: Import the necessary modules in your Jupyter notebook:
    from jina import Client
    
  • Create a Client Instance: Establish a connection to the Jina service:
    client = Client(host='http://localhost:5100')  # Adjust the host as needed
    
  • Send a Request to the Reader: Use the Reader API to extract data from a target URL:
    response = client.post('/search', {'text': 'Your query here'})
    

Step 3: Implement Mendable's Firecrawl

Firecrawl is designed for efficient web crawling. Here's how to use it:

  • Visit Firecrawl's Site: Access Firecrawl's official page to get started.
  • Set Up Your Firecrawl Project: Follow the documentation to configure your crawler settings.
  • Run a Basic Crawl: Use the command line to initiate a crawl:
    firecrawl start --url https://example.com
    

Step 4: Explore Scrapegraph-ai

Scrapegraph-ai is another robust tool for web scraping. Let's dive into its setup.

  • Clone the Scrapegraph-ai Repository: Access the GitHub repo and clone it:
    git clone https://github.com/VinciGit00/Scrapegraph-ai.git
    
  • Install Required Dependencies: Navigate to the cloned repository and install dependencies:
    cd Scrapegraph-ai
    pip install -r requirements.txt
    
  • Run Scraping Scripts: Follow the examples in the repository documentation to start scraping:
    python scrape.py --url https://example.com
    

Conclusion

In this tutorial, we covered the essential steps to scrape the web for LLMs using Jina AI, Mendable's Firecrawl, and Scrapegraph-ai. You learned how to set up your environment, utilize each tool, and initiate web scraping tasks.

As a next step, experiment with different URLs and queries to gather data that can train or enhance your language models. Happy scraping!