Scrape paginated data with ZeroWork (or how to use nested loops)

3 min read 7 months ago
Published on Aug 06, 2024 This response is partially generated with the help of AI. It may contain inaccuracies.

Introduction

In this tutorial, you will learn how to scrape paginated data using ZeroWork, specifically through the example of LinkedIn search results. This guide will walk you through setting up nested loops to efficiently collect data from multiple pages, ensuring you can gather more than just the initial set of results.

Step 1: Set Up Your Initial Workflow

  • Begin by creating a workflow in ZeroWork.
  • Input the URL of the LinkedIn search results page where you have entered your search keywords.
  • Ensure you have the first loop set up to iterate over the profiles displayed on the first page.

Step 2: Create the Outer Loop for Pagination

  • Add a new loop to your workflow for pagination. This outer loop will be responsible for clicking the "Next" button to move through the pages.
  • Set the loop to repeat a specific number of times, depending on how many pages you want to scrape. For example, to scrape five pages, set it to repeat five times.

Step 3: Add the Inner Loop for Profile Collection

  • Inside the outer loop, create an inner loop to iterate over the profiles on the current page.
  • Set this loop to save 10 profiles per page using the "Save Web Element" building block.
  • Name your data table (e.g., "profiles") to store the scraped information.

Step 4: Integrate the Click Action for Pagination

  • After the inner loop, add a "Click Web Element" building block to simulate clicking the "Next" button.
  • Use the label of the "Next" button in your selector to ensure the taskboard interacts with the correct element.

Step 5: Adjust Delays for Loading Times

  • Increase the delay between page loads to accommodate LinkedIn’s loading times. This will help ensure that the profiles are fully loaded before the taskboard attempts to scrape the data.

Step 6: Handle Endless Scrolling (if applicable)

  • If you encounter a page with endless scroll instead of pagination, ZeroWork can handle this automatically. You won’t need to set up any additional loops for endless scrolling.

Step 7: Run the Taskboard and Verify Results

  • Execute your taskboard to start scraping.
  • Monitor the progress as it clicks through the pages and collects the profiles.
  • After completion, check the data table to confirm that you have successfully collected the intended number of profiles.

Step 8: Troubleshooting Visibility Issues

  • If the "Next" button is not visible due to screen size limitations, add a keyboard action to simulate pressing the space bar. This action will scroll the page, making the button accessible.
Keyboard Action: Space Bar

Conclusion

By following these steps, you have successfully set up a system to scrape paginated data from LinkedIn using ZeroWork. You can adjust the number of pages to scrape based on your needs. For further exploration, consider experimenting with different data sources or integrating additional data processing steps. Happy scraping!