Web Scraping 101 ดูดข้อมูลเวปด้วย Python

3 min read 2 months ago
Published on Aug 25, 2024 This response is partially generated with the help of AI. It may contain inaccuracies.

Table of Contents

Introduction

This tutorial will guide you through the basics of web scraping using Python. By the end of this guide, you'll learn how to extract data from websites and save it into Excel files, enabling you to leverage online information for various applications. Web scraping can automate data collection processes, making it a valuable skill for data analysis and business intelligence.

Step 1: Setting Up Your Environment

Before you begin scraping, you need to set up your Python environment.

  • Install Python: Ensure you have Python installed on your computer. You can download it from the official Python website.
  • Install Required Libraries: You will need a few libraries for web scraping. Open your command line or terminal and run:
    pip install requests beautifulsoup4 pandas openpyxl
    
    • requests: To send HTTP requests.
    • BeautifulSoup: For parsing HTML and extracting data.
    • pandas: To manage and manipulate data.
    • openpyxl: To write data to Excel files.

Step 2: Understanding the Target Website

Before you scrape, familiarize yourself with the structure of the website you want to extract data from.

  • Inspect the Page: Right-click on the webpage and select “Inspect” to open the Developer Tools. Use the "Elements" tab to view the HTML structure.
  • Identify Data: Look for the specific HTML tags that contain the information you wish to scrape. Common tags include <div>, <span>, and <table>.

Step 3: Writing the Scraping Script

Now, you can write a Python script to scrape the data.

  1. Import Libraries:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
  2. Send a Request: Use the requests library to get the content of the webpage.

    url = 'https://example.com'  # Replace with your target URL
    response = requests.get(url)
    
  3. Parse the HTML: Use BeautifulSoup to parse the webpage content.

    soup = BeautifulSoup(response.text, 'html.parser')
    
  4. Extract Data: Find the data you want using the appropriate selectors.

    data = []
    for item in soup.find_all('div', class_='your-class'):  # Adjust to your needs
        title = item.find('h2').text  # Replace with actual tags
        link = item.find('a')['href']
        data.append({'Title': title, 'Link': link})
    

Step 4: Saving Data to Excel

Once you have extracted the data, you can save it to an Excel file using pandas.

  1. Create a DataFrame:

    df = pd.DataFrame(data)
    
  2. Export to Excel: Use to_excel() to save the DataFrame.

    df.to_excel('output.xlsx', index=False)
    

Conclusion

You have now learned how to set up your Python environment, inspect a webpage, write a web scraping script, and save the extracted data to an Excel file. This foundational knowledge allows you to automate data collection from various online sources. For your next steps, consider exploring more complex scraping scenarios, such as handling pagination, dealing with JavaScript-loaded content, or using APIs for structured data access. Happy scraping!