Web Scraping 101 ดูดข้อมูลเวปด้วย Python
Table of Contents
Introduction
This tutorial will guide you through the basics of web scraping using Python. By the end of this guide, you'll learn how to extract data from websites and save it into Excel files, enabling you to leverage online information for various applications. Web scraping can automate data collection processes, making it a valuable skill for data analysis and business intelligence.
Step 1: Setting Up Your Environment
Before you begin scraping, you need to set up your Python environment.
- Install Python: Ensure you have Python installed on your computer. You can download it from the official Python website.
- Install Required Libraries: You will need a few libraries for web scraping. Open your command line or terminal and run:
pip install requests beautifulsoup4 pandas openpyxl
requests
: To send HTTP requests.BeautifulSoup
: For parsing HTML and extracting data.pandas
: To manage and manipulate data.openpyxl
: To write data to Excel files.
Step 2: Understanding the Target Website
Before you scrape, familiarize yourself with the structure of the website you want to extract data from.
- Inspect the Page: Right-click on the webpage and select “Inspect” to open the Developer Tools. Use the "Elements" tab to view the HTML structure.
- Identify Data: Look for the specific HTML tags that contain the information you wish to scrape. Common tags include
<div>
,<span>
, and<table>
.
Step 3: Writing the Scraping Script
Now, you can write a Python script to scrape the data.
-
Import Libraries:
import requests from bs4 import BeautifulSoup import pandas as pd
-
Send a Request: Use the
requests
library to get the content of the webpage.url = 'https://example.com' # Replace with your target URL response = requests.get(url)
-
Parse the HTML: Use BeautifulSoup to parse the webpage content.
soup = BeautifulSoup(response.text, 'html.parser')
-
Extract Data: Find the data you want using the appropriate selectors.
data = [] for item in soup.find_all('div', class_='your-class'): # Adjust to your needs title = item.find('h2').text # Replace with actual tags link = item.find('a')['href'] data.append({'Title': title, 'Link': link})
Step 4: Saving Data to Excel
Once you have extracted the data, you can save it to an Excel file using pandas.
-
Create a DataFrame:
df = pd.DataFrame(data)
-
Export to Excel: Use
to_excel()
to save the DataFrame.df.to_excel('output.xlsx', index=False)
Conclusion
You have now learned how to set up your Python environment, inspect a webpage, write a web scraping script, and save the extracted data to an Excel file. This foundational knowledge allows you to automate data collection from various online sources. For your next steps, consider exploring more complex scraping scenarios, such as handling pagination, dealing with JavaScript-loaded content, or using APIs for structured data access. Happy scraping!