Web Scraping Instagram Reels and Pictures with Python
Table of Contents
Introduction
In this tutorial, you'll learn how to scrape Instagram Reels and pictures using Python. We'll utilize libraries like Selenium and BeautifulSoup to automate the process of logging in, searching for content, and downloading media files. This guide is ideal for anyone looking to enhance their Python skills while exploring web scraping techniques.
Step 1: Import Python Libraries
Start by importing the necessary libraries. Ensure you have the following installed:
- Selenium
- BeautifulSoup
- Requests
Use the following code to import them:
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
Practical Tips
- You may need to install these libraries if you haven't already. Use
pip install selenium beautifulsoup4 requests
in your terminal.
Step 2: Setup Chromedriver
Chromedriver is essential for using Selenium with Chrome. Download the appropriate version for your Chrome browser from the Chromedriver website.
Steps to Set Up
- Place the Chromedriver executable in a directory accessible to your script.
- Use the following code to initiate the driver:
driver = webdriver.Chrome(executable_path='path_to_chromedriver')
Step 3: Automate Login and Password
To scrape Instagram, you need to log in. Use Selenium to automate this process.
Steps to Automate Login
- Navigate to Instagram's login page:
driver.get('https://www.instagram.com/accounts/login/')
- Locate the username and password fields and input your credentials:
username_input = driver.find_element_by_name('username')
password_input = driver.find_element_by_name('password')
username_input.send_keys('your_username')
password_input.send_keys('your_password')
- Submit the login form:
password_input.submit()
Common Pitfalls
- Ensure that your account is not set to two-factor authentication, as this may complicate the scraping process.
Step 4: Automate Search
After logging in, you can automate searches for specific content.
Steps to Perform a Search
- Use the search bar to find a user, hashtag, or location:
search_url = 'https://www.instagram.com/explore/tags/your_hashtag/'
driver.get(search_url)
Step 5: Gather Post URLs with BeautifulSoup
Once on the desired page, use BeautifulSoup to extract post URLs.
Steps to Extract URLs
- Parse the page content:
soup = BeautifulSoup(driver.page_source, 'html.parser')
- Find all the relevant post links:
posts = soup.find_all('a', href=True)
post_urls = ['https://www.instagram.com' + post['href'] for post in posts if 'p' in post['href']]
Step 6: Access JSON to Collect URLs
Instagram may serve JSON data for media. Access this data to gather more URLs.
Steps to Access JSON
- Use the URL of a specific post to access its JSON data:
json_url = 'https://www.instagram.com/p/your_post_id/?__a=1'
response = requests.get(json_url)
data = response.json()
- Extract media URLs from the JSON response:
media_url = data['graphql']['shortcode_media']['display_url']
Step 7: Download Files
Now that you have the media URLs, download the images or videos.
Steps to Download Media
- Create a function to handle file downloads:
def download_file(url, filename):
response = requests.get(url)
with open(filename, 'wb') as file:
file.write(response.content)
- Call this function for each media URL:
for index, url in enumerate(media_urls):
download_file(url, f'image_{index}.jpg')
Conclusion
In this tutorial, you’ve learned how to set up a web scraper for Instagram using Python, Selenium, and BeautifulSoup. You can now automate the login process, search for content, gather URLs, and download media files. As a next step, consider exploring additional features such as filtering by date or user engagement. Happy scraping!