How to build an ETL pipeline with Python | Data pipeline | Export from SQL Server to PostgreSQL
Table of Contents
Introduction
This tutorial will guide you through the process of building an ETL (Extract, Transform, Load) pipeline using Python. The primary objective is to extract data from SQL Server's AdventureWorks database, transform it as needed, and load it into a PostgreSQL database. This is particularly useful for data analytics and reporting projects, ensuring that data is optimized for queries.
Step 1: Set Up PostgreSQL
-
Download PostgreSQL:
- Visit the PostgreSQL download page and follow the installation instructions for your operating system.
-
Configure PostgreSQL:
- During installation, set up a password for the default user (postgres).
- Note the port number (usually 5432) and the database name you will create later.
-
Create a Database:
- Open the PostgreSQL command line or pgAdmin.
- Create a new database using the command:
CREATE DATABASE your_database_name;
Step 2: Set Up SQL Server
-
Install SQL Server:
- Follow the installation guide in the SQL Server install video.
-
Configure SQL Server:
- Ensure SQL Server is running and accessible.
- Make a note of your SQL Server instance name and authentication method (Windows or SQL Server Authentication).
-
Create a Database:
- Use SQL Server Management Studio to create a database if you haven't already, or use the AdventureWorks database available by default.
Step 3: Install Required Python Libraries
-
Install Libraries:
- Use pip to install the following libraries:
pip install pandas pyodbc sqlalchemy psycopg2
- Use pip to install the following libraries:
-
Install ODBC Driver:
- Download and install the Microsoft ODBC Driver for SQL Server from the official page.
Step 4: Connect to SQL Server and Extract Data
-
Create Connection String:
- Use the following code to connect to SQL Server:
import pyodbc conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};' 'SERVER=server_name;' 'DATABASE=database_name;' 'UID=user;' 'PWD=password')
- Use the following code to connect to SQL Server:
-
Extract Data:
- Write a SQL query to extract data from your source table:
query = "SELECT * FROM your_table_name" df = pd.read_sql(query, conn)
- Write a SQL query to extract data from your source table:
Step 5: Transform Data (if needed)
-
Data Cleaning and Transformation:
- Use pandas to clean or transform your data as necessary:
# Example: Remove duplicates df = df.drop_duplicates()
- Use pandas to clean or transform your data as necessary:
-
Prepare Data for Loading:
- Ensure the dataframe structure matches the destination PostgreSQL table schema.
Step 6: Load Data into PostgreSQL
-
Create PostgreSQL Connection:
- Use SQLAlchemy to connect to PostgreSQL:
from sqlalchemy import create_engine engine = create_engine('postgresql://user:password@localhost:5432/your_database_name')
- Use SQLAlchemy to connect to PostgreSQL:
-
Load Data:
- Use the following code to load the data into PostgreSQL:
df.to_sql('your_table_name', engine, if_exists='replace', index=False)
- Use the following code to load the data into PostgreSQL:
Step 7: Test the ETL Pipeline
-
Verify Data in PostgreSQL:
- Use queries to check if the data has been successfully loaded:
SELECT * FROM your_table_name;
- Use queries to check if the data has been successfully loaded:
-
Check for Errors:
- Review any error messages during the process and adjust your code accordingly.
Conclusion
Building an ETL pipeline with Python allows you to efficiently transfer data from SQL Server to PostgreSQL. By following these steps, you can set up your own pipeline for data analytics and reporting purposes. For further enhancements, consider implementing error handling and logging in your ETL process. Explore the provided resources and links for deeper insights and additional functionalities.