How to build an ETL pipeline with Python | Data pipeline | Export from SQL Server to PostgreSQL

3 min read 2 months ago
Published on Jun 04, 2025 This response is partially generated with the help of AI. It may contain inaccuracies.

Introduction

This tutorial will guide you through the process of building an ETL (Extract, Transform, Load) pipeline using Python. The primary objective is to extract data from SQL Server's AdventureWorks database, transform it as needed, and load it into a PostgreSQL database. This is particularly useful for data analytics and reporting projects, ensuring that data is optimized for queries.

Step 1: Set Up PostgreSQL

  1. Download PostgreSQL:

  2. Configure PostgreSQL:

    • During installation, set up a password for the default user (postgres).
    • Note the port number (usually 5432) and the database name you will create later.
  3. Create a Database:

    • Open the PostgreSQL command line or pgAdmin.
    • Create a new database using the command:
      CREATE DATABASE your_database_name;
      

Step 2: Set Up SQL Server

  1. Install SQL Server:

  2. Configure SQL Server:

    • Ensure SQL Server is running and accessible.
    • Make a note of your SQL Server instance name and authentication method (Windows or SQL Server Authentication).
  3. Create a Database:

    • Use SQL Server Management Studio to create a database if you haven't already, or use the AdventureWorks database available by default.

Step 3: Install Required Python Libraries

  1. Install Libraries:

    • Use pip to install the following libraries:
      pip install pandas pyodbc sqlalchemy psycopg2
      
  2. Install ODBC Driver:

    • Download and install the Microsoft ODBC Driver for SQL Server from the official page.

Step 4: Connect to SQL Server and Extract Data

  1. Create Connection String:

    • Use the following code to connect to SQL Server:
      import pyodbc
      
      conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};'
                            'SERVER=server_name;'
                            'DATABASE=database_name;'
                            'UID=user;'
                            'PWD=password')
      
  2. Extract Data:

    • Write a SQL query to extract data from your source table:
      query = "SELECT * FROM your_table_name"
      df = pd.read_sql(query, conn)
      

Step 5: Transform Data (if needed)

  1. Data Cleaning and Transformation:

    • Use pandas to clean or transform your data as necessary:
      # Example: Remove duplicates
      df = df.drop_duplicates()
      
  2. Prepare Data for Loading:

    • Ensure the dataframe structure matches the destination PostgreSQL table schema.

Step 6: Load Data into PostgreSQL

  1. Create PostgreSQL Connection:

    • Use SQLAlchemy to connect to PostgreSQL:
      from sqlalchemy import create_engine
      
      engine = create_engine('postgresql://user:password@localhost:5432/your_database_name')
      
  2. Load Data:

    • Use the following code to load the data into PostgreSQL:
      df.to_sql('your_table_name', engine, if_exists='replace', index=False)
      

Step 7: Test the ETL Pipeline

  1. Verify Data in PostgreSQL:

    • Use queries to check if the data has been successfully loaded:
      SELECT * FROM your_table_name;
      
  2. Check for Errors:

    • Review any error messages during the process and adjust your code accordingly.

Conclusion

Building an ETL pipeline with Python allows you to efficiently transfer data from SQL Server to PostgreSQL. By following these steps, you can set up your own pipeline for data analytics and reporting purposes. For further enhancements, consider implementing error handling and logging in your ETL process. Explore the provided resources and links for deeper insights and additional functionalities.