Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)

3 min read 2 months ago
Published on Jan 25, 2025 This response is partially generated with the help of AI. It may contain inaccuracies.

Introduction

This tutorial provides a comprehensive guide to using the Pandas library in Python for data science. You'll learn how to install Pandas, read and manipulate data from various file formats, and perform essential operations like filtering, sorting, and aggregating data. This tutorial is perfect for beginners looking to enhance their data analysis skills with Python.

Step 1: Install Pandas

To get started with Pandas, you need to install it. Use the following command in your terminal or command prompt:

pip install pandas

Step 2: Obtain the Data

You will need sample data to work with. You can download a dataset from Kaggle or use the provided GitHub repository:

Step 3: Load Data into Pandas

You can load data from various formats like CSV, Excel, and TXT. Here’s how to load a CSV file:

import pandas as pd

# Load a CSV file
data = pd.read_csv('path/to/your/file.csv')

Step 4: Read Data

You can access data in different ways:

  • Get entire columns: data['column_name']
  • Get specific rows: data.iloc[row_index]
  • Access specific cells: data.at[row_index, 'column_name']

Step 5: Iterate Through Rows

You can iterate through DataFrame rows using the iterrows() method:

for index, row in data.iterrows()

print(row['column_name'])

Step 6: Filter Rows Based on Conditions

To filter data based on specific conditions:

filtered_data = data[data['column_name'] > value]

Step 7: Get High-Level Statistics

You can get descriptive statistics of your dataset:

stats = data.describe()  # Includes min, max, mean, std dev, etc.

Step 8: Sort Values

Sorting data can be done alphabetically or numerically:

sorted_data = data.sort_values(by='column_name', ascending=True)

Step 9: Modify the DataFrame

You can make changes to your DataFrame in various ways:

  • Add a new column:
data['new_column'] = data['existing_column'] * 2
  • Delete a column:
data.drop('column_name', axis=1, inplace=True)
  • Sum multiple columns to create a new one:
data['total'] = data['column1'] + data['column2']
  • Rearrange columns:
data = data[['new_order_col1', 'new_order_col2']]

Step 10: Save Data

To save your modified DataFrame back to a file:

data.to_csv('path/to/save/file.csv', index=False)

Step 11: Filter Data with Multiple Conditions

You can filter data using multiple conditions by combining them with logical operators:

filtered_data = data[(data['column1'] > value1) & (data['column2'] < value2)]

Step 12: Reset Index

To reset the DataFrame index after filtering or modifying data:

data.reset_index(drop=True, inplace=True)

Step 13: Use Regex for Filtering

You can filter data based on text patterns using regular expressions:

filtered_data = data[data['column_name'].str.contains('pattern', regex=True)]

Step 14: Make Conditional Changes

Modify data conditionally using loc:

data.loc[data['column_name'] > value, 'column_name'] = new_value

Step 15: Aggregate Statistics with Groupby

Use the groupby function to perform aggregate statistics:

grouped_data = data.groupby('column_name').agg({'another_column': ['sum', 'mean', 'count']})

Step 16: Work with Large Datasets

For large datasets, use the chunksize parameter to read data in smaller portions:

for chunk in pd.read_csv('large_file.csv', chunksize=1000)

process(chunk) # Replace with your processing function

Conclusion

In this tutorial, you learned how to use the Pandas library for various data manipulation tasks. You now have the skills to install Pandas, load data from different formats, filter and sort data, and perform aggregate statistics. For further learning, explore more advanced features like the apply() function or consider taking online courses to deepen your knowledge of data science. Happy coding!