Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)
Table of Contents
Introduction
This tutorial provides a comprehensive guide to using the Pandas library in Python for data science. You'll learn how to install Pandas, read and manipulate data from various file formats, and perform essential operations like filtering, sorting, and aggregating data. This tutorial is perfect for beginners looking to enhance their data analysis skills with Python.
Step 1: Install Pandas
To get started with Pandas, you need to install it. Use the following command in your terminal or command prompt:
pip install pandas
Step 2: Obtain the Data
You will need sample data to work with. You can download a dataset from Kaggle or use the provided GitHub repository:
- Kaggle dataset: Pokemon Dataset
- GitHub repository: Keith Galli's Pandas Code
Step 3: Load Data into Pandas
You can load data from various formats like CSV, Excel, and TXT. Here’s how to load a CSV file:
import pandas as pd
# Load a CSV file
data = pd.read_csv('path/to/your/file.csv')
Step 4: Read Data
You can access data in different ways:
- Get entire columns:
data['column_name']
- Get specific rows:
data.iloc[row_index]
- Access specific cells:
data.at[row_index, 'column_name']
Step 5: Iterate Through Rows
You can iterate through DataFrame rows using the iterrows()
method:
for index, row in data.iterrows()
for index, row in data.iterrows()
print(row['column_name'])
Step 6: Filter Rows Based on Conditions
To filter data based on specific conditions:
filtered_data = data[data['column_name'] > value]
Step 7: Get High-Level Statistics
You can get descriptive statistics of your dataset:
stats = data.describe() # Includes min, max, mean, std dev, etc.
Step 8: Sort Values
Sorting data can be done alphabetically or numerically:
sorted_data = data.sort_values(by='column_name', ascending=True)
Step 9: Modify the DataFrame
You can make changes to your DataFrame in various ways:
- Add a new column:
data['new_column'] = data['existing_column'] * 2
- Delete a column:
data.drop('column_name', axis=1, inplace=True)
- Sum multiple columns to create a new one:
data['total'] = data['column1'] + data['column2']
- Rearrange columns:
data = data[['new_order_col1', 'new_order_col2']]
Step 10: Save Data
To save your modified DataFrame back to a file:
data.to_csv('path/to/save/file.csv', index=False)
Step 11: Filter Data with Multiple Conditions
You can filter data using multiple conditions by combining them with logical operators:
filtered_data = data[(data['column1'] > value1) & (data['column2'] < value2)]
Step 12: Reset Index
To reset the DataFrame index after filtering or modifying data:
data.reset_index(drop=True, inplace=True)
Step 13: Use Regex for Filtering
You can filter data based on text patterns using regular expressions:
filtered_data = data[data['column_name'].str.contains('pattern', regex=True)]
Step 14: Make Conditional Changes
Modify data conditionally using loc
:
data.loc[data['column_name'] > value, 'column_name'] = new_value
Step 15: Aggregate Statistics with Groupby
Use the groupby
function to perform aggregate statistics:
grouped_data = data.groupby('column_name').agg({'another_column': ['sum', 'mean', 'count']})
Step 16: Work with Large Datasets
For large datasets, use the chunksize
parameter to read data in smaller portions:
for chunk in pd.read_csv('large_file.csv', chunksize=1000)
for chunk in pd.read_csv('large_file.csv', chunksize=1000)
process(chunk) # Replace with your processing function
Conclusion
In this tutorial, you learned how to use the Pandas library for various data manipulation tasks. You now have the skills to install Pandas, load data from different formats, filter and sort data, and perform aggregate statistics. For further learning, explore more advanced features like the apply()
function or consider taking online courses to deepen your knowledge of data science. Happy coding!