Master Databricks and Apache Spark Step by Step: Lesson 22 - PySpark Using SQL
Table of Contents
Introduction
In this tutorial, we will explore how to use PySpark with SQL to perform data analysis on large datasets within Databricks. By leveraging SQL alongside Python, you will enhance your productivity and take advantage of the powerful features of Apache Spark. This guide will walk you through the steps necessary to load Spark dataframes from SQL queries, perform analyses, and understand the benefits of integrating SQL with Python.
Step 1: Understanding Spark DataFrames
- Familiarize yourself with two types of DataFrames:
- Local DataFrame: Similar to a Pandas DataFrame, this operates in memory on a single machine.
- Distributed DataFrame: This allows Spark to split large datasets across multiple nodes, enabling parallel processing.
Practical Tip
When working with large datasets, always use Distributed DataFrames to avoid memory issues that can occur with Local DataFrames.
Step 2: Setting Up Your Environment
- Ensure you have created SQL tables in Databricks as referenced in Lesson 9 of the series.
- If you are using open-source Spark with Zeppelin notebooks, make sure you have the correct setup.
Step 3: Switching Context to Your Database
- Use the following code to switch to your database context:
spark.sql("USE aw_project")
Step 4: Executing SQL Queries
- Execute SQL queries using PySpark. For example, to select data from the
internet_sales
table and handle null values, use:
spdf_sales_info = spark.sql("SELECT * FROM internet_sales WHERE sales_info IS NOT NULL")
Practical Tip
Utilize the display()
function in Databricks to visualize your DataFrame results without overwhelming the system:
display(spdf_sales_info)
Step 5: Creating New Columns
- To create a new column based on existing columns, use the
withColumn
method. For example, to calculate the tax rate:
spdf_sales_info = spdf_sales_info.withColumn("tax_rate", spdf_sales_info.tax_amount / spdf_sales_info.sales_amount)
Step 6: Creating SQL Tables from DataFrames
- To create a new SQL table from a DataFrame, use the following:
spdf_sales_info.write.format("delta").saveAsTable("t_sales_info_extract")
Note
Make sure to check Delta table options since not all options may work with Delta format.
Step 7: Working with Temporary Views
- Create a temporary view for your DataFrame to allow access from different languages (e.g., R or Scala):
spdf_sales_info.createOrReplaceTempView("tv_temp_view_sales_info")
Step 8: Aggregating Data
- Use SQL to perform aggregations by sales territory key. For example, to get total and average sales:
aggregated_sales = spark.sql("""
SELECT sales_territory_key,
COUNT(*) as total_sales,
AVG(sales_amount) as average_sales
FROM internet_sales
GROUP BY sales_territory_key
""")
display(aggregated_sales)
Conclusion
In this tutorial, you learned how to effectively use PySpark with SQL to analyze large datasets in Databricks. By integrating SQL queries, you can enhance your data processing workflows, utilize powerful aggregation functions, and create efficient data structures. Moving forward, explore more advanced features of PySpark and consider diving deeper into data visualization and machine learning libraries compatible with Pandas DataFrames. Happy coding!