Renaming columns with Apache Spark (PySpark)
Introduction
In this post, you will learn how to rename columns of a Dataframe with PySpark.
In this post, we will cover reading and writing csv files with Apace Spark (PySpark)
Code so far to follow along
from pyspark.sql.types import StructType, StructField, DateType, IntegerType, DoubleType
from pyspark.sql.functions import *
url = 'https://raw.githubusercontent.com/adamrichardson14/PySpark-Tutorial/main/sales.csv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
schema = StructType(
[
StructField("date", DateType(), True),
StructField("est_ref", IntegerType(), True),
StructField("capacity", IntegerType(), True),
StructField("occupancy", DoubleType(), True),
StructField("rooms_sold", IntegerType(), True),
StructField("avg_rate_paid", DoubleType(), True),
StructField("sales_value", DoubleType(), True),
]
)
path = SparkFiles.get('sales.csv')
# I need to append "file://" as I'm using Databricks.
# If you are doing this locally, just use the path and delete "file://" +
raw = spark.read.csv("file://" + path,
header=True,
schema=schema,
)
raw = raw.withColumnRenamed("est_ref", "EstKey")
Renaming a single column
Let’s take a look at how we can rename a single column with PySpark. We’re going to change the est_ref
column to EstKey
raw = raw.withColumnRenamed("est_ref", "EstKey")
Notice that we are overwriting the original raw
dataframe by reassigning the result of withColumnRenamed
to a new raw variable, which is returning a new dataframe. We will do this a lot as dataframes are immutable.
Renaming multiple columns
The most common way to rename multiple columns, is simply to chain withColumnRenamed()
. Let’s rename some more columns and change the names to follow PascalCase
format
raw = (
raw.withColumnRenamed("est_ref", "EstKey")
.withColumnRenamed("date", "Date")
.withColumnRenamed("capacity", "Capacity")
.withColumnRenamed("occupancy", "Occupancy")
.withColumnRenamed("rooms_sold", "RoomsSold")
.withColumnRenamed("avg_rate_paid", "AvgRatePaid")
.withColumnRenamed("sales_value", "SalesValue")
)
We now have the following
|-- Date: date (nullable = true)
|-- EstKey: integer (nullable = true)
|-- Capacity: integer (nullable = true)
|-- Occupancy: double (nullable = true)
|-- RoomsSold: integer (nullable = true)
|-- AvgRatePaid: double (nullable = true)
|-- SalesValue: double (nullable = true)
Renaming Multiple Columns Helper Function
In Pandas, we have a nice function that will rename columns for us, however we don’t currently have that in PySpark. The beauty of programming is that we can just create one! I have this helper function imported into most notebooks and use it frequently as it’s much cleaner than just chaining withColumnRenamed()
from pyspark.sql import DataFrame
def rename_columns(df: DataFrame, columns_dict: dict) -> DataFrame:
for old_name, new_name in columns_dict.items():
df = df.withColumnRenamed(old_name, new_name)
return df
You can see this function is taking in a dataframe, along with a dictionary which contains the current column name, and the column name you will change it to. We loop over the dictionary, call withColumnRenamed
and simply return the new dataframe once the loop has gone through all of the items.
You can use the function like this to achieve the same as we did above. I think it’s a little cleaner
raw = rename_columns(
raw,
{
"est_ref": "EstKey",
"date": "Date",
"capacity": "Capacity",
"occupancy": "Occupancy",
"rooms_sold": "RoomsSold",
"avg_rate_paid": "AvgRatePaid",
"sales_value": "SalesValue",
},
)
Renaming all columns with toDF
I would still recommend doing the above, as if the order of your data changes, this method will fail. If you’re just working with some static files though and you want to change the names of all of the columns, this is probably the fastest way.
Step 1
Create a List
with your new column names in the correct order. We’re going to change them back now to their original names after making the changes above.
new_column_names = ["est_ref", "date", "capacity", "occupancy", "rooms_sold", "avg_rate_paid", "sales_value"]
Step 2
Use toDF() to rename all of the columns
raw = raw.toDF(*new_column_names)
And now we are back to where we started… What a ride!
|-- est_ref: date (nullable = true)
|-- date: integer (nullable = true)
|-- capacity: integer (nullable = true)
|-- occupancy: double (nullable = true)
|-- rooms_sold: integer (nullable = true)
|-- avg_rate_paid: double (nullable = true)
|-- sales_value: double (nullable = true)
Conclusion
For continuing the course, I’m going to remove the last toDF
rename, so we have our new names for the next tutorial.
Hopefully you can now rename any columns 😃
Related Posts
-
Apache Spark - Complete guide
By: Adam RichardsonLearn everything you need to know about Apache Spark with this comprehensive guide. We will cover Apache spark basics, all the way to advanced.
-
Spark SQL Column / Data Types explained
By: Adam RichardsonLearn about all of the column types in Spark SQL, how to use them with examples.
-
Mastering JSON Files in PySpark
By: Adam RichardsonLearn how to read and write JSON files in PySpark effectively with this comprehensive guide for developers seeking to enhance their data processing skills.
-
Pivoting and Unpivoting with PySpark
By: Adam RichardsonLearn how to effectively pivot and unpivot data in PySpark with step-by-step examples for efficient data transformation and analysis in big data projects.