Renaming columns with Apache Spark (PySpark)

Introduction

In this post, you will learn how to rename columns of a Dataframe with PySpark.

Learn How to Read and Write CSV Files with Apache Spark.

In this post, we will cover reading and writing csv files with Apace Spark (PySpark)

Code so far to follow along

from pyspark.sql.types import StructType, StructField, DateType, IntegerType, DoubleType
from pyspark.sql.functions import *

url = 'https://raw.githubusercontent.com/adamrichardson14/PySpark-Tutorial/main/sales.csv'
from pyspark import SparkFiles

spark.sparkContext.addFile(url)


schema = StructType(
    [
        StructField("date", DateType(), True),
        StructField("est_ref", IntegerType(), True),
        StructField("capacity", IntegerType(), True),
        StructField("occupancy", DoubleType(), True),
        StructField("rooms_sold", IntegerType(), True),
        StructField("avg_rate_paid", DoubleType(), True),
        StructField("sales_value", DoubleType(), True),
    ]
)

path  = SparkFiles.get('sales.csv')

# I need to append "file://" as I'm using Databricks.
# If you are doing this locally, just use the path and delete "file://" +

raw = spark.read.csv("file://" + path,
    header=True,
    schema=schema,
)

raw = raw.withColumnRenamed("est_ref", "EstKey")

Renaming a single column

Let’s take a look at how we can rename a single column with PySpark. We’re going to change the est_ref column to EstKey

raw = raw.withColumnRenamed("est_ref", "EstKey")

Notice that we are overwriting the original raw dataframe by reassigning the result of withColumnRenamed to a new raw variable, which is returning a new dataframe. We will do this a lot as dataframes are immutable.

Renaming multiple columns

The most common way to rename multiple columns, is simply to chain withColumnRenamed(). Let’s rename some more columns and change the names to follow PascalCase format

raw = (
    raw.withColumnRenamed("est_ref", "EstKey")
    .withColumnRenamed("date", "Date")
    .withColumnRenamed("capacity", "Capacity")
    .withColumnRenamed("occupancy", "Occupancy")
    .withColumnRenamed("rooms_sold", "RoomsSold")
    .withColumnRenamed("avg_rate_paid", "AvgRatePaid")
    .withColumnRenamed("sales_value", "SalesValue")
)

We now have the following

 |-- Date: date (nullable = true)
 |-- EstKey: integer (nullable = true)
 |-- Capacity: integer (nullable = true)
 |-- Occupancy: double (nullable = true)
 |-- RoomsSold: integer (nullable = true)
 |-- AvgRatePaid: double (nullable = true)
 |-- SalesValue: double (nullable = true)

Renaming Multiple Columns Helper Function

In Pandas, we have a nice function that will rename columns for us, however we don’t currently have that in PySpark. The beauty of programming is that we can just create one! I have this helper function imported into most notebooks and use it frequently as it’s much cleaner than just chaining withColumnRenamed()

from pyspark.sql import DataFrame

def rename_columns(df: DataFrame, columns_dict: dict) -> DataFrame:
    for old_name, new_name in columns_dict.items():
        df = df.withColumnRenamed(old_name, new_name)
    return df

You can see this function is taking in a dataframe, along with a dictionary which contains the current column name, and the column name you will change it to. We loop over the dictionary, call withColumnRenamed and simply return the new dataframe once the loop has gone through all of the items.

You can use the function like this to achieve the same as we did above. I think it’s a little cleaner

raw = rename_columns(
    raw,
    {
        "est_ref": "EstKey",
        "date": "Date",
        "capacity": "Capacity",
        "occupancy": "Occupancy",
        "rooms_sold": "RoomsSold",
        "avg_rate_paid": "AvgRatePaid",
        "sales_value": "SalesValue",
    },
)

Renaming all columns with toDF

I would still recommend doing the above, as if the order of your data changes, this method will fail. If you’re just working with some static files though and you want to change the names of all of the columns, this is probably the fastest way.

Step 1

Create a List with your new column names in the correct order. We’re going to change them back now to their original names after making the changes above.

new_column_names = ["est_ref", "date", "capacity", "occupancy", "rooms_sold", "avg_rate_paid", "sales_value"]

Step 2

Use toDF() to rename all of the columns

raw = raw.toDF(*new_column_names)

And now we are back to where we started… What a ride!

 |-- est_ref: date (nullable = true)
 |-- date: integer (nullable = true)
 |-- capacity: integer (nullable = true)
 |-- occupancy: double (nullable = true)
 |-- rooms_sold: integer (nullable = true)
 |-- avg_rate_paid: double (nullable = true)
 |-- sales_value: double (nullable = true)

Conclusion

For continuing the course, I’m going to remove the last toDF rename, so we have our new names for the next tutorial.

Hopefully you can now rename any columns 😃