Handling Null Values in Python Pandas

Intro

In this article we are now looking at how we handle null values in our data. Null or missing values can often be found in real-world datasets, and it is important to properly handle these values in order to accurately analyze and manipulate the data. Null values can arise for a variety of reasons, such as incomplete data collection or data corruption. If left unhandled, null values can introduce errors or biases into your analysis and lead to incorrect or misleading results. Therefore, it is essential to identify and address null values before proceeding with any data analysis tasks.

We just finished correcting all of our data types in this series. You can read about that here.

Working with Python Pandas Data Types

Learn about the different data types in Pandas, the powerful Python library for data analysis. Discover how to work with numerical, categorical, and textual data, as well as dates and times.

Checking For Null Values

We can quite simply count Null Values within a pandas dataframe to see what we’re working with.

nullCounts = raw.isnull().sum()
print(nullCounts)

I’ve created a new block within the Jupyter Notebook for this code. You should have something that looks like this now.

Pandas Null Values

You can see our result is great. We have 0 null values. Not great for our example, however we should be handling null values as a best practice anyway. Depending on the analysis that we’re doing, we might want to handle these in different ways, and we will go through them.

Result

Column	Null Count
date	0
estKey	0
capacity	0
occupancy	0
roomsSold	0
avgRate	0
salesValue	0

Removing Null Values

If we have columns that are extremely important, then we might want to remove rows of data where we have a null value in certain columns. In this example, we really can’t do anything with the data if it doesn’t have a date, or an estKey

Using `dropna()` to remove values

nullCounts = raw.isnull().sum()

clean = raw.dropna(subset=["date", "estKey"])

clean.head()

Notice I’ve assigned a new variable clean. Our data is not yet clean, however now that we are manipulating the data, removing rows etc it is no longer raw. Our raw data with the correct columns and types may be useful, and we will leave it so that it is available in the future.

An example of this being useful is that maybe these values are not useful for our analysis, but if you’re looking at raw sales data, and you can’t attribute sales to a date or establishment, it will still be good to see the volume of sales we’re talking about here, as there could be impact of this.

Stages of data cleanliness

The stages of cleaning and preparing data are extremely important, and actually we are going through them as part of this course. You can read more about them here

5 Steps to Clean and Normalize Your Data for Analysis

Effective data analysis relies on clean, consistent, and well-structured data. In this post, we explore the different stages of data cleanliness, from raw data to production-ready data

Calculating Null Values

There are times where maybe we can calculate our Null values by using other columns we have available. In our example, the avgRate can be calculated by taking salesValue / roomsSold. Occupancy can be calculated by taking capacity / roomsSold

It’s unlikely we would have null values like this and then also the columns to correct them in the same dataset. You will usually need to join other tables to make this happen, and we will cover joins soon. For this tutorial though, let’s implement the code to make this happen. The first thing I’m going to do is extend our dropna() to also include the columns, salesValue, roomsSold and capacity. If we don’t have these values then we can not calculate the others.

Adding to the `dropna()`

nullCounts = raw.isnull().sum()

clean = raw.dropna(subset=["date", "estKey", "roomsSold", "capacity", "salesValue"])

clean.head()

Calculating values

Now we can use the following code to replace Null values in average rate and occupancy. I’m ensuring to round the value to 2 decimal places so that it’s consistent with the rest of the rows of data.

nullCounts = raw.isnull().sum()

clean = raw.dropna(subset=["date", "estKey", "roomsSold", "capacity", "salesValue"])

if clean["avgRate"].isnull().any():
  clean["avgRate"] = clean["salesValue"] / clean["roomsSold"]
  clean["avgRate"] = clean["avgRate"].round(2)

if clean["occupancy"].isnull().any():
  clean["occupancy"] = clean["roomsSold"] / clean["capacity"]
  clean["occupancy"] = clean["occupancy"].round(2)

clean.head()

Breaking this down

The isnull function is called on the “occupancy” column of the “clean” dataframe. This returns a new series of boolean values, where each value is True if the corresponding element in the “occupancy” column is null, and False otherwise.
The any function is called on the resulting series of boolean values. This returns a single boolean value that is True if any of the values in the series are True, and False otherwise.
The if statement checks if the boolean value returned by the any function is True. If it is, the code inside the if statement will be executed. If it is False, the code inside the if statement will be skipped.

It’s OK to do this here, as we are calculating the avgRate exactly. Sometimes we can’t quite get the exact number like we can here and it’s important to understand the impact it could have on your analysis of using an estimate. You have to decide is it better to work with slightly incorrect data or an estimate, or to simply not include that data at all.

Conclusion

We have only covered a couple of methods here. There are other ways of handling null values however these are the most common ones.

Getting started / Introduction to Jupyter notebooks

In this guide, we will go over some key functionality of Jupyter notebooks that are essential to learn.