· Pandas · 5 min read
Replacing Values in a Pandas Dataframe with replace() Function
Replacing values in a column
In pandas, the replace()
function can be used to replace values in a column of a dataframe. When working with large datasets, it is common to come across null or missing values, and the replace()
function can come in handy to deal with this.
To replace a specific value in a column, we can use the following syntax:
df['column_name'] = df['column_name'].replace(old_value, new_value)
For example, suppose we have a dataframe about fruits:
import pandas as pd
data = {'fruits': ['banana', 'orange', 'apple', 'kiwi', 'mango', 'apple']}
df = pd.DataFrame(data)
If we want to replace all ‘apple’ values to ‘pear’, we can do it like this:
df['fruits'] = df['fruits'].replace('apple', 'pear')
This will give us a new dataframe where the ‘apple’ values have been replaced with ‘pear’.
We can also replace multiple values at once using the replace()
function, by passing in a list of old_values and new_values.
Here is an example:
df['fruits'] = df['fruits'].replace(['banana', 'mango'], ['pineapple', 'melon'])
This will replace ‘banana’ with ‘pineapple’, and ‘mango’ with ‘melon’.
It is important to note that this function returns a new dataframe with the values replaced, and does not modify the original dataframe.
In conclusion, the replace()
function of pandas is a powerful tool for replacing values in a column. It can substitute single or multiple values, making cleanup of data simple and efficient.
Replacing with dictionaries
In pandas, we can use Python dictionaries to replace values in a dataframe, which can be especially useful when replacing multiple values.
To replace values with dictionaries, we can use the following syntax:
df = df.replace({'column_name': {old_value_1: new_value_1, old_value_2: new_value_2}})
Let’s take the same example of our fruit dataframe from the previous section. Say we want to replace ‘banana’ with ‘pineapple’ and ‘apple’ with ‘peach’. We can use a dictionary to achieve this:
data = {'fruits': ['banana', 'orange', 'apple', 'kiwi', 'mango', 'apple']}
df = pd.DataFrame(data)
dict_fruits = {'banana': 'pineapple',
'apple': 'peach'}
df = df.replace({'fruits': dict_fruits})
After applying the replace
function with the dictionary, the resulting dataframe will have ‘banana’ replaced with ‘pineapple’, and ‘apple’ replaced with ‘peach’.
The dictionary can be created manually as in the example above, or it can be created dynamically based on the values to be replaced.
For example, suppose we have another dataframe about employees, and we want to replace their employment status from ‘active’ to ‘inactive’. We can create a dictionary dynamically by finding all distinct values in the status
column, and then creating a new dictionary with the old and new status values:
import pandas as pd
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'status': ['active', 'active', 'inactive', 'active']
}
df = pd.DataFrame(data)
status_dict = {k: 'inactive' for k in df.status.unique()}
df = df.replace({'status': status_dict})
This will set all values of ‘active’ to ‘inactive’.
In conclusion, using dictionaries to replace values in pandas dataframes is an efficient solution when replacing multiple values. It’s easy to use, and can be created dynamically based on the data.
Replacing with regex
In pandas, we can use regular expressions (regex) to replace values in a dataframe that matches a particular pattern. This can be useful for cleaning up data when there is a certain pattern in the values that needs to be changed.
We can use the replace()
function with regular expressions to replace the desired values. The syntax for replacing with regex is as follows:
df['column_name'] = df['column_name'].replace({'regex_pattern': 'new_value'}, regex=True)
For example, let’s consider a dataframe with a ‘phone_number’ column, where the values are in the format ‘(XXX) XXX-XXXX’. If we want to replace all the brackets and hyphens with spaces, we can use the following regex:
df['phone_number'] = df['phone_number'].replace({'[\(\)\-]': ' '}, regex=True)
The [\(\)\-]
pattern matches all occurrences of open parenthesis, closed parenthesis, and hyphen characters, and replaces them with a space.
We can also use regex to replace a value or pattern with another value extracted from the original string. For example, suppose we have a column named ‘date’ with dates in the format ‘MM/DD/YYYY’, and we want to replace the year with the last two digits only. We can use the str.replace()
method with regex to accomplish this as follows:
df['date'] = df['date'].str.replace('(\d{2}/\d{2}/)\d{4}', r'\1YY', regex=True)
In this example, the pattern (\d{2}/\d{2}/)\d{4}
matches any string with the format XX/XX/YYYY
, and replaces only the YYYY
part of the string with YY
.
In conclusion, using regex to replace values in pandas dataframes offers a flexible way of replacing values based on patterns in the data. By using regex patterns, you can identify and replace data accurately and efficiently.
Summary
The replace()
function in pandas is a powerful tool for replacing values in dataframes. It is especially useful when working with missing or null values, or when large-scale replacements are needed. This article covered three different ways to use the replace()
function in pandas:
- Replacing values in a column
- Replacing with dictionaries
- Replacing with regex
By providing examples for each method, we have demonstrated the versatility and efficiency of this function. The next time you need to manipulate data in pandas, remember to use the replace()
function to make cleaning and reformatting your data a breeze.