· Pandas · 7 min read
Eliminate Duplicate Rows with pandas drop_duplicates() Function
Removing duplicate rows from a pandas DataFrame
”Using drop_duplicates() Function”
The drop_duplicates() function in pandas is a very useful and popular method for removing duplicate rows in a DataFrame. It helps in data cleaning and analysis by eliminating redundant rows, which can significantly improve the accuracy of data analysis. Let’s dive into this concept in more detail.
To start with, let us create a sample dataset that we will use throughout this article. In this example, we have a DataFrame with multiple rows and a few columns.
import pandas as pd
df = pd.DataFrame({
'Product': ['Mobile', 'Laptop', 'TV', 'TV', 'Mobile', 'Laptop'],
'Brand': ['Samsung', 'Dell', 'LG', 'LG', 'Samsung', 'Dell']
})
print("Original DataFrame:")
print(df)
The output of the above code block should be a DataFrame with 6 rows and 2 columns:
Original DataFrame:
Product Brand
0 Mobile Samsung
1 Laptop Dell
2 TV LG
3 TV LG
4 Mobile Samsung
5 Laptop Dell
As you can see, there are duplicate rows in this dataset because of the repetition of “TV” and “Mobile”. We can use the drop_duplicates() function to remove these duplicate rows.
df = df.drop_duplicates()
print("DataFrame after removing duplicates:")
print(df)
The output of the above code block should now give us a DataFrame without duplicate rows:
DataFrame after removing duplicates:
Product Brand
0 Mobile Samsung
1 Laptop Dell
2 TV LG
Here we can observe that the duplicate rows, row 3 and row 4, have been removed, and we are left with only the unique rows. The drop_duplicates() function removes all rows that have the same values in each column.
The drop_duplicates() function is a very handy method to use, and you can use it with its parameters to get more control over exactly what to drop. We can pass several parameters to this function depending on our requirements, such as:
-
keep
: With this parameter, we can specify whether to keep the first, last or none of the duplicate rows. For example, if we setkeep=first
, the first occurrence of a duplicate row is retained, and all the subsequent duplicates are removed. Similarly, if we setkeep=last
, the duplicate row is removed from every place except the last occurrence. -
subset
: This parameter takes a list of column names and removes duplicates based only on these columns. For example, if we set keep=‘first’ and subset=[‘Product’], it will keep only the rows where product is unique.
We can also use the inplace
parameter to modify the original DataFrame in place instead of creating a new one.
Overall, the drop_duplicates() function is a simple and powerful method to remove duplicate rows in a pandas DataFrame. By using this function, we can easily clean our dataset and prepare it for accurate data analysis.
Example of drop_duplicates() function
The drop_duplicates() function in pandas is a useful method for cleaning up your dataset by removing any duplicate rows. Let’s dive into an example where we can apply this function to better understand how it works.
Let’s consider a dataset of employees in a company, where each row represents a unique employee. However, due to some system issues, there might be some duplicate entries that we want to remove. Here’s an example of such a dataset:
Employee ID | First Name | Last Name | Age | Gender |
---|---|---|---|---|
101 | John | Smith | 25 | Male |
101 | John | Smith | 25 | Male |
102 | Emily | Green | 35 | Female |
103 | John | Smith | 30 | Male |
As we can see from the table, rows number 1 and 2 are identical. Likewise, rows 1 and 4 have the same Employee ID, First Name, Last Name, and Gender values. These rows are considered duplicate rows, and we can use the drop_duplicates() function to remove them.
Let’s write some code to load the above data into a pandas DataFrame and apply the drop_duplicates() method to remove the duplicates.
import pandas as pd
df = pd.read_csv("employee_data.csv")
print("Original DataFrame:")
print(df)
df = df.drop_duplicates()
print("DataFrame after removing duplicates:")
print(df)
The above code should output:
Original DataFrame:
Employee ID First Name Last Name Age Gender
0 101 John Smith 25 Male
1 101 John Smith 25 Male
2 102 Emily Green 35 Female
3 103 John Smith 30 Male
DataFrame after removing duplicates:
Employee ID First Name Last Name Age Gender
0 101 John Smith 25 Male
2 102 Emily Green 35 Female
3 103 John Smith 30 Male
As we can see from the above outputs, the original DataFrame had duplicate rows, but after applying the drop_duplicates() function on the DataFrame, duplicate rows were removed, thereby resulting in a clean DataFrame.
We can see from this example that the drop_duplicates() function can come in handy when working with datasets with duplicate rows. With few lines of code, we can easily have a dataset that is clean and free from redundancies.
Customizing drop_duplicates() in pandas
The drop_duplicates() function in pandas is a very versatile function that provides several options to customize the removal of duplicates based on specific conditions. Let’s take a look at how we can customize the function to better suit our data cleaning needs.
In the simplest form, the drop_duplicates() function will remove all rows where all columns have the same values. However, there are several parameters that we can use to further customize the function.
One key parameter is the keep parameter. It determines which occurrence of the duplicate rows to keep. By default, the function will keep the first occurrence of the row and remove the rest. Here is an example where we set the keep parameter to ‘last’:
df.drop_duplicates(keep='last', inplace=True)
This will remove all duplicate rows but keep the last occurrence of each, so the resulting DataFrame will only contain unique rows, where the unique rows are the last occurrences.
Another parameter that is often used is the subset parameter. It allows us to drop duplicates only based on specific column(s). For example, if we have a DataFrame where we have multiple columns that we want to ignore for duplicates, we can use the subset parameter to only consider a few columns. Here is an example where we only consider the ‘Product’ column:
df.drop_duplicates(subset=['Product'], inplace=True)
This will remove all rows that have duplicate values in the ‘Product’ column, while ignoring the duplicates in other rows.
Finally, we can also create custom functions using the ‘subset’ parameter to remove duplicates based on our specific requirements. For instance, we can define a function that removes duplicates that have a variance of less than some threshold:
def custom_duplicates(df, subset=None, threshold=0):
if subset:
for col in subset:
df = df[np.abs(df[col] - df[col].mean()) / df[col].std() > threshold]
else:
df = df[np.abs(df - df.mean()) / df.std() > threshold]
return df
This custom_duplicates() function will remove rows where the absolute variance of all columns is less than a certain threshold. We can use this function as follows:
df = pd.DataFrame({
'A': np.random.normal(0, 1, 10),
'B': np.random.normal(0, 1, 10),
'C': np.random.normal(0, 1, 10),
})
print(df)
df = custom_duplicates(df, threshold=0.01)
print(df)
In this example, we generate a random DataFrame and then use our custom_duplicates() function to remove any rows where the variance is less than 0.01.
In conclusion, the drop_duplicates() function in pandas is a powerful method for removing duplicate rows from a DataFrame. By using the various parameters available, we can customize the function to cater to our specific data cleaning needs, resulting in a clean and accurate DataFrame for data analysis.
Summary
In this article, we discussed the drop_duplicates() function in pandas, which is an incredibly useful method for removing duplicate rows in a DataFrame. We showed how you can use this function to easily clean up your datasets and prepare them for accurate data analysis. We also explored some examples and different parameters we can pass to the function to customize its behavior.
Using drop_duplicates() function can greatly improve the accuracy of your data analysis and save your precious time. By taking advantage of the various options available with this function, you can ensure that your dataset is clean and free of duplications, allowing you to perform better data analysis.