· Pandas · 20 min read
Transform Your Data into Actionable Insights with These Handy Pandas Techniques
In today’s post, we’ll be discussing how to use the powerful Python library, Pandas, to transform your data into actionable insights. Pandas is a widely used tool for data manipulation and analysis, and is particularly useful for working with large datasets. Whether you’re a beginner or an experienced developer, this post will provide you with some handy techniques for using Pandas to make the most of your data. So let’s get started!
Selecting columns
Pandas provides several ways to select columns of a DataFrame. The most common way is to use the square bracket notation, which allows us to specify the names of the columns we want to select. Here’s an example of how we might use this method to select the “Name” and “Age” columns of a DataFrame called df
:
# Select the Name and Age columns
selected_columns = df[["Name", "Age"]]
Another way to select columns is to use the .loc
method, which allows us to specify the names of the columns we want to select using a label-based indexing scheme. Here’s an example of how we might use this method to select the same columns as above:
# Select the Name and Age columns using the .loc method
selected_columns = df.loc[:, ["Name", "Age"]]
Sorting data
Pandas provides several ways to sort the data in a DataFrame. One of the most common ways is to use the .sort_values
method, which allows us to specify the column or columns that we want to use to sort the data, as well as the sorting order (ascending or descending). Here’s an example of how we might use this method to sort the data in a DataFrame called df
by the “Name” column in ascending order:
# Sort the data by the Name column in ascending order
sorted_df = df.sort_values("Name")
Another way to sort the data in a DataFrame is to use the .sort_index
method, which allows us to sort the data by the row labels (index) of the DataFrame. Here’s an example of how we might use this method to sort the data in df
by the row labels in ascending order:
# Sort the data by the row labels (index) in ascending order
sorted_df = df.sort_index()
Filtering Rows Based on Conditions
To filter rows in a pandas DataFrame based on one or more conditions, you can use the DataFrame.query()
method.
Here is an example that filters the rows of a DataFrame based on the values in the “Name” and “Age” columns:
# create a sample DataFrame
data = {'Name': ['John', 'Jane', 'Alice', 'Bob'],
'Age': [20, 30, 40, 50]}
df = pd.DataFrame(data)
# filter rows based on conditions
df = df.query('Age > 25 and Name == "Jane"')
# display the filtered DataFrame
print(df)
This code filters the rows of the DataFrame by keeping only those rows where the value in the “Age” column is greater than 25 and the value in the “Name” column is “Jane”. The resulting DataFrame will have only one row:
Copy code
Name Age
1 Jane 30
You can use any valid Python expression in the DataFrame.query()
method, including logical operators, comparison operators, and functions. You can also use the @
symbol to reference columns in the DataFrame in the expression. For example, the following code is equivalent to the code above:
# filter rows based on conditions
df = df.query('@Age > 25 and @Name == "Jane"')
# display the filtered DataFrame
print(df)
You can also use the bracket notation to filter rows in a Pandas Dataframe
# filter rows based on conditions
mask = (df['Age'] > 25) & (df['Name'] == 'Jane')
df = df[mask]
Adding New Columns
In pandas, we can add new columns to a DataFrame by using the assign()
method. This method allows us to manipulate existing columns and create new ones based on their values.
Here is an example of adding a new column to a DataFrame:
# Create a DataFrame with some example data
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [4, 5, 6]
})
# Add a new column "C" by using the values in column "A" and "B"
df = df.assign(C = lambda x: x["A"] + x["B"])
In this example, we create a DataFrame with columns “A” and “B”, which contain some numeric values. We then use the assign()
method to create a new column “C” that is the sum of the values in columns “A” and “B”. This results in a new DataFrame with three columns: “A”, “B”, and “C”.
Changing Column Types
In pandas, we can change the data type of a column using the astype()
method. This method allows us to convert the values in a column from one data type to another.
Here is an example of changing the data type of a column in a DataFrame:
# Create a DataFrame with some example data
df = pd.DataFrame({
"A": ["1", "2", "3"],
"B": ["4", "5", "6"]
})
# Change the data type of column "A" from string to integer
df["A"] = df["A"].astype(int)
In this example, we create a DataFrame with columns “A” and “B”, which contain some string values. We then use the astype()
method to convert the values in column “A” from strings to integers. This results in a new DataFrame with the same columns, but with the values in column “A” being of the integer data type.
Convert String to Date and Back
Here is an example of converting a column of strings to dates and back again:
# Create a DataFrame with a column of strings in the "YYYY-MM-DD" format
df = pd.DataFrame({
"date": ["2022-01-01", "2022-02-01", "2022-03-01"]
})
# Convert the strings to datetime values
df["date"] = pd.to_datetime(df["date"])
# Convert the datetime values back to strings in the "YYYY-MM-DD" format
df["date"] = df["date"].dt.strftime("%Y-%m-%d")
In this example, we create a DataFrame with a column “date” that contains string values in the “YYYY-MM-DD” format. We then use the to_datetime()
method to convert these values to the datetime
data type. Finally, we use the dt.strftime()
method to convert the datetime
values back to strings in the same format. This results in a DataFrame with the same column and values, but with the data type of the values being changed from string to datetime
and back again.
Renaming Columns
In pandas, we can rename one or more columns in a DataFrame using the rename()
method. This method allows us to specify new names for the columns in the DataFrame.
Here is an example of renaming columns in a DataFrame:
# Create a DataFrame with some example data
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [4, 5, 6]
})
# Rename the columns
df = df.rename(columns = {"A": "X", "B": "Y"})
In this example, we create a DataFrame with columns “A” and “B”, which contain some numeric values. We then use the rename()
method to specify new names for these columns, “X” and “Y”. This results in a new DataFrame with the same data, but with the column names changed to “X” and “Y”.
Removing Duplicate Rows
In pandas, we can remove duplicate rows from a DataFrame using the drop_duplicates()
method. This method will return a new DataFrame with the duplicate rows removed.
Here is an example of removing duplicate rows from a DataFrame:
# Create a DataFrame with some example data
df = pd.DataFrame({
"A": [1, 2, 2, 3, 4, 4, 5],
"B": [1, 1, 2, 2, 3, 3, 4]
})
# Remove duplicate rows
df = df.drop_duplicates()
In this example, we create a DataFrame with columns “A” and “B”, which contain some numeric values. Some of the values in these columns are duplicates. We then use the drop_duplicates()
method to remove the duplicate rows from the DataFrame. This results in a new DataFrame with only the unique rows of data.
Removing duplicate rows with a subset of columns
To remove duplicate rows using a subset of columns in Pandas, we can use the DataFrame.drop_duplicates()
method. This method will return a new DataFrame with the duplicate rows removed, based on the subset of columns that we specify.
Here is an example of how we can use this method to remove duplicate rows based on the values in the "column1"
and "column2"
columns of a DataFrame:
# Create a DataFrame with some duplicate rows
df = pd.DataFrame({
"column1": ["a", "b", "c", "a", "b", "c"],
"column2": [1, 2, 3, 1, 2, 3],
"column3": [4, 5, 6, 4, 5, 6]
})
# Remove duplicate rows based on the values in columns "column1" and "column2"
df = df.drop_duplicates(subset=["column1", "column2"])
In this example, the original DataFrame had six rows, with three rows that were duplicates based on the values in the "column1"
and "column2"
columns. After using the drop_duplicates()
method to remove these duplicates, the resulting DataFrame only has three rows, which are unique based on the values in the "column1"
and "column2"
columns.
Note that we can also use the keep
parameter of the drop_duplicates()
method to specify which of the duplicate rows we want to keep. For example, if we set keep="first"
, then the method will keep the first occurrence of each duplicate row and remove the rest. Alternatively, if we set keep="last"
, then the method will keep the last occurrence of each duplicate row and remove the rest.
Here is an example of using the keep
parameter of the drop_duplicates()
method to keep either the first or last occurrence of each duplicate row, depending on the value that we specify:
# Create a DataFrame with some duplicate rows
df = pd.DataFrame({
"column1": ["a", "b", "c", "a", "b", "c"],
"column2": [1, 2, 3, 1, 2, 3],
"column3": [4, 5, 6, 4, 5, 6]
})
# Remove duplicate rows, keeping the first occurrence of each duplicate
df1 = df.drop_duplicates(subset=["column1", "column2"], keep="first")
# Remove duplicate rows, keeping the last occurrence of each duplicate
df2 = df.drop_duplicates(subset=["column1", "column2"], keep="last")
In this example, the df1
DataFrame will contain the first occurrence of each duplicate row, while the df2
DataFrame will contain the last occurrence of each duplicate row. The original df
DataFrame is not modified by the drop_duplicates()
method, and the resulting DataFrames are new DataFrames that only contain the rows that were not removed as duplicates.
Handling Missing Values in Pandas with fillna
To handle missing values in a Pandas dataframe, we can use the fillna
method. This method allows us to specify a value to use in place of missing data.
For example, if we have a dataframe with missing values in the “Age” column, we can use fillna
to replace those missing values with the average age of all the non-missing values in the column:
# Create a sample dataframe with missing values in the "Age" column
data = {
"Name": ["John", "Jane", "Jack", "Jill", "Jim"],
"Age": [25, None, 28, None, 33]
}
df = pd.DataFrame(data)
# Calculate the average age of the non-missing values in the "Age" column
avg_age = df["Age"].mean()
# Use fillna to replace the missing values in the "Age" column with the average age
df["Age"].fillna(avg_age, inplace=True)
# The resulting dataframe should have no missing values in the "Age" column
print(df)
This would output the following dataframe:
Name Age
0 John 25.0
1 Jane 28.5
2 Jack 28.0
3 Jill 28.5
4 Jim 33.0
As you can see, the missing values in the “Age” column have been replaced with the average age of the non-missing values in the column.
If you do not want to use the average value to fill in missing data, you can specify a different value to use with the fillna
method. For example, if you want to use the value 0 in place of missing data, you can do the following:
# Create a sample dataframe with missing values in the "Age" column
data = {
"Name": ["John", "Jane", "Jack", "Jill", "Jim"],
"Age": [25, None, 28, None, 33]
}
df = pd.DataFrame(data)
# Use fillna to replace the missing values in the "Age" column with 0
df["Age"].fillna(0, inplace=True)
Using dropna to remove missing values
Another way to handle missing values in a Pandas dataframe is to use the dropna
method. This method allows us to remove rows or columns that have missing values.
For example, if we have a dataframe with missing values in the “Age” column, we can use dropna
to remove all rows that have a missing value in the “Age” column:
# Create a sample dataframe with missing values in the "Age" column
data = {
"Name": ["John", "Jane", "Jack", "Jill", "Jim"],
"Age": [25, None, 28, None, 33]
}
df = pd.DataFrame(data)
# Use dropna to remove all rows with a missing value in the "Age" column
df.dropna(subset=["Age"], inplace=True)
# The resulting dataframe should not have any rows with missing values in the "Age" column
Replacing String Values in a Column with Pandas
The following code shows an example of how to replace string values in a column using Pandas. In this example, we will replace the values in the status
column of a dataframe with a new value based on the current value.
# create a sample dataframe
data = {'name': ['John', 'Jane', 'Jim', 'Sue', 'Bob'],
'status': ['single', 'married', 'single', 'married', 'single']}
df = pd.DataFrame(data)
# replace the values in the 'status' column
df['status'] = df['status'].replace({'single': 'S', 'married': 'M'})
This code will output the following:
name status
0 John S
1 Jane M
2 Jim S
3 Sue M
4 Bob S
In this example, we use the replace()
method to replace the string values in the status
column with new values. We use a dictionary to specify the old and new values for each string we want to replace. In this case, we replace the string 'single'
with the letter 'S'
and the string 'married'
with the letter 'M'
. The replace()
method returns a new series with the updated values, which we then assign back to the status
column of the dataframe.
Joining DataFrames with Pandas
When working with multiple data sets in a Pandas DataFrame, it can be useful to combine the data sets into a single DataFrame. This can be done using the .join()
method, which allows you to join two DataFrames on their indices.
Here is an example of how to use the .join()
method to combine two DataFrames:
# Create two DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=[0, 1, 2])
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']},
index=[0, 2, 3])
# Use the .join() method to combine the DataFrames
df3 = df1.join(df2)
This code produces the following output:
A B C D
0 A0 B0 C0 D0
1 A1 B1 NaN NaN
2 A2 B2 C1 D1
3 NaN NaN C2 D2
As you can see, the .join()
method has combined the two DataFrames by matching the indices of each DataFrame. In this example, the resulting DataFrame contains NaN (not a number) values for rows that did not have matching indices in both DataFrames.
Set the index
To make a column an index in a Pandas DataFrame, you can use the set_index()
method. Here is an example of how to use the set_index()
method to make the UserID
column the index of a DataFrame:
# Create a DataFrame
df = pd.DataFrame({'UserID': [1, 2, 3, 4],
'Name': ['John', 'Jane', 'Joe', 'Jill'],
'Age': [25, 30, 35, 40]})
# Use the .set_index() method to make the 'UserID' column the index
df = df.set_index('UserID')
Here is an example of how to use the .join()
method to join two DataFrames on the UserID
index:
# Create two DataFrames
df1 = pd.DataFrame({'UserID': [1, 2, 3],
'Name': ['John', 'Jane', 'Joe']})
df2 = pd.DataFrame({'UserID': [1, 2, 3],
'Age': [25, 30, 35]})
# Use the .set_index() method to make the 'UserID' column the index for both DataFrames
df1 = df1.set_index('UserID')
df2 = df2.set_index('UserID')
# Use the .join() method to join the DataFrames on their 'UserID' index
df3 = df1.join(df2)
Join on multiple columns
To join two DataFrames on multiple columns in Pandas, you can use the .merge()
method. Here is an example of how to use the .merge()
method to join two DataFrames on the Surname
, Postcode
, and DateOfBirth
columns:
# Create two DataFrames
df1 = pd.DataFrame({'Surname': ['Smith', 'Jones', 'Brown'],
'Postcode': ['10001', '10002', '10003'],
'DateOfBirth': ['1990-01-01', '1991-02-01', '1992-03-01']})
df2 = pd.DataFrame({'Surname': ['Smith', 'Jones', 'Brown'],
'Postcode': ['10001', '10002', '10003'],
'DateOfBirth': ['1990-01-01', '1991-02-01', '1992-03-01'],
'Age': [30, 31, 32]})
# Use the .merge() method to join the DataFrames on the 'Surname', 'Postcode', and 'DateOfBirth' columns
df3 = df1.merge(df2, on=['Surname', 'Postcode', 'DateOfBirth'])
Finding the Minimum, Maximum, Sum, or Average of Data
# Create a Pandas dataframe with sample data
df = pd.DataFrame({
"Name": ["John", "Jane", "Jack", "Jill"],
"Age": [25, 27, 32, 22],
"Income": [50000, 60000, 70000, 40000]
})
# Find the minimum value in the "Age" column
min_age = df["Age"].min()
print(f"Minimum Age: {min_age}")
# Find the maximum value in the "Income" column
max_income = df["Income"].max()
print(f"Maximum Income: {max_income}")
# Find the sum of all values in the "Income" column
sum_income = df["Income"].sum()
print(f"Total Income: {sum_income}")
# Find the average value in the "Age" column
avg_age = df["Age"].mean()
print(f"Average Age: {avg_age:.2f}")
Output:
Minimum Age: 22
Maximum Income: 70000
Total Income: 210000
Average Age: 27.25
Filtering Out Outliers with Pandas
Filtering outliers is a technique used in data analysis to identify and remove unusually large or small values that may be skewing the data. This can be useful for improving the accuracy and interpretability of the analysis, as outliers can often have a disproportionate effect on the results.
To filter out outliers, we can use the mean and standard deviation of the data to identify values that are unusually far from the center of the distribution. In a normal distribution, about 95% of the values will be within two standard deviations of the mean. So, by removing values that are more than two standard deviations away from the mean, we can effectively filter out a small number of unusually large or small values that may be distorting the data.
Of course, this approach is not always appropriate, and it is important to carefully consider the data and the goals of the analysis before deciding how to filter outliers. In some cases, other techniques may be more appropriate, such as using the median and interquartile range, or using more advanced methods such as robust regression.
To filter out outliers with Pandas, we can use the DataFrame.loc
method and specify conditions for which values to keep. For example, if we have a DataFrame df
with a column 'col1'
containing numeric values and we want to keep only values that are within two standard deviations of the mean, we can use the following code:
mean = df['col1'].mean()
std = df['col1'].std()
df = df.loc[(df['col1'] > mean - 2 * std) & (df['col1'] < mean + 2 * std)]
This will filter out any values that are more than two standard deviations away from the mean of col1
, leaving us with a DataFrame containing only the values that are within two standard deviations of the mean.
Alternatively, we can use the DataFrame.query
method to filter out outliers using the same conditions:
mean = df['col1'].mean()
std = df['col1'].std()
df = df.query('@mean - 2 * @std < col1 < @mean + 2 * @std')
Here, we use the @
symbol to refer to the variables mean
and std
in the query string, allowing us to use their values in the conditions. This approach can be useful when the conditions are too complex to be easily expressed using the DataFrame.loc
method.
In both cases, the resulting DataFrame will contain only the values that are within two standard deviations of the mean of col1
, and any outliers will have been filtered out.
Visualizing data using plots with Pandas
Pandas is a powerful library for working with data in Python, and it includes a variety of tools for visualizing data. One of the most useful tools is the plot
method, which can be used to create a wide range of different types of plots from a DataFrame or a Series.
Here is an example of using the plot
method to create a line plot from a DataFrame that contains monthly data:
# Create a DataFrame with some sample data
df = pd.DataFrame({'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
'sales': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200]})
# Use the plot method to create a line plot
df.plot(x='month', y='sales', kind='line')
In this example, we create a DataFrame with some sample data, and then use the plot
method to create a line plot. We specify the x
and y
values to use for the plot, and we specify the kind
of plot to create (in this case, a line
plot).
The following code demonstrates how to create various plots / charts
import pandas as pd
import matplotlib.pyplot as plt
# load the data
data = pd.read_csv("data.csv")
# create a line plot
data.plot(x="Year", y="Sales", title="Company Sales over the Years")
plt.show()
# create a scatter plot
data.plot(x="Profit", y="Expenses", kind="scatter", title="Profit vs Expenses")
plt.show()
# create a histogram
data["Profit"].plot(kind="hist", bins=20, title="Profit Distribution")
plt.show()
# create a bar chart
data.plot(x="Year", y="Profit", kind="bar", title="Profit by Year")
plt.show()
Sorting values before plotting
In the line plot above, to make sure the year axis is in the correct order, the sort_values
method can be used to sort the data by the year column before creating the plot. The updated code would look like this:
# sort the data by year
data = data.sort_values("Year")
# create a line plot
data.plot(x="Year", y="Sales", title="Company Sales over the Years")
plt.show()
Grouping data and applying aggregate functions with Pandas
To group data and apply aggregate functions with Pandas, we can use the groupby()
and agg()
methods. The groupby()
method allows us to group data based on specified columns, and the agg()
method allows us to apply aggregate functions to the grouped data.
For example, let’s say we have the following dataframe with information about different animals:
df = pd.DataFrame({'species': ['lion', 'tiger', 'elephant', 'giraffe', 'gazelle', 'monkey'],
'weight': [200, 250, 5000, 1000, 100, 50],
'height': [10, 12, 10, 15, 6, 4],
'age': [5, 8, 30, 12, 3, 4]})
To group the data by species and calculate the average weight and height for each species, we would use the following code:
df.groupby('species')['weight', 'height'].agg('mean')
This would return the following dataframe:
weight height
species
elephant 5000 10
gazelle 100 6
giraffe 1000 15
lion 200 10
monkey 50 4
tiger 250 12
Transforming data using map with Pandas
In the following example, we will use the map
function to transform the values in a Pandas dataframe. We will start with a dataframe containing information about different fruits, including their type, color, and quantity.
# create a dataframe with fruit information
fruit_df = pd.DataFrame({'type': ['apple', 'banana', 'strawberry', 'grape'],
'color': ['red', 'yellow', 'red', 'purple'],
'quantity': [10, 20, 30, 40]})
# display the dataframe
fruit_df
# use map to transform the color values
fruit_df['color'] = fruit_df['color'].map({'red': 'green', 'yellow': 'blue', 'purple': 'orange'})
Using Lambda Functions with Pandas
Here is an example of using a lambda function with Pandas to create a new column in a DataFrame:
# Create a DataFrame with some sample data
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Use a lambda function to create a new column called 'C'
# that is the sum of columns 'A' and 'B'
df['C'] = df.apply(lambda x: x['A'] + x['B'], axis=1)
# Print the resulting DataFrame
print(df)
# Output:
# A B C
# 0 1 4 5
# 1 2 5 7
# 2 3 6 9
When using the apply()
method on a Pandas DataFrame, the axis
parameter specifies which dimension of the DataFrame the function is applied to. In this case, setting axis=1
specifies that the function should be applied to each row of the DataFrame, rather than each column. This is necessary because our lambda function is using the values from columns A
and B
to calculate the value for column C
. If we did not specify axis=1
, the function would be applied to each column instead of each row, and the resulting values would not be correct.
Conclusion
Congratulations on completing this guide on working with data in Pandas! You’ve learned how to select and sort columns, filter rows based on conditions, add new columns, group and aggregate data, and handle missing data. These are essential skills for any data analyst or scientist, and with Pandas, you have the power to manipulate and analyze your data in a variety of ways. Keep practicing and exploring the capabilities of Pandas, and you’ll be able to tackle any data challenge that comes your way. Happy data wrangling!