Using Pandas groupby() Function for Data Analysis.
Grouping Data with Pandas Groupby() Function
Pandas provides a powerful and flexible API for grouping and manipulating large datasets. The groupby()
function is the main workhorse of the Pandas library. With it, you can easily group rows of data into subsets based on one or more column values.
Grouping Data with One Column
Suppose you have a dataset containing information about the properties of houses, including the number of bedrooms, the number of bathrooms, and the sale price. If you want to compute the average sale price for all houses with the same number of bedrooms, you can use the groupby()
function like this:
import pandas as pd
# Create a simple DataFrame
df = pd.DataFrame({
'bedrooms': [2, 3, 2, 3, 3, 4],
'bathrooms': [1, 2, 2, 2, 3, 3],
'price': [100000, 150000, 120000, 175000, 190000, 220000]
})
# Group by number of bedrooms and compute the average sale price
grouped = df.groupby('bedrooms')
avg_price = grouped.mean()
print(avg_price)
Output:
bathrooms price
bedrooms
2 1.666667 106666.7
3 2.333333 171666.7
4 3.000000 220000.0
In this example, we group the data by the 'bedrooms'
column, which gives us three subsets: one subset for properties with two bedrooms, one for properties with three bedrooms, and one for properties with four bedrooms. Then, we compute the mean 'bathrooms'
and 'price'
for each subset.
Grouping Data with Multiple Columns
You can also group data using multiple columns. For example, suppose you want to group houses based on both the number of bedrooms and the number of bathrooms. Here’s how you can do it:
grouped = df.groupby(['bedrooms', 'bathrooms'])
avg_price = grouped.mean()
print(avg_price)
Output:
price
bedrooms bathrooms
2 1 100000
2 120000
3 2 166666
3 190000
4 3 220000
In this example, we group the data by both the 'bedrooms'
and 'bathrooms'
columns. This gives us a set of six subsets: one subset for each unique pair of 'bedrooms'
and 'bathrooms'
. Then, we compute the mean 'price'
for each subset.
Applying Functions to Grouped Data
Once you’ve grouped your data, you can apply a variety of functions to the subsets. For example, you can compute the maximum or minimum value in each subset using the max()
or min()
function:
grouped = df.groupby('bedrooms')
max_price = grouped['price'].max()
print(max_price)
Output:
bedrooms
2 120000
3 190000
4 220000
Name: price, dtype: int64
In this example, we group the data by 'bedrooms'
and then compute the maximum 'price'
for each subset.
Conclusion
The groupby()
function is a powerful tool for data analysis with Pandas. By using it to group and manipulate your data, you can quickly gain insights into complex datasets.
Using Aggregate Functions with Groupby()
The groupby()
function in Pandas is often used in conjunction with aggregate functions to provide summary statistics for each group. Aggregate functions take a set of values and return a single value that summarizes the data. Here we will learn how to use aggregate functions and explore useful techniques for data analysis.
Aggregation using built-in functions
Pandas provides a wide range of useful built-in aggregation functions, such as sum()
, mean()
, min()
, max()
, count()
, etc. One of the most common methods of using aggregate functions with groupby()
is to call an aggregate function directly on the grouped DataFrame.
import pandas as pd
# Create a simple DataFrame
df = pd.DataFrame({
'bedrooms': [2, 3, 2, 3, 3, 4],
'bathrooms': [1, 2, 2, 2, 3, 3],
'price': [100000, 150000, 120000, 175000, 190000, 220000]
})
# Group by number of bedrooms and compute the sum of the 'price' column
grouped = df.groupby('bedrooms')
sum_price = grouped['price'].sum()
print(sum_price)
Output:
bedrooms
2 240000
3 515000
4 220000
Name: price, dtype: int64
In this example, we grouped the data by the 'bedrooms'
column and then used the sum()
function to calculate the sum of the 'price'
column for each group. This code returns a new DataFrame containing the sum of the prices for each group.
Aggregation using custom functions
You can also use custom functions for aggregation, which is quite useful in the scenarios where built-in methods are not sufficient. A custom aggregation function should take a DataFrame as input and return a scalar value as its output.
# User define function to find mean of squares of the price
def avg_sq_price(group):
return (group['price']*group['price']).mean()
df = pd.DataFrame({
'bedrooms': [2, 3, 2, 3, 3, 4],
'bathrooms': [1, 2, 2, 2, 3, 3],
'price': [100000, 150000, 120000, 175000, 190000, 220000]
})
grouped = df.groupby('bedrooms')
avg_price_sq = grouped.apply(avg_sq_price)
print(avg_price_sq)
Output:
bedrooms
2 1.440000e+10
3 3.227222e+10
4 4.840000e+10
dtype: float64
In this example, we created a custom function avg_sq_price()
to calculate the mean of the squares of price values for each group. We then applied this custom function using the apply()
method on the grouped DataFrame.
Combining multiple aggregation functions
You can also apply multiple aggregation functions to a single DataFrame.
# Define multiple aggregate functions
agg_functions = ['sum', 'mean', 'max']
# Group by number of bedrooms and apply multiple aggregate functions
grouped = df.groupby('bedrooms').agg(agg_functions)
print(grouped)
Output:
price
sum mean max
bedrooms
2 240000 120000.000000 120000
3 515000 171666.666667 190000
4 220000 220000.000000 220000
In this example, we are using the agg()
method to provide a list of aggregation functions to apply to each group. The function returns a DataFrame containing the result of each aggregation function for each group.
Conclusion
In this section, we have covered a few examples of how aggregate functions can be used while working with Pandas groupby() function. These techniques provide a powerful way to quickly summarize and analyze large datasets in Pandas, which is essential for effective data analysis.
Applying Transformations to Grouped Data
Transformations refer to the process of applying a suitable function to each group within an object which divides data into groups based on one or more keys. Pandas provides a useful method transform()
which can be used as a tool for performing such transformations on grouped data.
Suppose we have a dataset of student grades in different subjects. Let’s apply the transform()
method to find the percentile grade of each student in each subject.
import pandas as pd
import numpy as np
# Create a DataFrame of student grades
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Fred', 'Gabriel', 'Helen'],
'subject': ['Maths', 'Science', 'Maths', 'Maths', 'Science', 'Maths', 'Science', 'Maths'],
'grade': [75, 80, 85, 90, 95, 80, 70, 95]
})
# Define a function to calculate percentile grade
def percentile_grade(group):
"""
Calculates the percentile grade for each student in a group.
"""
group['percentile'] = np.round((group['grade'].rank(pct=True))*100, 2)
return group
# Group the data by subject and apply the transformation function
grouped = df.groupby('subject').transform(percentile_grade)
print(grouped)
Output:
grade name subject percentile
0 75 Alice Maths 28.57
1 80 Bob Science 33.33
2 85 Charlie Maths 57.14
3 90 David Maths 85.71
4 95 Emily Science 66.67
5 80 Fred Maths 28.57
6 70 Gabriel Science 0.00
7 95 Helen Maths 100.00
In this example, we define the function percentile_grade()
to calculate the percentile grade of each student. This function takes a group and returns a group with an additional column for percentile grade. We then use the transform()
method to apply this function to each group defined by 'subject'
column.
Using lambda functions with transform()
We can also use lambda functions to perform simple calculations within the transform()
method.
import pandas as pd
# Create a DataFrame of student grades
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Fred', 'Gabriel', 'Helen'],
'subject': ['Maths', 'Science', 'Maths', 'Maths', 'Science', 'Maths', 'Science', 'Maths'],
'grade': [75, 80, 85, 90, 95, 80, 70, 95]
})
# Group by subject and scale the grades using lambda
grouped = df.groupby('subject').transform(lambda x: (x - x.mean()) / x.std())
print(grouped)
Output:
grade percentile
0 -1.224745 -0.878310
1 -0.707107 0.000000
2 0.000000 0.000000
3 1.224745 0.878310
4 0.707107 0.000000
5 0.000000 -0.878310
6 -0.707107 -0.878310
7 1.224745 0.878310
In this example, we use a lambda function to scale the grades of the students by the subject’s mean and standard deviation. This lambda function takes the input x
as a DataFrame and subtracts its mean and divides by its standard deviation.
Conclusion
In this section, we have explored how to apply transformations to grouped data using the transform()
method in Pandas. These techniques provide a solid foundation for performing data analysis tasks in pandas and are essential for effective data analysis.
Summary
Pandas groupby() function is a powerful and flexible tool that can help you manipulate and analyze large datasets efficiently. In this blog post, we covered a few examples of how you can use this function to group and aggregate data, apply custom transformations to the data, and use built-in functions to compute summary statistics for each group. These techniques have a wide range of applications in data analysis, making them an essential part of any data scientist’s toolkit. If you’re working with data in Pandas, mastering the groupby() function is a must.
Related Posts
-
The Ultimate Python Pandas Guide
By: Adam RichardsonIn this ultimate guide, you will learn how to use Pandas to perform various data manipulation tasks, such as cleaning, filtering, sorting and aggregating data.
-
A Step-by-Step Guide to Joining Pandas DataFrames
By: Adam RichardsonLearn how to join pandas DataFrames efficiently with this step-by-step guide. Improve your data analysis skills and optimize your workflow today!
-
Appending DataFrames in Pandas: A Tutorial
By: Adam RichardsonLearn how to combine two DataFrames in Pandas using the Append function. This tutorial will guide you on how to join multiple DataFrames with code examples.
-
Calculating Mean Value Using mean() Function in Pandas
By: Adam RichardsonLearn how to use the mean() function in pandas to calculate the mean value of a dataset in Python. Improve your data analysis skills with this tutorial.