Using Pandas groupby() Function for Data Analysis.

Grouping Data with Pandas Groupby() Function

Pandas provides a powerful and flexible API for grouping and manipulating large datasets. The groupby() function is the main workhorse of the Pandas library. With it, you can easily group rows of data into subsets based on one or more column values.

Grouping Data with One Column

Suppose you have a dataset containing information about the properties of houses, including the number of bedrooms, the number of bathrooms, and the sale price. If you want to compute the average sale price for all houses with the same number of bedrooms, you can use the groupby() function like this:

import pandas as pd

# Create a simple DataFrame
df = pd.DataFrame({
    'bedrooms': [2, 3, 2, 3, 3, 4],
    'bathrooms': [1, 2, 2, 2, 3, 3],
    'price': [100000, 150000, 120000, 175000, 190000, 220000]
})

# Group by number of bedrooms and compute the average sale price
grouped = df.groupby('bedrooms')
avg_price = grouped.mean()

print(avg_price)

Output:

                bathrooms     price
bedrooms
2                1.666667  106666.7
3                2.333333  171666.7
4                3.000000  220000.0

In this example, we group the data by the 'bedrooms' column, which gives us three subsets: one subset for properties with two bedrooms, one for properties with three bedrooms, and one for properties with four bedrooms. Then, we compute the mean 'bathrooms' and 'price' for each subset.

Grouping Data with Multiple Columns

You can also group data using multiple columns. For example, suppose you want to group houses based on both the number of bedrooms and the number of bathrooms. Here’s how you can do it:

grouped = df.groupby(['bedrooms', 'bathrooms'])
avg_price = grouped.mean()

print(avg_price)

Output:

                       price
bedrooms bathrooms
2        1           100000
         2           120000
3        2           166666
         3           190000
4        3           220000

In this example, we group the data by both the 'bedrooms' and 'bathrooms' columns. This gives us a set of six subsets: one subset for each unique pair of 'bedrooms' and 'bathrooms'. Then, we compute the mean 'price' for each subset.

Applying Functions to Grouped Data

Once you’ve grouped your data, you can apply a variety of functions to the subsets. For example, you can compute the maximum or minimum value in each subset using the max() or min() function:

grouped = df.groupby('bedrooms')
max_price = grouped['price'].max()

print(max_price)

Output:

bedrooms
2    120000
3    190000
4    220000
Name: price, dtype: int64

In this example, we group the data by 'bedrooms' and then compute the maximum 'price' for each subset.

Conclusion

The groupby() function is a powerful tool for data analysis with Pandas. By using it to group and manipulate your data, you can quickly gain insights into complex datasets.

Using Aggregate Functions with Groupby()

The groupby() function in Pandas is often used in conjunction with aggregate functions to provide summary statistics for each group. Aggregate functions take a set of values and return a single value that summarizes the data. Here we will learn how to use aggregate functions and explore useful techniques for data analysis.

Aggregation using built-in functions

Pandas provides a wide range of useful built-in aggregation functions, such as sum(), mean(), min(), max(), count(), etc. One of the most common methods of using aggregate functions with groupby() is to call an aggregate function directly on the grouped DataFrame.

import pandas as pd

# Create a simple DataFrame
df = pd.DataFrame({
    'bedrooms': [2, 3, 2, 3, 3, 4],
    'bathrooms': [1, 2, 2, 2, 3, 3],
    'price': [100000, 150000, 120000, 175000, 190000, 220000]
})

# Group by number of bedrooms and compute the sum of the 'price' column
grouped = df.groupby('bedrooms')
sum_price = grouped['price'].sum()

print(sum_price)

Output:

bedrooms
2    240000
3    515000
4    220000
Name: price, dtype: int64

In this example, we grouped the data by the 'bedrooms' column and then used the sum() function to calculate the sum of the 'price' column for each group. This code returns a new DataFrame containing the sum of the prices for each group.

Aggregation using custom functions

You can also use custom functions for aggregation, which is quite useful in the scenarios where built-in methods are not sufficient. A custom aggregation function should take a DataFrame as input and return a scalar value as its output.

# User define function to find mean of squares of the price
def avg_sq_price(group):
    return (group['price']*group['price']).mean()

df = pd.DataFrame({
    'bedrooms': [2, 3, 2, 3, 3, 4],
    'bathrooms': [1, 2, 2, 2, 3, 3],
    'price': [100000, 150000, 120000, 175000, 190000, 220000]
})

grouped = df.groupby('bedrooms')
avg_price_sq = grouped.apply(avg_sq_price)

print(avg_price_sq)

Output:

bedrooms
2    1.440000e+10
3    3.227222e+10
4    4.840000e+10
dtype: float64

In this example, we created a custom function avg_sq_price() to calculate the mean of the squares of price values for each group. We then applied this custom function using the apply() method on the grouped DataFrame.

Combining multiple aggregation functions

You can also apply multiple aggregation functions to a single DataFrame.

# Define multiple aggregate functions
agg_functions = ['sum', 'mean', 'max']

# Group by number of bedrooms and apply multiple aggregate functions
grouped = df.groupby('bedrooms').agg(agg_functions)

print(grouped)

Output:

           price
             sum           mean     max
bedrooms
2         240000  120000.000000  120000
3         515000  171666.666667  190000
4         220000  220000.000000  220000

In this example, we are using the agg() method to provide a list of aggregation functions to apply to each group. The function returns a DataFrame containing the result of each aggregation function for each group.

Conclusion

In this section, we have covered a few examples of how aggregate functions can be used while working with Pandas groupby() function. These techniques provide a powerful way to quickly summarize and analyze large datasets in Pandas, which is essential for effective data analysis.

Applying Transformations to Grouped Data

Transformations refer to the process of applying a suitable function to each group within an object which divides data into groups based on one or more keys. Pandas provides a useful method transform() which can be used as a tool for performing such transformations on grouped data.

Suppose we have a dataset of student grades in different subjects. Let’s apply the transform() method to find the percentile grade of each student in each subject.

import pandas as pd
import numpy as np

# Create a DataFrame of student grades
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Fred', 'Gabriel', 'Helen'],
    'subject': ['Maths', 'Science', 'Maths', 'Maths', 'Science', 'Maths', 'Science', 'Maths'],
    'grade': [75, 80, 85, 90, 95, 80, 70, 95]
})

# Define a function to calculate percentile grade
def percentile_grade(group):
    """
    Calculates the percentile grade for each student in a group.
    """
    group['percentile'] = np.round((group['grade'].rank(pct=True))*100, 2)
    return group

# Group the data by subject and apply the transformation function
grouped = df.groupby('subject').transform(percentile_grade)

print(grouped)

Output:

      grade      name subject  percentile
0  75     Alice   Maths    28.57
1  80       Bob  Science   33.33
2  85   Charlie   Maths     57.14
3  90     David   Maths     85.71
4  95     Emily  Science   66.67
5  80      Fred   Maths     28.57
6  70   Gabriel  Science     0.00
7  95     Helen   Maths     100.00

In this example, we define the function percentile_grade() to calculate the percentile grade of each student. This function takes a group and returns a group with an additional column for percentile grade. We then use the transform() method to apply this function to each group defined by 'subject' column.

Using lambda functions with transform()

We can also use lambda functions to perform simple calculations within the transform() method.

import pandas as pd

# Create a DataFrame of student grades
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Fred', 'Gabriel', 'Helen'],
    'subject': ['Maths', 'Science', 'Maths', 'Maths', 'Science', 'Maths', 'Science', 'Maths'],
    'grade': [75, 80, 85, 90, 95, 80, 70, 95]
})

# Group by subject and scale the grades using lambda
grouped = df.groupby('subject').transform(lambda x: (x - x.mean()) / x.std())

print(grouped)

Output:

      grade  percentile
0 -1.224745   -0.878310
1 -0.707107    0.000000
2  0.000000    0.000000
3  1.224745    0.878310
4  0.707107    0.000000
5  0.000000   -0.878310
6 -0.707107   -0.878310
7  1.224745    0.878310

In this example, we use a lambda function to scale the grades of the students by the subject’s mean and standard deviation. This lambda function takes the input x as a DataFrame and subtracts its mean and divides by its standard deviation.

Conclusion

In this section, we have explored how to apply transformations to grouped data using the transform() method in Pandas. These techniques provide a solid foundation for performing data analysis tasks in pandas and are essential for effective data analysis.

Summary

Pandas groupby() function is a powerful and flexible tool that can help you manipulate and analyze large datasets efficiently. In this blog post, we covered a few examples of how you can use this function to group and aggregate data, apply custom transformations to the data, and use built-in functions to compute summary statistics for each group. These techniques have a wide range of applications in data analysis, making them an essential part of any data scientist’s toolkit. If you’re working with data in Pandas, mastering the groupby() function is a must.