· Pandas · 6 min read
How to Add New Column in Pandas DataFrame
Understanding Pandas DataFrame
A Pandas DataFrame is a 2-dimensional size-mutable, tabular data structure with labeled axes (rows and columns). It is a powerful tool for data manipulation, analysis and cleaning in Python.
The DataFrame consists of three main components:
-
Index: It is used to label the rows and can be used to access the data in each row.
-
Columns: Each column has a label that is unique and can be used to access the data in each column.
-
Data: This is the data contained in the DataFrame.
A DataFrame can be created in various ways, such as from a dictionary, a list of dictionaries, a list of lists, etc. Here is an example:
import pandas as pd
# Creating a DataFrame from a dictionary
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 32, 18, 47],
'country': ['USA', 'Canada', 'UK', 'USA']}
df = pd.DataFrame(data)
In this example, we created a DataFrame with three columns: name, age, and country. Each column is a Pandas Series, which can be thought of as a single column of data with a label.
To add a new column to an existing DataFrame, we can use the df['column_name']
syntax followed by an equal sign and the new column data. Here is an example:
# Adding a new column 'profession' to our DataFrame
df['profession'] = ['doctor', 'teacher', 'student', 'engineer']
We can also add a new column by performing some computation on the existing columns. Here is an example:
# Adding a new column 'birth_year' to our DataFrame
df['birth_year'] = pd.datetime.now().year - df['age']
In this example, we computed the birth year of each person by subtracting their age from the current year.
In this section, we learned about what a Pandas DataFrame is and how to add a new column to an existing DataFrame. It is a fundamental concept in data analysis, and mastering it will allow you to perform a wide range of data-related tasks in Python.
Adding New Column in DataFrame
Adding a new column to a Pandas DataFrame involves assigning a new column to an existing DataFrame using the df['column_name']
syntax. The new column can be created by providing a scalar value or a list of values having the same length as that of the DataFrame.
Let’s say we have a DataFrame of student data and we want to add a new column for their grades. Here’s an example:
import pandas as pd
# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 32, 18, 47],
'Country': ['USA', 'Canada', 'UK', 'USA']}
df = pd.DataFrame(data)
# Adding a new column 'Grade' with scalar value
df['Grade'] = 'B'
# Printing the modified DataFrame
print(df)
In this example, we added a new column ‘Grade’ to our DataFrame and assigned the scalar value ‘B’ to all rows. You can also add a new column with a list of values.
# Creating a list of grades
grades = ['A', 'B', 'C', 'D']
# Adding a new column 'Grade' with list of values
df['Grade'] = grades
# Printing the modified DataFrame
print(df)
In this example, we added a new column ‘Grade’ with a list of values to our DataFrame.
We can also add a new column by performing some computation on the existing columns. Here is an example:
# Adding a new column 'Birth Year' by computing the birth year from age
df['Birth Year'] = pd.datetime.now().year - df['Age']
# Printing the modified DataFrame
print(df)
In this example, we computed the birth year of each student by subtracting their age from the current year.
In this section, we learned how to add a new column to a Pandas DataFrame using scalar value, list of values or by performing computation on the existing columns. Understanding this concept is crucial in data analysis with Pandas, as it enables us to add new information to our data quickly and easily.
Performing Operations on New Column
Once a new column has been added to a Pandas DataFrame, we can perform various operations on it. These operations include arithmetic and logical operations, aggregation, and filtering based on column values.
Let’s say we have a DataFrame of employee data and we want to add a new column for their monthly salary based on their hourly rate and number of hours worked in a month. Here’s an example:
import pandas as pd
# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Hourly Rate': [15.25, 20.50, 10.75, 12.00],
'Hours Worked': [160, 120, 140, 180]}
df = pd.DataFrame(data)
# Adding a new column 'Monthly Salary' by computing the product of 'Hourly Rate' and 'Hours Worked'
df['Monthly Salary'] = df['Hourly Rate'] * df['Hours Worked']
# Printing the modified DataFrame
print(df)
In this example, we added a new column ‘Monthly Salary’ to our DataFrame by computing the product of ‘Hourly Rate’ and ‘Hours Worked’ columns.
We can also perform aggregation on the new column such as finding the average, minimum, maximum, or median monthly salary of employees.
# Finding the average monthly salary
avg_salary = df['Monthly Salary'].mean()
# Finding the maximum monthly salary
max_salary = df['Monthly Salary'].max()
# Finding the median monthly salary
median_salary = df['Monthly Salary'].median()
# Printing the results
print("Average Monthly Salary:", avg_salary)
print("Maximum Monthly Salary:", max_salary)
print("Median Monthly Salary:", median_salary)
In this example, we performed aggregation on the ‘Monthly Salary’ column by finding the average, maximum and median monthly salary of employees.
We can also filter our DataFrame based on the new column values such as finding all employees with monthly salary greater than a certain value.
# Filtering based on monthly salary greater than $2000
filtered_df = df[df['Monthly Salary'] > 2000]
# Printing the filtered DataFrame
print(filtered_df)
In this example, we filtered our DataFrame to find all employees with monthly salary greater than $2000.
In this section, we learned how to perform various operations on a newly added column in a Pandas DataFrame such as arithmetic and logical operations, aggregation and filtering based on column values. These operations enable us to further analyze our data, derive new insights and make informed decisions.
Summary
Adding a new column to a Pandas DataFrame is a fundamental concept in data analysis with pandas. It enables us to add new information to our data quickly and easily. We can add a new column by assigning a scalar value or a list of values having the same length as that of the DataFrame. We can also add a new column by performing some computation on the existing columns. Once a new column has been added, we can perform various operations on it such as arithmetic and logical operations, aggregation, and filtering based on column values.
My personal advice on this topic is to always use meaningful column names and to keep the data type of the added column consistent with the existing columns. This enables us to maintain data integrity and avoid analysis errors. Overall, mastering the concept of adding a new column in Pandas DataFrame is essential in data analysis, and it will enable you to perform a range of data-related tasks more efficiently.