· Pandas · 6 min read
Creating and adding columns with Python Pandas
Intro
One of the key features of Pandas is the ability to work with columns in a DataFrame, which are essentially series of data with a common index. In this tutorial, we will cover various methods for creating and adding columns to a Pandas DataFrame.
By the end of this post, you will have a solid understanding of how to work with columns in Pandas.
Learn how to effectively deal with null values in Tableau. In this tutorial, we'll cover strategies for handling nulls, including using the IS NOT NULL function.
Calculating a column with existing columns
We’re going to calculate a column for Revenue Per Available Room. The first course of action is reassigning our clean dataframe, so that we can reference this later if needs be. Insert a new cell and write the following code
df = clean
We are now going to use the df
variable, short for dataframe.
Using short hand bracket notation
The simplest way to add columns is by using the short hand bracket notation. Let’s take a look at an example, where we calculate the RevPAR figure (Revenue Per Available room) using this method
df = clean
df['revPAR'] = df['roomsSold'] * df['avgRate'] / df['capacity']
df.head(5)
Output
date | estKey | capacity | occupancy | roomsSold | avgRate | salesValue | revPAR |
---|---|---|---|---|---|---|---|
2022-12-27 | 0 | 289 | 0.75 | 217 | 35.97 | 7805.49 | 27.008616 |
2022-12-27 | 1 | 203 | 0.35 | 71 | 82.31 | 5844.01 | 28.788227 |
2022-12-27 | 2 | 207 | 0.51 | 106 | 227.83 | 24149.98 | 116.666570 |
2022-12-27 | 3 | 27 | 0.37 | 10 | 126.46 | 1264.60 | 46.837037 |
2022-12-27 | 4 | 20 | 0.87 | 17 | 191.57 | 3256.69 | 162.834500 |
Not that revPAR is rounding to 6 decimal places, and we have not specified the type. Let’s change our code to do those two things.
As a best practice, you should always specify the type when adding columns with Pandas.
df = clean
df['revPAR'] = (df['roomsSold'] * df['avgRate'] / df['capacity']).round(2).astype(float)
df.head(5)
Our output is now rounded to two decimal places, and is the correct type float64
Using Pandas Assign function
df = clean
df = df.assign(revPAR=(df['roomsSold'] * df['avgRate'] / df['capacity']).round(2).astype(float))
df.head()
This gives the same output. The assign()
function is a method of the DataFrame class in the pandas library that allows you to create a new DataFrame by adding one or more new columns to an existing DataFrame. It takes a dictionary of column names and column values as arguments, and returns a new DataFrame with the added columns.
Using Pandas Insert function
Using the Pandas insert function allows us to insert a column at a specific location within the dataframe. You can do this if you want to have your columns in a specific order.
The insert function will modify the existing dataframe in place. So you do not need to overwrite the
df
variable.
df = clean
df.insert(3, 'revPAR', (df['roomsSold'] * df['avgRate'] / df['capacity']).round(2).astype(float))
df.head()
Output
date | estKey | capacity | revPAR | occupancy | roomsSold | avgRate | salesValue |
---|---|---|---|---|---|---|---|
2022-12-27 | 0 | 289 | 27.01 | 0.75 | 217 | 35.97 | 7805.49 |
2022-12-27 | 1 | 203 | 28.79 | 0.35 | 71 | 82.31 | 5844.01 |
2022-12-27 | 2 | 207 | 116.67 | 0.51 | 106 | 227.83 | 24149.98 |
2022-12-27 | 3 | 27 | 46.84 | 0.37 | 10 | 126.46 | 1264.60 |
2022-12-27 | 4 | 20 | 162.83 | 0.87 | 17 | 191.57 | 3256.69 |
As you can see, we now have revPAR in a specific location (the 4th column) as we start counting columns at a 0 index. So we have date=0, estKey =1, capacity=2 and revPAR = 3
More examples
We can perform any Python Math operators to easily create new columns.
Python Operator Types
Operator | Operator Name | Description | Example Use |
---|---|---|---|
+ | Addition | Adds two values | x = 3 + 4 (x is now 7) |
- | Subtraction | Subtracts one value from another | y = 10 - 3 (y is now 7) |
* | Multiplication | Multiply two values | z = 2 * 5 (z is now 10) |
/ | Division | Divide one value by another | a = 10 / 3 (a is now 3.3333333333333335) |
% | Modulo | Returns the remainder of division | b = 7 % 4 (b is now 3) |
** | Exponentiation | Raises one value to the power of another | c = 2 ** 3 (c is now 8) |
Example of Math Operation
df['occupancyNew'] = (df['roomsSold'] / df['capacity']).round(2).astype(float)
You can interchange these easily. Here is how you multiply two columns
df['Multiply'] = (df['roomsSold'] * df['capacity']).round(2).astype(float)
Concatenating Columns (Join two columns together)
Let’s have a look at how we can concatenate two columns in pandas. In our example, maybe we want a unique column which contains the date and estKey as a string. Let’s look at using the apply method for this, and a lambda function. There are string concatenation methods in Pandas which is simpler, but these are not strings. This will work regardless of type.
Python f strings - Side note
In the code below, we are using Python f strings to concatenate the date and estKey columns.
About f strings
Python f-strings allow you to embed expressions inside string literals, using the f prefix and curly braces . They provide a concise and convenient way to include the value of variables or expressions inside a string.
Here is a summary of some key points about Python f-strings:
- F-strings start with the f prefix and are followed by a string literal, which can contain expressions inside curly braces . The expressions inside the curly braces are evaluated and the results are included in the string.
- You can use any valid Python expression inside the curly braces, including variables, function calls, and arithmetic operations.
- F-strings are faster than the older string formatting methods, such as the % operator or the format method, because they do not require any additional string formatting.
Output
df = clean
df['revPAR'] = df['roomsSold'] * df['avgRate'] / df['capacity']
df['occupancyNew'] = (df['roomsSold'] / df['capacity']).round(2).astype(float)
df['dateEstKey'] = df.apply(lambda x: f"{x['date']}-{x['estKey']}", axis=1)
df.head()
Our column now looks like 2022-12-27 00:00:00-0
Let’s go one step further and only use the “date” part, and not the time. To do this, we are going to select on a part of the date column. The easiest way to do this is by using string slicing. But remember, our date is not a string!
df = clean
df['revPAR'] = df['roomsSold'] * df['avgRate'] / df['capacity']
df['occupancyNew'] = (df['roomsSold'] / df['capacity']).round(2).astype(float)
df['dateEstKey'] = df.apply(lambda x: f"{str(x['date'])[:10]}-{x['estKey']}", axis=1)
df.head()
I’m using the str()
python method to convert the date to a string, before using [:10]
to select the first 10 characters of the string.
Our dateEstKey column now looks like 2022-12-27-0
, which is exactly what we wanted.
Adding a column or columns with a single value
Sometimes we just want to apply a value to all columns within a dataset. For example we can add the number 1 so that we can SUM()
rows instead of counting them.
Here’s how you can do that
df = df.assign(literalOne=1).astype({'literalOne': 'int'})
Let’s say we also wanted to add a brand onto this dataset. We could amend it to add multiple values.
df = df.assign(literalOne=1, brand="Cojolt").astype({'literalOne': 'int', 'brand': 'object'})
df.head()
Conditional Columns
Conditional columns means that we can add a value depending on the result of a condition.
In this example. Let’s say we want to simply give a rating against revPAR of bad or good.
If revPAR is less that 100 it’s bad. If it’s more than or equal to 100 then it’s good.
Calculating with Apply & Lambda function
df['revPARStatus'] = df['revPAR'].apply(lambda x: 'bad' if x < 100 else 'good')
Extract Lambda to a Python function
A lambda function is basically an anonymous function that is run inline
. We can declare functions that we can run too. Here is how we could do that.
def get_revpar_status(revpar):
if revpar < 100:
return 'bad'
else:
return 'good'
df['revPARStatus'] = df['revPAR'].apply(get_revpar_status)
This is a little cleaner and easier to read.
We could add a third status to our if statement, if we wanted to include more categorisation.
def get_revpar_status(revpar):
if revpar < 100:
return 'bad'
elif revpar < 200:
return 'OK'
else:
return 'good'
Dataframe now
date | estKey | capacity | occupancy | roomsSold | avgRate | salesValue | revPAR | occupancyNew | dateEstKey | revPARStatus |
---|---|---|---|---|---|---|---|---|---|---|
2022-12-27 | 0 | 289 | 0.75 | 217 | 35.97 | 7805.49 | 27.008616 | 0.75 | 2022-12-27-0 | bad |
2022-12-27 | 1 | 203 | 0.35 | 71 | 82.31 | 5844.01 | 28.788227 | 0.35 | 2022-12-27-1 | bad |
2022-12-27 | 2 | 207 | 0.51 | 106 | 227.83 | 24149.98 | 116.666570 | 0.51 | 2022-12-27-2 | OK |
2022-12-27 | 3 | 27 | 0.37 | 10 | 126.46 | 1264.60 | 46.837037 | 0.37 | 2022-12-27-3 | bad |
2022-12-27 | 4 | 20 | 0.87 | 17 | 191.57 | 3256.69 | 162.834500 | 0.85 | 2022-12-27-4 | OK |
I’ve deleted the additional columns brand and literalOne, you can learn how in the next post.
Let's look at how you can delete/remove columns in Python Pandas