Creating and adding columns with Python Pandas

Intro

One of the key features of Pandas is the ability to work with columns in a DataFrame, which are essentially series of data with a common index. In this tutorial, we will cover various methods for creating and adding columns to a Pandas DataFrame.

By the end of this post, you will have a solid understanding of how to work with columns in Pandas.

How to deal with Null handling in Tableau or Is Not Null

Learn how to effectively deal with null values in Tableau. In this tutorial, we'll cover strategies for handling nulls, including using the IS NOT NULL function.

Calculating a column with existing columns

We’re going to calculate a column for Revenue Per Available Room. The first course of action is reassigning our clean dataframe, so that we can reference this later if needs be. Insert a new cell and write the following code

df = clean

We are now going to use the df variable, short for dataframe.

Using short hand bracket notation

The simplest way to add columns is by using the short hand bracket notation. Let’s take a look at an example, where we calculate the RevPAR figure (Revenue Per Available room) using this method

df = clean

df['revPAR'] = df['roomsSold'] * df['avgRate'] / df['capacity']
df.head(5)

Output

date	estKey	capacity	occupancy	roomsSold	avgRate	salesValue	revPAR
2022-12-27	0	289	0.75	217	35.97	7805.49	27.008616
2022-12-27	1	203	0.35	71	82.31	5844.01	28.788227
2022-12-27	2	207	0.51	106	227.83	24149.98	116.666570
2022-12-27	3	27	0.37	10	126.46	1264.60	46.837037
2022-12-27	4	20	0.87	17	191.57	3256.69	162.834500

Not that revPAR is rounding to 6 decimal places, and we have not specified the type. Let’s change our code to do those two things.

As a best practice, you should always specify the type when adding columns with Pandas.

df = clean

df['revPAR'] = (df['roomsSold'] * df['avgRate'] / df['capacity']).round(2).astype(float)
df.head(5)

Our output is now rounded to two decimal places, and is the correct type float64

Using Pandas Assign function

df = clean

df = df.assign(revPAR=(df['roomsSold'] * df['avgRate'] / df['capacity']).round(2).astype(float))
df.head()

This gives the same output. The assign() function is a method of the DataFrame class in the pandas library that allows you to create a new DataFrame by adding one or more new columns to an existing DataFrame. It takes a dictionary of column names and column values as arguments, and returns a new DataFrame with the added columns.

Using Pandas Insert function

Using the Pandas insert function allows us to insert a column at a specific location within the dataframe. You can do this if you want to have your columns in a specific order.

The insert function will modify the existing dataframe in place. So you do not need to overwrite the df variable.

df = clean

df.insert(3, 'revPAR', (df['roomsSold'] * df['avgRate'] / df['capacity']).round(2).astype(float))
df.head()

Output

date	estKey	capacity	revPAR	occupancy	roomsSold	avgRate	salesValue
2022-12-27	0	289	27.01	0.75	217	35.97	7805.49
2022-12-27	1	203	28.79	0.35	71	82.31	5844.01
2022-12-27	2	207	116.67	0.51	106	227.83	24149.98
2022-12-27	3	27	46.84	0.37	10	126.46	1264.60
2022-12-27	4	20	162.83	0.87	17	191.57	3256.69

As you can see, we now have revPAR in a specific location (the 4th column) as we start counting columns at a 0 index. So we have date=0, estKey =1, capacity=2 and revPAR = 3

More examples

We can perform any Python Math operators to easily create new columns.

Python Operator Types

Operator	Operator Name	Description	Example Use
+	Addition	Adds two values	`x = 3 + 4` (x is now 7)
-	Subtraction	Subtracts one value from another	`y = 10 - 3` (y is now 7)
*	Multiplication	Multiply two values	`z = 2 * 5` (z is now 10)
/	Division	Divide one value by another	`a = 10 / 3` (a is now 3.3333333333333335)
%	Modulo	Returns the remainder of division	`b = 7 % 4` (b is now 3)
**	Exponentiation	Raises one value to the power of another	`c = 2 ** 3` (c is now 8)

Example of Math Operation

df['occupancyNew'] = (df['roomsSold'] / df['capacity']).round(2).astype(float)

You can interchange these easily. Here is how you multiply two columns

df['Multiply'] = (df['roomsSold'] * df['capacity']).round(2).astype(float)

Concatenating Columns (Join two columns together)

Let’s have a look at how we can concatenate two columns in pandas. In our example, maybe we want a unique column which contains the date and estKey as a string. Let’s look at using the apply method for this, and a lambda function. There are string concatenation methods in Pandas which is simpler, but these are not strings. This will work regardless of type.

Python f strings - Side note

In the code below, we are using Python f strings to concatenate the date and estKey columns.

About f strings

Python f-strings allow you to embed expressions inside string literals, using the f prefix and curly braces . They provide a concise and convenient way to include the value of variables or expressions inside a string.

Here is a summary of some key points about Python f-strings:

F-strings start with the f prefix and are followed by a string literal, which can contain expressions inside curly braces . The expressions inside the curly braces are evaluated and the results are included in the string.
You can use any valid Python expression inside the curly braces, including variables, function calls, and arithmetic operations.
F-strings are faster than the older string formatting methods, such as the % operator or the format method, because they do not require any additional string formatting.

Output

df = clean

df['revPAR'] = df['roomsSold'] * df['avgRate'] / df['capacity']
df['occupancyNew'] = (df['roomsSold'] / df['capacity']).round(2).astype(float)
df['dateEstKey'] = df.apply(lambda x: f"{x['date']}-{x['estKey']}", axis=1)

df.head()

Our column now looks like 2022-12-27 00:00:00-0

Let’s go one step further and only use the “date” part, and not the time. To do this, we are going to select on a part of the date column. The easiest way to do this is by using string slicing. But remember, our date is not a string!

df = clean

df['revPAR'] = df['roomsSold'] * df['avgRate'] / df['capacity']
df['occupancyNew'] = (df['roomsSold'] / df['capacity']).round(2).astype(float)
df['dateEstKey'] = df.apply(lambda x: f"{str(x['date'])[:10]}-{x['estKey']}", axis=1)

df.head()

I’m using the str() python method to convert the date to a string, before using [:10] to select the first 10 characters of the string.

Our dateEstKey column now looks like 2022-12-27-0, which is exactly what we wanted.

Adding a column or columns with a single value

Sometimes we just want to apply a value to all columns within a dataset. For example we can add the number 1 so that we can SUM() rows instead of counting them.

Here’s how you can do that

df = df.assign(literalOne=1).astype({'literalOne': 'int'})

Let’s say we also wanted to add a brand onto this dataset. We could amend it to add multiple values.

df = df.assign(literalOne=1, brand="Cojolt").astype({'literalOne': 'int', 'brand': 'object'})

df.head()

Conditional Columns

Conditional columns means that we can add a value depending on the result of a condition.

In this example. Let’s say we want to simply give a rating against revPAR of bad or good.

If revPAR is less that 100 it’s bad. If it’s more than or equal to 100 then it’s good.

Calculating with Apply & Lambda function

df['revPARStatus'] = df['revPAR'].apply(lambda x: 'bad' if x < 100 else 'good')

Extract Lambda to a Python function

A lambda function is basically an anonymous function that is run inline. We can declare functions that we can run too. Here is how we could do that.

def get_revpar_status(revpar):
  if revpar < 100:
    return 'bad'
  else:
    return 'good'

df['revPARStatus'] = df['revPAR'].apply(get_revpar_status)

This is a little cleaner and easier to read.

We could add a third status to our if statement, if we wanted to include more categorisation.

def get_revpar_status(revpar):
  if revpar < 100:
    return 'bad'
  elif revpar < 200:
    return 'OK'
  else:
    return 'good'

Dataframe now

date	estKey	capacity	occupancy	roomsSold	avgRate	salesValue	revPAR	occupancyNew	dateEstKey	revPARStatus
2022-12-27	0	289	0.75	217	35.97	7805.49	27.008616	0.75	2022-12-27-0	bad
2022-12-27	1	203	0.35	71	82.31	5844.01	28.788227	0.35	2022-12-27-1	bad
2022-12-27	2	207	0.51	106	227.83	24149.98	116.666570	0.51	2022-12-27-2	OK
2022-12-27	3	27	0.37	10	126.46	1264.60	46.837037	0.37	2022-12-27-3	bad
2022-12-27	4	20	0.87	17	191.57	3256.69	162.834500	0.85	2022-12-27-4	OK

I’ve deleted the additional columns brand and literalOne, you can learn how in the next post.

Remove Columns in Python Pandas

Let's look at how you can delete/remove columns in Python Pandas