· Pandas · 5 min read

A Step-by-Step Guide to Joining Pandas DataFrames

Concatenating DataFrames Horizontally

When dealing with large datasets, it is often necessary to combine multiple data sources. Concatenation is one way to combine DataFrames horizontally. It allows you to combine columns of two or more datasets.

To concatenate two DataFrames horizontally, use the pd.concat() function from the pandas library. Suppose we have two DataFrames: df1 and df2. To join these two DataFrames horizontally, we use the following code:

import pandas as pd

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

frames = [df1, df2]
result = pd.concat(frames, axis=1)

print(result)

Output:

    A   B   C   D   A   B   C   D
0  A0  B0  C0  D0  A4  B4  C4  D4
1  A1  B1  C1  D1  A5  B5  C5  D5
2  A2  B2  C2  D2  A6  B6  C6  D6
3  A3  B3  C3  D3  A7  B7  C7  D7

As shown above, the pd.concat() function takes in a list of DataFrames and joins them horizontally along the axis specified with the axis= parameter. Here, we pass axis=1 to join the DataFrames based on columns.

Keep in mind that when joining DataFrames horizontally, both DataFrames must have the same index. Otherwise, joining the datasets can lead to issues, like data being overwritten, or duplicated index values.

Concatenating DataFrames Vertically

Concatenating DataFrames vertically allows you to stack multiple DataFrames on top of each other, resulting in a single DataFrame with all the rows from each of the DataFrames. One way to do this is by using the pd.concat() function from the pandas library.

Suppose we have two DataFrames: df1 and df2. To concatenate these two DataFrames vertically use the following code:

import pandas as pd

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                      'B': ['B4', 'B5', 'B6', 'B7'],
                      'C': ['C4', 'C5', 'C6', 'C7'],
                      'D': ['D4', 'D5', 'D6', 'D7']})

frames = [df1, df2]
result = pd.concat(frames)

print(result)

Output:

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
0  A4  B4  C4  D4
1  A5  B5  C5  D5
2  A6  B6  C6  D6
3  A7  B7  C7  D7

As shown above, the pd.concat() function by default concatenates vertically. This means that it appends the rows of the first DataFrame to the second DataFrame.

Keep in mind that when joining DataFrames vertically, both DataFrames must have the same columns. Otherwise, joining the datasets can lead to issues, like columns being ignored, or duplicated columns.

Merging DataFrames

Merging DataFrames is another way to combine datasets. Unlike concatenation, merging allows you to combine DataFrames based on a common column. Merging is useful when you have two or more datasets that share a common variable, such as a customer ID, product number, or date.

To merge two DataFrames in pandas, use the pd.merge() function. Suppose we have two DataFrames: df1 and df2. We want to merge these DataFrames based on the key column.

import pandas as pd

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})

df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

result = pd.merge(df1, df2, on='key')

print(result)

Output:

  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3

As shown above, the pd.merge() function takes the two DataFrames and joins them based on the common column key. Here, the on parameter specifies the column on which to merge the two datasets.

By default, pd.merge() performs an ‘inner’ join, which returns only the rows that have matching keys in both DataFrames. It can also perform ‘left’, ‘right’ and ‘outer’ joins, which return either only the rows present in the first DataFrame, only in the second DataFrame, or all rows from both DataFrames, respectively. This is controlled via the how= parameter.

Keep in mind that when merging DataFrames, you should ensure that the column(s) in which you want to merge the data share the same data type across both DataFrames. Otherwise, the merge may fail or produce unexpected results.

Summary

This article discussed various methods for combining DataFrames in pandas: concatenation and merging. We covered both horizontal and vertical concatenation, along with practical examples. We also explained how to merge DataFrames based on a common key. Knowing how to combine datasets will help you build more comprehensive data pipelines and ultimately contribute to better insights. When working with large datasets, it’s important to choose the right method for combining them. Choose concatenation when you want to stack datasets either vertically or horizontally. Choose merging when you want to join datasets based on a common variable. By understanding these techniques, you will have a stronger sense of how to combine your data efficiently for maximum efficacy.