· Pandas · 5 min read
Working with Python Pandas Data Types
Intro
We’re going to look at data types and almost everything you need to know about them in Python Pandas. This is part of a series and the previous post was about renaming columns.
In this post, we will cover how to rename a single or multiple columns in Python Pandas.
Current dataframe
Our dataframe looks like this now.
date | estKey | capacity | occupancy | roomsSold | avgRate | salesValue |
---|---|---|---|---|---|---|
2022-12-27 | 0 | 289 | 0.75 | 217 | 35.97 | 7805.49 |
2022-12-27 | 1 | 203 | 0.35 | 71 | 82.31 | 5844.01 |
2022-12-27 | 2 | 207 | 0.51 | 106 | 227.83 | 24149.98 |
2022-12-27 | 3 | 27 | 0.37 | 10 | 126.46 | 1264.60 |
2022-12-27 | 4 | 20 | 0.87 | 17 | 191.57 | 3256.69 |
What are data types?
Simple Explanation
Data types are like categories for the information in a table. Pandas can hold several different types of data, including:
- Numbers: These can be whole numbers (like 3 or 17) or numbers with decimals (like 2.5 or 10.9). These can be used to show amounts or measurements.
- Labels: Sometimes we want to group or filter information by categories. These categories can be written as words (like “red” or “big”) or numbers (like 1 or 2).
- Words: Sometimes we want to store information as words or sentences. We can do this with the text data type.
- Dates and times: Pandas has special ways to store and work with dates and times. This can be helpful when we want to see how things change over time.
- True or false: Sometimes we just need to know if something is true or false. We can use the boolean data type for this.
Pandas Data Types
It’s important to recognise that across different ways of storing data, these data types are referred to with different names, and can behave slightly different. Here are the Pandas data types we will be using.
Type Table
Data Type | Description |
---|---|
int64 | Integer data type. Can hold whole numbers such as 3, 17, or 100. |
float64 | Floating point data type. Can hold numbers with decimal points such as 2.5, 10.9, or 3.14159. |
object | Object data type. Can hold a variety of data types, including strings, lists, and dictionaries. |
bool | Boolean data type. Can hold the values True or False . |
datetime64[ns] | Datetime data type. Can hold date and time information, such as “2022-12-28” or “14:30:00”. |
category | Categorical data type. Can hold a fixed set of categorical values, such as “red”, “yellow”, or “green”. |
Type Use Cases
To help understanding more, here are some use-cases of each of the data types.
Data Type | Use Case |
---|---|
int64 | An int64 data type can be used to represent quantitative data that does not contain decimal points. For example, you might use an int64 column to store the number of items in an order, the number of employees in a company, or the number of votes in an election. |
float64 | A float64 data type can be used to represent quantitative data that does contain decimal points. This data type is useful for storing data with a high level of precision, such as measurements, currency values, or scientific data. |
object | An object data type can be used to store a variety of data types, including strings, lists, and dictionaries. This data type is often used when the specific data type of the column is not known in advance, or when the column contains a mix of data types. |
bool | A bool data type can be used to store true/false values. This data type is often used to store binary data, such as whether a customer has opted in to a newsletter or whether a product is in stock. |
datetime64[ns] | A datetime64[ns] data type can be used to store date and time information. This data type is useful for working with time-series data, such as stock prices, weather data, or event logs. |
category | A category data type can be used to store a fixed set of categorical values. This data type is useful for storing data that can be grouped or filtered, such as product categories, geographical |
Checking our types
Now that we understand the types we can use in Pandas data. Let’s check the types. Types are inferred when using readcsv which means a _best guess is made. You should always check the data types when working with data.
You can check types on a dataframe by using the .info() method.
Let’s add that to our code
import pandas as pd
raw = pd.read_csv("sales.csv")
raw.rename(columns={'est_ref': 'est_key', 'avg_rate_paid': 'avg_rate'}, inplace=True)
raw.rename(columns=lambda x: x[0].lower() + x.strip().lower().replace('_', ' ').title().replace(' ', '')[1:], inplace=True)
raw.info()
Here is the important information from the output
Column | Non-Null Count | Dtype |
---|---|---|
date | 1451274 | object |
estKey | 1451274 | int64 |
capacity | 1451274 | int64 |
occupancy | 1451274 | float64 |
roomsSold | 1451274 | int64 |
avgRate | 1451274 | float64 |
salesValue | 1451274 | float64 |
Changing a single type
So, the date column as an object does not look correct however the rest of the data types do.
Changing a string to a date in Pandas
Let’s change the object
to a date.
import pandas as pd
raw = pd.read_csv("sales.csv")
raw.rename(columns={'est_ref': 'est_key', 'avg_rate_paid': 'avg_rate'}, inplace=True)
raw.rename(columns=lambda x: x[0].lower() + x.strip().lower().replace('_', ' ').title().replace(' ', '')[1:], inplace=True)
raw['date'] = pd.to_datetime(raw['date'])
raw.head()
The data has now changed from 27/12/2022
to 2022-12-27
date | estKey | capacity | occupancy | roomsSold | avgRate | salesValue |
---|---|---|---|---|---|---|
2022-12-27 | 0 | 289 | 0.75 | 217 | 35.97 | 7805.49 |
You can now also check the type by adding raw.info()
which will return datetime64[ns] which is what we expect.
Column | Non-Null Count | Dtype |
---|---|---|
date | 1451274 | datetime64[ns] |
Explaining the code
We use the code raw['date'] = pd.to_datetime(raw['date'])
raw['date']
Assigns a new value to the “date” column of the raw data frame.- The new value is the result of calling
pd.to_datetime()
on the existing “date” column. pd.to_datetime()
converts the values in the “date” column to a datetime data type.
Setting multiple types
It’s a good idea to always explicitly set the data types you want. If the source data will change or the type is inferred differently, this will offer a failsafe. If data is of the incorrect format and can not be converted, this will give you an error too, which is actually helpful. You wouldn’t want to cause problems later in the process, so it’s good to find it early.
Code for setting the types
import pandas as pd
raw = pd.read_csv("sales.csv")
raw.rename(columns={'est_ref': 'est_key', 'avg_rate_paid': 'avg_rate'}, inplace=True)
raw.rename(columns=lambda x: x[0].lower() + x.strip().lower().replace('_', ' ').title().replace(' ', '')[1:], inplace=True)
raw['date'] = pd.to_datetime(raw['date'])
raw = raw.astype({
'date': 'datetime64[ns]',
'estKey': 'int64',
'capacity': 'int64',
'occupancy': 'float64',
'roomsSold': 'int64',
'avgRate': 'float64',
'salesValue': 'float64'
})
raw.info()
We don’t expect the output to change or to have any errors at this point, this is just a best practice as described above to prevent any problem occurring in the future.
Type checklist
This is best practice. If you’re just doing a quick one off piece of adhoc analysis that won’t be repeated, it’s not necessarily worth investing the time, but it’s a good habit to get into.
- All the types in our data are correct. If you’re not sure on a type, please reference the tables above.
- We have explicitly set types for all of the columns in our dataframe.
Learn how to handle null or missing values in your data using the powerful pandas library in Python.