· Pandas · 4 min read
Importing CSV files with Python Pandas
Introduction
In order to follow along with this course, we recommend you use the following data sets. It’s not necassary, you’re free to just read. You will get a lot more benefit if you code along with us though.
Data to use
We’ve created 3 tables that you should download, and put into the location of your Jupyter notebook, before you get started. Just click the links below to download.
Managers
Establishments
Sales
It should now look like this
So now that we have our data. Let’s think about what we want to do. We want to bring in the data to our Notebook, and create a Pandas dataframe. This will allow us to start using our data.
The most common method of moving/sharing data is by csv file.
CSV files
A CSV file is like a big list of information. Each piece of information is put in its own little box called a “cell.” Then, all the cells are organized into rows and columns like in a spreadsheet. Each row represents one person or thing and each column is a piece of information about that person or thing. CSV stands for “Comma Separated Values” because the cells are separated by a comma in the file. CSV files are often used to store data and they can be opened and edited in different computer programs.
CSV File headers
In a CSV file, the first row is often used to label the columns and is known as the header row. The header row contains the names of the different pieces of information that are being stored in the file. For example, if the CSV file contains a list of people and their contact information, the header row might include column names like “Name,” “Email,” “Phone,” and “Address.”
The header row is important because it helps to identify the purpose of each column in the file. It also makes it easier to understand the data when you are viewing or working with the file. It’s a good practice to include a header row in your CSV file, especially if the file will be used by others.
If you are working with a CSV file that does not have a header row, you can usually add one by simply creating a new row at the top of the file and typing in the names of the columns. Just be sure to separate each column name with a comma.
Reading the data
So let’s look at the code to read the csv file, and I will talk through it after.
import pandas as pd
raw = pd.read_csv("sales.csv")
raw.head(5)
So lets look line by line what this is doing.
import pandas as pd
This code is importing the pandas library and giving it an alias pd. The alias pd means that we can use the pandas library in our code by referencingpd
. You can give it almost any alias, however pd is a convention and I would stick to it.raw = pd.read_csv("sales.csv")
This is starting to get a little more complicated now but let’s break down a couple of things we’re doing.raw =
We are creating a variable raw, which will store the result ofpd.read_csv("sales.csv")
. This is our pandas dataframe. It’s called raw as this data has not been manipulated at all yet.pd.read_csv("sales.csv")
The pandas library includes a functionread_csv
that we can use to do exactly that, read CSV files. The function will return a pandas dataframe.
raw.head(5)
The variableraw
now holds our dataframe. With pandas dataframes we can use the head() function to show a certain amount of rows from the top of our data
Output
You should see an output that looks like this! Congratulations you now have your first pandas dataframe. You can repeat these steps for reading any CSV files into dataframes.
date | est_ref | capacity | occupancy | rooms_sold | avg_rate_paid | sales_value |
---|---|---|---|---|---|---|
2022-12-27 | 0 | 289 | 0.75 | 217 | 35.97 | 7805.49 |
2022-12-27 | 1 | 203 | 0.35 | 71 | 82.31 | 5844.01 |
2022-12-27 | 2 | 207 | 0.51 | 106 | 227.83 | 24149.98 |
2022-12-27 | 3 | 27 | 0.37 | 10 | 126.46 | 1264.60 |
2022-12-27 | 4 | 20 | 0.87 | 17 | 191.57 | 3256.69 |
In this post, we will cover how to rename a single or multiple columns in Python Pandas.