If you ever decide to become someone who is into big data, surely you can do it without having a clue about pandas. But that’s not the brightest solution, because why would you leave aside something that’s gonna make you a lot better. Pandas as well know library for manipulating datasets that contains numerical and table structures, which makes it pretty good-to-know library for data engineers and data scientists. In part 1 we’re gonna go through some of the basic stuff to introduce you to the Pandas capabilities.. For the purpose of this article, as an example dataset I’ve used free dataset about GDP statistics for World countries (You can download it from here). Also, you can find a lot of other datasets online for free in this place UCI Repository. For presentation I’m using Jupyter Notebook.
Let’s start with basic stuff – importing data to our script, after importing pandas library.
import pandas as pd data = pd.read_csv('gdp_country.csv', skiprows=2, delimiter=',')
Variable I’m gonna use from now on is called “data”, and I think this is pretty clear to everyone who has at least a little programming experience. Next thing is reading data with pd.read_csv(), and path to data is the first parameter in parentheses. In this case, data is in the same folder location as my Python (notebook) script. Parameter “skiprows” depends on quality of your data. I have chosen this time on purpose, to make it more interesting, but to be honest this is one in 50 times that I need to use it. As the word says, you can choose how many top raws you want to skip. The last parameter is really important. CSV file is an abbreviation of comma-separated-values, so it’s obvious that delimiter, like something that separates the values, is comma in this dataset.
When it comes to displaying dataset after reading it, you don’t always need to see a full set, it’s enough to look at the few top rows, and that is when we use head() function. As a parameter you type the number of top rows you want to see.
data.head(2)
Big Data means that you are really often going to work with a huge amount of data, which usually includes some unwanted or unnecessary columns and rows, so the best way to optimize your memory usage is to drop them. If there is just a one column, you can use a lighter command
data.drop("column_to_drop", axis=1, inplace=True)
For more columns the best approach is to create a list of column names, and drop them all together:
Or by selecting columns that contain some string,
data[data.columns.drop(list(data.filter(regex='string_we_need')))]
Drop() function takes column name or list of column names as a first parameter, the second one is axis, which has values 0 and 1 that represent rows and columns.
Inplace with values True or False gives you an opportunity to save it permanently in the existing variable, but if you create a new one, like in the second example, “inplace” parameter should not be used.
In case you need to drop many columns and keep just a few, there is an easier way with just selecting necessary columns.
data = data[['Country Name', 'Country Code', 'Indicator name', 'Indicator Code', '1965', '2009', '2011']]
If you tend to make a new dataframe with exactly the same shape and value, you can just do the next:
new_data = data
However, this will not give you the possibility to work with each other independently by not catching the other one.
The following line provides such a capability:
New_data = data.copy()
Feature data.shape will gives a quick insight into number of rows and columns in the dataframe.
Filtering rows is not that distant from filtering columns.
data = data[[data['Country Name']=='Angola']
Filtering can be applied to multiple criteria by chaining conditions:
data = data[(data['Country Name']=='Angola') | (data['Country Name']=='Uruguay')]
Another way for achieving this:
my_list_of_values = ['Angola', 'Uruguay'] data.loc[data['Country Name'].isin(my_list_of_values)]
If you need top rows you can use head function, mentioned earlier, but assume that you need rows somewhere from the middle of the dataset, there is selecting by the index, starting from 0 as a first row.
data = data[134:142]
Also, a set of random rows can be retrieved from Pandas Dataframe, using function .sample()
data.sample(4)
You may have noticed that we have NaN value for Monaco in 2011. and that’s classified as a missing data. For now, we are going to skip that.
Now let’s play with column names and keys.
Something that is usually required is changing column names.
data.rename(columns={'Country Name' : 'Name of country'}, inplace=True)
Adding prefix or suffix is frequent but simple routine.
data = data.add_suffix('_added-suffix')
Since the data comes from various sources, after reading it in spaces might occur within column mans. The bests practice is removing them, since they might cause troubles:
data.columns = [x.strip().replace(' ','_') for x in data.columns]
To split all column values by one point we can use next code-line (but keep in mind that no values should be missing:
data['Indicator Name'].str.split(',').str[1]
We pass the name of the DataFrame (“data”), then column name (“Indicator Name”), and proceed with the character we’re gonna split by and ordinal number of string we are taking (started with zero).
Splitting in Pandas works pretty much the same as in Python:
String = '2_under_score' Splitted = String.split('_')[1] Out: 'Under'
To set desired value in one cell, we can pass the name of a column and row index:
data.set_value(7, 'Indicator Name', 'Changed value in cell').inplace=True
Handling missing values and duplicates with basic grouping and sorting data
There is a new column Continent which I’m gonna use for grouping.
data.sort_values(['2011'],ascending=False).groupby('Continent').head(5)
Column “2011” is chosen to be sorted in descending order, but just within a self group that represents continents. That’s all you need to know about grouping to get in touch with missing nan values and duplicates.
In this case, it’s obvious there are some missing values, but sometimes in large datasets it will be in the middle so we need to use a function .isnull(), we can use it inside the function len to count rows with NaN values in some column. That’s pretty easy and we do it this way:
len(data[data.Column.isnull()==True])
Also, we can check unique values with the next code.
len(data['name_of _column'].unique())
The way to handle missing data depends on the requirements, whether to fill it with some constant value or drop rows or columns.
Dropping row with at least one NaN value:
data.dropna(axis=0, inplace=True)
From the previous dataframe now it’s only one row left.
Dropping columns:
data.dropna(axis=1, inplace=True)
Impute value instead of NaN value:
data.fillna(1)
For example, if you plan to do some multiplication, the best way to fill missing values is fill with 1, especially when some calculations is in plan to do, but you can fill it with a string or anything you need.
Dropping duplicates is pretty simple with this code “data.drop_duplicates()”, but this won’t work for you every time, so we need something a little bit more complex.
I have one case where there was a huge number of duplicate rows or rows with approximately the same values, so I will use it to take maximum value for each group. Columns that you want to keep you specify in groupby parentheses.
There is a neat way to do this with Pandas:
data.groupby(['Continent'])[['1965', '2009', '2011']].max()
As it’s noticed, if you don’t mention column name in a group by or you don’t do any aggregate function on them, they won’t appear in the output.
“Continent” is now index column, and for going back from index to column label, we need to reset the index.
data.reset_index(drop=False)
With options drop=True and drop=False it is possible to keep or drop index column, in this case, column “Continent”.
And for the end, saving dataframe into csv file:
data.to_csv('name_of_csv_file.csv', index=False)
That’s enough for Part 1 – we have gone over basic data import from csv into Pandas DataFrame, selecting, dropping, renaming columns, and concluded with handling the missing data. Those operations are ordinary for almost every dataset to get prepared for further calculations. In the following posts, I’m going to write about joining, merging, advanced column calculations, applying map functions and more. The focus will be on more challenges from my real projects, and I find that it will be more interesting. Any question is welcome, so do not hesitate. 🙂