Data Exploration with Pandas (Part 2)

In the previous article, I wrote about some introductory stuff and basic Pandas capabilities. In this part, the main focus will be on DateTime values. I am also going to introduce you to some grouping and merging possibilities in Pandas. For this purpose here is another dataset downloaded from UCI Repository, which contains date and time columns.

This time my data comes from Excel file, so the way I read it is a slightly different than in the previous post. After importing pandas library (see the previous blog post) I’m going to use read_excel() function and display data sample with head() function.

Selection_023

As you can see, we have “PERIOD_START_TIME” as a first column, and this column is going to be the focus of this post. This dataset is also suitable for grouping and making aggregations on top of these groups.

To check the column types, the info() function is used. This helps you to look if “PERIOD_START_TIME” is a datetime object, so datetime functions can be applied. If not, the column should be cast to datetime in the following way (after importing the datetime library):

import datatime as dt
data['PERIOD_START_TIME'] = pd.to_datetime(data.PERIOD_START_TIME)

Now, let’s see how to aggregate “PERIOD_START_TIME” to hourly values.

import datetime as dt
data = data.groupby([data['PERIOD_START_TIME'].dt.floor('H'),'CustomerID', 'InvoiceNo', 'StockCode', 'Country']).agg(['mean','sum', 'max'])
data.columns = data.columns.map('_'.join)

What we have done here is aggregating “PERIOD_START_TIME” to the hourly level and finding mean, max and sum for on purpose selected columns while grouping by others. In this case, I have drop other aggregated column and keep just the sum.

Selection_024

Now let’s find a busy hour for each country to see hours when citizens mostly do shopping.

First, let’s find “Quantity_sum” for every hour grouped by country:

data['Quantity_sum_by_hour'] = data.groupby(['PERIOD_START_TIME' ,'Country'])['Quantity_sum'].transform('sum')

For this case we need to separate date and time columns.

We can achieve this with:

data['date'] = data['PERIOD_START_TIME'].dt.date
data['time'] = data['PERIOD_START_TIME'].dt.time

After next code line it will be more understandable why we need separated date and time:

busy_hour=data.sort_values(['Quantity_sum_by_hour'],ascending=False).groupby(['Date','Country']).head(1)

Here we have a grouping by data and country, because our goal is to find sum of columns “Quantity” for every country,  for each day.

Now let’s sort by country to see who are the biggest shoppers:

busy_hour.sort_values(['date', 'Country'],ascending=False).groupby(['date', 'Country']).head(1)

Selection_025

Using head() function with value 1 returns maximum value for each day and country.

As you can see, I’ve selected just a few columns that I need.

Next step could be plotting some results of our previous steps.

Let’s take a look at how values for a busy hour in the United Kingdom grew in a few consecutive days.

united_kingdom_data = data[data['Country']=='United Kingdom']

The simplest way for plotting is using matplotlib library:

import matplotlib.pyplot as plt
plt.figure(figsize=(20, 5))
plt.plot(busy_hour['date'], busy_hour['Quantity_sum_by_hour'])
plt.show()

Selection_026

The first line of code is importing library, the second one is the assigning a size, in the third line we have to choose x and y coordinates. Without the last one where we said plt.show(), nothing would be printed.

That would be a straightforward example of grouping and plotting data. A bit deeper plotting functionalities will be the subject of my future posts.

Now let’s import one more dataset, this time a csv (downloaded from here),  to show joining and merging capabilities.

In this data file called ‘continents_data’ we also have a ‘Country’ column, which is going to be used for joining, and new columns: ‘Continent’ and ‘Region’.

data_merged = data.merge(continents_data, how='inner', on='Country')

, or

data_merged_2 = pd.merge(data, continents_data, how='inner', on='Country')

Selection_034Also, you can append one dataframe to another one, with append() and concat() function, but I’m leaving it to you to explore.
Merging methods compared to SQL:
Left   -  Left outer join
Right - Right outer join
Outer - Full outer join
Inner - Inner join

When performing an outer join, a good practice is to include indicator column, which shows information from which dataset that value comes.

pd.merge(data, continents_data, how='outer', on='Country', indicator='indicator_column')

Selection_035

We should add a suffix to know from which dataframe it comes. As we can see, in dataset  “continents_data” we have no data for United Kingdom, so columns ‘Continent’ and ‘Region’ are empty and we can clearly conclude that from ‘indicator_column’ which we create in this way:

pd.merge(data, continents_data, on='subject_id', how='left', suffixes=('_left', '_right'))

We can also map a set of values into new values:

data['Continent'] = data.Continent.map({'Europe': 'Old Continent', 'Asia' : 'the East', 'Americas' : 'West'})

Now let’s calculate a rolling mean across our data:

data['Rolling_mean'] = data.groupby('Country')['UnitPrice'].apply(lambda x:x.rolling(center=False,window=2).mean())

The first element of the moving average is obtained by taking the average of the initial fixed subset of the number series. Then the subset is modified by “shifting forward”; that is, excluding the first number of the series and including the next value in the subset.
Selection_039

This post provides the basics of data manipulation using merging, joining and mapping functionalities, and also catching up with the time-stamped data. In following posts, I’m going to write in more detail about data plotting and visualization, but make sure you have understood basic pandas functionalities presented in this and previous posts in order to follow up. 🙂

Data Exploration with Pandas (part 1)

If you ever decide to become someone who is into big data, surely you can do it without having a clue about pandas. But that’s not the brightest solution, because why would you leave aside something that’s gonna make you a lot better. Pandas as well know library for manipulating datasets that contains numerical and table structures, which makes it pretty good-to-know library for data engineers and data scientists. In part 1 we’re gonna go through some of the basic stuff to introduce you to the Pandas capabilities.. For the purpose of this article, as an example dataset I’ve used free dataset about GDP statistics for World countries (You can download it from here). Also, you can find a lot of other datasets online for free in this place UCI Repository. For presentation I’m using Jupyter Notebook.

Let’s start with basic stuff – importing data to our script, after importing pandas library.

import pandas as pd
data = pd.read_csv('gdp_country.csv', skiprows=2, delimiter=',')

Variable I’m gonna use from now on is called “data”, and I think this is pretty clear to everyone who has at least a little programming experience.  Next thing is reading data with pd.read_csv(), and path to data is the first parameter in parentheses. In this case, data is in the same folder location as my Python (notebook) script. Parameter “skiprows” depends on quality of your data. I have chosen this time on purpose, to make it more interesting, but to be honest this is one in 50 times that I need to use it. As the word says, you can choose how many top raws you want to skip. The last parameter is really important. CSV file is an abbreviation of comma-separated-values, so it’s obvious that delimiter, like something that separates the values, is comma in this dataset.

When it comes to displaying dataset after reading it, you don’t always need to see a full set, it’s enough to look at the few top rows, and that is when we use head() function. As a parameter you type the number of top rows you want to see.

data.head(2)

Selection_008

 

Big Data means that you are really often going to  work with a huge amount of data, which usually includes some unwanted or unnecessary columns and rows, so the best way to optimize your memory usage is to drop them. If there is just a one column, you can use a lighter command

data.drop("column_to_drop", axis=1, inplace=True)

For more columns the best approach is to create a list of column names, and drop them all together:

Or by selecting columns that contain some string,

data[data.columns.drop(list(data.filter(regex='string_we_need')))]

Drop() function takes column name or list of column names as a first parameter, the second one is axis, which has values 0 and 1 that represent rows and columns.

Inplace with values True or False gives you an opportunity to save it permanently in the existing variable, but if you create a new one, like in the second example, “inplace” parameter should not be used.

In case you need to drop many columns and keep just a few, there is an easier way with just selecting necessary columns.

data = data[['Country Name', 'Country Code', 'Indicator name', 'Indicator Code', '1965', '2009',  '2011']]

Selection_020

If you tend to make a new dataframe with exactly the same shape and value, you can just do the next:

new_data = data

However, this will not give you the possibility to work with each other independently by not catching the other one.

The following line provides such a capability:

New_data = data.copy()

Feature data.shape will gives a quick insight into number of rows and columns in the dataframe.

Filtering rows is not that distant from filtering columns.

data = data[[data['Country Name']=='Angola']

Selection_003

Filtering can be applied to multiple criteria by chaining conditions:

data = data[(data['Country Name']=='Angola') | (data['Country Name']=='Uruguay')]

Selection_004

Another way for achieving this:

my_list_of_values = ['Angola', 'Uruguay']
data.loc[data['Country Name'].isin(my_list_of_values)]

If you need top rows you can use head function, mentioned earlier, but assume that you need rows somewhere from the middle of the dataset, there is selecting by the index, starting from 0 as a first row.

data = data[134:142]

Also, a set of random rows can be retrieved from Pandas Dataframe, using function .sample()

data.sample(4)

Selection_006

You may have noticed that we have NaN value for Monaco in 2011. and that’s classified as a missing data. For now, we are going to skip that.

Now let’s play with column names and keys.

Something that is usually required is changing column names.

data.rename(columns={'Country Name' : 'Name of country'}, inplace=True)

Selection_007

Adding prefix or suffix is frequent but simple routine.

data = data.add_suffix('_added-suffix')

Selection_010

Since the data comes from various sources, after reading it in spaces might occur within column mans. The bests practice is removing them, since they might cause troubles:

data.columns = [x.strip().replace(' ','_') for x in data.columns]

Selection_012

To split all column values by one point we can use next code-line (but keep in mind that no values should be missing:

data['Indicator Name'].str.split(',').str[1]

We pass the name of the DataFrame (“data”), then column name (“Indicator Name”), and proceed with the character we’re gonna split by and ordinal number of string we are taking (started with zero).

Splitting in Pandas works pretty much the same as in Python:

String = '2_under_score'
Splitted = String.split('_')[1]
Out:
'Under'

To set desired value in one cell, we can pass the name of a column and row index:

data.set_value(7, 'Indicator Name', 'Changed value in cell').inplace=True

Selection_021

Handling missing values and duplicates with basic grouping and sorting data

There is a new column Continent which I’m gonna use for grouping.

data.sort_values(['2011'],ascending=False).groupby('Continent').head(5)

Selection_028

Column “2011” is chosen to be sorted in descending order, but just within a self group that represents continents. That’s all you need to know about grouping to get in touch with missing nan values and duplicates.

In this case, it’s obvious there are some missing values, but sometimes in large datasets it will be in the middle so we need to use a function .isnull(), we can use it inside the function len to count rows with NaN values in some column. That’s pretty easy and we do it this way:

len(data[data.Column.isnull()==True])

Also, we can check unique values with the next code.

len(data['name_of _column'].unique())

The way to handle missing data depends on the requirements, whether to fill it with some constant value or drop rows or columns.

Dropping row with at least one NaN value:

data.dropna(axis=0, inplace=True)

Selection_029

From the previous dataframe now it’s only one row left.

Dropping columns:

data.dropna(axis=1, inplace=True)

Selection_030

Impute value instead of NaN value:

data.fillna(1)

Selection_017

For example, if you plan to do some multiplication, the best way to fill missing values is fill with 1, especially when some calculations is in plan to do, but you can fill it with a string or anything you need.

Dropping duplicates is pretty simple with this code “data.drop_duplicates()”, but this won’t work for you every time, so we need something a little bit more complex.

I have one case where there was a huge number of duplicate rows or rows with approximately the same values, so I will use it to take maximum value for each group. Columns that you want to keep you specify in groupby parentheses.

There is a neat way to do this with Pandas:

data.groupby(['Continent'])[['1965', '2009', '2011']].max()

Selection_018

As it’s noticed, if you don’t mention column name in a group by or you don’t do any aggregate function on them, they won’t appear in the output.

“Continent” is now index column, and for going back from index to column label, we need to reset the index.

data.reset_index(drop=False)

With options drop=True and drop=False it is possible to keep or drop index column, in this case, column “Continent”.

And for the end, saving dataframe into csv file:

data.to_csv('name_of_csv_file.csv', index=False)

That’s enough for Part 1 – we have gone over basic data import from csv into Pandas DataFrame, selecting, dropping, renaming columns, and concluded with handling the missing data. Those operations are ordinary for almost every dataset to get prepared for further calculations. In the following posts, I’m going to write about joining, merging, advanced column calculations, applying map functions and more. The focus will be on more challenges from my real projects, and I find that it will be more interesting. Any question is welcome, so do not hesitate. 🙂