Data Exploration with Pandas (Part 2)

Sinisa Jovic Technology Leave a Comment

In the previous article, I wrote about some introductory stuff and basic Pandas capabilities. In this part, the main focus will be on DateTime values. I am also going to introduce you to some grouping and merging possibilities in Pandas. For this purpose here is another dataset downloaded from UCI Repository, which contains date and time columns.

This time my data comes from Excel file, so the way I read it is a slightly different than in the previous post. After importing pandas library (see the previous blog post) I’m going to use read_excel() function and display data sample with head() function.

Selection_023

As you can see, we have “PERIOD_START_TIME” as a first column, and this column is going to be the focus of this post. This dataset is also suitable for grouping and making aggregations on top of these groups.

To check the column types, the info() function is used. This helps you to look if “PERIOD_START_TIME” is a datetime object, so datetime functions can be applied. If not, the column should be cast to datetime in the following way (after importing the datetime library):

import datatime as dt
data['PERIOD_START_TIME'] = pd.to_datetime(data.PERIOD_START_TIME)

Now, let’s see how to aggregate “PERIOD_START_TIME” to hourly values.

import datetime as dt
data = data.groupby([data['PERIOD_START_TIME'].dt.floor('H'),'CustomerID', 'InvoiceNo', 'StockCode', 'Country']).agg(['mean','sum', 'max'])
data.columns = data.columns.map('_'.join)

What we have done here is aggregating “PERIOD_START_TIME” to the hourly level and finding mean, max and sum for on purpose selected columns while grouping by others. In this case, I have drop other aggregated column and keep just the sum.

Selection_024

Now let’s find a busy hour for each country to see hours when citizens mostly do shopping.

First, let’s find “Quantity_sum” for every hour grouped by country:

data['Quantity_sum_by_hour'] = data.groupby(['PERIOD_START_TIME' ,'Country'])['Quantity_sum'].transform('sum')

For this case we need to separate date and time columns.

We can achieve this with:

data['date'] = data['PERIOD_START_TIME'].dt.date
data['time'] = data['PERIOD_START_TIME'].dt.time

After next code line it will be more understandable why we need separated date and time:

busy_hour=data.sort_values(['Quantity_sum_by_hour'],ascending=False).groupby(['Date','Country']).head(1)

Here we have a grouping by data and country, because our goal is to find sum of columns “Quantity” for every country,  for each day.

Now let’s sort by country to see who are the biggest shoppers:

busy_hour.sort_values(['date', 'Country'],ascending=False).groupby(['date', 'Country']).head(1)

Selection_025

Using head() function with value 1 returns maximum value for each day and country.

As you can see, I’ve selected just a few columns that I need.

Next step could be plotting some results of our previous steps.

Let’s take a look at how values for a busy hour in the United Kingdom grew in a few consecutive days.

united_kingdom_data = data[data['Country']=='United Kingdom']

The simplest way for plotting is using matplotlib library:

import matplotlib.pyplot as plt
plt.figure(figsize=(20, 5))
plt.plot(busy_hour['date'], busy_hour['Quantity_sum_by_hour'])
plt.show()

Selection_026

The first line of code is importing library, the second one is the assigning a size, in the third line we have to choose x and y coordinates. Without the last one where we said plt.show(), nothing would be printed.

That would be a straightforward example of grouping and plotting data. A bit deeper plotting functionalities will be the subject of my future posts.

Now let’s import one more dataset, this time a csv (downloaded from here),  to show joining and merging capabilities.

In this data file called ‘continents_data’ we also have a ‘Country’ column, which is going to be used for joining, and new columns: ‘Continent’ and ‘Region’.

data_merged = data.merge(continents_data, how='inner', on='Country')

, or

data_merged_2 = pd.merge(data, continents_data, how='inner', on='Country')

Selection_034Also, you can append one dataframe to another one, with append() and concat() function, but I’m leaving it to you to explore.
Merging methods compared to SQL:
Left   -  Left outer join
Right - Right outer join
Outer - Full outer join
Inner - Inner join

When performing an outer join, a good practice is to include indicator column, which shows information from which dataset that value comes.

pd.merge(data, continents_data, how='outer', on='Country', indicator='indicator_column')

Selection_035

We should add a suffix to know from which dataframe it comes. As we can see, in dataset  “continents_data” we have no data for United Kingdom, so columns ‘Continent’ and ‘Region’ are empty and we can clearly conclude that from ‘indicator_column’ which we create in this way:

pd.merge(data, continents_data, on='subject_id', how='left', suffixes=('_left', '_right'))

We can also map a set of values into new values:

data['Continent'] = data.Continent.map({'Europe': 'Old Continent', 'Asia' : 'the East', 'Americas' : 'West'})

Now let’s calculate a rolling mean across our data:

data['Rolling_mean'] = data.groupby('Country')['UnitPrice'].apply(lambda x:x.rolling(center=False,window=2).mean())

The first element of the moving average is obtained by taking the average of the initial fixed subset of the number series. Then the subset is modified by “shifting forward”; that is, excluding the first number of the series and including the next value in the subset.
Selection_039

This post provides the basics of data manipulation using merging, joining and mapping functionalities, and also catching up with the time-stamped data. In following posts, I’m going to write in more detail about data plotting and visualization, but make sure you have understood basic pandas functionalities presented in this and previous posts in order to follow up. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *