Making a difficult data analysis question easy
A great strategy to use when faced with a tricky data analysis problem is to reshape the dataset into a format that turns it into an easy problem. In this article we will look at an example involving a simple calculation and extensive reshaping.
Introduction
One of the things that makes data exploration such a powerful skill to have is the ease with which we can find interesting datasets online. We can search for data that will help us to answer real world questions, or just browse for interesting collections of data (that's how I picked all the example datasets for the videos that accompany the Drawing from Data book).
One downside, however, of finding existing datasets to work with is that we have no control over the format of the data. When we hit the download button on a new dataset we might find all sorts of things that will make our analysis difficult. Some of these I've already written about - invalid data, missing data, data spread over multiple files or with obvious errors.
A more subtle problem is when the data are not in the structure that we would like. This issue is bit more complex than the others. If a dataset has annotation errors then we could all agree that that is bad. But the specific way that a data file is structured isn't good or bad in itself - it's just that a given structure makes some problems easy and some problems hard.
Thankfully, pandas has very powerful tools for reshaping and restructuring data. In this article we'll try to answer a problem that is difficult given the structure of our data file, but becomes very easy when we reshape the data.
As always, to keep this article to a reasonable size we won't be going in to too much detail on the methods involved. All of the methods we'll use here are covered in detail in the Drawing from Data book. In particular, check out chapter 3 for a discussion of calculations on series and chapter 14 for the reshaping tools.
If you're interested in more articles and tips on data exploration with Python, you should subscribe to the Drawing from Data newsletter or follow me on Twitter.
Taking a look at the dataset
Our starting point for this analysis is a data file giving the highest and lowest mean daily temperature in Farenheit for each year for a bunch of different cities. Let's take a look at it:
import pandas as pd
pd.options.display.max_columns=6
pd.options.display.max_rows=10
cities = pd.read_csv('city_min_max.csv')
cities
Hopefully it's not too hard to see the structure. We have a year column, then a bunch of paired minimum and maximum columns for each city. Notice that this is a very wide dataframe, with many more columns than rows. We might call it a summary table.
This format makes it very easy to answer some types of question. For example, in which years was the coldest day in Singapore warmer than the warmest day in Stockholm? To find out, we just have to filter on these two columns then take the year column:
cities[cities['Stockholm_max'] < cities['Singapore_min']]['Year']
The challenge
Our goal for this analysis is simply to calculate the different between the minimum and maximum for each city in each year. Let's call this the temperature range. We would like to end up with a very simple table that has three columns: city, year and temperature range.
Now, in many ways the above dataframe looks like it will be very nice to work with. We have no missing data, no invalid values, all the columns are floating point with the exception of year. So this should be an easy calculation to carry out.
Option one: working with the data in the existing form
Let's start with something simple - can we calculate the temperature range for a single city? Given the column names, that's easy - we just subtract the two series:
cities['London_max'] - cities['London_min']
These numbers aren't much use without the year column, so let's add that. The easiest way is to set year as the index, so that it gets propagated to the resulting series:
cities.set_index('Year')['London_max'] - cities.set_index('Year')['London_min']
OK, now let's try multiple cities. If we just manually pick a few cities and put the calculation in a loop, it's fairly straightforward to create the series. For each city name we can figure out the column names by appending _min
and _max
, then calculate the temperature range by subtracting them like before:
for city in ['Birmingham', 'Delhi', 'Osaka']:
temp_range = cities.set_index('Year')[f'{city}_max'] - cities.set_index('Year')[f'{city}_min']
print(city, temp_range)
Although the calculation is not too hard, in the above code we're just printing the series. Life gets a lot more complicated when we want to turn those numbers into a new dataframe. Now we have to turn each city series into a dataframe, add the city name as a column, store all the individual city dataframes in a list, and concatenate them at the end:
dfs = []
for city in ['Birmingham', 'Delhi', 'Osaka']:
# calculate the series as before
city_series = cities.set_index('Year')[f'{city}_max'] - cities.set_index('Year')[f'{city}_min']
# turn it into a dataframe and move the year into a column
city_df = city_series.to_frame('temp range').reset_index()
# add the city name column
city_df['city'] = city
dfs.append(city_df)
# after the loop, concatenate all of the individual city dataframes
pd.concat(dfs)
This is starting to look pretty complicated, but at least now most of the work is done. Now we just have to replace our manual list of city names with a list of all the city names. Unfortunately, that will involve quite a bit of processing as well. We need to take all the column names, get the part before the underscore, discard any duplicates (otherwise we will get each city name twice) and remember to exclude the year column. To keep it brief let's just see the final expression (for an explanation of the string processing methods we are using, take a look at chapter 3 in the Drawing from Data book):
cities.columns.str.split('_').str.get(0).unique()[1:]
Technically the result of the above expression is an Index
object, not a list, but it will behave like a list when we iterate over it in a loop.
Plugging this expression into our calculation loop gives us the following code:
dfs = []
for city in cities.columns.str.split('_').str.get(0).unique()[1:]:
city_series = cities.set_index('Year')[f'{city}_max'] - cities.set_index('Year')[f'{city}_min']
city_df = city_series.to_frame('temp range').reset_index()
city_df['city'] = city
dfs.append(city_df)
pd.concat(dfs)
Even for experienced pandas users, this would take quite a bit of effort to understand! But notice that the actual calculation is very simple - we're just subtracting one column from another. All of the complexity is to deal with getting the information we want into the right format. The most important bit of the code - the subtraction - is hard to spot, as it's somewhat hidden inside our complicated for
loop.
Option two: reshaping the data
Since all of the complexity in our above code is due to dealing with the format of the data, it's possible to imagine a different format that would make life much easier. Imagine that we had a dataframe with four columns - city, year, min and max. In that case, we would just subtract the min column from the max column and our work would be done.
We don't have a dataframe in that format, but let's see if it's possible to make one. The first step is to convert our existing dataframe into long format using melt
:
tidy_cities = cities.melt(id_vars=['Year'])
tidy_cities
Since this is only a temporary dataframe, we will just allow melt
to use the default variable and value names.
Next we can take the variable column and, using the same string processing tools as before, turn it into two columns based on the underscore:
tidy_cities['variable'].str.split('_', expand=True)
We can add these two new columns by just setting new columns on the existing dataframe. We'll also drop the old variable column as we don't need it anymore:
tidy_cities[['city', 'measurement']] = tidy_cities['variable'].str.split('_', expand=True)
tidy_cities.drop(columns=['variable'], inplace=True)
tidy_cities
Now we just have to get the min and max as separate columns. The easiest way to do this is to set a multilevel index, use unstack
to turn the measurement values as columns, then reset the index. In this step we'll also rename the columns:
city_summary = tidy_cities.set_index(['Year', 'city', 'measurement']).unstack().reset_index()
city_summary.columns = ['year', 'city', 'max temp', 'min temp']
city_summary
Now we have our data in the exact same format that we imagined. And, as promised, the calculation we need to do becomes very easy:
city_summary['max temp'] - city_summary['min temp']
Adding this series as a new column gives us our final dataframe:
city_summary['temp range'] = city_summary['max temp'] - city_summary['min temp']
city_summary
# looping
dfs = []
for city in cities.columns.str.split('_').str.get(0).unique()[1:]:
city_series = cities.set_index('Year')[f'{city}_max'] - cities.set_index('Year')[f'{city}_min']
city_df = city_series.to_frame('temp range').reset_index()
city_df['city'] = city
dfs.append(city_df)
city_summary = pd.concat(dfs)
# reshaping
tidy_cities = cities.melt(id_vars=['Year'])
tidy_cities[['city', 'measurement']] = tidy_cities['variable'].str.split('_', expand=True)
tidy_cities.drop(columns=['variable'], inplace=True)
city_summary = tidy_cities.set_index(['Year', 'city', 'measurement']).unstack().reset_index()
city_summary.columns = ['year', 'city', 'max temp', 'min temp']
city_summary['temp range'] = city_summary['max temp'] - city_summary['min temp']
As an aside, the solution involving reshaping is about 10x faster on my machine, but let's ignore that for now - this article isn't about performance, and we haven't made any attempt to optimize either piece of code.
Comparing the two solutions, we might note that they are about equal in complexity. If anything, the reshaping solution looks like more typing - though that's partly because if involves more column names. So why do I strongly prefer the second solution?
Think about it from the perspective of someone reading the code for the first time and trying to understand what the code is doing. In the first solution, you have to comprehend the whole for
loop in one go. One can imagine adding some print
statements to look at the intermediate variables in order to understand the whole thing.
The second solution has an interesting property, however - you can still understand the essential calculation even if you have no idea how the reshaping works. All the code involved in reshaping - everything up to this point:
tidy_cities = cities.melt(id_vars=['Year'])
tidy_cities[['city', 'measurement']] = tidy_cities['variable'].str.split('_', expand=True)
tidy_cities.drop(columns=['variable'], inplace=True)
city_summary = tidy_cities.set_index(['Year', 'city', 'measurement']).unstack().reset_index()
city_summary.columns = ['year', 'city', 'max temp', 'min temp']
city_summary
can be pretty much ignored as long as you understand the structure of the city_summary
dataframe. And as we already mentioned, once the reshaping is done, the actual calculation is very straightforward:
city_summary['temp range'] = city_summary['max temp'] - city_summary['min temp']
city_summary
As a side benefit, the complete reshaped dataframe is likely to be useful in answering other weather-related questions. Remember, though, that although the reshaped dataframe makes it easier to answer our particular question, it makes it much harder to answer different questions. Look how much more complicated the code is to find the years when the warmest day in Stockholm was colder than the coldest day in Singapore:
(
city_summary
.groupby('year')
.filter(
lambda x :
x[x['city'] == 'Stockholm']['max temp'].mean()
< x[x['city'] == 'Singapore']['min temp'].mean()
)
['year']
.unique()
)
This is so complicated it would make this article twice as long to explain it - recall that this was a single line for the data in the original format.
Summary
In this article we have looked at two different approaches to solving a difficult data analysis problem in pandas. If we had to describe them at a high level, we might say that the first way is to write code to solve the difficult problem. The second is to write code to reshape the data so that the difficult problem becomes an easy problem, then solve the easy problem.
In the real world, this can be a very powerful problem-solving technique when working with datasets. The key is the imagination step: given a data analysis question, imagine what data structure would make that question easy to answer. Like any aspect of programming, this becomes easier with practice.
In my experience teaching data exploration with Python, the reshaping tools are the parts of pandas that are most under-used. Partly this is because they can be tricky to understand, but partly it's a matter of mindset. It can be difficult to remember that the format of a dataset when we recieve it is not set in stone; it's simply how it happens to be stored at that time. And one of the big differences I notice between novice and experienced programmers is a willingness to rearrange the data to suit the question at hand.
So next time you are faced with a tricky data analysis problem, try this strategy. Before starting to write code, try to think of a structure for the data that would make the problem easier, and see if you can get your data into that form.
If you've made it this far, you should definitely subscribe to the Drawing from Data newsletter, follow me on Twitter, or buy the Drawing from Data book!