Introduction

One of the things that makes data exploration such a powerful skill to have is the ease with which we can find interesting datasets online. We can search for data that will help us to answer real world questions, or just browse for interesting collections of data (that's how I picked all the example datasets for the videos that accompany the Drawing from Data book).

One downside, however, of finding existing datasets to work with is that we have no control over the format of the data. When we hit the download button on a new dataset we might find all sorts of things that will make our analysis difficult. Some of these I've already written about - invalid data, missing data, data spread over multiple files or with obvious errors.

A more subtle problem is when the data are not in the structure that we would like. This issue is bit more complex than the others. If a dataset has annotation errors then we could all agree that that is bad. But the specific way that a data file is structured isn't good or bad in itself - it's just that a given structure makes some problems easy and some problems hard.

Thankfully, pandas has very powerful tools for reshaping and restructuring data. In this article we'll try to answer a problem that is difficult given the structure of our data file, but becomes very easy when we reshape the data.

As always, to keep this article to a reasonable size we won't be going in to too much detail on the methods involved. All of the methods we'll use here are covered in detail in the Drawing from Data book. In particular, check out chapter 3 for a discussion of calculations on series and chapter 14 for the reshaping tools.

If you're interested in more articles and tips on data exploration with Python, you should subscribe to the Drawing from Data newsletter or follow me on Twitter.

Taking a look at the dataset

Our starting point for this analysis is a data file giving the highest and lowest mean daily temperature in Farenheit for each year for a bunch of different cities. Let's take a look at it:

import pandas as pd
pd.options.display.max_columns=6
pd.options.display.max_rows=10


cities = pd.read_csv('city_min_max.csv')
cities
Year Abidjan_max Abidjan_min ... Youngstown_min Zurich_max Zurich_min
0 1995 86.7 73.0 ... 4.8 77.6 17.1
1 1996 86.1 72.2 ... -2.1 74.0 10.7
2 1997 85.1 73.0 ... 2.5 73.4 14.2
3 1998 87.2 73.4 ... 13.2 78.6 19.7
4 1999 87.2 73.4 ... 3.3 75.8 16.3
... ... ... ... ... ... ... ...
21 2016 86.9 74.9 ... 7.0 76.4 20.0
22 2017 86.4 74.6 ... 7.9 80.3 17.4
23 2018 85.8 74.6 ... 2.0 80.5 14.5
24 2019 87.0 74.5 ... -4.0 81.7 26.1
25 2020 86.4 77.4 ... 15.4 67.1 27.0

26 rows × 547 columns

Hopefully it's not too hard to see the structure. We have a year column, then a bunch of paired minimum and maximum columns for each city. Notice that this is a very wide dataframe, with many more columns than rows. We might call it a summary table.

This format makes it very easy to answer some types of question. For example, in which years was the coldest day in Singapore warmer than the warmest day in Stockholm? To find out, we just have to filter on these two columns then take the year column:

cities[cities['Stockholm_max'] < cities['Singapore_min']]['Year']
1     1996
2     1997
3     1998
5     2000
7     2002
      ... 
18    2013
20    2015
21    2016
22    2017
25    2020
Name: Year, Length: 16, dtype: int64

The challenge

Our goal for this analysis is simply to calculate the different between the minimum and maximum for each city in each year. Let's call this the temperature range. We would like to end up with a very simple table that has three columns: city, year and temperature range.

Now, in many ways the above dataframe looks like it will be very nice to work with. We have no missing data, no invalid values, all the columns are floating point with the exception of year. So this should be an easy calculation to carry out.

Option one: working with the data in the existing form

Let's start with something simple - can we calculate the temperature range for a single city? Given the column names, that's easy - we just subtract the two series:

cities['London_max'] - cities['London_min']
0     52.0
1     50.4
2     49.9
3     42.7
4     46.8
      ... 
21    45.5
22    47.3
23    55.3
24    52.0
25    28.0
Length: 26, dtype: float64

These numbers aren't much use without the year column, so let's add that. The easiest way is to set year as the index, so that it gets propagated to the resulting series:

cities.set_index('Year')['London_max'] - cities.set_index('Year')['London_min']
Year
1995    52.0
1996    50.4
1997    49.9
1998    42.7
1999    46.8
        ... 
2016    45.5
2017    47.3
2018    55.3
2019    52.0
2020    28.0
Length: 26, dtype: float64

OK, now let's try multiple cities. If we just manually pick a few cities and put the calculation in a loop, it's fairly straightforward to create the series. For each city name we can figure out the column names by appending _min and _max, then calculate the temperature range by subtracting them like before:

for city in ['Birmingham', 'Delhi', 'Osaka']:
    temp_range = cities.set_index('Year')[f'{city}_max'] - cities.set_index('Year')[f'{city}_min']
    print(city, temp_range)
Birmingham Year
1995    65.8
1996    74.1
1997    64.8
1998    57.4
1999    69.0
        ... 
2016    61.1
2017    62.8
2018    65.1
2019    58.2
2020    46.3
Length: 26, dtype: float64
Delhi Year
1995    52.2
1996    47.5
1997    51.9
1998    54.6
1999    50.7
        ... 
2016    48.8
2017    52.1
2018    48.5
2019    58.1
2020    41.5
Length: 26, dtype: float64
Osaka Year
1995    55.1
1996    56.1
1997    52.3
1998    56.5
1999    57.0
        ... 
2016    61.0
2017    55.5
2018    61.2
2019    54.2
2020    38.8
Length: 26, dtype: float64

Although the calculation is not too hard, in the above code we're just printing the series. Life gets a lot more complicated when we want to turn those numbers into a new dataframe. Now we have to turn each city series into a dataframe, add the city name as a column, store all the individual city dataframes in a list, and concatenate them at the end:

dfs = []
for city in ['Birmingham', 'Delhi', 'Osaka']:
    
    # calculate the series as before
    city_series = cities.set_index('Year')[f'{city}_max'] - cities.set_index('Year')[f'{city}_min']
    
    # turn it into a dataframe and move the year into a column
    city_df = city_series.to_frame('temp range').reset_index()
    
    # add the city name column
    city_df['city'] = city
      
    dfs.append(city_df)
    
# after the loop, concatenate all of the individual city dataframes
pd.concat(dfs)
Year temp range city
0 1995 65.8 Birmingham
1 1996 74.1 Birmingham
2 1997 64.8 Birmingham
3 1998 57.4 Birmingham
4 1999 69.0 Birmingham
... ... ... ...
21 2016 61.0 Osaka
22 2017 55.5 Osaka
23 2018 61.2 Osaka
24 2019 54.2 Osaka
25 2020 38.8 Osaka

78 rows × 3 columns

This is starting to look pretty complicated, but at least now most of the work is done. Now we just have to replace our manual list of city names with a list of all the city names. Unfortunately, that will involve quite a bit of processing as well. We need to take all the column names, get the part before the underscore, discard any duplicates (otherwise we will get each city name twice) and remember to exclude the year column. To keep it brief let's just see the final expression (for an explanation of the string processing methods we are using, take a look at chapter 3 in the Drawing from Data book):

cities.columns.str.split('_').str.get(0).unique()[1:]
Index(['Abidjan', 'Abilene', 'Abu Dhabi', 'Addis Ababa', 'Akron Canton',
       'Albany', 'Albuquerque', 'Algiers', 'Allentown', 'Almaty',
       ...
       'Washington DC', 'Washington', 'West Palm Beach', 'Wichita Falls',
       'Wichita', 'Wilkes Barre', 'Windhoek', 'Yakima', 'Youngstown',
       'Zurich'],
      dtype='object', length=273)

Technically the result of the above expression is an Index object, not a list, but it will behave like a list when we iterate over it in a loop.

Plugging this expression into our calculation loop gives us the following code:

dfs = []
for city in cities.columns.str.split('_').str.get(0).unique()[1:]:
    city_series = cities.set_index('Year')[f'{city}_max'] - cities.set_index('Year')[f'{city}_min']
    city_df = city_series.to_frame('temp range').reset_index()
    city_df['city'] = city
    dfs.append(city_df)
pd.concat(dfs)
Year temp range city
0 1995 13.7 Abidjan
1 1996 13.9 Abidjan
2 1997 12.1 Abidjan
3 1998 13.8 Abidjan
4 1999 13.8 Abidjan
... ... ... ...
21 2016 56.4 Zurich
22 2017 62.9 Zurich
23 2018 66.0 Zurich
24 2019 55.6 Zurich
25 2020 40.1 Zurich

7098 rows × 3 columns

Even for experienced pandas users, this would take quite a bit of effort to understand! But notice that the actual calculation is very simple - we're just subtracting one column from another. All of the complexity is to deal with getting the information we want into the right format. The most important bit of the code - the subtraction - is hard to spot, as it's somewhat hidden inside our complicated for loop.

Option two: reshaping the data

Since all of the complexity in our above code is due to dealing with the format of the data, it's possible to imagine a different format that would make life much easier. Imagine that we had a dataframe with four columns - city, year, min and max. In that case, we would just subtract the min column from the max column and our work would be done.

We don't have a dataframe in that format, but let's see if it's possible to make one. The first step is to convert our existing dataframe into long format using melt:

tidy_cities = cities.melt(id_vars=['Year'])
tidy_cities
Year variable value
0 1995 Abidjan_max 86.7
1 1996 Abidjan_max 86.1
2 1997 Abidjan_max 85.1
3 1998 Abidjan_max 87.2
4 1999 Abidjan_max 87.2
... ... ... ...
14191 2016 Zurich_min 20.0
14192 2017 Zurich_min 17.4
14193 2018 Zurich_min 14.5
14194 2019 Zurich_min 26.1
14195 2020 Zurich_min 27.0

14196 rows × 3 columns

Since this is only a temporary dataframe, we will just allow melt to use the default variable and value names.

Next we can take the variable column and, using the same string processing tools as before, turn it into two columns based on the underscore:

tidy_cities['variable'].str.split('_', expand=True)
0 1
0 Abidjan max
1 Abidjan max
2 Abidjan max
3 Abidjan max
4 Abidjan max
... ... ...
14191 Zurich min
14192 Zurich min
14193 Zurich min
14194 Zurich min
14195 Zurich min

14196 rows × 2 columns

We can add these two new columns by just setting new columns on the existing dataframe. We'll also drop the old variable column as we don't need it anymore:

tidy_cities[['city', 'measurement']] = tidy_cities['variable'].str.split('_', expand=True)

tidy_cities.drop(columns=['variable'], inplace=True)
tidy_cities
Year value city measurement
0 1995 86.7 Abidjan max
1 1996 86.1 Abidjan max
2 1997 85.1 Abidjan max
3 1998 87.2 Abidjan max
4 1999 87.2 Abidjan max
... ... ... ... ...
14191 2016 20.0 Zurich min
14192 2017 17.4 Zurich min
14193 2018 14.5 Zurich min
14194 2019 26.1 Zurich min
14195 2020 27.0 Zurich min

14196 rows × 4 columns

Now we just have to get the min and max as separate columns. The easiest way to do this is to set a multilevel index, use unstack to turn the measurement values as columns, then reset the index. In this step we'll also rename the columns:

city_summary = tidy_cities.set_index(['Year', 'city', 'measurement']).unstack().reset_index()
city_summary.columns = ['year', 'city', 'max temp', 'min temp']
city_summary
year city max temp min temp
0 1995 Abidjan 86.7 73.0
1 1995 Abilene 89.9 24.6
2 1995 Abu Dhabi 104.4 62.4
3 1995 Addis Ababa 71.8 55.4
4 1995 Akron Canton 84.4 2.9
... ... ... ... ...
7093 2020 Wilkes Barre 67.2 15.6
7094 2020 Windhoek 86.3 58.0
7095 2020 Yakima 70.0 16.1
7096 2020 Youngstown 67.3 15.4
7097 2020 Zurich 67.1 27.0

7098 rows × 4 columns

Now we have our data in the exact same format that we imagined. And, as promised, the calculation we need to do becomes very easy:

city_summary['max temp'] - city_summary['min temp'] 
0       13.7
1       65.3
2       42.0
3       16.4
4       81.5
        ... 
7093    51.6
7094    28.3
7095    53.9
7096    51.9
7097    40.1
Length: 7098, dtype: float64

Adding this series as a new column gives us our final dataframe:

city_summary['temp range'] = city_summary['max temp'] - city_summary['min temp'] 
city_summary
year city max temp min temp temp range
0 1995 Abidjan 86.7 73.0 13.7
1 1995 Abilene 89.9 24.6 65.3
2 1995 Abu Dhabi 104.4 62.4 42.0
3 1995 Addis Ababa 71.8 55.4 16.4
4 1995 Akron Canton 84.4 2.9 81.5
... ... ... ... ... ...
7093 2020 Wilkes Barre 67.2 15.6 51.6
7094 2020 Windhoek 86.3 58.0 28.3
7095 2020 Yakima 70.0 16.1 53.9
7096 2020 Youngstown 67.3 15.4 51.9
7097 2020 Zurich 67.1 27.0 40.1

7098 rows × 5 columns

Comparing the two solutions

For the sake of comparison, let's see the two sets of code together, without comments:

# looping

dfs = []
for city in cities.columns.str.split('_').str.get(0).unique()[1:]:
    city_series = cities.set_index('Year')[f'{city}_max'] - cities.set_index('Year')[f'{city}_min']
    city_df = city_series.to_frame('temp range').reset_index()
    city_df['city'] = city
    dfs.append(city_df)
city_summary = pd.concat(dfs)

# reshaping
tidy_cities = cities.melt(id_vars=['Year'])
tidy_cities[['city', 'measurement']] = tidy_cities['variable'].str.split('_', expand=True)
tidy_cities.drop(columns=['variable'], inplace=True)
city_summary = tidy_cities.set_index(['Year', 'city', 'measurement']).unstack().reset_index()
city_summary.columns = ['year', 'city', 'max temp', 'min temp']
city_summary['temp range'] = city_summary['max temp'] - city_summary['min temp'] 
246 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As an aside, the solution involving reshaping is about 10x faster on my machine, but let's ignore that for now - this article isn't about performance, and we haven't made any attempt to optimize either piece of code.

Comparing the two solutions, we might note that they are about equal in complexity. If anything, the reshaping solution looks like more typing - though that's partly because if involves more column names. So why do I strongly prefer the second solution?

Think about it from the perspective of someone reading the code for the first time and trying to understand what the code is doing. In the first solution, you have to comprehend the whole for loop in one go. One can imagine adding some print statements to look at the intermediate variables in order to understand the whole thing.

The second solution has an interesting property, however - you can still understand the essential calculation even if you have no idea how the reshaping works. All the code involved in reshaping - everything up to this point:

tidy_cities = cities.melt(id_vars=['Year'])
tidy_cities[['city', 'measurement']] = tidy_cities['variable'].str.split('_', expand=True)
tidy_cities.drop(columns=['variable'], inplace=True)
city_summary = tidy_cities.set_index(['Year', 'city', 'measurement']).unstack().reset_index()
city_summary.columns = ['year', 'city', 'max temp', 'min temp']
city_summary
year city max temp min temp
0 1995 Abidjan 86.7 73.0
1 1995 Abilene 89.9 24.6
2 1995 Abu Dhabi 104.4 62.4
3 1995 Addis Ababa 71.8 55.4
4 1995 Akron Canton 84.4 2.9
... ... ... ... ...
7093 2020 Wilkes Barre 67.2 15.6
7094 2020 Windhoek 86.3 58.0
7095 2020 Yakima 70.0 16.1
7096 2020 Youngstown 67.3 15.4
7097 2020 Zurich 67.1 27.0

7098 rows × 4 columns

can be pretty much ignored as long as you understand the structure of the city_summary dataframe. And as we already mentioned, once the reshaping is done, the actual calculation is very straightforward:

city_summary['temp range'] = city_summary['max temp'] - city_summary['min temp'] 
city_summary
year city max temp min temp temp range
0 1995 Abidjan 86.7 73.0 13.7
1 1995 Abilene 89.9 24.6 65.3
2 1995 Abu Dhabi 104.4 62.4 42.0
3 1995 Addis Ababa 71.8 55.4 16.4
4 1995 Akron Canton 84.4 2.9 81.5
... ... ... ... ... ...
7093 2020 Wilkes Barre 67.2 15.6 51.6
7094 2020 Windhoek 86.3 58.0 28.3
7095 2020 Yakima 70.0 16.1 53.9
7096 2020 Youngstown 67.3 15.4 51.9
7097 2020 Zurich 67.1 27.0 40.1

7098 rows × 5 columns

As a side benefit, the complete reshaped dataframe is likely to be useful in answering other weather-related questions. Remember, though, that although the reshaped dataframe makes it easier to answer our particular question, it makes it much harder to answer different questions. Look how much more complicated the code is to find the years when the warmest day in Stockholm was colder than the coldest day in Singapore:

(
    city_summary
    .groupby('year')
    .filter(
        lambda x : 
        x[x['city'] == 'Stockholm']['max temp'].mean() 
        < x[x['city'] == 'Singapore']['min temp'].mean()
    )
    ['year']
    .unique()
)
array([1996, 1997, 1998, 2000, 2002, 2004, 2007, 2008, 2009, 2011, 2012,
       2013, 2015, 2016, 2017, 2020])

This is so complicated it would make this article twice as long to explain it - recall that this was a single line for the data in the original format.

Summary

In this article we have looked at two different approaches to solving a difficult data analysis problem in pandas. If we had to describe them at a high level, we might say that the first way is to write code to solve the difficult problem. The second is to write code to reshape the data so that the difficult problem becomes an easy problem, then solve the easy problem.

In the real world, this can be a very powerful problem-solving technique when working with datasets. The key is the imagination step: given a data analysis question, imagine what data structure would make that question easy to answer. Like any aspect of programming, this becomes easier with practice.

In my experience teaching data exploration with Python, the reshaping tools are the parts of pandas that are most under-used. Partly this is because they can be tricky to understand, but partly it's a matter of mindset. It can be difficult to remember that the format of a dataset when we recieve it is not set in stone; it's simply how it happens to be stored at that time. And one of the big differences I notice between novice and experienced programmers is a willingness to rearrange the data to suit the question at hand.

So next time you are faced with a tricky data analysis problem, try this strategy. Before starting to write code, try to think of a structure for the data that would make the problem easier, and see if you can get your data into that form.

If you've made it this far, you should definitely subscribe to the Drawing from Data newsletter, follow me on Twitter, or buy the Drawing from Data book!