tldr: concatenating categorical Series with nonidentical categories gives an object dtype in the result, with severe memory implications.

Introduction

In a library as large and featureful as pandas, there are bound to be surprising behaviours. In this article we will take a look at a memory issue that I've run into multiple times in real life datasets - an unexpected increase in memory usage when concatenating multiple dataframes.

As always, to keep this article to a reasonable size we won't be going in to too much detail on the methods involved. All of the methods we'll use here are covered in detail in the Drawing from Data book. In particular, check out chapter 2 for a discussion of data types and chapter 16 for a discussion of the memory implications.

If you're interested in more articles and tips on data exploration with Python, you should subscribe to the Drawing from Data newsletter or follow me on Twitter.

Saving memory on a single data file

As an example, we'll be using a slightly modified version of this car accident dataset:

https://www.kaggle.com/sobhanmoosavi/us-accidents

This dataset contains records of car accidents in the USA, and for the purposes of this story we have one file per month. Our goal is to combine all of these to make one large dataframe that we can use for analysis. Let's load up the first file to see the structure:

import pandas as pd

# load just the data for January
df = pd.read_csv('January_car_accidents.csv')
df
City State month
0 El Cerrito CA January
1 Berkeley CA January
2 Oakley CA January
3 Richmond CA January
4 El Cerrito CA January
... ... ... ...
301919 Sun Valley CA January
301920 Costa Mesa CA January
301921 Costa Mesa CA January
301922 Madras OR January
301923 Jordan Valley OR January

301924 rows × 3 columns

This is pretty straightforward - we have columns for City, State and Month, and around 300,000 rows. Each row represents a single car accident. In the original dataset, of course, there are many more columns. For this article we are interested in the State column.

Let's assume that because this is a large dataset, we're worried about running out of memory. We know that for columns with a small number of values, storing them as a categorical data type can save a lot of memory (see chapter 16 in the Drawing from Data book for a full discussion of memory issues). Let's check how much memory the State column uses when stored as object. Because memory_usage returns the result in bytes, we'll divide by 1e6 to get an answer in megabytes:

df['State'].memory_usage(deep=True) / 1e6
17.813644

Now the same column as a category:

df['State'].astype('category').memory_usage(deep=True) / 1e6
0.305956

As expected, a considerable saving. Once we've figured this out, it's probably a good habit to specify a categorical data types when reading in the file:

df = pd.read_csv('January_car_accidents.csv', dtype={'State' : 'category'})
df.memory_usage(deep=True) / 1e6
Index     0.000128
City     19.861495
State     0.305828
month    19.323136
dtype: float64

Incidentally, the City and month columns would also be better stored as categories, but we will ignore that for now.

Concatenating multiple data files

Let's now make a loop to read all 12 data files, and check the memory usage for each, remembering to keep the 'category' dtype:

import glob

for filename in glob.glob('*car_accidents.csv'):
    month_df = pd.read_csv(filename, dtype = {'State' : 'category'})
    print(filename, month_df['State'].memory_usage(deep=True) / 1e6)
March_car_accidents.csv 0.29748
April_car_accidents.csv 0.30353
September_car_accidents.csv 0.296511
February_car_accidents.csv 0.288488
May_car_accidents.csv 0.300636
January_car_accidents.csv 0.305956
July_car_accidents.csv 0.227059
June_car_accidents.csv 0.314354
November_car_accidents.csv 0.303147
December_car_accidents.csv 0.303714
October_car_accidents.csv 0.328636
August_car_accidents.csv 0.293021

Obviously the exact numbers differ, but each of our single-month data files takes around 0.3 Mb for the State column. Our estimate for the memory usage of the State column in our combined dataframe, therefore, is 0.3 * 12: around 3.6 Mb.

The best way to combine these data files is to make a list of dataframes then concatenate them at the end:

all_dfs = []

for filename in glob.glob('*car_accidents.csv'):
    month_df = pd.read_csv(filename, dtype = {'State' : 'category'})
    all_dfs.append(month_df)
    
big_df = pd.concat(all_dfs)
big_df
City State month
0 Columbus OH March
1 Miamisburg OH March
2 Dayton OH March
3 Columbus OH March
4 Columbus OH March
... ... ... ...
288925 Riverside CA August
288926 San Diego CA August
288927 Orange CA August
288928 Culver City CA August
288929 Highland CA August

3513617 rows × 3 columns

All looks fine here: for our combined dataset we have around 3.5 million rows. But take a look at the memory usage:

big_df['State'].memory_usage(deep=True) / 1e6
235.412339

Far from taking up less than 4 Mb as expected, our State column is taking up more than 200 Mb. What has happened? The clue is in the data type:

big_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3513617 entries, 0 to 288929
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   City    object
 1   State   object
 2   month   object
dtypes: object(3)
memory usage: 107.2+ MB

We have lost our categorical type; the State colum in the big dataframe has been turned back into object. The reason: not every state is present in every single-month data file:

big_df.groupby('month')['State'].nunique()
month
April        48
August       49
December     49
February     49
January      48
July         49
June         48
March        49
May          49
November     49
October      49
September    49
Name: State, dtype: int64

Most months have records for 49 states, but a few have records for only 48. We can find the missing states quite easily with a mixture of pandas and Python's built in set type:

april = big_df[big_df['month'] == 'April']
august = big_df[big_df['month'] == 'August']

# which state has records for August but not April
set(august['State']) - set(april['State']) 
{'ND'}

The missing states mean that the categories are slightly different between the individual month dataframes, and when we ask pandas to concatenate categorical datasets with different categories, it sets the data type back to object. We can see this behaviour with a simple example:

cat_series_one = pd.Series(['apple', 'banana', 'orange']).astype('category')
cat_series_two = pd.Series(['apple', 'banana', 'banana']).astype('category')

combined_series = pd.concat([cat_series_one, cat_series_two])

# look at the dtype for the combined series
combined_series
0     apple
1    banana
2    orange
0     apple
1    banana
2    banana
dtype: object

We will encounter this problem any time we use pd.concat on categorical columns where the categories are not exactly identical. One such scenario is the one we've just looked at: when we need to combine multiple data files.

Concatenation and chunking

Another is when we use chunking to read a data file one piece at a time. The canonical way to read a large CSV file in chunks is to append the chunks to a list then use pd.concat on the list:

all_dfs = []

# process 100,000 rows at a time
for chunk in pd.read_csv('January_car_accidents.csv', dtype={'State' : 'category'}, chunksize=100000):
    # possibly do some processing on chunk
    all_dfs.append(chunk)
    
pd.concat(all_dfs).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301924 entries, 0 to 301923
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   City    301914 non-null  object
 1   State   301924 non-null  object
 2   month   301924 non-null  object
dtypes: object(3)
memory usage: 6.9+ MB

Here we run into the same issue. Although we have specified a categorical data type for the State column for each chunk, when we combine them the State column gets turned back into an object, since each chunk has a slightly different set of unique states in it.

This version of the problem is especially harsh, as the most common reason for chunking input files in the first place is to reduce peak memory usage! So it's an especially nasty surprise when the memory usage of the State column turns out to be so much higher than our estimate.

Fixing the problem

We can get round this problem in a number of ways. If we have enough memory, we can simply take our combined dataframe and change the State column to a category after it's been assembled:

big_df['State'] = big_df['State'].astype('category')
big_df.memory_usage(deep=True) / 1e6
Index     28.108936
City     231.140089
State      3.517580
month    222.078650
dtype: float64

That gets the State column memory usage back down to our estimate of around 3.5 Mb.

If we haven't got enough memory to do this, we have to force all the single-month dataframe State columns to have identical categories. We can do this by explicitly creating a categorical data type and listing the categories that we want:

from pandas.api.types import CategoricalDtype

# just a list of Python strings
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

state_type = CategoricalDtype(categories=states)
state_type
CategoricalDtype(categories=['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL',
                  'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME',
                  'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH',
                  'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI',
                  'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI',
                  'WY'],
, ordered=False)

Now when we create a series with this type, it will always have all the states as its categories regardless of which states are actually in the data:

pd.Series(['KS', 'TX', 'VT']).astype(state_type)
0    KS
1    TX
2    VT
dtype: category
Categories (51, object): ['AL', 'AK', 'AZ', 'AR', ..., 'WA', 'WV', 'WI', 'WY']

Notice that the above series has all 51 states in its categories even though there are only 3 in the data.

Using this new type in our input loop results in the categorical data type being preserved after concatenation:

all_dfs = []

for filename in glob.glob('*car_accidents.csv'):
    month_df = pd.read_csv(filename, dtype = {'State' : state_type})
    all_dfs.append(month_df)
    
big_df = pd.concat(all_dfs)
big_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3513617 entries, 0 to 288929
Data columns (total 3 columns):
 #   Column  Dtype   
---  ------  -----   
 0   City    object  
 1   State   category
 2   month   object  
dtypes: category(1), object(2)
memory usage: 83.8+ MB

In this case I've just copied an existing Python list of state abbreviations that I found online. In other situations it might be possible to generate the list by reading all the input files one line at a time:

all_states = set()

for filename in glob.glob('*car_accidents.csv'):
    for line in open(filename):
        state = line.split(',')[1]
        all_states.add(state)

# remove the column name
all_states.remove('State')
print(all_states)
{'WY', 'NY', 'MS', 'IL', 'NJ', 'MA', 'DC', 'DE', 'UT', 'WA', 'OK', 'RI', 'CO', 'IN', 'TX', 'AZ', 'WI', 'TN', 'FL', 'IA', 'SC', 'ME', 'ND', 'NC', 'MD', 'SD', 'NE', 'KS', 'PA', 'GA', 'OR', 'MT', 'KY', 'NH', 'ID', 'MN', 'OH', 'VA', 'MI', 'NM', 'CT', 'CA', 'MO', 'AL', 'WV', 'AR', 'NV', 'LA', 'VT'}

Although this takes a while to run, it's very memory-friendly as we only have to store a single row of each input file in memory at a time.

If you've made it this far, you should definitely subscribe to the Drawing from Data newsletter, follow me on Twitter, or buy the Drawing from Data book!