Why does my memory usage explode when concatenating dataframes?
Storing string columns as categories can result in massive memory savings when working with large dataframes. However, those savings can surprisingly disappear when we start concatenating dataframes.
tldr: concatenating categorical Series
with nonidentical categories gives an object
dtype in the result, with severe memory implications.
Introduction
In a library as large and featureful as pandas, there are bound to be surprising behaviours. In this article we will take a look at a memory issue that I've run into multiple times in real life datasets - an unexpected increase in memory usage when concatenating multiple dataframes.
As always, to keep this article to a reasonable size we won't be going in to too much detail on the methods involved. All of the methods we'll use here are covered in detail in the Drawing from Data book. In particular, check out chapter 2 for a discussion of data types and chapter 16 for a discussion of the memory implications.
If you're interested in more articles and tips on data exploration with Python, you should subscribe to the Drawing from Data newsletter or follow me on Twitter.
Saving memory on a single data file
As an example, we'll be using a slightly modified version of this car accident dataset:
https://www.kaggle.com/sobhanmoosavi/us-accidents
This dataset contains records of car accidents in the USA, and for the purposes of this story we have one file per month. Our goal is to combine all of these to make one large dataframe that we can use for analysis. Let's load up the first file to see the structure:
import pandas as pd
# load just the data for January
df = pd.read_csv('January_car_accidents.csv')
df
This is pretty straightforward - we have columns for City, State and Month, and around 300,000 rows. Each row represents a single car accident. In the original dataset, of course, there are many more columns. For this article we are interested in the State column.
Let's assume that because this is a large dataset, we're worried about running out of memory. We know that for columns with a small number of values, storing them as a categorical data type can save a lot of memory (see chapter 16 in the Drawing from Data book for a full discussion of memory issues). Let's check how much memory the State column uses when stored as object
. Because memory_usage
returns the result in bytes, we'll divide by 1e6 to get an answer in megabytes:
df['State'].memory_usage(deep=True) / 1e6
Now the same column as a category:
df['State'].astype('category').memory_usage(deep=True) / 1e6
As expected, a considerable saving. Once we've figured this out, it's probably a good habit to specify a categorical data types when reading in the file:
df = pd.read_csv('January_car_accidents.csv', dtype={'State' : 'category'})
df.memory_usage(deep=True) / 1e6
import glob
for filename in glob.glob('*car_accidents.csv'):
month_df = pd.read_csv(filename, dtype = {'State' : 'category'})
print(filename, month_df['State'].memory_usage(deep=True) / 1e6)
Obviously the exact numbers differ, but each of our single-month data files takes around 0.3 Mb for the State column. Our estimate for the memory usage of the State column in our combined dataframe, therefore, is 0.3 * 12: around 3.6 Mb.
The best way to combine these data files is to make a list of dataframes then concatenate them at the end:
all_dfs = []
for filename in glob.glob('*car_accidents.csv'):
month_df = pd.read_csv(filename, dtype = {'State' : 'category'})
all_dfs.append(month_df)
big_df = pd.concat(all_dfs)
big_df
All looks fine here: for our combined dataset we have around 3.5 million rows. But take a look at the memory usage:
big_df['State'].memory_usage(deep=True) / 1e6
Far from taking up less than 4 Mb as expected, our State column is taking up more than 200 Mb. What has happened? The clue is in the data type:
big_df.info()
We have lost our categorical type; the State colum in the big dataframe has been turned back into object
. The reason: not every state is present in every single-month data file:
big_df.groupby('month')['State'].nunique()
Most months have records for 49 states, but a few have records for only 48. We can find the missing states quite easily with a mixture of pandas and Python's built in set type:
april = big_df[big_df['month'] == 'April']
august = big_df[big_df['month'] == 'August']
# which state has records for August but not April
set(august['State']) - set(april['State'])
The missing states mean that the categories are slightly different between the individual month dataframes, and when we ask pandas to concatenate categorical datasets with different categories, it sets the data type back to object
. We can see this behaviour with a simple example:
cat_series_one = pd.Series(['apple', 'banana', 'orange']).astype('category')
cat_series_two = pd.Series(['apple', 'banana', 'banana']).astype('category')
combined_series = pd.concat([cat_series_one, cat_series_two])
# look at the dtype for the combined series
combined_series
We will encounter this problem any time we use pd.concat
on categorical columns where the categories are not exactly identical. One such scenario is the one we've just looked at: when we need to combine multiple data files.
Concatenation and chunking
Another is when we use chunking to read a data file one piece at a time. The canonical way to read a large CSV file in chunks is to append the chunks to a list then use pd.concat
on the list:
all_dfs = []
# process 100,000 rows at a time
for chunk in pd.read_csv('January_car_accidents.csv', dtype={'State' : 'category'}, chunksize=100000):
# possibly do some processing on chunk
all_dfs.append(chunk)
pd.concat(all_dfs).info()
Here we run into the same issue. Although we have specified a categorical data type for the State column for each chunk, when we combine them the State column gets turned back into an object
, since each chunk has a slightly different set of unique states in it.
This version of the problem is especially harsh, as the most common reason for chunking input files in the first place is to reduce peak memory usage! So it's an especially nasty surprise when the memory usage of the State column turns out to be so much higher than our estimate.
Fixing the problem
We can get round this problem in a number of ways. If we have enough memory, we can simply take our combined dataframe and change the State column to a category after it's been assembled:
big_df['State'] = big_df['State'].astype('category')
big_df.memory_usage(deep=True) / 1e6
That gets the State column memory usage back down to our estimate of around 3.5 Mb.
If we haven't got enough memory to do this, we have to force all the single-month dataframe State columns to have identical categories. We can do this by explicitly creating a categorical data type and listing the categories that we want:
from pandas.api.types import CategoricalDtype
# just a list of Python strings
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
state_type = CategoricalDtype(categories=states)
state_type
Now when we create a series with this type, it will always have all the states as its categories regardless of which states are actually in the data:
pd.Series(['KS', 'TX', 'VT']).astype(state_type)
Notice that the above series has all 51 states in its categories even though there are only 3 in the data.
Using this new type in our input loop results in the categorical data type being preserved after concatenation:
all_dfs = []
for filename in glob.glob('*car_accidents.csv'):
month_df = pd.read_csv(filename, dtype = {'State' : state_type})
all_dfs.append(month_df)
big_df = pd.concat(all_dfs)
big_df.info()
In this case I've just copied an existing Python list of state abbreviations that I found online. In other situations it might be possible to generate the list by reading all the input files one line at a time:
all_states = set()
for filename in glob.glob('*car_accidents.csv'):
for line in open(filename):
state = line.split(',')[1]
all_states.add(state)
# remove the column name
all_states.remove('State')
print(all_states)
Although this takes a while to run, it's very memory-friendly as we only have to store a single row of each input file in memory at a time.
If you've made it this far, you should definitely subscribe to the Drawing from Data newsletter, follow me on Twitter, or buy the Drawing from Data book!