Finding invalid values in numerical columns
Real life datasets, especially ones that have been manually curated, often contain mixed data types. The first step to fixing invalid values is to get an idea of their distribution.
- Introduction
- Which values can't be converted?
- Dealing with many invalid values
- Extending to other data types
- Conclusion
Introduction
In an ideal world, every time we load a data file into pandas we will end up with a nice neat dataframe with well defined columns. In the real world, however, we often end up with columns that contain values in a mixture of different formats. This makes it hard to carry out our analyses. In this article we'll look at a simple way to quickly identify invalid values so that we can fix them.
As always, to keep this article to a reasonable size we won't be going in to too much detail on the methods involved. All of the methods we'll use here are covered in detail in the Drawing from Data book. In particular, check out chapter 2 for a discussion of data types and chapter 16 for a discussion of the memory implications.
If you're interested in more articles and tips on data exploration with Python, you should subscribe to the Drawing from Data newsletter or follow me on Twitter.
The MMA dataset
For our example, we'll look at a real life data file taken from the Metropolitan Museum of Art in New York. The museum has published a dataset describing around half a million different artworks. The data are available as a CSV file in this GitHub repo.
This is a good dataset to explore when looking for messy formatting, for a number of reasons:
- it's large, so even rare errors are likely to be well represented.
- it's curated by humans, rather than being automatically generated, so it's likely to contain typos and other human errors
- furthermore, it's large enough to have multiple human contributors, so the data will likely reflect inconsistencies between different ideas about data entry
- it's very heterogeneous - the individual artworks are of very different types. We have data for paintings, sculptures, photography, ceramics, fabrics, etc. Any time we have multiple different types of objects being represented by a single data structure, we have the oppourtunity for messy data.
Because the data are hosted on GitHub, we should be able to read them directly and take a look at the dataframe:
import pandas as pd
pd.options.display.max_columns=4
met = pd.read_csv('https://github.com/metmuseum/openaccess/raw/master/MetObjects.csv')
Notice that we immediately get a hint that something is weird in the data file when pandas complains about columns with mixed types. Let's ignore this for now and bravely press on with a look at the dataframe:
met
All looks OK here; we have a dataframe with just under half a million rows. There are too many columns to fit comfortably on the screen, so let's just check their names:
met.columns
Most of these seem pretty self evident. We can see here a common pattern when representing multiple different types of objects in a single data structure: a large number of columns, many of which are mostly missing data. For example, all but 2000 values are missing from the River column:
met['River'].describe()
Presumably there are a small number of objects that are associated with a particular river, and because we want to store that information we need a river column even though most of the objects don't have it.
One of the most well-populated columns is accession year, with more than 99% of the values present:
met['AccessionYear'].count() / met['AccessionYear'].size
This makes sense. We would expect an art museum to have pretty thorough records of when each object was acquired.
Let's remove any rows without an accession year:
met = met.dropna(subset=['AccessionYear'])
then take a look at the series:
met['AccessionYear']
This looks pretty straightforward, let's try to find all the objects that were acquired before 1950:
met[met['AccessionYear'] < 1950]
Now we have run into a problem. We were assuming that the accession year column was numeric based on the values that we saw, but it is actually of type object
, as we can easily check:
met['AccessionYear'].dtype
We might think to fix this by explicity changing the data type:
met['AccessionYear'].astype('int')
But now we encounter the reason that this column ended up with an object
data type rather than int
or float
. At least one of the values in the colum is written as a complete date: 2005-02-15
rather than just a year, so we get a ValueError
when trying to convert it to an integer.
Which values can't be converted?
So here's where it get annoying. Because Python crashes at the first value that it can't turn into an integer, we have no idea how many non-integer values are in the column. If it's just one out of half a million, then we can probably just delete that row and try again. But if we have lots of non-integer values, then that would be throwing away a lot of our data.
Furthermore, if there are lots of non-integer values, are they all written in the same format? If it turns out that we have many values written in year-month-day format, just like the one we already found, then fixing them might be a simple string processing job. But if we have lots of different formats, then the task will be much trickier.
Because we have nearly half a million values to check, obviously it's impossible to look at them all manually. So we will have to write a function to help us. The trick is to use try/except
to prevent pandas from simply crashing out at the first invalid value. So here's the code:
import numpy as np
def check_int(value):
try:
int(value)
return np.NaN
except ValueError:
return value
Our function will take a single value, and attempt to convert it to an integer inside a try
block. If the int
call is sucessful - i.e. the value is a valid integer - then the function will return np.NaN
to represent missing data. If the call to int
causes a ValueError
- i.e. the value that we passed in isn't a valid integer - then it will trigger the except
block and return the value itself.
To see how this works, let's try it on a few different inputs. Passing in a valid integer results in missing data:
check_int('42'), check_int(1970.0)
Whereas passing in anything that can't be converted to an integer results in the original value:
check_int('banana'), check_int('1970.0'), check_int('2005-12-15')
Notice a subtlety here in the handling of strings vs. floats: in Python calling int
on a floating point number truncates it:
int(3.1415)
whereas calling int
on a string containing a floating point number causes an error:
int('3.1415')
So how does this function help us find our invalid values? If we use apply
to call it on each value of the accession year column:
met['AccessionYear'].apply(check_int)
We get back a series of the same length where values that were sucessfully converted to integers are missing, and invalid values are the same. So we can easily find all of the invalid values by dropping missing data:
met['AccessionYear'].apply(check_int).dropna()
Here we have our answer: there are two values in this column that have been writted as complete dates.
Now that we have the invalid values we can figure out how to fix them. In this case, because both invalid values have the same date format we can do this by splitting on -
and taking the first element of the resulting list:
fixed_values = met['AccessionYear'].apply(check_int).dropna().str.split('-').str.get(0)
fixed_values
See chapter 3 of the Drawing from Data book for more about string methods and the str
accessor.
Having fixed these two values, we can use the fixed values to update the original series:
met['AccessionYear'].update(fixed_values)
Which now allows us to convert the data type to integer:
met['AccessionYear'].astype(int)
and carry out our filter:
met[
met['AccessionYear'].astype(int) < 1950
]
About 200,000 objects were acquired prior to 1950.
Dealing with many invalid values
Let's try this technique on a more complicated column. Say that we want to find all objects that were acquired after the death of the artist. There's a column called Artist End Date that has values for about half of the rows:
met['Artist End Date'].dropna()
From a glance at the values it looks like this is a series of year integers, but looking at the dtype
tells us that it has ended up as type object
. Sure enough, when we try to convert it to an integer type:
met['Artist End Date'].dropna().astype(int)
we can see that it has some invalid values. To keep things simple let's remove any rows with a missing artist end date:
met = met.dropna(subset=['Artist End Date'])
Then use our function to find the invalid values as before:
met['Artist End Date'].apply(check_int).dropna()
Here we have a much more complicated situation than the acession date. Where previously we had only two invalid values to worry about, now we have neary 70,000. However, since these values are in a pandas series, it's easy to summarize them using value_counts
:
met['Artist End Date'].apply(check_int).dropna().value_counts()
The pattern is a bit tricky to see here. Let's just take the single most common invalid value:
met['Artist End Date'].apply(check_int).dropna().value_counts().index[0]
It turns out to be the string '1635 |1661 '
. This looks like it's structured data that has been taken from a table and accidentally included two values. Let's get the first row that has a non-integer value in the artist end date column (row 34):
met.loc[34]
Now we can see what's going on - take a look at the artist roles and dates:
met.loc[34][['Artist Role', 'Artist Display Name', 'Artist End Date']]
This vase has two listed artists - the designer (Paulding Farnman) and the manufacturer (Tiffany & Co.). The designer died in 1927, but the manufacturer is still around. These data nicely illustrate two common features of messy curated data - multiple pieces of data being forced into a single field, and the use of arbitrary values (in this case the date 9999) to represent special cases.
How are we going to deal with these data? For this example, let's assume that for each object the most important contributor is listed first. So we will take our non-integer end dates:
met['Artist End Date'].apply(check_int).dropna()
and fix them by splitting on the tab character and taking the first element of the resulting string:
fixed_values = met['Artist End Date'].apply(check_int).dropna().str.split('\|').str.get(0)
fixed_values
Just as before, we will update the series using these fixed values:
met['Artist End Date'].update(fixed_values)
Now we can try to change the series to an int:
met['Artist End Date'].astype(int)
But we get the same error as before! Just like with the accession date column, the artist end date column has some dates written in year-month-day format. We can find them with one more trip through our check_int
function:
met['Artist End Date'].apply(check_int).dropna()
Then fix them like we did before:
fixed_values = met['Artist End Date'].apply(check_int).dropna().str.split('-').str.get(0)
fixed_values
update the series once more:
met['Artist End Date'].update(fixed_values)
and turn it into an integer:
met['Artist End Date'] = met['Artist End Date'].astype(int)
Now we can finally do our query: which objects were acquired after the death of the artist?
met[
met['AccessionYear'].astype(int) > met['Artist End Date']
]
def check_float(value):
try:
float(value)
return np.NaN
except ValueError:
return value
And then see how it works, remembering that NaN
in the output indicates values that were converted without any problem:
test = pd.Series([42, 3.1415, 'banana'])
test.apply(check_float)
With some pandas data types we have to be a bit careful. In particular, we need to remember that the nullable integer data type Int64
(with an uppercase I) is more strict than the regular int64
. In particular, int64
will happily convert a floating point number to an integer by truncating it:
pd.Series([3.1415]).astype('int64')
and will work with string representations of integers:
pd.Series(['42']).astype('int64')
In contrast, Int64
will raise an error when trying to convert either an integer represented as a string:
pd.Series(['42']).astype('Int64')
or when trying to convert a floating point number:
pd.Series([3.1415]).astype('Int64')
So if we want to write a function that will identify values that can't be changed to Int64
, we have to construct a series inside the try
block and force the Int64
dtype:
def check_nullable_int(value):
try:
pd.Series([value], dtype='Int64')
return np.NaN
except:
return value
Testing it shows the behaviour:
pd.Series([42, '42', 3.1415, '3.1415', 'banana']).apply(check_nullable_int)
Unsurprisingly, this approach is much slower:
%timeit met['Artist End Date'].apply(check_int)
%timeit met['Artist End Date'].apply(check_nullable_int)
So for simple cases we should probably use the int
and float
functions.
Conclusion
This type of messy data, where we have multiple different representations in the same column, occurs very frequently in real world manually curated datasets. Writing a small helper function like our check_int
can make it much easier to find the values that are causing problems, and figure out how to fix them. We need to watch out for slight differences between data types.
If you've made it this far, you should definitely subscribe to the Drawing from Data newsletter, follow me on Twitter, or buy the Drawing from Data book!