Introduction

In an ideal world, every time we load a data file into pandas we will end up with a nice neat dataframe with well defined columns. In the real world, however, we often end up with columns that contain values in a mixture of different formats. This makes it hard to carry out our analyses. In this article we'll look at a simple way to quickly identify invalid values so that we can fix them.

As always, to keep this article to a reasonable size we won't be going in to too much detail on the methods involved. All of the methods we'll use here are covered in detail in the Drawing from Data book. In particular, check out chapter 2 for a discussion of data types and chapter 16 for a discussion of the memory implications.

If you're interested in more articles and tips on data exploration with Python, you should subscribe to the Drawing from Data newsletter or follow me on Twitter.

The MMA dataset

For our example, we'll look at a real life data file taken from the Metropolitan Museum of Art in New York. The museum has published a dataset describing around half a million different artworks. The data are available as a CSV file in this GitHub repo.

This is a good dataset to explore when looking for messy formatting, for a number of reasons:

  • it's large, so even rare errors are likely to be well represented.
  • it's curated by humans, rather than being automatically generated, so it's likely to contain typos and other human errors
  • furthermore, it's large enough to have multiple human contributors, so the data will likely reflect inconsistencies between different ideas about data entry
  • it's very heterogeneous - the individual artworks are of very different types. We have data for paintings, sculptures, photography, ceramics, fabrics, etc. Any time we have multiple different types of objects being represented by a single data structure, we have the oppourtunity for messy data.

Because the data are hosted on GitHub, we should be able to read them directly and take a look at the dataframe:

import pandas as pd
pd.options.display.max_columns=4
met = pd.read_csv('https://github.com/metmuseum/openaccess/raw/master/MetObjects.csv')
/home/martin/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (5,7,10,11,12,13,14,15,22,34,35,36,37,38,39,40,41,42,43,44,46) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,

Notice that we immediately get a hint that something is weird in the data file when pandas complains about columns with mixed types. Let's ignore this for now and bravely press on with a look at the dataframe:

met
Object Number Is Highlight ... Tags AAT URL Tags Wikidata URL
0 1979.486.1 False ... NaN NaN
1 1980.264.5 False ... NaN NaN
2 67.265.9 False ... NaN NaN
3 67.265.10 False ... NaN NaN
4 67.265.11 False ... NaN NaN
... ... ... ... ... ...
474521 07.225.14b False ... http://vocab.getty.edu/page/aat/300054534|http... https://www.wikidata.org/wiki/Q333|https://www...
474522 69.292.12 False ... http://vocab.getty.edu/page/aat/300250049|http... https://www.wikidata.org/wiki/Q2092297|https:/...
474523 See attached list False ... NaN NaN
474524 See attached list - chairs False ... NaN NaN
474525 61.100.3c False ... http://vocab.getty.edu/page/aat/300266506 https://www.wikidata.org/wiki/Q5113

474526 rows × 54 columns

All looks OK here; we have a dataframe with just under half a million rows. There are too many columns to fit comfortably on the screen, so let's just check their names:

met.columns
Index(['Object Number', 'Is Highlight', 'Is Timeline Work', 'Is Public Domain',
       'Object ID', 'Gallery Number', 'Department', 'AccessionYear',
       'Object Name', 'Title', 'Culture', 'Period', 'Dynasty', 'Reign',
       'Portfolio', 'Constiuent ID', 'Artist Role', 'Artist Prefix',
       'Artist Display Name', 'Artist Display Bio', 'Artist Suffix',
       'Artist Alpha Sort', 'Artist Nationality', 'Artist Begin Date',
       'Artist End Date', 'Artist Gender', 'Artist ULAN URL',
       'Artist Wikidata URL', 'Object Date', 'Object Begin Date',
       'Object End Date', 'Medium', 'Dimensions', 'Credit Line',
       'Geography Type', 'City', 'State', 'County', 'Country', 'Region',
       'Subregion', 'Locale', 'Locus', 'Excavation', 'River', 'Classification',
       'Rights and Reproduction', 'Link Resource', 'Object Wikidata URL',
       'Metadata Date', 'Repository', 'Tags', 'Tags AAT URL',
       'Tags Wikidata URL'],
      dtype='object')

Most of these seem pretty self evident. We can see here a common pattern when representing multiple different types of objects in a single data structure: a large number of columns, many of which are mostly missing data. For example, all but 2000 values are missing from the River column:

met['River'].describe()
count                  2097
unique                  229
top       Upper Sepik River
freq                    362
Name: River, dtype: object

Presumably there are a small number of objects that are associated with a particular river, and because we want to store that information we need a river column even though most of the objects don't have it.

One of the most well-populated columns is accession year, with more than 99% of the values present:

met['AccessionYear'].count() / met['AccessionYear'].size
0.9916421860972845

This makes sense. We would expect an art museum to have pretty thorough records of when each object was acquired.

Let's remove any rows without an accession year:

met = met.dropna(subset=['AccessionYear'])

then take a look at the series:

met['AccessionYear']
0         1979.0
1         1980.0
2         1967.0
3         1967.0
4         1967.0
           ...  
474519    1964.0
474520    1978.0
474521    1907.0
474522    1969.0
474525    1961.0
Name: AccessionYear, Length: 470560, dtype: object

This looks pretty straightforward, let's try to find all the objects that were acquired before 1950:

met[met['AccessionYear'] < 1950]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-ac1cd7060217> in <module>
----> 1 met[met['AccessionYear'] < 1950]

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/ops/common.py in new_method(self, other)
     63         other = item_from_zerodim(other)
     64 
---> 65         return method(self, other)
     66 
     67     return new_method

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/arraylike.py in __lt__(self, other)
     35     @unpack_zerodim_and_defer("__lt__")
     36     def __lt__(self, other):
---> 37         return self._cmp_method(other, operator.lt)
     38 
     39     @unpack_zerodim_and_defer("__le__")

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/series.py in _cmp_method(self, other, op)
   4937         rvalues = extract_array(other, extract_numpy=True)
   4938 
-> 4939         res_values = ops.comparison_op(lvalues, rvalues, op)
   4940 
   4941         return self._construct_result(res_values, name=res_name)

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/ops/array_ops.py in comparison_op(left, right, op)
    241 
    242     elif is_object_dtype(lvalues.dtype):
--> 243         res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
    244 
    245     else:

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/ops/array_ops.py in comp_method_OBJECT_ARRAY(op, x, y)
     53         result = libops.vec_compare(x.ravel(), y.ravel(), op)
     54     else:
---> 55         result = libops.scalar_compare(x.ravel(), y, op)
     56     return result.reshape(x.shape)
     57 

pandas/_libs/ops.pyx in pandas._libs.ops.scalar_compare()

TypeError: '<' not supported between instances of 'str' and 'int'

Now we have run into a problem. We were assuming that the accession year column was numeric based on the values that we saw, but it is actually of type object, as we can easily check:

met['AccessionYear'].dtype
dtype('O')

We might think to fix this by explicity changing the data type:

met['AccessionYear'].astype('int')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-d07a2f803482> in <module>
----> 1 met['AccessionYear'].astype('int')

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5870         else:
   5871             # else, only a single dtype is given
-> 5872             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5873             return self._constructor(new_data).__finalize__(self, method="astype")
   5874 

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1072         # work around NumPy brokenness, #1987
   1073         if np.issubdtype(dtype.type, np.integer):
-> 1074             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
   1075 
   1076         # if we have a datetime/timedelta array of objects

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: '2005-02-15'

But now we encounter the reason that this column ended up with an object data type rather than int or float. At least one of the values in the colum is written as a complete date: 2005-02-15 rather than just a year, so we get a ValueError when trying to convert it to an integer.

Which values can't be converted?

So here's where it get annoying. Because Python crashes at the first value that it can't turn into an integer, we have no idea how many non-integer values are in the column. If it's just one out of half a million, then we can probably just delete that row and try again. But if we have lots of non-integer values, then that would be throwing away a lot of our data.

Furthermore, if there are lots of non-integer values, are they all written in the same format? If it turns out that we have many values written in year-month-day format, just like the one we already found, then fixing them might be a simple string processing job. But if we have lots of different formats, then the task will be much trickier.

Because we have nearly half a million values to check, obviously it's impossible to look at them all manually. So we will have to write a function to help us. The trick is to use try/except to prevent pandas from simply crashing out at the first invalid value. So here's the code:

import numpy as np

def check_int(value):
    try:
        int(value)
        return np.NaN
    except ValueError:
        return value

Our function will take a single value, and attempt to convert it to an integer inside a try block. If the int call is sucessful - i.e. the value is a valid integer - then the function will return np.NaN to represent missing data. If the call to int causes a ValueError - i.e. the value that we passed in isn't a valid integer - then it will trigger the except block and return the value itself.

To see how this works, let's try it on a few different inputs. Passing in a valid integer results in missing data:

check_int('42'), check_int(1970.0)
(nan, nan)

Whereas passing in anything that can't be converted to an integer results in the original value:

check_int('banana'), check_int('1970.0'), check_int('2005-12-15')
('banana', '1970.0', '2005-12-15')

Notice a subtlety here in the handling of strings vs. floats: in Python calling int on a floating point number truncates it:

int(3.1415)
3

whereas calling int on a string containing a floating point number causes an error:

int('3.1415')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-388b05e53588> in <module>
----> 1 int('3.1415')

ValueError: invalid literal for int() with base 10: '3.1415'

So how does this function help us find our invalid values? If we use apply to call it on each value of the accession year column:

met['AccessionYear'].apply(check_int)
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
         ... 
474519    NaN
474520    NaN
474521    NaN
474522    NaN
474525    NaN
Name: AccessionYear, Length: 470560, dtype: object

We get back a series of the same length where values that were sucessfully converted to integers are missing, and invalid values are the same. So we can easily find all of the invalid values by dropping missing data:

met['AccessionYear'].apply(check_int).dropna()
179190    2005-02-15
390850    2020-03-23
Name: AccessionYear, dtype: object

Here we have our answer: there are two values in this column that have been writted as complete dates.

Now that we have the invalid values we can figure out how to fix them. In this case, because both invalid values have the same date format we can do this by splitting on - and taking the first element of the resulting list:

fixed_values = met['AccessionYear'].apply(check_int).dropna().str.split('-').str.get(0)
fixed_values
179190    2005
390850    2020
Name: AccessionYear, dtype: object

See chapter 3 of the Drawing from Data book for more about string methods and the str accessor.

Having fixed these two values, we can use the fixed values to update the original series:

met['AccessionYear'].update(fixed_values)
/home/martin/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3418: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)

Which now allows us to convert the data type to integer:

met['AccessionYear'].astype(int)
0         1979
1         1980
2         1967
3         1967
4         1967
          ... 
474519    1964
474520    1978
474521    1907
474522    1969
474525    1961
Name: AccessionYear, Length: 470560, dtype: int64

and carry out our filter:

met[
    met['AccessionYear'].astype(int) < 1950
]
Object Number Is Highlight ... Tags AAT URL Tags Wikidata URL
14 16.74.49 False ... NaN NaN
15 16.74.27 False ... NaN NaN
16 16.74.28 False ... NaN NaN
17 16.74.29 False ... NaN NaN
18 16.74.30 False ... NaN NaN
... ... ... ... ... ...
474498 91.16.59 False ... NaN NaN
474499 93.13.6 False ... NaN NaN
474500 93.13.92 False ... http://vocab.getty.edu/page/aat/300025943|http... https://www.wikidata.org/wiki/Q467|https://www...
474504 13.225.13–.14, .16–.28 False ... http://vocab.getty.edu/page/aat/300025928|http... https://www.wikidata.org/wiki/Q8441|https://ww...
474521 07.225.14b False ... http://vocab.getty.edu/page/aat/300054534|http... https://www.wikidata.org/wiki/Q333|https://www...

202888 rows × 54 columns

About 200,000 objects were acquired prior to 1950.

Dealing with many invalid values

Let's try this technique on a more complicated column. Say that we want to find all objects that were acquired after the death of the artist. There's a column called Artist End Date that has values for about half of the rows:

met['Artist End Date'].dropna()
0         1869      
1         1844      
9         1917      
10        1844      
11        1869      
             ...    
474492    2047      
474503    1925      
474512    1900      
474519    1776      
474522    1839      
Name: Artist End Date, Length: 234232, dtype: object

From a glance at the values it looks like this is a series of year integers, but looking at the dtype tells us that it has ended up as type object. Sure enough, when we try to convert it to an integer type:

met['Artist End Date'].dropna().astype(int)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-77ad2746910f> in <module>
----> 1 met['Artist End Date'].dropna().astype(int)

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5870         else:
   5871             # else, only a single dtype is given
-> 5872             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5873             return self._constructor(new_data).__finalize__(self, method="astype")
   5874 

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1072         # work around NumPy brokenness, #1987
   1073         if np.issubdtype(dtype.type, np.integer):
-> 1074             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
   1075 
   1076         # if we have a datetime/timedelta array of objects

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: '1927      |9999      '

we can see that it has some invalid values. To keep things simple let's remove any rows with a missing artist end date:

met = met.dropna(subset=['Artist End Date'])

Then use our function to find the invalid values as before:

met['Artist End Date'].apply(check_int).dropna()
34                              1927      |9999      
111                             1933      |1902      
112                             1933      |1902      
203                             1933      |1902      
204                             1933      |1902      
                             ...                     
474468                          1840      |1875      
474469                          1840      |1875      
474470                          1831      |1840      
474472    1890      |1831      |1842      |1864      
474473                          1900      |1842      
Name: Artist End Date, Length: 69630, dtype: object

Here we have a much more complicated situation than the acession date. Where previously we had only two invalid values to worry about, now we have neary 70,000. However, since these values are in a pandas series, it's easy to summarize them using value_counts:

met['Artist End Date'].apply(check_int).dropna().value_counts()
1635      |1661                                           879
1682      |1716                                           589
1897      |1889                                           520
9999      |9999                                           426
1870      |1848      |1848                                379
                                                         ... 
1815      |1808      |1850      |1787                       1
1641      |1713                                             1
1780      |1699      |1766      |1695      |1789            1
1868      |1879                                             1
1917      |1660                                             1
Name: Artist End Date, Length: 24693, dtype: int64

The pattern is a bit tricky to see here. Let's just take the single most common invalid value:

met['Artist End Date'].apply(check_int).dropna().value_counts().index[0]
'1635      |1661      '

It turns out to be the string '1635 |1661 '. This looks like it's structured data that has been taken from a table and accidentally included two values. Let's get the first row that has a non-integer value in the artist end date column (row 34):

met.loc[34]
Object Number                                                        04.1a–c
Is Highlight                                                            True
Is Timeline Work                                                        True
Is Public Domain                                                       False
Object ID                                                                 35
Gallery Number                                                           706
Department                                                 The American Wing
AccessionYear                                                         1904.0
Object Name                                                             Vase
Title                                                         The Adams Vase
Culture                                                             American
Period                                                                   NaN
Dynasty                                                                  NaN
Reign                                                                    NaN
Portfolio                                                                NaN
Constiuent ID                                                      108316253
Artist Role                                            Designer|Manufacturer
Artist Prefix                                    Designed by|Manufactured by
Artist Display Name                           Paulding Farnham|Tiffany & Co.
Artist Display Bio                                    1859–1927|1837–present
Artist Suffix                                                            NaN
Artist Alpha Sort                            Farnham, Paulding|Tiffany & Co.
Artist Nationality                                                       NaN
Artist Begin Date                                      1859      |1837      
Artist End Date                                        1927      |9999      
Artist Gender                                                              |
Artist ULAN URL            http://vocab.getty.edu/page/ulan/500336597|htt...
Artist Wikidata URL                                                        |
Object Date                                                          1893–95
Object Begin Date                                                       1893
Object End Date                                                         1895
Medium                     Gold, amethysts, spessartites, tourmalines, fr...
Dimensions                 Overall: 19 7/16 x 13 x 9 1/4 in. (49.4 x 33 x...
Credit Line                                    Gift of Edward D. Adams, 1904
Geography Type                                                       Made in
City                                                                New York
State                                                                    NaN
County                                                                   NaN
Country                                                        United States
Region                                                                   NaN
Subregion                                                                NaN
Locale                                                                   NaN
Locus                                                                    NaN
Excavation                                                               NaN
River                                                                    NaN
Classification                                                         Metal
Rights and Reproduction                                                  NaN
Link Resource              http://www.metmuseum.org/art/collection/search/35
Object Wikidata URL                  https://www.wikidata.org/wiki/Q83545838
Metadata Date                                                            NaN
Repository                          Metropolitan Museum of Art, New York, NY
Tags                                              Animals|Garlands|Birds|Men
Tags AAT URL               http://vocab.getty.edu/page/aat/300249525|http...
Tags Wikidata URL          https://www.wikidata.org/wiki/Q729|https://www...
Name: 34, dtype: object

Now we can see what's going on - take a look at the artist roles and dates:

met.loc[34][['Artist Role', 'Artist Display Name', 'Artist End Date']]
Artist Role                     Designer|Manufacturer
Artist Display Name    Paulding Farnham|Tiffany & Co.
Artist End Date                 1927      |9999      
Name: 34, dtype: object

This vase has two listed artists - the designer (Paulding Farnman) and the manufacturer (Tiffany & Co.). The designer died in 1927, but the manufacturer is still around. These data nicely illustrate two common features of messy curated data - multiple pieces of data being forced into a single field, and the use of arbitrary values (in this case the date 9999) to represent special cases.

How are we going to deal with these data? For this example, let's assume that for each object the most important contributor is listed first. So we will take our non-integer end dates:

met['Artist End Date'].apply(check_int).dropna()
34                              1927      |9999      
111                             1933      |1902      
112                             1933      |1902      
203                             1933      |1902      
204                             1933      |1902      
                             ...                     
474468                          1840      |1875      
474469                          1840      |1875      
474470                          1831      |1840      
474472    1890      |1831      |1842      |1864      
474473                          1900      |1842      
Name: Artist End Date, Length: 69630, dtype: object

and fix them by splitting on the tab character and taking the first element of the resulting string:

fixed_values = met['Artist End Date'].apply(check_int).dropna().str.split('\|').str.get(0)
fixed_values
34        1927      
111       1933      
112       1933      
203       1933      
204       1933      
             ...    
474468    1840      
474469    1840      
474470    1831      
474472    1890      
474473    1900      
Name: Artist End Date, Length: 69630, dtype: object

Just as before, we will update the series using these fixed values:

met['Artist End Date'].update(fixed_values)

Now we can try to change the series to an int:

met['Artist End Date'].astype(int)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-62-4702df368e1d> in <module>
----> 1 met['Artist End Date'].astype(int)

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5870         else:
   5871             # else, only a single dtype is given
-> 5872             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5873             return self._constructor(new_data).__finalize__(self, method="astype")
   5874 

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1072         # work around NumPy brokenness, #1987
   1073         if np.issubdtype(dtype.type, np.integer):
-> 1074             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
   1075 
   1076         # if we have a datetime/timedelta array of objects

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: '2005-08-01'

But we get the same error as before! Just like with the accession date column, the artist end date column has some dates written in year-month-day format. We can find them with one more trip through our check_int function:

met['Artist End Date'].apply(check_int).dropna()
65537     2005-08-01
65632     1972-03-23
65645     2004-09-28
65675     2002-06-12
65692     2004-09-28
             ...    
471268    1971-07-26
471269    1988-11-15
471270    1971-07-26
472106    1971-07-26
472186    2007-06-22
Name: Artist End Date, Length: 1775, dtype: object

Then fix them like we did before:

fixed_values = met['Artist End Date'].apply(check_int).dropna().str.split('-').str.get(0)
fixed_values
65537     2005
65632     1972
65645     2004
65675     2002
65692     2004
          ... 
471268    1971
471269    1988
471270    1971
472106    1971
472186    2007
Name: Artist End Date, Length: 1775, dtype: object

update the series once more:

met['Artist End Date'].update(fixed_values)

and turn it into an integer:

met['Artist End Date'] = met['Artist End Date'].astype(int)

Now we can finally do our query: which objects were acquired after the death of the artist?

met[
    met['AccessionYear'].astype(int) > met['Artist End Date']
]
Object Number Is Highlight ... Tags AAT URL Tags Wikidata URL
0 1979.486.1 False ... NaN NaN
1 1980.264.5 False ... NaN NaN
9 1979.486.3 False ... NaN NaN
10 1979.486.2 False ... NaN NaN
11 1979.486.7 False ... NaN NaN
... ... ... ... ... ...
474473 69.524.28 False ... http://vocab.getty.edu/page/aat/300025943 https://www.wikidata.org/wiki/Q467
474503 TR.124.2020 False ... NaN NaN
474512 2002.233.25 False ... http://vocab.getty.edu/page/aat/300164595 https://www.wikidata.org/wiki/Q335261
474519 64.101.433a, b False ... http://vocab.getty.edu/page/aat/300250047|http... https://www.wikidata.org/wiki/Q3736439|https:/...
474522 69.292.12 False ... http://vocab.getty.edu/page/aat/300250049|http... https://www.wikidata.org/wiki/Q2092297|https:/...

183125 rows × 54 columns

Extending to other data types

We can use the same kind of helper function to find value that can't be converted to other data types. All we have to do is change the conversion code inside the function. So, for example, to find values that can't be converted to floats, we would write this:

def check_float(value):
    try:
        float(value)
        return np.NaN
    except ValueError:
        return value

And then see how it works, remembering that NaN in the output indicates values that were converted without any problem:

test = pd.Series([42, 3.1415, 'banana'])
test.apply(check_float)
0       NaN
1       NaN
2    banana
dtype: object

With some pandas data types we have to be a bit careful. In particular, we need to remember that the nullable integer data type Int64 (with an uppercase I) is more strict than the regular int64. In particular, int64 will happily convert a floating point number to an integer by truncating it:

pd.Series([3.1415]).astype('int64')
0    3
dtype: int64

and will work with string representations of integers:

pd.Series(['42']).astype('int64')
0    42
dtype: int64

In contrast, Int64 will raise an error when trying to convert either an integer represented as a string:

pd.Series(['42']).astype('Int64')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-122-5bf16a32e0d4> in <module>
----> 1 pd.Series(['42']).astype('Int64')

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5870         else:
   5871             # else, only a single dtype is given
-> 5872             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5873             return self._constructor(new_data).__finalize__(self, method="astype")
   5874 

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1017     # dispatch on extension dtype if needed
   1018     if is_extension_array_dtype(dtype):
-> 1019         return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
   1020 
   1021     if not isinstance(dtype, np.dtype):

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy)
    361         cls, scalars, *, dtype=None, copy: bool = False
    362     ) -> "IntegerArray":
--> 363         return integer_array(scalars, dtype=dtype, copy=copy)
    364 
    365     @classmethod

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/arrays/integer.py in integer_array(values, dtype, copy)
    141     TypeError if incompatible types
    142     """
--> 143     values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
    144     return IntegerArray(values, mask)
    145 

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    225             "mixed-integer-float",
    226         ]:
--> 227             raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
    228 
    229     elif is_bool_dtype(values) and is_integer_dtype(dtype):

TypeError: object cannot be converted to an IntegerDtype

or when trying to convert a floating point number:

pd.Series([3.1415]).astype('Int64')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/arrays/integer.py in safe_cast(values, dtype, copy)
    154     try:
--> 155         return values.astype(dtype, casting="safe", copy=copy)
    156     except TypeError as err:

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-123-3ee8009491ea> in <module>
----> 1 pd.Series([3.1415]).astype('Int64')

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5870         else:
   5871             # else, only a single dtype is given
-> 5872             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5873             return self._constructor(new_data).__finalize__(self, method="astype")
   5874 

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1017     # dispatch on extension dtype if needed
   1018     if is_extension_array_dtype(dtype):
-> 1019         return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
   1020 
   1021     if not isinstance(dtype, np.dtype):

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy)
    361         cls, scalars, *, dtype=None, copy: bool = False
    362     ) -> "IntegerArray":
--> 363         return integer_array(scalars, dtype=dtype, copy=copy)
    364 
    365     @classmethod

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/arrays/integer.py in integer_array(values, dtype, copy)
    141     TypeError if incompatible types
    142     """
--> 143     values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
    144     return IntegerArray(values, mask)
    145 

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    258         values = safe_cast(values, dtype, copy=False)
    259     else:
--> 260         values = safe_cast(values, dtype, copy=False)
    261 
    262     return values, mask

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/pandas/core/arrays/integer.py in safe_cast(values, dtype, copy)
    160             return casted
    161 
--> 162         raise TypeError(
    163             f"cannot safely cast non-equivalent {values.dtype} to {np.dtype(dtype)}"
    164         ) from err

TypeError: cannot safely cast non-equivalent float64 to int64

So if we want to write a function that will identify values that can't be changed to Int64, we have to construct a series inside the try block and force the Int64 dtype:

def check_nullable_int(value):
    try:
        pd.Series([value], dtype='Int64')
        return np.NaN
    except:
        return value

Testing it shows the behaviour:

pd.Series([42, '42', 3.1415, '3.1415', 'banana']).apply(check_nullable_int)
0         NaN
1          42
2    3.141500
3      3.1415
4      banana
dtype: object

Unsurprisingly, this approach is much slower:

%timeit met['Artist End Date'].apply(check_int)
%timeit met['Artist End Date'].apply(check_nullable_int)
45.1 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
17.7 s ± 184 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So for simple cases we should probably use the int and float functions.

Conclusion

This type of messy data, where we have multiple different representations in the same column, occurs very frequently in real world manually curated datasets. Writing a small helper function like our check_int can make it much easier to find the values that are causing problems, and figure out how to fix them. We need to watch out for slight differences between data types.

If you've made it this far, you should definitely subscribe to the Drawing from Data newsletter, follow me on Twitter, or buy the Drawing from Data book!