Working with string methods in pandas
Once we get used to pandas' ability to vectorize code, we want to use it all the time. Doing this with string columns requires a bit more ceremony.
One of the most useful features of pandas
is its ability to take a bit of code that looks like it operates on a single bit of data, and have it operate on a whole bunch of different values.
To see an example, let's load a dataset that shows fuel efficiency for different car models:
import pandas as pd
pd.options.display.max_rows = 10
df = pd.read_csv(
"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
)
df
Each row is a car, and for each car we have miles per gallon (mpg), plus a bunch of information about the engine and car, and finally the name. Say we want to convert the horsepower column into Kilowatts (kW). A quick trip to Google shows that one horsepower equals 0.7457 kW, so we might just iterate over the values and convert them like this:
kilowatts = []
for horsepower in df['horsepower']:
kilowatts.append(horsepower * 0.7457)
kilowatts[:10]
However, the magic of pandas
allows us to do this instead:
df['horsepower'] * 0.7457
We just write the expression as if we were trying to convert a single number, and pandas
takes care of applying it to the whole series. This is called vectorization, and it's generally faster and more convenient than writing for
loops.
However, we often run into problems when we try to use the same technique with text data. Let's get a list of the manufacturers for each car, which we can do just by getting the word before the first space in the name field.
First we'll use a loop:
manufacturers = []
for model in df['name']:
manufacturers.append(model.split(' ')[0])
manufacturers[:10]
So far so good. Now, can we use the same trick as we did earlier and write the expression as if we're calculating it for a single value?
df['name'].split(' ')[0]
The magic doesn't work. To put it in simple terms, a pandas Series
object "knows" about the *
operator, but it doesn't know anything about split()
. If we think about it, we can see why this must be the case. A Series
object has to be able to hold either numbers or text, so it doesn't make sense for it to have methods like split()
that only work on text.
One way round this is to switch to apply()
, which lets us run arbitrary code on values in a series. We can write a function that takes a single name and returns the manufacturer:
def get_manufacturer(name):
return name.split(' ')[0]
Then apply it to the name column:
df['name'].apply(get_manufacturer)
But a better way is to use the series str
attribute. This attribute has nearly all of the string methods that we might be used to, plus a bunch more functionality specific to pandas. Let's start with a simpler example - say we want to change all the names to upper case. Here it is with apply()
:
df['name'].apply(str.upper)
And here is is using the str
attribute:
df['name'].str.upper()
Going back to our original problem, we can split the name into a list of strings like this:
df['name'].str.split(' ')
Now there's one final complication before we can get just the first word - the result of the expression above is itself a series. So we have to access its str
attribute again before we can use square brackets to get just the first word:
df['name'].str.split(' ').str[0]
Now we have a series which contains the data we want. At this point we could store it in a new column:
df['manufacturer'] = df['name'].str.split(' ').str[0]
df
Or do any other type of series processing:
df['name'].str.split(' ').str[0].value_counts()
This technique is also very useful for filtering. Say we want to find just the cars made by Ford. We can use the startswith
method of the str
attribute to get a series of boolean values:
df['name'].str.startswith('ford')
Which we can then use as a filter mask:
df[df['name'].str.startswith('ford')]
There are many useful methods hiding in the str
attribute. Let's find all the cars with names longer than 30 characters, which we can do with the len()
method:
long_name_cars = df[df['name'].str.len() > 30]
long_name_cars
Here's a common annoyance when making charts with long axis labels:
import matplotlib.pyplot as plt
plt.figure(figsize=(8,8))
long_name_cars.set_index('name')['horsepower'].plot(kind='barh')
None
The long labels take up a bunch of space on the left hand side of the chart. We can fix this by using the .wrap()
method to split the names over multiple lines:
plt.figure(figsize=(8,8))
long_name_cars['display_name'] = long_name_cars['name'].str.wrap(15)
long_name_cars.set_index('display_name')['horsepower'].plot(kind='barh')
None
One last example: a bunch of the car names end with the string "(sw)". Let's find them:
df[df['name'].str.endswith('(sw)')]
The "sw" stands for "Station Wagon". Let's replace the abbreviation with the full name:
df['name'] = df['name'].str.replace("\(sw\)", "(Station Wagon)")
df[40:50]
Notice that we have to escape the parentheses in the "(sw)" string, as the replace()
method works on regular expressions.
There are many other useful string processing methods in the pandas str
attribute. The documentation page has a bunch of examples, and a method summary at the end. So before you set out to process strings either by writing a loop, or by using apply()
, check to see if there's a method there that will do what you want!