# Does reducing numerical precision affect real world datasets?

Reducing numerical precision is a way to save memory in pandas, but does it make a difference to the conclusions that we might draw from real world datasets?

- Executive summary
- Slightly longer summary
- Introduction
- Part one: changing data types and measuring differences
- Part two: let's try it on a larger scale
- Part three: looking at the results
- Conclusion

### Executive summary

Switching from 64 bit to 32 bit precision for floating point data in pandas saves memory and has virtually no effect on the data, as long as the values aren't greater than around 10^{300}

### Slightly longer summary

- I wrote code to download ~1,000 real world datasets from kaggle and look at summary statistics and linear regressions of floating point columns in both 64 and 32 bit precision
- The final analysis involved ~1,000 datasets, ~8,000 data files, ~500,000,000 individual floating point values and ~500,000 linear regressions.
- Changing to 32 bit precision introduced a mean proportional error of around 10
^{-8}in the summary statistics, with the worst affected columns having a proportional error of around 10^{-6} - For linear regressions, the mean proportional change in slope and intercept was on the order of 10
^{-6} - A small number of outliers with near-zero slope (representing uninteresting relationships) had larger proportional errors
- In ~400,000 valid linear regressions I was unable to find any relationships whose significance at the 5% level was changed by moving from 64 to 32 bit precision
- For real world datasets, switching from 64 to 32 bit floating point precision is overwhelmingly unlikely to alter our conclusions about the data, and is an easy way to reduce memory usage.

## Introduction

As you've probably noticed, this is a longish article, so grab a cup of coffee if you're planning to read it all. If you're already familiar with floating point precision and error in pandas you can probably skip part one. Most of the methods stuff is in part two, so if you just want to see the results, skip to part three. Also, just to head off the obvious criticism, I think that most of the conclusions of this article could have been arrived at by pure reasoning, without looking at any data at all. But hopefully some folks will, like me, be more satisfied by empirical data :-)

Obviously, I think that pandas is wonderful :-) But it has a number of quirks, among them a reputation for being memory hungry. There are no shortage of articles online discussing how to fit large dataframes in memory, and there are a number of well established techniques (there's a whole chapter on the subject in the Drawing from Data book, and also an extended discussion in the car accidents dataset video).

One technique that always gets mentioned is storing numbers in a reduced precision data type. This is usually proposed as one of the last things to try after other options have been exhausted. I think there are a couple of reasons for this. One is that messing about with data types is a bit harder to explain than simple things like ignoring unneeded columns. Another is that deliberately throwing away information - which is what reducing precision will do - feels very uncomfortable from a data analysis point of view!

One thing that I've rarely seen in such discussions is a measurement of how much of an effect reduced precision will have in real life datasets. This is surprising, since pandas makes it very easy to measure. So let's go ahead and try it...

## Part one: changing data types and measuring differences

Before we dive in, a quick reminder of the pandas tools we'll use. All of the methods etc. are described in detail in the Drawing from Data book, so we'll limit ourselves here to an overview.

As an example, let's load the planets dataset:

```
import pandas as pd
pd.options.display.max_columns=4
planets = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/planets.csv')
planets
```

which stores information about planets that have been discovered around other stars. We are interested in the orbital period column, which we'll store in a variable for convenience:

```
op = planets['orbital_period']
op
```

which tells us how long each planet takes to revolve around its host star. Because it's comprised of floating point numbers, it get the `float64`

data type by default. To change the data type, we can just use the `as_type`

method. Let's try `float32`

:

```
op.astype('float32')
```

All we need to know about these two data types is that, as the names suggest, a single `float64`

value uses 64 bits of memory, whereas a `float32`

uses, you guessed it, 32 bits. So it comes as no great surprise that the memory usage of the `float32`

version is very close to half that of the `float64`

version:

```
op.memory_usage(), op.astype('float32').memory_usage()
```

As well as a difference in memory usage, we can see from the output above that there are slight differences in the values themselves. This is perhaps easiest to see if we put the two series next to each other:

```
pd.concat([op, op.astype('float32')], axis=1)
```

To figure out how different the two series are, pandas makes it easy to just take the difference:

```
op.astype('float32') - op
```

It's probably more useful to view these as absolute difference, i.e. to ignore the sign:

```
(op.astype('float32') - op).abs()
```

Now we can look at a summary:

```
(op.astype('float32') - op).abs().describe()
```

The general picture here is that the change from `float64`

to `float32`

has made very little difference to the values. On average, it introduces an absolute error of around 4x10^{-6}, or 0.000004, and the single worst-case value has an absolute error of about 0.0002. These errors are very small relative to the range of the original values:

```
op.describe()
```

```
((op.astype('float32') - op).abs() / op).describe()
```

This calculation is a bit more involved, but easier to interpret. It is telling us that the switch from `float64`

to `float32`

changed the values on average by about 10^{-8} - a truly miniscule chanage.

Another way to look at the difference is to compare the summaries of the two series. For any given aggregation we can figure out the proportional change by taking the absolute difference and dividing by the original:

```
import numpy as np
np.abs(op.mean() - op.astype('float32').mean()) / op.mean()
```

So changing the series to `float32`

alters the mean by a factor of around 10^{-8}. We can do the same calculation with any other summary statistic - for instance, the standard deviation:

```
np.abs(op.std() - op.astype('float32').std()) / op.std()
```

### Comparing correlations

Another way to measure the effect of reduced precision is to see if it changes the correlation between floating point columns. In our planets dataset we have two other floating point columns: the mass and the distance. Using pandas' built in `corr`

method we can calculate the pairwise pearson correlation of all the floating point columns:

```
planets.select_dtypes('float64').corr()
```

Then do the same with a `float32`

version and take the difference:

```
(
planets.select_dtypes('float64').corr()
- planets.select_dtypes('float64').astype('float32').corr()
)
```

As we can see, the differences are tiny, and would definitely not alter our conclusions about the relationships in the dataset. For a more sophisticated look at correlations, we could bring in a linear regression:

```
from scipy.stats import linregress
linregress(
planets.dropna()['orbital_period'],
planets.dropna()['mass']
)
```

and again take the difference between the 64 and 32 bit versions:

```
l_64 = linregress(
planets.dropna()['orbital_period'],
planets.dropna()['mass']
)
l_32 = linregress(
planets.dropna()['orbital_period'].astype('float32'),
planets.dropna()['mass'].astype('float32')
)
print(f'slope difference : {l_64[0] - l_32[0]} ')
print(f'intercept difference : {l_64[1] - l_32[1]} ')
print(f'R value difference : {l_64[2] - l_32[2]} ')
```

Once again, the difference is incredibly minor. Given that the original values might be very small, a proprtional change might be easier to interpret:

```
print(f'proportional slope difference : {(l_64[0] - l_32[0]) / l_64[0]} ')
print(f'proportional intercept difference : {(l_64[1] - l_32[1]) / l_64[1]} ')
print(f'proportional R value difference : {(l_64[2] - l_32[2]) / l_64[2]} ')
```

## Part two: let's try it on a larger scale

Attentive readers will have noticed that all of the calculations above were done by Python code :-) That means that it should be possible to automate this process and try it across a large number of different datasets. For a convenient source of real world dataset we turn to kaggle.

The collection of datasets hosted at Kaggle is perfect for our purposes. It contains real world data from a massive range of different fields and sources, mostly stored in files that will be easy to get in to pandas. It also has a convenient API with a Python client library, which will make it easy to download a large collection of different data files.

### Downloading the data

To download our collection of datasets using the Kaggle API we have to go through a couple of steps. First we will connect to the API and authenticate:

```
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
```

Next we can get a list of datasets by using the `dataset_list`

method. We will search for CSV files, since they will be the easiest to load into pandas, and use pagination to get a large collection of dataset objects. Here I'm using the `tqdm`

package for a progress bar, since this code takes a long time to run. I'm not sure if it's necessary, but I'm also putting in a 1-second delay between requests in an attempt to be a good internet citizen:

```
from tqdm.notebook import trange, tqdm
import time
dataset = []
for page in trange(1,1000):
datasets.append(api.dataset_list(file_type='csv', page=page))
time.sleep(1)
```

Now we have a list of dataset objects, we can get on to the actual downloading. To keep things organized I'm using a separate folder for each dataset (which might end up containing multiple files). Once I've created the folder it's just a case of calling `dataset_download_files`

with each dataset reference from the previous step and the new folder name as the destination. I'm also using `unzip=True`

to make sure that we end up with extracted files. This will obviously take more disk space than leaving them compressed.

```
for dat in tqdm(all_datasets):
folder_name = dat.ref.replace('/', '-')
if not os.path.exists(folder_name):
os.mkdir(folder_name)
print(dat.ref)
api.dataset_download_files(str(dat.ref),path=folder_name, unzip=True)
```

```
def get_float_columns(filepath):
try:
df = pd.read_csv(filepath, encoding='latin-1').select_dtypes(['float64'])
except:
print(f'cannot read {filepath}')
return None
# replace infinite values with missing data
df.replace([np.inf, -np.inf], np.nan, inplace=True)
# get rid of columns that are only floats because they contain missing data
# i.e. they should actually be integers
keep = []
for c in df.columns:
if not all(df[c].dropna().astype(int) == df[c].dropna()):
keep.append(c)
if len(keep) == 0:
return None
return df[keep]
```

Some of the files will need special arguments to `read_csv`

in order to open. Of course, since we have so many files, we can't go and manually figure out the correct arguments for every one. So to open the file we'll just call `read_csv`

in a `try`

block so that files requiring complex arguments won't crash our program. We are only interested in floating point columns, so we'll use `select_dtypes`

to get them.

There's a slight complication in selecting the floating point columns: due to the way that pandas handles missing data, integer columns that contain missing data will end up as `float64`

by default. So there's a chunk of code to identify floating point columns that are really integers, and to drop them from the dataset.

We can test out the function by running it on our planets dataset:

```
get_float_columns('planets.csv')
```

As expected, we just get the floating point columns.

### Comparing summary statistics

Now we can move on to the first question, calculating the summary statistics for each column in both 64 bit and 32 bit versions. Most of the work will be done by `describe`

. We will add a `_64`

and a `_32`

suffix to the column names, then concatenate the two descriptions to give a wide summary table with one row per column in the original dataframe:

```
import numpy as np
def summarize_columns(df):
summary_64 = df.describe().T
summary_64.columns = [c + '_64' for c in summary_64.columns]
summary_64_abs = df.abs().describe().T
summary_64_abs.columns = [c + '_64_abs' for c in summary_64_abs.columns]
summary_32 = df.astype('float32').describe().T
summary_32.columns = [c + '_32' for c in summary_32.columns]
result = pd.concat([summary_32, summary_64, summary_64_abs], axis=1)
return result
planets = get_float_columns('planets.csv')
summarize_columns(planets)
```

There's one slight complication compared to our original planet code. Think about the case where we have a column of mixed negative and positive values. We might end up with a situation where the mean is very close to zero, even though the values themselves are not. This will give us an inflated estimate of the precision error when we divide by the mean. By working with a mean of the absolute values we can avoid this. The same goes for the other summary statistics - median, min, max, etc.

This dataframe is a bit awkward to read on a web page, since it's so wide. But the column structure makes it very easy to check, for example, the difference in the means, either in absolute terms:

```
summary = summarize_columns(planets)
(summary['mean_64'] - summary['mean_32']).abs()
```

Or as a proportion of the original:

```
summary = summarize_columns(planets)
(summary['mean_64'] - summary['mean_32']).abs() / summary['mean_64_abs']
```

```
import scipy.stats
import itertools
def summarize_regressions(df):
df = df.dropna()
# if fewer than two columns or fewer than ten rows, skip
if len(df.columns) < 2 or len(df) < 10:
return None
df_32 = df.astype('float32')
results = []
for x, y in itertools.combinations(df.columns, r=2):
regression_64 = scipy.stats.linregress(df[x], df[y])
regression_32 = scipy.stats.linregress(df_32[x], df_32[y])
results.append(
(
x,
y,
regression_64[0], # slope
regression_32[0],
regression_64[1], # intercept
regression_32[1],
regression_64[2], # R value
regression_32[2],
regression_64[3], # P value
regression_32[3],
)
)
return pd.DataFrame(
results,
columns = ['x', 'y',
'slope_64', 'slope_32',
'intercept_64', 'intercept_32',
'rvalue_64', 'rvalue_32',
'pvalue_64', 'pvalue_32']
)
summarize_regressions(planets)
```

The output is another summary table, this time showing us the slope, intercept, rvalue and pvalue for each pair of columns in both 64 bit and 32 bit form. As in the summary statistics, this makes it easy to find the differences in the various regression properties. Here's the difference in slope as a proportion of the original slope:

```
summary = summarize_regressions(planets)
(summary['slope_64'] - summary['slope_32']).abs() / summary['slope_64']
```

```
import glob
from tqdm.notebook import tqdm
dfs = []
for filepath in tqdm(glob.glob('/home/martin/Downloads/kaggle_test/*-*/*.csv')):
df = get_float_columns(filepath)
if df is not None:
summary = summarize_columns(df)
summary['filepath'] = filepath
dfs.append(summary)
summary_stats = pd.concat(dfs)
```

At the end of the process we have a big dataframe that contains the summary for all the data files:

```
summary_stats.head()
```

If you want to play with the data, you can download a compressed copy here.

Let's first take a look at the volume of data. We have processed 919 data files:

```
summary_stats['filepath'].nunique()
```

containing just over 8000 floating point columns:

```
len(summary_stats.reset_index().groupby(['index', 'filepath']))
```

and just over half a billion individual floating point values:

```
summary_stats['count_64'].sum()
```

One slight complication is that some of our data files contain very large values that will be outside the range of 32 bit floating point numbers. It's easy to spot those in the file as the maximum of the 32 bit version will end up infinite:

```
summary_stats[np.isinf(summary_stats['max_32'])]
```

It's also pretty easy to remove them from the results:

```
results = summary_stats[ ~ np.isinf(summary_stats['max_32'])]
```

```
(results['mean_64'] - results['mean_32']).abs()
```

but since we have so many values, we need to summarize them. We can do this with summary statistics:

```
(results['mean_64'] - results['mean_32']).abs().describe()
```

Notice that we have a pretty wild distribution here. The mean absolute error is about three million, but the median (50% in the above table) is 0.0000004! This is why we need to normalize these by viewing them as a proportion of the original mean of absolute values:

```
prop_errors = (results['mean_64'] - results['mean_32']).abs() / results['mean_64_abs']
prop_errors.describe()
```

This is easier to look at. The mean and median proportional error are both on the order of 10^{-8}, and the worst affected data column has its mean altered by around 10^{-6}. In practical terms, these are incredibly minor effects. We can easily view them as a distribution to see the long tail:

```
import seaborn as sns
sns.displot(
prop_errors,
aspect=2
)
```

We can do the same thing for the other aggregations of the original data - let's add the median (50% in our table) and the max. Calculating the proportional error in the minimum is tricky, since for many data columns the minimum value will be zero, leading to an infinite proportional error, so we will skip it.

Adding these measurements as additional columns:

```
results['mean prop error'] = (results['mean_64'] - results['mean_32']).abs() / results['mean_64_abs']
results['median prop error'] = (results['50%_64'] - results['50%_32']).abs() / results['50%_64_abs']
results['max prop error'] = (results['max_64'] - results['max_32']).abs() / results['max_64_abs']
```

will allow us to make a summary table:

```
results[['mean prop error', 'median prop error', 'max prop error']].describe()
```

A little bit of rearrangement using `melt`

will allow us to look at the distribution of proportional errors for all three summary statistics:

```
g = sns.displot(
data = results[['mean prop error', 'median prop error', 'max prop error']].melt(),
x = 'value',
col = 'variable',
)
import matplotlib.pyplot as plt
plt.xlim((0, 2e-7))
```

```
import glob
from tqdm.notebook import tqdm
dfs = []
for filepath in tqdm(glob.glob('/home/martin/Downloads/kaggle_test/*-*/*.csv')):
df = get_float_columns(filepath)
if df is not None:
summary = summarize_regressions(df)
if summary is not None:
summary['filepath'] = filepath
dfs.append(summary)
regressions = pd.concat(dfs)
```

If you want to play with the data, you can download a compressed version here.

The resulting dataframe has the same wide shape as before, but this time each row represents a regression result between two columns from the same data file. We have just over half a million regression results in total:

```
regressions
```

Because the number of pairwise comparisons increases with the square of the number of columns, the results are dominated by a small number of data files with many floating point columns:

```
regressions['filepath'].value_counts()
```

Out of our 919 data files, only 289 had at least two floating point columns. The data file with the most floating point columns contributes nearly two hundred thousand regression results!

To get an overview of the difference for each property of the regression result, we can add new columns as we did before and summarize them:

```
regressions['slope error'] = ((regressions['slope_64'] - regressions['slope_32']) / regressions['slope_64']).abs()
regressions['intercept error'] = ((regressions['intercept_64'] - regressions['intercept_32']) / regressions['intercept_64']).abs()
regressions['rvalue error'] = ((regressions['rvalue_64'] - regressions['rvalue_32']) / regressions['rvalue_64']).abs()
regressions['pvalue error'] = ((regressions['pvalue_64'] - regressions['pvalue_32']) / regressions['pvalue_64']).abs()
regressions[['slope error', 'intercept error', 'rvalue error', 'pvalue error']].dropna().describe()
```

Notice that in the `count`

row we can see how many of the comparisons gave meaningful (i.e. non-missing) output for the linear regressions on both the 64 and 32 bit versions.

Remember that these are proportional errors i.e. the proportion by which the values change. Although the mean errors are a couple of orders of magnitude greater than for the summary statistics, they are still very small, and very unlikely to change any conclusions about the data.

We have a very long tailed distribution for all four properties: the mean is several orders of magnitude higher than the median (50%) and the maximum values are several orders of magnitude higher again, suggesting a small number of extreme outliers. One way to show this is with a boxen plot:

```
sns.catplot(
data = regressions[['slope error', 'intercept error', 'rvalue error', 'pvalue error']].dropna().melt(),
x = 'variable',
y = 'value',
kind='boxen',
aspect = 2
)
```

The presence of the outliers make it impossible to see the distribution of the remaining values. If we plot the slope against the slope error, we can see that these outliers all represent regressions where the original slope was very near zero, suggesting no interesting relationship between the variables:

```
sns.relplot(
data = regressions,
x = 'slope_64',
y = 'slope error'
)
plt.xlim(-1e-8,1e-8)
```

The same is true for the intercept:

```
sns.relplot(
data = regressions,
x = 'intercept_64',
y = 'intercept error'
)
plt.xlim(-1e-5,1e-5)
```

An alternative way of looking at the pvalue error is to ask whether the change in pvalue would result in a difference in whether the relationship was considered significant. Setting aside the problems with interpreting pvalues in this way, we can use pandas to find any regressions where one version of the regression was significant at the 5% level and the other wasn't:

```
((regressions.dropna()['pvalue_64'] < 0.05) != (regressions.dropna()['pvalue_32'] < 0.05)).value_counts()
```

In the ~400,000 out of our ~500,000 comparisions that had no missing data, there were no such cases.

## Conclusion

It looks like in real world datasets, switching from 64 bit to 32 bit precision for floating point numbers has a vanishingly small chance of affecting our conclusions, and we should probably do it without fear when trying to minimize memory usage.

If you've made it this far, you should definitely subscribe to the Drawing from Data newsletter, follow me on Twitter, or buy the Drawing from Data book!