is it time to stop using sentinel values for null / “na” values?

** Fri 12 October 2018

This blog posts discusses the design and performance implications of using
bitmaps to mark null values instead of sentinel values (or special values like
NaN).

Depending on who you talk to, one controversial aspect of the Apache Arrow
columnar format is the fact that it uses a validity bitmap to mark value as
null or not null. Here’s the basic idea:

The number of nulls is a property of an array, generally computed once then
stored

If an array has any null values at all, then a bitmap must be allocated

If the value at bit i is 0, then the value is null, otherwise it is not
null

There’s some nice benefits of the bitmap approach:

If an array has no nulls, then the bitmap can be ignored or simply not
allocated at all
Nulls can be propagated in algorithms by using word-wise AND or OR
operations, processing 32 or 64 values at a time

Still, for systems like R which use special values to mark nulls, bitmaps can
be controversial. The argument sometimes centers on the performance
implications of having to check bits rather than comparing a value with
INT32_MIN or checking if it is NaN. It is also noted that NaN values are
propagated automatically by the CPU in arithmetic operations, so propagating
nulls in the bitmap case will be necessarily slower.

From the perspective of databases and data warehousing, reserving certain
values to mark a null (or NA) is widely considered unacceptable. NaN is valid
data, as is INT32_MIN and other common values used as sentinels.

Most SQL systems store the null-ness of value using an extra bit or byte. File
formats like Apache Avro and Parquet have a separate null/not-null encoding.

To show that the performance argument against bitmaps is not persuasive, I
wrote some in-memory performance benchmarks (I would like to do some
experiments with large memory-mapped datasets). To keep things simple, let’s
consider the sum operation which must aggregate all of the non-null values in
an array. We also will track the non-null count.

We are interested in a few typical cases:

Data with no nulls
Data with a small percentage of nulls, say 10%
Data with a high percentage of nulls, say 50%

To keep track of the results of a sum, I use a simple struct, templated on the
type of the values:

The naive sum looks like this

Let’s consider double precision floating point values, where NaN is a common
sentinel value. So a sum function for double is:

If we represent nulls with bits, we could write the sum naively like so:

It’s possible to go much faster using bitmaps with a few optimization
techniques:

Eliminate the “if” statements by using floating point operations to
conditionally add the non-null values

“Unroll” part of the for loop to sum 8 values at a time

Skip null checking for groups of 8 values with no nulls. This is made faster
still by generating a pre-populated table of bit popcounts for uint8_t
values.

Here’s the meat of this new vectorized sum algorithm:

To give a fair comparison with a sum without nulls, I also wrote similar
versions of this that sums 8 values at a time in the non-nullable case and
NaN sentinel value case. I also added the same benchmark with int64 values
using INT64_MIN as the sentinel value in case integer operations perform
differently from floating point operations.

Complete benchmarking code is here. Here are the performance results
for the 3 cases there there are 0%, 10%, and 50% nulls, respectively. The size
of the array being summed is 10 million values. The machine is my Xeon E3-1505M
Linux laptop.

It’s definitely possible that I’m doing something suboptimal in my
implementations. I would love to hear from algorithms experts!

Some observations on these benchmark results:

The “no nulls” sum functions, both vectorized and not, perform the same for
all 3 cases, since it does no null checking at all
When there are no nulls, there is no performance penality with the optimized
bitmap sum
When there are 10% nulls, the bitmap sum is about 30% faster than the
sentinel-based sum. The performance gap widens when the percentage of nulls
is higher.
The vectorized bitmap sum is faster when there are 50% nulls than when there
are 10% nulls. I am not sure why this is; I would speculate that because on
average many of the terms in the sum are 0 that the processor is able to
avoid some floating point operations. It might be possible to further
optimize by reducing the number of possible FP operations using a “binary
tree” sum.

Vectorization / batching brings little benefit to the sentinel-based algorithm

Using a bit- or byte-based masked null representation is a requirement for a
common memory format to be accepted by database systems as well as data science
libraries, so using sentinel values at all is not really an option in a project
like Apache Arrow. But, in many cases, using bitmaps allows faster algorithms
to be developed that are not possible with the sentinel value-based null
representation used in pandas and R.

I’m not a perfect programmer, so I’ll be interested to see what other kinds of
optimizations others can cook up. Take a look at the benchmark code!

is it time to stop using sentinel values for null / “na” values?

近期文章

近期评论

标签

热门

文章归档

分类目录

功能