There are some data types where the nulls are not stored "physically" using a validity bitmap on the parent ArrayData, but through nulls in child data:
(sidenote: Dictionary arrays could be considered here as well, but are a bit a mixed bag: there is a top-level null count through nulls in the indices, but additionally also the dictionary can contain nulls. So nulls can be encoded in two different ways)
The format specification has a "null_count" (in the IPC FieldNode in the Recordbatch message, and in the C Data Interface), and in those cases this refers to the "physical" null count. And this is followed by the C++ implementation, where the base Array::null_count() (implemented by ArrayData::GetNullCount()) looks at the validity buffer (typically the first buffer) to count the the unset bits, or directly return 0 if there is no validity buffer.
However, in practice you often want to know if there are actual "logical" nulls (not considering those leads to bugs, for example #34315).
@felipecrv @westonpace and I had some discussion about this on zulip (https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Null.20count.20of.20a.20UnionArray/near/336538844), and I think our current idea would be:
-
Add a GetLogicalNullCount to complement the existing Array::null_count() / ArrayData::GetNullCount()(changing null_count() itself might be too much of a breaking change? And would also create an inconsistency with where this is used in the specs)
-
Change Array::IsNull(i) to consider logical nulls instead of just physical nulls
There are some data types where the nulls are not stored "physically" using a validity bitmap on the parent ArrayData, but through nulls in child data:
(sidenote: Dictionary arrays could be considered here as well, but are a bit a mixed bag: there is a top-level null count through nulls in the indices, but additionally also the dictionary can contain nulls. So nulls can be encoded in two different ways)
The format specification has a "null_count" (in the IPC
FieldNodein the Recordbatch message, and in the C Data Interface), and in those cases this refers to the "physical" null count. And this is followed by the C++ implementation, where the baseArray::null_count()(implemented byArrayData::GetNullCount()) looks at the validity buffer (typically the first buffer) to count the the unset bits, or directly return 0 if there is no validity buffer.However, in practice you often want to know if there are actual "logical" nulls (not considering those leads to bugs, for example #34315).
@felipecrv @westonpace and I had some discussion about this on zulip (https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Null.20count.20of.20a.20UnionArray/near/336538844), and I think our current idea would be:
Add a
GetLogicalNullCountto complement the existingArray::null_count()/ArrayData::GetNullCount()(changingnull_count()itself might be too much of a breaking change? And would also create an inconsistency with where this is used in the specs)Change
Array::IsNull(i)to consider logical nulls instead of just physical nulls