-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Describe the bug
DataFusion's COUNT and COUNT DISTINCT aggregate functions produce incorrect results when operating on dictionary arrays that contain null values. The functions appear to be counting dictionary keys rather than properly handling the null values referenced by those keys.
Two specific issues have been identified:
COUNT with dictionary arrays: When a dictionary array has keys that reference null values, COUNT incorrectly counts those null references as valid values instead of ignoring them.
COUNT DISTINCT with all-null dictionary arrays: When a dictionary array contains only keys that reference null values, COUNT DISTINCT should return 0 but may return incorrect results.
To Reproduce
Issue 1 - COUNT with mixed null/non-null dictionary values:
use arrow::array::{DictionaryArray, Int32Array, StringArray};
use arrow::datatypes::Int32Type;
use std::sync::Arc;
// Create dictionary with values ["a", null, "c"]
let values = StringArray::from(vec![Some("a"), None, Some("c")]);
// Keys [0, 1, 2, 0, 1] reference: "a", null, "c", "a", null
let keys = Int32Array::from(vec![0, 1, 2, 0, 1]);
let dict_array = DictionaryArray::<Int32Type>::try_new(keys,
Arc::new(values))?;
// COUNT should return 3 (only non-null values: "a", "c", "a")
// But may incorrectly count the null references Issue 2 - COUNT DISTINCT with all-null dictionary values:
// Create dictionary where all keys reference null values
let dict_values = StringArray::from(vec![None, Some("abc")]);
let dict_indices = Int32Array::from(vec![0; 5]); // All keys point to
null
let dict_array = DictionaryArray::<Int32Type>::try_new(dict_indices,
Arc::new(dict_values))?;
// COUNT DISTINCT should return 0 since all referenced values are null Expected behavior
COUNT: Should only count non-null values in dictionary arrays by properly dereferencing dictionary keys to their actual values and ignoring null references.
COUNT DISTINCT: Should return 0 when all dictionary keys reference null values, and should properly count only distinct non-null values when there's a mix of null and non-null references.
Both functions should handle dictionary arrays by:
Dereferencing dictionary keys to their actual values Applying null-checking logic to the dereferenced values, not the keys Following the same null-handling semantics as regular arrays
Additional context
This issue affects the correctness of aggregate queries on dictionary-encoded columns, which are commonly used in analytical workloads for memory efficiency. The bug could lead to incorrect query results in production environments.
The issue is present in the core aggregation logic for both regular COUNT and COUNT DISTINCT operations when processing DictionaryArray inputs. The functions need to properly handle the indirection layer that dictionary encoding introduces.
🔑 Root cause: Arrow’s current DictionaryArray::is_null implementation only inspects the index bitmap (i.e. it checks which keys are null) but does not consult the values buffer to see whether a key points at a null slot. As a result, any code (including DataFusion’s aggregates) that relies on is_null will mis-classify null references as non-null. Arrow mentioned that they took the approach for performance reasons.