Skip to content

COUNT and COUNT DISTINCT produce incorrect results for dictionary arrays with null values #16339

@kosiew

Description

@kosiew

Describe the bug

DataFusion's COUNT and COUNT DISTINCT aggregate functions produce incorrect results when operating on dictionary arrays that contain null values. The functions appear to be counting dictionary keys rather than properly handling the null values referenced by those keys.

Two specific issues have been identified:

COUNT with dictionary arrays: When a dictionary array has keys that reference null values, COUNT incorrectly counts those null references as valid values instead of ignoring them.

COUNT DISTINCT with all-null dictionary arrays: When a dictionary array contains only keys that reference null values, COUNT DISTINCT should return 0 but may return incorrect results.

To Reproduce

Issue 1 - COUNT with mixed null/non-null dictionary values:

use arrow::array::{DictionaryArray, Int32Array, StringArray};
use arrow::datatypes::Int32Type;
use std::sync::Arc;

// Create dictionary with values ["a", null, "c"]
let values = StringArray::from(vec![Some("a"), None, Some("c")]);
// Keys [0, 1, 2, 0, 1] reference: "a", null, "c", "a", null
let keys = Int32Array::from(vec![0, 1, 2, 0, 1]);
let dict_array = DictionaryArray::<Int32Type>::try_new(keys,
Arc::new(values))?;

// COUNT should return 3 (only non-null values: "a", "c", "a")
// But may incorrectly count the null references 

Issue 2 - COUNT DISTINCT with all-null dictionary values:

// Create dictionary where all keys reference null values
let dict_values = StringArray::from(vec![None, Some("abc")]);
let dict_indices = Int32Array::from(vec![0; 5]); // All keys point to
null
let dict_array = DictionaryArray::<Int32Type>::try_new(dict_indices,
Arc::new(dict_values))?;

// COUNT DISTINCT should return 0 since all referenced values are null   

Expected behavior

COUNT: Should only count non-null values in dictionary arrays by properly dereferencing dictionary keys to their actual values and ignoring null references.

COUNT DISTINCT: Should return 0 when all dictionary keys reference null values, and should properly count only distinct non-null values when there's a mix of null and non-null references.

Both functions should handle dictionary arrays by:

Dereferencing dictionary keys to their actual values Applying null-checking logic to the dereferenced values, not the keys Following the same null-handling semantics as regular arrays

Additional context

This issue affects the correctness of aggregate queries on dictionary-encoded columns, which are commonly used in analytical workloads for memory efficiency. The bug could lead to incorrect query results in production environments.

The issue is present in the core aggregation logic for both regular COUNT and COUNT DISTINCT operations when processing DictionaryArray inputs. The functions need to properly handle the indirection layer that dictionary encoding introduces.

🔑 Root cause: Arrow’s current DictionaryArray::is_null implementation only inspects the index bitmap (i.e. it checks which keys are null) but does not consult the values buffer to see whether a key points at a null slot. As a result, any code (including DataFusion’s aggregates) that relies on is_null will mis-classify null references as non-null. Arrow mentioned that they took the approach for performance reasons.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions