Skip to content

Allow using dictionary arrays as filters#12382

Merged
alamb merged 3 commits intoapache:mainfrom
adriangb:dict-null
Sep 10, 2024
Merged

Allow using dictionary arrays as filters#12382
alamb merged 3 commits intoapache:mainfrom
adriangb:dict-null

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Sep 8, 2024

Fixes #12380

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Sep 8, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that a DictionaryArray with Boolean values is likely to only ever perform worse than a native BooleanArray

The rationale is that each element of a DictionaryArray takes at least 1 byte (for the index into the dictionary when indexed by Int8/UInt8) where each element in a BooleanArray consumes a single bit

So while I think this code is fine and we could merge it, the feature seems suspicious to me 🤔

@adriangb
Copy link
Contributor Author

adriangb commented Sep 8, 2024

I agree, I was thinking the same thing. The context is a kernel that operates on arrays, the output of which may or may not be used in a filter. I guess the kernel could do the conversion but I figured if we can support it here Filter could do the conversion internally (if it makes sense, maybe it already is?)

@alamb
Copy link
Contributor

alamb commented Sep 8, 2024

I agree, I was thinking the same thing. The context is a kernel that operates on arrays, the output of which may or may not be used in a filter. I guess the kernel could do the conversion but I figured if we can support it here Filter could do the conversion internally (if it makes sense, maybe it already is?)

I think this conversion is typically done in coercion -- like in https://docs.rs/datafusion/latest/datafusion/logical_expr/type_coercion/binary/fn.comparison_coercion.html

(which basically means DataFusion would apply a cast internally to cast Dictionary(*, Boolean) --> Boolean

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to allow Dictionary as a filter (even though I think subsequent processing will effectively cast it to a BooleanArray)

@Dandandan
Copy link
Contributor

I think that functions like json_contains shouldn't create a DictionaryArray but rather a BooleanArray directly (so no conversion is needed).

@adriangb
Copy link
Contributor Author

adriangb commented Sep 8, 2024

Yeah I agree, that's probably the better way to do things but does it hurt to also add this support? I guess it could be a performance foot gun to allow this, although I don't know how bad it would be. The alternative of not implementing this is the foot gun of: you implement a kernel, write some tests for it but not this specific case and then discovering via a user reporting a bug that DataFusion has this limitation and have to go patch your kernel to protect against this case.

@alamb
Copy link
Contributor

alamb commented Sep 9, 2024

Yeah I agree, that's probably the better way to do things but does it hurt to also add this support? I guess it could be a performance foot gun to allow this, although I don't know how bad it would be.

Yeah, I think the potential issue would be that your code creates these DictionaryArrays of Boolean that is quite a bit slower if you had created BooleanArray directly.

If DataFusion errors, then you know to go try and fix it, but if it doesn't error, then you may never know

That being said, I think this particular change is fine and that we could merge it for the reasons discussed above.

@alamb alamb merged commit 8d2b240 into apache:main Sep 10, 2024
@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

Thanks again @adriangb and @Dandandan

@samuelcolvin
Copy link
Contributor

See #12511 which is closely related

@adriangb adriangb deleted the dict-null branch September 17, 2024 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dictionary columns (with a Boolean values) can't be used directly as filters

4 participants