SQL logic tests for Run-End Encoded (REE)#16715
SQL logic tests for Run-End Encoded (REE)#16715rich-t-kid-datadog wants to merge 3 commits intoapache:mainfrom
Conversation
147e058 to
bef6ee3
Compare
|
TODO: Add Hash Joins |
|
|
||
| # LOWER function tests | ||
| query T | ||
| SELECT LOWER(name) FROM ree_test_two_columns WHERE name = 'Alice' LIMIT 1; |
There was a problem hiding this comment.
For these tests it could be nice to have a way to verify the output type is also Run End Encoded (as opposed to DataFusion implicitly casting to Utf8)
There was a problem hiding this comment.
this makes sense, I tried looking looking through Duckdb's sqlogictest Documentation but there didnt seem to be any clean way to do this. To work around it, I generated a temporary table from the result of the query TABLE (SELECT SUBSTR(name, 1, 3) AS name_prefix FROM ree_test_two_columns LIMIT 1)and ran the DESCRIBE operator on it and validated the schema.
ex.
DESCRIBE TABLE (SELECT SUBSTR(name, 1, 3) AS name_prefix FROM ree_test_two_columns LIMIT 1);
----
name_prefix RunEndEncoded(Int32, Utf8) YES
There was a problem hiding this comment.
Can use arrow_typeof(), for example:
datafusion/datafusion/sqllogictest/test_files/order.slt
Lines 196 to 200 in 602475f
gabotechs
left a comment
There was a problem hiding this comment.
Looking good! how about adding some test that try to stress REE with some more edge cases, for example, we could have tests that check for REE input containing NULLs, or when the input table contains no duplicates
c8b4e96 to
b90a1ac
Compare
|
TBD: add the |
b90a1ac to
8474840
Compare
|
Currently all test are commented out, This is to allow for the CI to pass. As features are added to REE, corresponding test will be uncommented. |
Add tests for NULL values and no-duplicate scenarios, plus DESCRIBE statements to validate REE type preservation through string operations.
8474840 to
a35bfff
Compare
Jefffrey
left a comment
There was a problem hiding this comment.
I think for this PR we should uncomment all the tests and expect them to be query failures, with comments on top with the expected results; that way as REE support is added in we will know to update these tests as they'll fail. As the PR currently is, I don't see much value in adding all these commented out tests that won't run.
Example of doing a query failure in SLT for reference:
datafusion/datafusion/sqllogictest/test_files/map.slt
Lines 158 to 159 in 602475f
|
Closing this as stale; feel free to reopen it when it becomes active again |
This PR contributes towards the larger (RLE)/(REE) epic!
Rationale for this change
DataFusion currently supports REE encoding through Arrow's RunEndEncoded type, but lacks comprehensive testing to ensure this functionality works correctly across various SQL operations.
This PR adds comprehensive test coverage for REE-encoded data to ensure that:
What changes are included in this PR?
This PR adds a new test file run_end_encoding.slt that provides comprehensive testing for Run-End Encoded data:
arrow_cast(column, 'RunEndEncoded(Int32, Utf8)')COUNT(*)andCOUNT(DISTINCT)on REE columnsLOWER()andUPPER()on REE columnsCONCAT()with REE columns (including nested operations)SUBSTR()/SUBSTRING()on REE columnsREPLACE()on REE columnsREVERSE()on REE columnsUPPER(SUBSTR(...)))The set of functions included is deliberately minimal for now, focusing on the most commonly used operations based on insights from a private dataset. This foundation will be expanded over time as broader REE support is implemented.
Are these changes tested?
The changes are test.
Are there any user-facing changes?
As of now, no. A majority of these test wont pass as of now due to the lack of support but it gives a guideline as to what our focus is.