Skip to content

Comments

SQL logic tests for Run-End Encoded (REE)#16715

Closed
rich-t-kid-datadog wants to merge 3 commits intoapache:mainfrom
rich-t-kid-datadog:RunEndEncoding-slt-test
Closed

SQL logic tests for Run-End Encoded (REE)#16715
rich-t-kid-datadog wants to merge 3 commits intoapache:mainfrom
rich-t-kid-datadog:RunEndEncoding-slt-test

Conversation

@rich-t-kid-datadog
Copy link

@rich-t-kid-datadog rich-t-kid-datadog commented Jul 8, 2025

This PR contributes towards the larger (RLE)/(REE) epic!

Rationale for this change

DataFusion currently supports REE encoding through Arrow's RunEndEncoded type, but lacks comprehensive testing to ensure this functionality works correctly across various SQL operations.
This PR adds comprehensive test coverage for REE-encoded data to ensure that:

  • Basic SQL operations (filtering, selection, aggregation) work correctly with REE columns
  • String functions (LOWER, UPPER, CONCAT, SUBSTR, REPLACE, REVERSE) properly handle REE-encoded data
  • Complex queries involving multiple REE columns function as expected
  • The encoding/decoding process is transparent to users and maintains data integrity

What changes are included in this PR?

This PR adds a new test file run_end_encoding.slt that provides comprehensive testing for Run-End Encoded data:

  1. Basic REE Functionality
  • Table creation with REE-encoded columns using arrow_cast(column, 'RunEndEncoded(Int32, Utf8)')
  • Verification of correct data types via DESCRIBE statements
  1. Filtering and Selection
  • Filtering on REE columns
  • Combined filtering on REE and non-REE columns
  1. Aggregation Functions
  • COUNT(*) and COUNT(DISTINCT) on REE columns
  1. String Function Testing
  • LOWER() and UPPER() on REE columns
  • CONCAT() with REE columns (including nested operations)
  • SUBSTR()/SUBSTRING() on REE columns
  • REPLACE() on REE columns
  • REVERSE() on REE columns
  • Combined string functions (e.g., UPPER(SUBSTR(...)))

The set of functions included is deliberately minimal for now, focusing on the most commonly used operations based on insights from a private dataset. This foundation will be expanded over time as broader REE support is implemented.

Are these changes tested?

The changes are test.

Are there any user-facing changes?

As of now, no. A majority of these test wont pass as of now due to the lack of support but it gives a guideline as to what our focus is.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 8, 2025
@rich-t-kid-datadog rich-t-kid-datadog force-pushed the RunEndEncoding-slt-test branch from 147e058 to bef6ee3 Compare July 9, 2025 14:06
@rich-t-kid-datadog rich-t-kid-datadog changed the title draft of .slt file. Implemented the basics, need to test with cast ch… [Draft]Add SQL logic tests for Run-End Encoded (REE) Jul 9, 2025
@rich-t-kid-datadog
Copy link
Author

TODO: Add Hash Joins
Not a priority


# LOWER function tests
query T
SELECT LOWER(name) FROM ree_test_two_columns WHERE name = 'Alice' LIMIT 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these tests it could be nice to have a way to verify the output type is also Run End Encoded (as opposed to DataFusion implicitly casting to Utf8)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense, I tried looking looking through Duckdb's sqlogictest Documentation but there didnt seem to be any clean way to do this. To work around it, I generated a temporary table from the result of the query TABLE (SELECT SUBSTR(name, 1, 3) AS name_prefix FROM ree_test_two_columns LIMIT 1)and ran the DESCRIBE operator on it and validated the schema.
ex.
DESCRIBE TABLE (SELECT SUBSTR(name, 1, 3) AS name_prefix FROM ree_test_two_columns LIMIT 1);
----
name_prefix RunEndEncoded(Int32, Utf8) YES

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use arrow_typeof(), for example:

# Demonstrate types
query TTT
SELECT arrow_typeof(c1), arrow_typeof(c2), arrow_typeof(c3) FROM test LIMIT 1;
----
Int32 Int64 Boolean

Copy link
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! how about adding some test that try to stress REE with some more edge cases, for example, we could have tests that check for REE input containing NULLs, or when the input table contains no duplicates

@rich-t-kid-datadog rich-t-kid-datadog force-pushed the RunEndEncoding-slt-test branch from c8b4e96 to b90a1ac Compare July 10, 2025 21:27
@rich-t-kid-datadog
Copy link
Author

rich-t-kid-datadog commented Jul 10, 2025

TBD: add the In & length operator into the test, should fit in with the other string operations and its frequently used at DD

@rich-t-kid-datadog rich-t-kid-datadog changed the title [Draft]Add SQL logic tests for Run-End Encoded (REE) Add SQL logic tests for Run-End Encoded (REE) Jul 14, 2025
@rich-t-kid-datadog rich-t-kid-datadog changed the title Add SQL logic tests for Run-End Encoded (REE) SQL logic tests for Run-End Encoded (REE) Jul 14, 2025
@rich-t-kid-datadog rich-t-kid-datadog force-pushed the RunEndEncoding-slt-test branch from b90a1ac to 8474840 Compare July 14, 2025 22:10
@rich-t-kid-datadog
Copy link
Author

Currently all test are commented out, This is to allow for the CI to pass. As features are added to REE, corresponding test will be uncommented.

Add tests for NULL values and no-duplicate scenarios, plus DESCRIBE statements
to validate REE type preservation through string operations.
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this PR we should uncomment all the tests and expect them to be query failures, with comments on top with the expected results; that way as REE support is added in we will know to update these tests as they'll fail. As the PR currently is, I don't see much value in adding all these commented out tests that won't run.

Example of doing a query failure in SLT for reference:

query error DataFusion error: Arrow error: Cast error: Cannot cast string 'ab' to value of Int64 type
SELECT MAKE_MAP('POST', 41, 'HEAD', 'ab', 'PATCH', 30);

@Jefffrey
Copy link
Contributor

Closing this as stale; feel free to reopen it when it becomes active again

@Jefffrey Jefffrey closed this Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants