preserve Field metadata in first_value/last_value#19335
preserve Field metadata in first_value/last_value#19335adriangb merged 4 commits intoapache:mainfrom
Conversation
| @@ -137,6 +137,18 @@ impl AggregateUDFImpl for FirstValue { | |||
| Ok(arg_types[0].clone()) | |||
There was a problem hiding this comment.
Ideally we should have return_type return an internal error instead of leaving it implemented if we use return_field now
| @@ -1075,6 +1087,18 @@ impl AggregateUDFImpl for LastValue { | |||
| Ok(arg_types[0].clone()) | |||
| /// UDF to extract metadata from a field for testing purposes | ||
| /// Usage: get_metadata(expr, 'key') -> returns the metadata value or NULL | ||
| #[derive(Debug, PartialEq, Eq, Hash)] | ||
| struct GetMetadataUdf { |
There was a problem hiding this comment.
I wonder if this is useful enough to introduce as a function to datafusion itself, instead of being only in SLT? 🤔
There was a problem hiding this comment.
Happy to do so if you think it's worth it - I went with the conservative approach
There was a problem hiding this comment.
We don't need to do it in this PR, but worth filing a ticket for as followup
| 3 1 | ||
| NULL 1 | ||
|
|
||
| # Regression test: first_value should preserve metadata |
There was a problem hiding this comment.
I noticed for the existing regression tests in this file, they don't actually check metadata 🤔
datafusion/datafusion/sqllogictest/test_files/metadata.slt
Lines 63 to 67 in 58377bf
With this new UDF we can look into updating those tests to be more similar to the ones introduced here
There was a problem hiding this comment.
Will file a ticket before we merge this to do so
There was a problem hiding this comment.
the tests I think test that there is metadata on the input tables (rather than the output tables)
I do really like the idea of adding a UDF, simlarly to arrow_typeof that can show the metadata
``sql
select arrow_typeof('foo');
+---------------------------+
| arrow_typeof(Utf8("foo")) |
+---------------------------+
| Utf8 |
+---------------------------+
1 row(s) fetched.
Elapsed 0.024 seconds.
Possibilities: add a new argument
```sql
> select arrow_typeof('foo', true);
Possibility: add a new function
> select arrow_metadata('foo');There was a problem hiding this comment.
I think this would help debug various metadata issues more easily
I can file a ticket if you think this is reasonable
| 3 1 | ||
| NULL 1 | ||
|
|
||
| # Regression test: first_value should preserve metadata |
There was a problem hiding this comment.
the tests I think test that there is metadata on the input tables (rather than the output tables)
I do really like the idea of adding a UDF, simlarly to arrow_typeof that can show the metadata
``sql
select arrow_typeof('foo');
+---------------------------+
| arrow_typeof(Utf8("foo")) |
+---------------------------+
| Utf8 |
+---------------------------+
1 row(s) fetched.
Elapsed 0.024 seconds.
Possibilities: add a new argument
```sql
> select arrow_typeof('foo', true);
Possibility: add a new function
> select arrow_metadata('foo');| 3 1 | ||
| NULL 1 | ||
|
|
||
| # Regression test: first_value should preserve metadata |
There was a problem hiding this comment.
I think this would help debug various metadata issues more easily
I can file a ticket if you think this is reasonable
| } | ||
|
|
||
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { | ||
| // Get the metadata key from the second argument (must be a string literal) |
There was a problem hiding this comment.
I think it would also be nice if we supported a single column version that returned the metadata as a struct array too
| ---- | ||
| 1 the id field NULL the name field | ||
| 3 the id field baz the name field | ||
| NULL the id field bar the name field |
There was a problem hiding this comment.
What would happen if a non-existing column is passed as a first argument of the get_metadata() udf ? Or a scalar value.
There was a problem hiding this comment.
Return an empty map?
There was a problem hiding this comment.
Or NULL.
I mean it would be good to have some negative test cases too.
There was a problem hiding this comment.
What would happen if a non-existing column is passed as a first argument of the get_metadata() udf
I think it should probably error like any other query that tries to access a undefined column
|
I opened #19356 to track generalizing the function added in this PR. |
|
@erratic-pattern is looking at something similar for us upstream in InfluxDB |
Which issue does this PR close?
Closes #19336
Rationale for this change
The
first_valueandlast_valueaggregate functions were not preserving Field metadata from their input arguments. This caused metadata to be lost when using these functions, which affects downstream consumers that rely on metadata (e.g., for DISTINCT ON queries which usefirst_valueinternally).What changes are included in this PR?
return_field()forFirstValueto preserve input field metadatareturn_field()forLastValueto preserve input field metadataget_metadataUDF for testing metadata preservation in sqllogictestfirst_value,last_value,DISTINCT ON,DISTINCT, and grouped columnsAre these changes tested?
Yes, new sqllogictest tests are added in
metadata.sltthat verify metadata is preserved through various aggregate operations.Are there any user-facing changes?
Yes, Field metadata is now correctly preserved when using
first_value()andlast_value()aggregate functions. This is a bug fix that improves metadata propagation.🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 [email protected]