Fix name tracker by xanderbailey · Pull Request #19856 · apache/datafusion

xanderbailey · 2026-01-16T17:23:16Z

Which issue does this PR close?

Closes Improve substrait NameTracker so it doesn't require uuids #17508

Rationale for this change

The previous implementation used UUID-based aliasing as a workaround to prevent duplicate names for literals in Substrait plans. This approach had several drawbacks:

Non-deterministic plan names that made testing difficult (requiring UUID regex filters)
Only addressed literal naming conflicts, not the broader issue of name deduplication
Added unnecessary dependency on the uuid crate
Didn't properly handle cases where the same qualified name could appear with different schema representations

What changes are included in this PR?

Enhanced NameTracker: Refactored to detect two types of conflicts:
- Duplicate schema names: Tracked via schema_name() to prevent validate_unique_names failures (e.g., two Utf8(NULL) literals)
- Ambiguous references: Tracked via qualified_name() to prevent DFSchema::check_names failures when a qualified field (e.g., left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same column name
Removed UUID dependency: Eliminated the uuid crate from datafusion/substrait
Removed literal-specific aliasing: The UUID-based workaround in project_rel.rs is no longer needed as the improved NameTracker handles all naming conflicts consistently
Deterministic naming: Name conflicts now use predictable __temp__N suffixes instead of random UUIDs

Note: This doesn't fully fix all the issues in #17508 which allow some special casing of CAST which are not included here.

Are these changes tested?

Yes:

Updated snapshot tests to reflect the new deterministic naming (e.g., Utf8("people")__temp__0 instead of UUID-based names)
Modified some roundtrip tests to verify semantic equivalence (schema matching and execution) rather than exact string matching, which is more robust
All existing integration tests pass with the new naming scheme

Are there any user-facing changes?

Minimal. The generated plan names are now deterministic and more readable (using __temp__N suffixes instead of UUIDs), but this is primarily an internal representation change. The functional behavior and query results remain unchanged.

dd-annarose

Makes sense. There's still the issue with CASTs that you mentioned in the PR description but this solution works. Handling the CAST issue seems to require a much much deeper rewrite; this solution is straightforward and enough for now.

xanderbailey · 2026-02-07T19:24:24Z

Thanks for the review, I have a couple of failing test cases here that I need to look into. Will take a look on Monday and report back.

hareshkh · 2026-02-16T11:28:52Z

datafusion/substrait/src/logical_plan/consumer/utils.rs

+        let mut counter = 0;
+        loop {
+            let candidate_name = format!("{schema_name}__temp__{counter}");
+            let candidate_expr = expr.clone().alias(candidate_name.clone());


This clone could be avoided by checking in the hashsets directly

dd-annarose

Thank you for the nice tests. This makes sense to me. Just left a small comment, it might be a little off as I haven't worked in the name tracker in a while.

dd-annarose · 2026-02-16T13:44:10Z

datafusion/substrait/src/logical_plan/consumer/utils.rs

+        let mut counter = 0;
+        let candidate_name = loop {
+            let candidate_name = format!("{schema_name}__temp__{counter}");
+            // .alias always produces an unqualified name so check for conflicts accordingly.


could we use alias_qualified() instead of alias or does that complicate things too much?

Yeah great question, looked at that and I think it complicates things and I can't find a reason to change it. I.e. I couldn't write a failing test that it would fix so thought it was best to keep it as is. WDYT?

let's keep it as is then

xanderbailey · 2026-02-17T19:27:34Z

@LiaCastaneda are you able to give this a look, seems like @dd-annarose and @hareshkh are good with it but I know you're also invested in the substrait work.

Hoping this will fix a number of the ambiguous reference errors we're seeing.

LiaCastaneda

Thanks for looking into this! 🙇‍♀️ I think this is a neat and easy to follow solution
cc @gabotechs or @alamb -- I think this PR makes sense. would either of you be able
to review whenever you have time?

LiaCastaneda · 2026-02-18T09:54:08Z

datafusion/substrait/src/logical_plan/consumer/utils.rs

+        let schema_name = expr.schema_name().to_string();
+        let mut counter = 0;
+        let candidate_name = loop {
+            let candidate_name = format!("{schema_name}__temp__{counter}");


There is also some logic to rename aliases to make them unique (used for avoiding duplicate names in join schemas here and here) This generates plans with :N suffixes like this, but this operates on Arrow Fields rather than Expr, so it can't be easily unified with the __temp__ mechanism. Maybe a future consistency improvement could standardize on one naming convention (using __temp__{N} everywhere), though probably the current distinction may be intentional (__temp__ = substrait conversion, :N = standard deduplication)?
(I'm not suggesting any change with this, it's an open question if it makes sense)

gabotechs

Looks good! thanks @xanderbailey

gabotechs · 2026-02-24T08:16:44Z

Thanks @xanderbailey for the PR, and @dd-annarose, @hareshkh and @LiaCastaneda for the reviews.

- Closes apache#17508 The previous implementation used UUID-based aliasing as a workaround to prevent duplicate names for literals in Substrait plans. This approach had several drawbacks: - Non-deterministic plan names that made testing difficult (requiring UUID regex filters) - Only addressed literal naming conflicts, not the broader issue of name deduplication - Added unnecessary dependency on the `uuid` crate - Didn't properly handle cases where the same qualified name could appear with different schema representations 1. Enhanced NameTracker: Refactored to detect two types of conflicts: - Duplicate schema names: Tracked via schema_name() to prevent validate_unique_names failures (e.g., two Utf8(NULL) literals) - Ambiguous references: Tracked via qualified_name() to prevent DFSchema::check_names failures when a qualified field (e.g., left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same column name 2. **Removed UUID dependency**: Eliminated the `uuid` crate from `datafusion/substrait` 3. **Removed literal-specific aliasing**: The UUID-based workaround in `project_rel.rs` is no longer needed as the improved NameTracker handles all naming conflicts consistently 4. **Deterministic naming**: Name conflicts now use predictable `__temp__N` suffixes instead of random UUIDs Note: This doesn't fully fix all the issues in apache#17508 which allow some special casing of `CAST` which are not included here. Yes: - Updated snapshot tests to reflect the new deterministic naming (e.g., `Utf8("people")__temp__0` instead of UUID-based names) - Modified some roundtrip tests to verify semantic equivalence (schema matching and execution) rather than exact string matching, which is more robust - All existing integration tests pass with the new naming scheme Minimal. The generated plan names are now deterministic and more readable (using `__temp__N` suffixes instead of UUIDs), but this is primarily an internal representation change. The functional behavior and query results remain unchanged.

- Closes apache#17508 The previous implementation used UUID-based aliasing as a workaround to prevent duplicate names for literals in Substrait plans. This approach had several drawbacks: - Non-deterministic plan names that made testing difficult (requiring UUID regex filters) - Only addressed literal naming conflicts, not the broader issue of name deduplication - Added unnecessary dependency on the `uuid` crate - Didn't properly handle cases where the same qualified name could appear with different schema representations 1. Enhanced NameTracker: Refactored to detect two types of conflicts: - Duplicate schema names: Tracked via schema_name() to prevent validate_unique_names failures (e.g., two Utf8(NULL) literals) - Ambiguous references: Tracked via qualified_name() to prevent DFSchema::check_names failures when a qualified field (e.g., left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same column name 2. **Removed UUID dependency**: Eliminated the `uuid` crate from `datafusion/substrait` 3. **Removed literal-specific aliasing**: The UUID-based workaround in `project_rel.rs` is no longer needed as the improved NameTracker handles all naming conflicts consistently 4. **Deterministic naming**: Name conflicts now use predictable `__temp__N` suffixes instead of random UUIDs Note: This doesn't fully fix all the issues in apache#17508 which allow some special casing of `CAST` which are not included here. Yes: - Updated snapshot tests to reflect the new deterministic naming (e.g., `Utf8("people")__temp__0` instead of UUID-based names) - Modified some roundtrip tests to verify semantic equivalence (schema matching and execution) rather than exact string matching, which is more robust - All existing integration tests pass with the new naming scheme Minimal. The generated plan names are now deterministic and more readable (using `__temp__N` suffixes instead of UUIDs), but this is primarily an internal representation change. The functional behavior and query results remain unchanged. (cherry picked from commit d59cdfe)

- Closes #17508 The previous implementation used UUID-based aliasing as a workaround to prevent duplicate names for literals in Substrait plans. This approach had several drawbacks: - Non-deterministic plan names that made testing difficult (requiring UUID regex filters) - Only addressed literal naming conflicts, not the broader issue of name deduplication - Added unnecessary dependency on the `uuid` crate - Didn't properly handle cases where the same qualified name could appear with different schema representations 1. Enhanced NameTracker: Refactored to detect two types of conflicts: - Duplicate schema names: Tracked via schema_name() to prevent validate_unique_names failures (e.g., two Utf8(NULL) literals) - Ambiguous references: Tracked via qualified_name() to prevent DFSchema::check_names failures when a qualified field (e.g., left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same column name 2. **Removed UUID dependency**: Eliminated the `uuid` crate from `datafusion/substrait` 3. **Removed literal-specific aliasing**: The UUID-based workaround in `project_rel.rs` is no longer needed as the improved NameTracker handles all naming conflicts consistently 4. **Deterministic naming**: Name conflicts now use predictable `__temp__N` suffixes instead of random UUIDs Note: This doesn't fully fix all the issues in #17508 which allow some special casing of `CAST` which are not included here. Yes: - Updated snapshot tests to reflect the new deterministic naming (e.g., `Utf8("people")__temp__0` instead of UUID-based names) - Modified some roundtrip tests to verify semantic equivalence (schema matching and execution) rather than exact string matching, which is more robust - All existing integration tests pass with the new naming scheme Minimal. The generated plan names are now deterministic and more readable (using `__temp__N` suffixes instead of UUIDs), but this is primarily an internal representation change. The functional behavior and query results remain unchanged. ## Which issue does this PR close?  - Closes #. ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?   Co-authored-by: Xander <[email protected]>

- Closes apache#17508 The previous implementation used UUID-based aliasing as a workaround to prevent duplicate names for literals in Substrait plans. This approach had several drawbacks: - Non-deterministic plan names that made testing difficult (requiring UUID regex filters) - Only addressed literal naming conflicts, not the broader issue of name deduplication - Added unnecessary dependency on the `uuid` crate - Didn't properly handle cases where the same qualified name could appear with different schema representations 1. Enhanced NameTracker: Refactored to detect two types of conflicts: - Duplicate schema names: Tracked via schema_name() to prevent validate_unique_names failures (e.g., two Utf8(NULL) literals) - Ambiguous references: Tracked via qualified_name() to prevent DFSchema::check_names failures when a qualified field (e.g., left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same column name 2. **Removed UUID dependency**: Eliminated the `uuid` crate from `datafusion/substrait` 3. **Removed literal-specific aliasing**: The UUID-based workaround in `project_rel.rs` is no longer needed as the improved NameTracker handles all naming conflicts consistently 4. **Deterministic naming**: Name conflicts now use predictable `__temp__N` suffixes instead of random UUIDs Note: This doesn't fully fix all the issues in apache#17508 which allow some special casing of `CAST` which are not included here. Yes: - Updated snapshot tests to reflect the new deterministic naming (e.g., `Utf8("people")__temp__0` instead of UUID-based names) - Modified some roundtrip tests to verify semantic equivalence (schema matching and execution) rather than exact string matching, which is more robust - All existing integration tests pass with the new naming scheme Minimal. The generated plan names are now deterministic and more readable (using `__temp__N` suffixes instead of UUIDs), but this is primarily an internal representation change. The functional behavior and query results remain unchanged. (cherry picked from commit d59cdfe) Co-authored-by: Xander <[email protected]>

github-actions bot added the substrait Changes to the substrait crate label Jan 16, 2026

xanderbailey mentioned this pull request Jan 16, 2026

Allow flag to alias all projected substrait expressions with a UUID #19123

Closed

dd-annarose approved these changes Feb 7, 2026

View reviewed changes

alamb marked this pull request as draft February 8, 2026 12:25

xanderbailey added 3 commits February 15, 2026 20:49

Fix name tracker

13999b1

fix

0d8aeac

fix

12744d0

xanderbailey force-pushed the xb/fix_name_tracker branch from 8788a58 to 12744d0 Compare February 15, 2026 22:51

xanderbailey added 2 commits February 15, 2026 23:12

fmt

6f23a8c

maybe this works:

61a6621

xanderbailey marked this pull request as ready for review February 16, 2026 01:06

xanderbailey requested a review from dd-annarose February 16, 2026 01:15

Add tests to name tracker

e35f773

hareshkh reviewed Feb 16, 2026

View reviewed changes

xanderbailey added 3 commits February 16, 2026 11:29

Avoid clone on expr when creating candidate

6306a7d

avoid str clone

34351b8

typo

8416107

hareshkh approved these changes Feb 16, 2026

View reviewed changes

dd-annarose approved these changes Feb 16, 2026

View reviewed changes

Merge branch 'main' into xb/fix_name_tracker

cf0a531

LiaCastaneda approved these changes Feb 18, 2026

View reviewed changes

gabotechs approved these changes Feb 24, 2026

View reviewed changes

gabotechs added this pull request to the merge queue Feb 24, 2026

Merged via the queue into apache:main with commit d59cdfe Feb 24, 2026
31 checks passed

hareshkh mentioned this pull request Feb 24, 2026

Release DataFusion 52.2.0 (minor/) Release (Feb 2026) #20287

Open

20 tasks

LiaCastaneda mentioned this pull request Feb 25, 2026

Bring name tracker fix DataDog/datafusion#82

Merged

xanderbailey mentioned this pull request Feb 25, 2026

Expressions from left and right of a join can fail planning during substrait converstion if their name is the same #19330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix name tracker#19856

Fix name tracker#19856
gabotechs merged 10 commits intoapache:mainfrom
xanderbailey:xb/fix_name_tracker

xanderbailey commented Jan 16, 2026 •

edited

Loading

Uh oh!

dd-annarose left a comment

Uh oh!

xanderbailey commented Feb 7, 2026

Uh oh!

hareshkh Feb 16, 2026

Uh oh!

dd-annarose left a comment

Uh oh!

dd-annarose Feb 16, 2026

Uh oh!

xanderbailey Feb 16, 2026

Uh oh!

dd-annarose Feb 17, 2026

Uh oh!

xanderbailey commented Feb 17, 2026 •

edited

Loading

Uh oh!

LiaCastaneda left a comment

Uh oh!

LiaCastaneda Feb 18, 2026

Uh oh!

gabotechs left a comment

Uh oh!

Uh oh!

gabotechs commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

xanderbailey commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

dd-annarose left a comment

Choose a reason for hiding this comment

Uh oh!

xanderbailey commented Feb 7, 2026

Uh oh!

hareshkh Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

dd-annarose left a comment

Choose a reason for hiding this comment

Uh oh!

dd-annarose Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

xanderbailey Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

dd-annarose Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

xanderbailey commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LiaCastaneda left a comment

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gabotechs commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xanderbailey commented Jan 16, 2026 •

edited

Loading

xanderbailey commented Feb 17, 2026 •

edited

Loading