Conversation
…#348) The nameAndTypeRegex was incorrectly using RegexOptions.RightToLeft, which caused it to find the FIRST '(' scanning right-to-left rather than the LAST '(' from the left. This meant that a schema item like 'Revenue (USD)(float)' was parsed with name='Revenue ' and type='USD)(float' instead of name='Revenue (USD)', type='float'. The fix removes RightToLeft from nameAndTypeRegex only. The other two regexes (typeAndUnitRegex and overrideByNameRegex) correctly use RightToLeft because they need to handle cases like 'type<unit<sub>>' or 'name->newName=type' where the name itself can contain the delimiter characters. A regression test is added to cover this case. Co-authored-by: Copilot <[email protected]>
…s-4c86d7d41f6a1513
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 This is an automated PR from Repo Assist.
Fixes the regex bug in CSV schema parsing reported in #348.
Root Cause
nameAndTypeRegexinCsvInference.fswas incorrectly usingRegexOptions.RightToLeft. This caused the regex engine to scan right-to-left and match the first(it found (scanning from the right), rather than the last(from the left.For a schema item like
"Revenue (USD)(float)", the intended parse is:"Revenue (USD)""float"But with
RightToLeft, the regex matched:"Revenue "(stopped at the inner(in(USD))"USD)(float"(incorrect!)Without
RightToLeft, the standard greedy.+in thenamegroup correctly finds the last parenthesized group as the type, because it consumes as many characters as possible before backtracking.Fix
Remove
RegexOptions.RightToLeftfromnameAndTypeRegexonly (line 44). The other two regexes (typeAndUnitRegexandoverrideByNameRegex) correctly retainRightToLeftbecause they need to handle cases like:"float(unit<sub)>"— unit string can contain<"Revenue->newName=float"— column names can contain=Test Status
All 465 existing tests pass ✅. A new regression test is added:
Trade-offs
None. The greedy (non-RightToLeft) behavior is the correct and intended behavior for
nameAndTypeRegex. This is a pure bug fix with no behaviour changes for existing valid schemas.