Skip to content

[Repo Assist] Fix CSV schema parsing for column names containing parentheses (closes #348)#604

Merged
dsyme merged 3 commits intomasterfrom
repo-assist/fix-issue-348-csv-schema-parens-4c86d7d41f6a1513
Mar 9, 2026
Merged

[Repo Assist] Fix CSV schema parsing for column names containing parentheses (closes #348)#604
dsyme merged 3 commits intomasterfrom
repo-assist/fix-issue-348-csv-schema-parens-4c86d7d41f6a1513

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot commented Mar 9, 2026

🤖 This is an automated PR from Repo Assist.

Fixes the regex bug in CSV schema parsing reported in #348.

Root Cause

nameAndTypeRegex in CsvInference.fs was incorrectly using RegexOptions.RightToLeft. This caused the regex engine to scan right-to-left and match the first ( it found (scanning from the right), rather than the last ( from the left.

For a schema item like "Revenue (USD)(float)", the intended parse is:

  • name = "Revenue (USD)"
  • type = "float"

But with RightToLeft, the regex matched:

  • name = "Revenue " (stopped at the inner ( in (USD))
  • type = "USD)(float" (incorrect!)

Without RightToLeft, the standard greedy .+ in the name group correctly finds the last parenthesized group as the type, because it consumes as many characters as possible before backtracking.

Fix

Remove RegexOptions.RightToLeft from nameAndTypeRegex only (line 44). The other two regexes (typeAndUnitRegex and overrideByNameRegex) correctly retain RightToLeft because they need to handle cases like:

  • "float(unit<sub)>" — unit string can contain <
  • "Revenue->newName=float" — column names can contain =

Test Status

All 465 existing tests pass ✅. A new regression test is added:

[(Test)]
let ``Can read CSV with schema where column name contains parentheses``() =
  let csv = "Revenue (USD),Count\n100.0,1\n200.0,2"
  use reader = new System.IO.StringReader(csv)
  let df = Frame.ReadCsv(reader, schema="Revenue (USD)(float),Count(int)")
  List.ofSeq df.ColumnKeys |> shouldEqual ["Revenue (USD)"; "Count"]
  df.GetColumn(float)("Revenue (USD)") |> Series.values |> List.ofSeq |> shouldEqual [100.0; 200.0]

Trade-offs

None. The greedy (non-RightToLeft) behavior is the correct and intended behavior for nameAndTypeRegex. This is a pure bug fix with no behaviour changes for existing valid schemas.

Generated by Repo Assist ·

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@30f2254f2a7a944da1224df45d181a3f8faefd0d

…#348)

The nameAndTypeRegex was incorrectly using RegexOptions.RightToLeft, which caused
it to find the FIRST '(' scanning right-to-left rather than the LAST '(' from the
left. This meant that a schema item like 'Revenue (USD)(float)' was parsed with
name='Revenue ' and type='USD)(float' instead of name='Revenue (USD)', type='float'.

The fix removes RightToLeft from nameAndTypeRegex only. The other two regexes
(typeAndUnitRegex and overrideByNameRegex) correctly use RightToLeft because they
need to handle cases like 'type<unit<sub>>' or 'name->newName=type' where the
name itself can contain the delimiter characters.

A regression test is added to cover this case.

Co-authored-by: Copilot <[email protected]>
@dsyme dsyme marked this pull request as ready for review March 9, 2026 04:00
@dsyme dsyme merged commit 1e1a4c6 into master Mar 9, 2026
2 checks passed
@dsyme dsyme deleted the repo-assist/fix-issue-348-csv-schema-parens-4c86d7d41f6a1513 branch March 9, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant