Fix bulk insert parsing of isolated quotes in tab-delimited data (#2792)#2795
Fix bulk insert parsing of isolated quotes in tab-delimited data (#2792)#2795
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2795 +/- ##
=========================================
Coverage 52.20% 52.20%
- Complexity 4142 4144 +2
=========================================
Files 149 149
Lines 34306 34306
Branches 5723 5723
=========================================
+ Hits 17908 17909 +1
+ Misses 13906 13905 -1
Partials 2492 2492 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This removes support for a regex delimiter. After digging through old versions of the code, it appears that the original support for a regex delimiter was an accidental feature, but it might not have worked properly in some cases (especially when At any rate, a delimiter cannot be a regex as of this merge, so all I propose now is to remove the mention of regex in the following comment (it appears twice in SQLServerBulkCSVFileRecord.java) since it is no longer true: * Delimiter to used to separate each column. Regex characters must be escaped with double backslashes. |
|
@abbrev - Thanks for the detailed context. |
Overview
This PR fixes bulk insert parsing of isolated quotes in tab-delimited data by removing problematic global quote state tracking from the
parseStringmethod inSQLServerBulkCSVFileRecord. The fix ensures that isolated quote characters are treated as literal data rather than field boundary markers, resolving IndexOutOfBoundsException errors during bulk copy operations.Problem Description
The current implementation uses a global
quotedboolean state in theparseStringmethod that toggles on every quote character encounter. This causes issues when tab-delimited data contains isolated quotes within fields:if (buffer.charAt(i) == doubleQuoteChar){quoted = !quoted; } else if (!quoted && /* delimiter found */) { // Process delimiter }When parsing data like "Do you wish to remove the product "\t22451\t1", the isolated quote incorrectly toggles the quoted state, causing subsequent tab delimiters to be ignored. This results in:
Expected: 5 fields parsed correctly
Actual: 3 fields parsed, causing
IndexOutOfBoundsExceptionRoot Cause
PR #2434 introduced quote handling logic to fix stack overflow issues in CSV parsing. While the fix successfully resolved the stack overflow problem for CSV files, it created a new issue where isolated quotes in tab-delimited data are treated as field boundary markers instead of literal characters.
Solution
Reverted to using
currentLine.split(delimiter, -1)instead ofparseString(currentLine, delimiter)for simple delimiter-based parsing.Maintained stack overflow fix from PR #2434 while fixing the quote parsing regression.
Added comprehensive test coverage with the exact problematic data patterns from issue #2792
Testing
Closes #2792