Issue 5051 - ignore duplicate header when load data#5275
Issue 5051 - ignore duplicate header when load data#5275frank-dong-ms-zz wants to merge 4 commits intodotnet:masterfrom
Conversation
| // Skip duplicate header string | ||
| if (_headerString == text) | ||
| { | ||
| _ch.Warning($"Ignore duplicate header string {text} at line {line}."); |
There was a problem hiding this comment.
Although I guess this is the only way of actually solving the issue, I'm concerned about the performance implications of having to do a full string comparison for every line we load. If the dataset is big (with many columns), would this performance implication be significant?
There was a problem hiding this comment.
After thinking about it, I guess the implementation of comparing strings will actually be optimized to return "false" as soon as it finds a character that doesn't match... and in general even if the dataset has many columns, this comparison would return false very soon for the vast majority of the cases. So I guess this isn't a problem...
In reply to: 448687279 [](ancestors = 448687279)
There was a problem hiding this comment.
I agree with your assessment. But I would still like to see some benchmark test that this is true, especially on a large file.
In reply to: 448701405 [](ancestors = 448701405,448687279)
There was a problem hiding this comment.
Are we comparing the first string read in the header file with the other strings as they are read? What happens if there is another header row in the text file, but has different spacing between the commas or has quotes?
In reply to: 448759220 [](ancestors = 448759220,448701405,448687279)
There was a problem hiding this comment.
I'm not sure that also checking for variations of the header (spacing, quotes, capitalization, culture variation, etc ...) would be a good idea, as it would necessarily increase the time spent while reading every line .
Besides, I'm not sure how much coverage we should give to fix #5051 ? I'd think it isn't a very common problem anyway, and so simply checking for the exact same string as found on the header of the file would be enough.
In reply to: 448759887 [](ancestors = 448759887,448759220,448701405,448687279)
|
|
||
| private void ThreadProc() | ||
| private void ReadLines() | ||
| { |
There was a problem hiding this comment.
suggestion: ThreadProc indicates that it is the main procedure of the of the thread. It might be better to retain that meaning and call it ReadLinesThreadProc
Codecov Report
@@ Coverage Diff @@
## master #5275 +/- ##
==========================================
- Coverage 73.74% 73.68% -0.06%
==========================================
Files 1022 1022
Lines 190200 190272 +72
Branches 20463 20465 +2
==========================================
- Hits 140263 140208 -55
- Misses 44418 44537 +119
- Partials 5519 5527 +8
|
| private TextLoader.Options GetLoaderArgs(string labelColumnName, string matrixColumnIndexColumnName, string matrixRowIndexColumnName) | ||
| { | ||
| return new TextLoader.Options() | ||
| { | ||
| Separator = "\t", | ||
| HasHeader = true, | ||
| Columns = new[] | ||
| { | ||
| new TextLoader.Column(labelColumnName, DataKind.Single, new [] { new TextLoader.Range(0) }), | ||
| new TextLoader.Column(matrixColumnIndexColumnName, DataKind.UInt32, new [] { new TextLoader.Range(1) }, new KeyCount(20)), | ||
| new TextLoader.Column(matrixRowIndexColumnName, DataKind.UInt32, new [] { new TextLoader.Range(2) }, new KeyCount(40)), | ||
| } | ||
| }; | ||
| } |
There was a problem hiding this comment.
If GetLoaderArgs is only used in your new TestLoadTextWithDuplicateHeader test, can you please simply move the content of this function inside the test itself? Thanks.
|
Put this PR on hold since AutoML not ignoring duplicate headers and we want to keep behavior consistent. |
|
Close this for now until we come to this issue later. |
address #5051