Skip to content

Issue 5051 - ignore duplicate header when load data#5275

Closed
frank-dong-ms-zz wants to merge 4 commits intodotnet:masterfrom
frank-dong-ms-zz:issue-5051
Closed

Issue 5051 - ignore duplicate header when load data#5275
frank-dong-ms-zz wants to merge 4 commits intodotnet:masterfrom
frank-dong-ms-zz:issue-5051

Conversation

@frank-dong-ms-zz
Copy link
Copy Markdown
Contributor

@frank-dong-ms-zz frank-dong-ms-zz commented Jul 1, 2020

address #5051

  1. ignore extra header in line reader of text loader
  2. enhance logging
  3. corresponding test
  4. fix existing text loader test bugs, data file "breast-cancer.txt" don't have header but many tests use this data file with hasHeader settting to true which is wrong.

@frank-dong-ms-zz frank-dong-ms-zz requested a review from a team as a code owner July 1, 2020 23:56
@frank-dong-ms-zz frank-dong-ms-zz requested a review from harishsk July 1, 2020 23:56
Comment thread test/Microsoft.ML.Tests/TrainerEstimators/MatrixFactorizationTests.cs Outdated
Comment on lines +778 to +781
// Skip duplicate header string
if (_headerString == text)
{
_ch.Warning($"Ignore duplicate header string {text} at line {line}.");
Copy link
Copy Markdown
Contributor

@antoniovs1029 antoniovs1029 Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I guess this is the only way of actually solving the issue, I'm concerned about the performance implications of having to do a full string comparison for every line we load. If the dataset is big (with many columns), would this performance implication be significant?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it, I guess the implementation of comparing strings will actually be optimized to return "false" as soon as it finds a character that doesn't match... and in general even if the dataset has many columns, this comparison would return false very soon for the vast majority of the cases. So I guess this isn't a problem...


In reply to: 448687279 [](ancestors = 448687279)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your assessment. But I would still like to see some benchmark test that this is true, especially on a large file.


In reply to: 448701405 [](ancestors = 448701405,448687279)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we comparing the first string read in the header file with the other strings as they are read? What happens if there is another header row in the text file, but has different spacing between the commas or has quotes?


In reply to: 448759220 [](ancestors = 448759220,448701405,448687279)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that also checking for variations of the header (spacing, quotes, capitalization, culture variation, etc ...) would be a good idea, as it would necessarily increase the time spent while reading every line .
Besides, I'm not sure how much coverage we should give to fix #5051 ? I'd think it isn't a very common problem anyway, and so simply checking for the exact same string as found on the header of the file would be enough.


In reply to: 448759887 [](ancestors = 448759887,448759220,448701405,448687279)


private void ThreadProc()
private void ReadLines()
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: ThreadProc indicates that it is the main procedure of the of the thread. It might be better to retain that meaning and call it ReadLinesThreadProc

@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 2, 2020

Codecov Report

Merging #5275 into master will decrease coverage by 0.05%.
The diff coverage is 97.36%.

@@            Coverage Diff             @@
##           master    #5275      +/-   ##
==========================================
- Coverage   73.74%   73.68%   -0.06%     
==========================================
  Files        1022     1022              
  Lines      190200   190272      +72     
  Branches    20463    20465       +2     
==========================================
- Hits       140263   140208      -55     
- Misses      44418    44537     +119     
- Partials     5519     5527       +8     
Flag Coverage Δ
#Debug 73.68% <97.36%> (-0.06%) ⬇️
#production 69.42% <93.93%> (-0.09%) ⬇️
#test 87.65% <100.00%> (+0.01%) ⬆️
Impacted Files Coverage Δ
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 82.41% <66.66%> (ø)
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 89.62% <96.66%> (+0.16%) ⬆️
...crosoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs 98.43% <100.00%> (ø)
test/Microsoft.ML.TestFrameworkCommon/Datasets.cs 97.42% <100.00%> (+0.03%) ⬆️
test/Microsoft.ML.Tests/OnnxConversionTest.cs 96.57% <100.00%> (ø)
test/Microsoft.ML.Tests/TextLoaderTests.cs 97.09% <100.00%> (+0.06%) ⬆️
...oft.ML.Tests/TrainerEstimators/PriorRandomTests.cs 100.00% <100.00%> (ø)
...osoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs 83.60% <0.00%> (-7.27%) ⬇️
src/Microsoft.ML.FastTree/Training/StepSearch.cs 57.42% <0.00%> (-4.96%) ⬇️
src/Microsoft.ML.Data/Training/TrainerUtils.cs 65.86% <0.00%> (-3.82%) ⬇️
... and 14 more

Comment on lines +155 to +168
private TextLoader.Options GetLoaderArgs(string labelColumnName, string matrixColumnIndexColumnName, string matrixRowIndexColumnName)
{
return new TextLoader.Options()
{
Separator = "\t",
HasHeader = true,
Columns = new[]
{
new TextLoader.Column(labelColumnName, DataKind.Single, new [] { new TextLoader.Range(0) }),
new TextLoader.Column(matrixColumnIndexColumnName, DataKind.UInt32, new [] { new TextLoader.Range(1) }, new KeyCount(20)),
new TextLoader.Column(matrixRowIndexColumnName, DataKind.UInt32, new [] { new TextLoader.Range(2) }, new KeyCount(40)),
}
};
}
Copy link
Copy Markdown
Contributor

@antoniovs1029 antoniovs1029 Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If GetLoaderArgs is only used in your new TestLoadTextWithDuplicateHeader test, can you please simply move the content of this function inside the test itself? Thanks.

@frank-dong-ms-zz
Copy link
Copy Markdown
Contributor Author

Put this PR on hold since AutoML not ignoring duplicate headers and we want to keep behavior consistent.
Also we might want to do the compare and ignore logic at row paring since we can checking for variations of header(spacing, quotes, capialization etc...).

@frank-dong-ms-zz
Copy link
Copy Markdown
Contributor Author

Close this for now until we come to this issue later.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants