Issue 5051 - ignore duplicate header when load data by frank-dong-ms-zz · Pull Request #5275 · dotnet/machinelearning

frank-dong-ms-zz · 2020-07-01T23:56:23Z

address #5051

ignore extra header in line reader of text loader
enhance logging
corresponding test
fix existing text loader test bugs, data file "breast-cancer.txt" don't have header but many tests use this data file with hasHeader settting to true which is wrong.

antoniovs1029 · 2020-07-02T00:43:59Z

+                                        // Skip duplicate header string
+                                        if (_headerString == text)
+                                        {
+                                            _ch.Warning($"Ignore duplicate header string {text} at line {line}.");


Although I guess this is the only way of actually solving the issue, I'm concerned about the performance implications of having to do a full string comparison for every line we load. If the dataset is big (with many columns), would this performance implication be significant?

After thinking about it, I guess the implementation of comparing strings will actually be optimized to return "false" as soon as it finds a character that doesn't match... and in general even if the dataset has many columns, this comparison would return false very soon for the vast majority of the cases. So I guess this isn't a problem...

In reply to: 448687279 [](ancestors = 448687279)

I agree with your assessment. But I would still like to see some benchmark test that this is true, especially on a large file.

In reply to: 448701405 [](ancestors = 448701405,448687279)

Are we comparing the first string read in the header file with the other strings as they are read? What happens if there is another header row in the text file, but has different spacing between the commas or has quotes?

In reply to: 448759220 [](ancestors = 448759220,448701405,448687279)

I'm not sure that also checking for variations of the header (spacing, quotes, capitalization, culture variation, etc ...) would be a good idea, as it would necessarily increase the time spent while reading every line .
Besides, I'm not sure how much coverage we should give to fix #5051 ? I'd think it isn't a very common problem anyway, and so simply checking for the exact same string as found on the header of the file would be enough.

In reply to: 448759887 [](ancestors = 448759887,448759220,448701405,448687279)

harishsk · 2020-07-02T05:35:46Z


-                private void ThreadProc()
+                private void ReadLines()
                {


suggestion: ThreadProc indicates that it is the main procedure of the of the thread. It might be better to retain that meaning and call it ReadLinesThreadProc

codecov · 2020-07-02T06:51:04Z

Codecov Report

Merging #5275 into master will decrease coverage by 0.05%.
The diff coverage is 97.36%.

@@            Coverage Diff             @@
##           master    #5275      +/-   ##
==========================================
- Coverage   73.74%   73.68%   -0.06%     
==========================================
  Files        1022     1022              
  Lines      190200   190272      +72     
  Branches    20463    20465       +2     
==========================================
- Hits       140263   140208      -55     
- Misses      44418    44537     +119     
- Partials     5519     5527       +8

Flag	Coverage Δ
#Debug	`73.68% <97.36%> (-0.06%)`	⬇️
#production	`69.42% <93.93%> (-0.09%)`	⬇️
#test	`87.65% <100.00%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs	`82.41% <66.66%> (ø)`
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`89.62% <96.66%> (+0.16%)`	⬆️
...crosoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs	`98.43% <100.00%> (ø)`
test/Microsoft.ML.TestFrameworkCommon/Datasets.cs	`97.42% <100.00%> (+0.03%)`	⬆️
test/Microsoft.ML.Tests/OnnxConversionTest.cs	`96.57% <100.00%> (ø)`
test/Microsoft.ML.Tests/TextLoaderTests.cs	`97.09% <100.00%> (+0.06%)`	⬆️
...oft.ML.Tests/TrainerEstimators/PriorRandomTests.cs	`100.00% <100.00%> (ø)`
...osoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs	`83.60% <0.00%> (-7.27%)`	⬇️
src/Microsoft.ML.FastTree/Training/StepSearch.cs	`57.42% <0.00%> (-4.96%)`	⬇️
src/Microsoft.ML.Data/Training/TrainerUtils.cs	`65.86% <0.00%> (-3.82%)`	⬇️
... and 14 more

antoniovs1029 · 2020-07-02T13:32:25Z

+        private TextLoader.Options GetLoaderArgs(string labelColumnName, string matrixColumnIndexColumnName, string matrixRowIndexColumnName)
+        {
+            return new TextLoader.Options()
+            {
+                Separator = "\t",
+                HasHeader = true,
+                Columns = new[]
+                {
+                    new TextLoader.Column(labelColumnName, DataKind.Single, new [] { new TextLoader.Range(0) }),
+                    new TextLoader.Column(matrixColumnIndexColumnName, DataKind.UInt32, new [] { new TextLoader.Range(1) }, new KeyCount(20)),
+                    new TextLoader.Column(matrixRowIndexColumnName, DataKind.UInt32, new [] { new TextLoader.Range(2) }, new KeyCount(40)),
+                }
+            };
+        }


If GetLoaderArgs is only used in your new TestLoadTextWithDuplicateHeader test, can you please simply move the content of this function inside the test itself? Thanks.

frank-dong-ms-zz · 2020-07-02T17:53:28Z

Put this PR on hold since AutoML not ignoring duplicate headers and we want to keep behavior consistent.
Also we might want to do the compare and ignore logic at row paring since we can checking for variations of header(spacing, quotes, capialization etc...).

…to issue-5051

frank-dong-ms-zz · 2020-07-06T18:12:41Z

Close this for now until we come to this issue later.

frank-dong-ms-zz requested a review from a team as a code owner July 1, 2020 23:56

frank-dong-ms-zz requested a review from harishsk July 1, 2020 23:56

antoniovs1029 reviewed Jul 2, 2020

View reviewed changes

Comment thread test/Microsoft.ML.Tests/TrainerEstimators/MatrixFactorizationTests.cs Outdated

antoniovs1029 reviewed Jul 2, 2020

View reviewed changes

harishsk reviewed Jul 2, 2020

View reviewed changes

antoniovs1029 reviewed Jul 2, 2020

View reviewed changes

frank-dong-ms added 4 commits July 3, 2020 11:49

ignore duplicate header

e94a5b7

add logging and test

dd4d9d3

Merge branch 'master' of https://github.com/dotnet/machinelearning in…

070b62c

…to issue-5051

take comments and fix related tests

f95c88a

frank-dong-ms-zz closed this Jul 6, 2020

frank-dong-ms-zz mentioned this pull request Jul 8, 2020

Test fix using breastcancel dataset and test cleanup #5292

Merged

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 5051 - ignore duplicate header when load data#5275

Issue 5051 - ignore duplicate header when load data#5275
frank-dong-ms-zz wants to merge 4 commits intodotnet:masterfrom
frank-dong-ms-zz:issue-5051

frank-dong-ms-zz commented Jul 1, 2020 •

edited

Loading

Uh oh!

Uh oh!

antoniovs1029 Jul 2, 2020 •

edited by harishsk

Loading

Uh oh!

antoniovs1029 Jul 2, 2020

Uh oh!

harishsk Jul 2, 2020

Uh oh!

harishsk Jul 2, 2020

Uh oh!

antoniovs1029 Jul 2, 2020

Uh oh!

harishsk Jul 2, 2020

Uh oh!

codecov Bot commented Jul 2, 2020 •

edited

Loading

Uh oh!

antoniovs1029 Jul 2, 2020 •

edited

Loading

Uh oh!

frank-dong-ms-zz commented Jul 2, 2020

Uh oh!

frank-dong-ms-zz commented Jul 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

frank-dong-ms-zz commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

antoniovs1029 Jul 2, 2020 • edited by harishsk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jul 2, 2020

Choose a reason for hiding this comment

Uh oh!

harishsk Jul 2, 2020

Choose a reason for hiding this comment

Uh oh!

harishsk Jul 2, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jul 2, 2020

Choose a reason for hiding this comment

Uh oh!

harishsk Jul 2, 2020

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

antoniovs1029 Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frank-dong-ms-zz commented Jul 2, 2020

Uh oh!

frank-dong-ms-zz commented Jul 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

frank-dong-ms-zz commented Jul 1, 2020 •

edited

Loading

antoniovs1029 Jul 2, 2020 •

edited by harishsk

Loading

codecov Bot commented Jul 2, 2020 •

edited

Loading

antoniovs1029 Jul 2, 2020 •

edited

Loading