Samples template for ranking catalog #3338

artidoro · 2019-04-15T18:04:18Z

Tracked under #2522.

This PR adds samples for FastTree ranking.

This PR also rewrites the samples for LightGbm ranking using a template file which will make it easier to maintain going forward.

codecov · 2019-04-15T18:43:48Z

Codecov Report

Merging #3338 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3338      +/-   ##
==========================================
- Coverage   72.71%   72.69%   -0.02%     
==========================================
  Files         807      807              
  Lines      145172   145172              
  Branches    16225    16225              
==========================================
- Hits       105559   105537      -22     
- Misses      35195    35220      +25     
+ Partials     4418     4415       -3

Flag	Coverage Δ
#Debug	`72.69% <ø> (-0.02%)`	⬇️
#production	`68.23% <ø> (-0.02%)`	⬇️
#test	`88.97% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs	`94.18% <ø> (ø)`	⬆️
src/Microsoft.ML.Core/Data/ProgressReporter.cs	`70.95% <0%> (-6.99%)`	⬇️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs	`89.26% <0%> (-0.63%)`	⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`84.9% <0%> (+0.2%)`	⬆️

shmoradims · 2019-04-15T22:36:34Z

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/FastTree.cs

+                    GroupId = (uint)(i / groupSize),
+                    // Create random features that are correlated with the label.
+                    // For data points with larger labels, the feature values are slightly increased by adding a constant.
+                    Features = Enumerable.Repeat(randomFloat() + label * 0.1f, 50).ToArray()


Repeat(randomFloat() + label * 0.1f, 50).ToArray() [](start = 42, length = 50)

is this correct? I think it repeats the same number 50 times which is not what you want.

in other samples we do it like this:
Enumerable.Repeat(label, 50).Select(x => x + randomFloat()).ToArray() #Resolved

rogancarr · 2019-04-16T17:22:37Z

src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs

+        /// <example>
+        /// <format type="text/markdown">
+        /// <![CDATA[
+        /// [!code-csharp[FastTreeBinaryClassification](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/FastTree.cs)]


FastTreeBinaryClassification [](start = 26, length = 28)

Please have the names match the destination file. #Resolved

Thank you for pointing out!

In reply to: 275908194 [](ancestors = 275908194)

rogancarr · 2019-04-16T17:25:15Z

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/Ranking.ttinclude

+            var mlContext = new MLContext(seed: 0);
+
+            // Create a list of training data points.
+            var dataPoints = GenerateRandomDataPoints(1000);


var dataPoints = GenerateRandomDataPoints(1000); [](start = 11, length = 49)

There is an asymmetry between the dataset loading for train and test.

I would prefer to have one call to GenerateRandomDataPoints, followed by a call to LoadFromEnumerable, and then followed by a TrainTestSplit that uses the GroupingColumn (or whatever we called it) option. This would be a good place to remind people that they need to be careful about the GroupId column. #Resolved

You are right, I like your suggested approach better, I am going to write an issue about it, and fix it for all the samples (not just ranking).
This template is pretty standard across the different samples at the moment.

In reply to: 275909222 [](ancestors = 275909222)

rogancarr · 2019-04-16T17:26:27Z

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/Ranking.ttinclude

+            var predictions = mlContext.Data.CreateEnumerable<Prediction>(transformedTestData, reuseRowObject: false).ToList();
+
+            // Look at 5 predictions
+            foreach (var p in predictions.Take(5))


predictions.Take(5) [](start = 30, length = 19)

Don't use Linq. Instead, show how to do this using mlContext.Data.TakeRows prior to creating an enumerable. This puts more of sample coverage over our codebase. #Resolved

Ok will do that.

In reply to: 275909687 [](ancestors = 275909687)

Ivanidzo4ka · 2019-04-16T18:11:00Z

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/Ranking.ttinclude

+        {
+            [KeyType(5)]
+            public uint Label { get; set; }
+            [KeyType(100)]            


[](start = 26, length = 12)

can you remove this extra spaces. #Resolved

now I can see them haha :) Thank you for pointing out.

In reply to: 275927787 [](ancestors = 275927787)

Ivanidzo4ka · 2019-04-16T18:12:03Z

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/Ranking.ttinclude

+            [KeyType(5)]
+            public uint Label { get; set; }
+            [KeyType(100)]            
+            public uint GroupId { get; set; }            


[](start = 45, length = 12)

and this one #Resolved

Ivanidzo4ka · 2019-04-16T18:13:30Z

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/Ranking.ttinclude

+            float randomFloat() => (float)random.NextDouble();
+            for (int i = 0; i < count; i++)
+            {
+                var label = random.Next(0, 5);


var label = random.Next(0, 5) [](start = 16, length = 29)

isn't label = 0 while it uint/key means this is a missing label? #Pending

So the underlying value can be 0. When reading as a KeyType there is mapping that will shift everything by +1. So that 0 maps to 1.
I looked at the IDataView that was generated and the label values go from 1 to 5.

In reply to: 275928728 [](ancestors = 275928728)

Ivanidzo4ka · 2019-04-16T18:19:20Z

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/LightGbmWithOptions.cs

-            Microsoft.ML.SamplesUtils.ConsoleUtils.PrintMetrics(metrics);
+                    FeatureFraction = 0.9
+                },
+                RowGroupColumnName = "GroupId"


RowGroupColumnName = "GroupId" [](start = 16, length = 30)

I find it strange what we use this line only in one example out of 4.
If it's important, why not in all 4, if it's not, why we even show it? #Resolved

So in the two samples without option, the RowGrouColumnName column defaults to 'GroupID'.
However, I just noticed that in the advanced options constructor, the RowGroupColumnName defaults to null. We should probably fix this.

In reply to: 275931195 [](ancestors = 275931195)

In the meantime I am going to include this line in the other sample with options.

In reply to: 275985563 [](ancestors = 275985563,275931195)

Ivanidzo4ka · 2019-04-16T22:11:32Z

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/Ranking.ttinclude

+            var transformedTestData = model.Transform(testData);
+
+            // Take the top 5 rows.
+            var topTransformedTestData = mlContext.Data.TakeRows(transformedTestData, 5);


what @wschin did in his PR, he just generate 5 rows of data. What's the point of having 500 test samples if you use only 5 of them?

Yes you are right, I don't like this approach anyways, I have filed an issue to start using TrainTestSplit instead of generating training and test data separately.

I will fix that for all samples at once.

Ivanidzo4ka

shmoradims

artidoro added the documentation Related to documentation of ML.NET label Apr 15, 2019

artidoro requested review from Ivanidzo4ka, rogancarr, shmoradims and wschin April 15, 2019 18:04

artidoro self-assigned this Apr 15, 2019

shmoradims mentioned this pull request Apr 15, 2019

Docs and samples for the API reference site (P0 & P1 Trainers) #2522

Closed

shmoradims reviewed Apr 15, 2019

View reviewed changes

rogancarr reviewed Apr 16, 2019

View reviewed changes

Ivanidzo4ka reviewed Apr 16, 2019

View reviewed changes

artidoro mentioned this pull request Apr 16, 2019

Samples templates could be improved and use TrainTestSplit #3360

Closed

artidoro force-pushed the rankingsamples branch from cced6b4 to 3aa7cd2 Compare April 16, 2019 20:54

Ivanidzo4ka reviewed Apr 16, 2019

View reviewed changes

Ivanidzo4ka approved these changes Apr 16, 2019

View reviewed changes

artidoro requested review from Ivanidzo4ka, rogancarr and shmoradims April 16, 2019 22:48

shmoradims approved these changes Apr 17, 2019

View reviewed changes

artidoro force-pushed the rankingsamples branch from 3aa7cd2 to 1877e7d Compare April 17, 2019 18:22

artidoro added 3 commits April 17, 2019 12:45

rankingsamples

79b0086

review comment

85b4aa5

review comments

76aa0d4

artidoro force-pushed the rankingsamples branch from 1877e7d to 76aa0d4 Compare April 17, 2019 19:50

artidoro merged commit cc40049 into dotnet:master Apr 17, 2019

ghost locked as resolved and limited conversation to collaborators Mar 22, 2022

Samples template for ranking catalog #3338

Samples template for ranking catalog #3338

Uh oh!

Conversation

artidoro commented Apr 15, 2019

Uh oh!

codecov bot commented Apr 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shmoradims Apr 15, 2019 • edited by artidoro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 16, 2019 • edited by artidoro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 16, 2019 • edited by artidoro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 16, 2019 • edited by artidoro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Apr 16, 2019 • edited by artidoro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Apr 16, 2019 • edited by artidoro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Apr 16, 2019 • edited by artidoro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Apr 16, 2019 • edited by artidoro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka left a comment

Choose a reason for hiding this comment

Uh oh!

shmoradims left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Apr 15, 2019 •

edited

Loading

shmoradims Apr 15, 2019 •

edited by artidoro

Loading

rogancarr Apr 16, 2019 •

edited by artidoro

Loading

rogancarr Apr 16, 2019 •

edited by artidoro

Loading

rogancarr Apr 16, 2019 •

edited by artidoro

Loading

Ivanidzo4ka Apr 16, 2019 •

edited by artidoro

Loading

Ivanidzo4ka Apr 16, 2019 •

edited by artidoro

Loading

Ivanidzo4ka Apr 16, 2019 •

edited by artidoro

Loading

Ivanidzo4ka Apr 16, 2019 •

edited by artidoro

Loading