Skip to content

Conversation

@artidoro
Copy link
Contributor

Tracked under #2522.

This PR adds samples for FastTree ranking.

This PR also rewrites the samples for LightGbm ranking using a template file which will make it easier to maintain going forward.

@artidoro artidoro added the documentation Related to documentation of ML.NET label Apr 15, 2019
@artidoro artidoro self-assigned this Apr 15, 2019
@codecov
Copy link

codecov bot commented Apr 15, 2019

Codecov Report

Merging #3338 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3338      +/-   ##
==========================================
- Coverage   72.71%   72.69%   -0.02%     
==========================================
  Files         807      807              
  Lines      145172   145172              
  Branches    16225    16225              
==========================================
- Hits       105559   105537      -22     
- Misses      35195    35220      +25     
+ Partials     4418     4415       -3
Flag Coverage Δ
#Debug 72.69% <ø> (-0.02%) ⬇️
#production 68.23% <ø> (-0.02%) ⬇️
#test 88.97% <ø> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs 94.18% <ø> (ø) ⬆️
src/Microsoft.ML.Core/Data/ProgressReporter.cs 70.95% <0%> (-6.99%) ⬇️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.26% <0%> (-0.63%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.9% <0%> (+0.2%) ⬆️

GroupId = (uint)(i / groupSize),
// Create random features that are correlated with the label.
// For data points with larger labels, the feature values are slightly increased by adding a constant.
Features = Enumerable.Repeat(randomFloat() + label * 0.1f, 50).ToArray()
Copy link

@shmoradims shmoradims Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeat(randomFloat() + label * 0.1f, 50).ToArray() [](start = 42, length = 50)

is this correct? I think it repeats the same number 50 times which is not what you want.

in other samples we do it like this:
Enumerable.Repeat(label, 50).Select(x => x + randomFloat()).ToArray() #Resolved

/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[FastTreeBinaryClassification](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/FastTree.cs)]
Copy link
Contributor

@rogancarr rogancarr Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FastTreeBinaryClassification [](start = 26, length = 28)

Please have the names match the destination file. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out!


In reply to: 275908194 [](ancestors = 275908194)

var mlContext = new MLContext(seed: 0);

// Create a list of training data points.
var dataPoints = GenerateRandomDataPoints(1000);
Copy link
Contributor

@rogancarr rogancarr Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var dataPoints = GenerateRandomDataPoints(1000); [](start = 11, length = 49)

There is an asymmetry between the dataset loading for train and test.

I would prefer to have one call to GenerateRandomDataPoints, followed by a call to LoadFromEnumerable, and then followed by a TrainTestSplit that uses the GroupingColumn (or whatever we called it) option. This would be a good place to remind people that they need to be careful about the GroupId column. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I like your suggested approach better, I am going to write an issue about it, and fix it for all the samples (not just ranking).
This template is pretty standard across the different samples at the moment.


In reply to: 275909222 [](ancestors = 275909222)

var predictions = mlContext.Data.CreateEnumerable<Prediction>(transformedTestData, reuseRowObject: false).ToList();

// Look at 5 predictions
foreach (var p in predictions.Take(5))
Copy link
Contributor

@rogancarr rogancarr Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predictions.Take(5) [](start = 30, length = 19)

Don't use Linq. Instead, show how to do this using mlContext.Data.TakeRows prior to creating an enumerable. This puts more of sample coverage over our codebase. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will do that.


In reply to: 275909687 [](ancestors = 275909687)

{
[KeyType(5)]
public uint Label { get; set; }
[KeyType(100)]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        [](start = 26, length = 12)

can you remove this extra spaces. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I can see them haha :) Thank you for pointing out.


In reply to: 275927787 [](ancestors = 275927787)

[KeyType(5)]
public uint Label { get; set; }
[KeyType(100)]
public uint GroupId { get; set; }
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        [](start = 45, length = 12)

and this one #Resolved

float randomFloat() => (float)random.NextDouble();
for (int i = 0; i < count; i++)
{
var label = random.Next(0, 5);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var label = random.Next(0, 5) [](start = 16, length = 29)

isn't label = 0 while it uint/key means this is a missing label? #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the underlying value can be 0. When reading as a KeyType there is mapping that will shift everything by +1. So that 0 maps to 1.
I looked at the IDataView that was generated and the label values go from 1 to 5.


In reply to: 275928728 [](ancestors = 275928728)

Microsoft.ML.SamplesUtils.ConsoleUtils.PrintMetrics(metrics);
FeatureFraction = 0.9
},
RowGroupColumnName = "GroupId"
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RowGroupColumnName = "GroupId" [](start = 16, length = 30)

I find it strange what we use this line only in one example out of 4.
If it's important, why not in all 4, if it's not, why we even show it? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the two samples without option, the RowGrouColumnName column defaults to 'GroupID'.
However, I just noticed that in the advanced options constructor, the RowGroupColumnName defaults to null. We should probably fix this.


In reply to: 275931195 [](ancestors = 275931195)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime I am going to include this line in the other sample with options.


In reply to: 275985563 [](ancestors = 275985563,275931195)

var transformedTestData = model.Transform(testData);

// Take the top 5 rows.
var topTransformedTestData = mlContext.Data.TakeRows(transformedTestData, 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what @wschin did in his PR, he just generate 5 rows of data. What's the point of having 500 test samples if you use only 5 of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, I don't like this approach anyways, I have filed an issue to start using TrainTestSplit instead of generating training and test data separately.

I will fix that for all samples at once.

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link

@shmoradims shmoradims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@artidoro artidoro merged commit cc40049 into dotnet:master Apr 17, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

documentation Related to documentation of ML.NET

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants