-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Samples template for ranking catalog #3338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3338 +/- ##
==========================================
- Coverage 72.71% 72.69% -0.02%
==========================================
Files 807 807
Lines 145172 145172
Branches 16225 16225
==========================================
- Hits 105559 105537 -22
- Misses 35195 35220 +25
+ Partials 4418 4415 -3
|
| GroupId = (uint)(i / groupSize), | ||
| // Create random features that are correlated with the label. | ||
| // For data points with larger labels, the feature values are slightly increased by adding a constant. | ||
| Features = Enumerable.Repeat(randomFloat() + label * 0.1f, 50).ToArray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeat(randomFloat() + label * 0.1f, 50).ToArray() [](start = 42, length = 50)
is this correct? I think it repeats the same number 50 times which is not what you want.
in other samples we do it like this:
Enumerable.Repeat(label, 50).Select(x => x + randomFloat()).ToArray() #Resolved
| /// <example> | ||
| /// <format type="text/markdown"> | ||
| /// <] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FastTreeBinaryClassification [](start = 26, length = 28)
Please have the names match the destination file. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| var mlContext = new MLContext(seed: 0); | ||
|
|
||
| // Create a list of training data points. | ||
| var dataPoints = GenerateRandomDataPoints(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var dataPoints = GenerateRandomDataPoints(1000); [](start = 11, length = 49)
There is an asymmetry between the dataset loading for train and test.
I would prefer to have one call to GenerateRandomDataPoints, followed by a call to LoadFromEnumerable, and then followed by a TrainTestSplit that uses the GroupingColumn (or whatever we called it) option. This would be a good place to remind people that they need to be careful about the GroupId column. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I like your suggested approach better, I am going to write an issue about it, and fix it for all the samples (not just ranking).
This template is pretty standard across the different samples at the moment.
In reply to: 275909222 [](ancestors = 275909222)
| var predictions = mlContext.Data.CreateEnumerable<Prediction>(transformedTestData, reuseRowObject: false).ToList(); | ||
|
|
||
| // Look at 5 predictions | ||
| foreach (var p in predictions.Take(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predictions.Take(5) [](start = 30, length = 19)
Don't use Linq. Instead, show how to do this using mlContext.Data.TakeRows prior to creating an enumerable. This puts more of sample coverage over our codebase. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | ||
| [KeyType(5)] | ||
| public uint Label { get; set; } | ||
| [KeyType(100)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 26, length = 12)
can you remove this extra spaces. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now I can see them haha :) Thank you for pointing out.
In reply to: 275927787 [](ancestors = 275927787)
| [KeyType(5)] | ||
| public uint Label { get; set; } | ||
| [KeyType(100)] | ||
| public uint GroupId { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 45, length = 12)
and this one #Resolved
| float randomFloat() => (float)random.NextDouble(); | ||
| for (int i = 0; i < count; i++) | ||
| { | ||
| var label = random.Next(0, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var label = random.Next(0, 5) [](start = 16, length = 29)
isn't label = 0 while it uint/key means this is a missing label? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the underlying value can be 0. When reading as a KeyType there is mapping that will shift everything by +1. So that 0 maps to 1.
I looked at the IDataView that was generated and the label values go from 1 to 5.
In reply to: 275928728 [](ancestors = 275928728)
| Microsoft.ML.SamplesUtils.ConsoleUtils.PrintMetrics(metrics); | ||
| FeatureFraction = 0.9 | ||
| }, | ||
| RowGroupColumnName = "GroupId" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowGroupColumnName = "GroupId" [](start = 16, length = 30)
I find it strange what we use this line only in one example out of 4.
If it's important, why not in all 4, if it's not, why we even show it? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the two samples without option, the RowGrouColumnName column defaults to 'GroupID'.
However, I just noticed that in the advanced options constructor, the RowGroupColumnName defaults to null. We should probably fix this.
In reply to: 275931195 [](ancestors = 275931195)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meantime I am going to include this line in the other sample with options.
In reply to: 275985563 [](ancestors = 275985563,275931195)
cced6b4 to
3aa7cd2
Compare
| var transformedTestData = model.Transform(testData); | ||
|
|
||
| // Take the top 5 rows. | ||
| var topTransformedTestData = mlContext.Data.TakeRows(transformedTestData, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what @wschin did in his PR, he just generate 5 rows of data. What's the point of having 500 test samples if you use only 5 of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right, I don't like this approach anyways, I have filed an issue to start using TrainTestSplit instead of generating training and test data separately.
I will fix that for all samples at once.
Ivanidzo4ka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
shmoradims
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
3aa7cd2 to
1877e7d
Compare
1877e7d to
76aa0d4
Compare
Tracked under #2522.
This PR adds samples for
FastTreeranking.This PR also rewrites the samples for
LightGbmranking using a template file which will make it easier to maintain going forward.