Skip to content

KFold cross-validation#8

Merged
Mec-iS merged 22 commits intosmartcorelib:developmentfrom
Mec-iS:issue-5-mod-selector-kfold
Oct 13, 2020
Merged

KFold cross-validation#8
Mec-iS merged 22 commits intosmartcorelib:developmentfrom
Mec-iS:issue-5-mod-selector-kfold

Conversation

@Mec-iS
Copy link
Copy Markdown
Collaborator

@Mec-iS Mec-iS commented Oct 7, 2020

#5

...

  • implement test_indices, test_matrices, split to accept a matrix and return a Vec
  • improve allocation (?)
  • make all three to return an iterator (?)

Copy link
Copy Markdown
Collaborator

@VolodymyrOrlov VolodymyrOrlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Mec-iS, this is a great start! I have couple of suggestions for you that will help to improve this PR:

  • Run cargo fmt every time before committing your code, otherwise automatic build will fail if code is incorrectly formatted.
  • Test how project's documentation looks by running cargo doc --no-deps --features "nalgebra-bindings ndarray-bindings". This command will generate html with documentation in the target/doc/ folder.

I am looking forward to see implementation of the function that splits dataset into k folds!

Mec-iS and others added 2 commits October 9, 2020 17:20
Co-authored-by: VolodymyrOrlov <[email protected]>
Co-authored-by: VolodymyrOrlov <[email protected]>
@Mec-iS
Copy link
Copy Markdown
Collaborator Author

Mec-iS commented Oct 9, 2020

Thanks. As a next step I will try to implement the interface with only the matrix as input (not considering shuffle and groups). Obviously all the documentation will be cleaned up and fixed after the implementation is satisfactory.

@Mec-iS
Copy link
Copy Markdown
Collaborator Author

Mec-iS commented Oct 9, 2020

@VolodymyrOrlov can you specify the meaning of this assignment for me please?

@VolodymyrOrlov
Copy link
Copy Markdown
Collaborator

I think the idea is to make sure that total number of samples from all folds equals total number of samples. Here is an example
image
without this line we have only 2 + 2 + 2 = 6 samples in all folds. If you uncomment this line and run the code again total number of samples equal number of samples in X, 3 + 2 + 2 = 7:
image

@Mec-iS
Copy link
Copy Markdown
Collaborator Author

Mec-iS commented Oct 10, 2020

Thanks for the explanation.

Please check that test_indices works as expected

@Mec-iS
Copy link
Copy Markdown
Collaborator Author

Mec-iS commented Oct 11, 2020

Ready for review 👍

@Mec-iS Mec-iS requested a review from VolodymyrOrlov October 11, 2020 00:22
@VolodymyrOrlov
Copy link
Copy Markdown
Collaborator

VolodymyrOrlov commented Oct 11, 2020

@Mec-iS the code looks good! Couple of things left before I can proceed to merging this PR:

  1. Looks like our implementation does not split data the same way as Scikit-learn's does.
  2. Arguably flag shuffle is the most important feature of K-fold split. Let's add it and set it to true by default.

Mec-iS and others added 2 commits October 12, 2020 10:24
Co-authored-by: VolodymyrOrlov <[email protected]>
Co-authored-by: VolodymyrOrlov <[email protected]>
@Mec-iS
Copy link
Copy Markdown
Collaborator Author

Mec-iS commented Oct 12, 2020

I had problem understanding the shape of the matrix, in my understanding:

  • M::shape().0 returns the dimensions (columns) the samples
  • M::shape().1 returns the samples (rows) the dimensions

Please provide guidance. I couldn't find this information clearly stated in the documentation.

@Mec-iS
Copy link
Copy Markdown
Collaborator Author

Mec-iS commented Oct 12, 2020

ahah, sorry I got the dimensions confused. I just had to switch which index in shapes I was using and now the tests seem to pass.

Lesson learned: the developer may not know (or have forgotten, like in my case) the conventions used by numpy, so having the basics explained in the documentation for Matrix would be nice.

@Mec-iS
Copy link
Copy Markdown
Collaborator Author

Mec-iS commented Oct 12, 2020

Question: in my last commit I added this line to explicitly define return_values as immutable before returning ownership, the same for indices in test_indices after the optional shuffle; is this worth it or the compiler can do this kind of optimisation automatically? I couldn't find a reference to this in Rust doc so I am not sure. If not, let me know I will revert the commit.

@VolodymyrOrlov
Copy link
Copy Markdown
Collaborator

I had problem understanding the shape of the matrix, in my understanding:

  • M::shape().0 returns the dimensions (columns) the samples
  • M::shape().1 returns the samples (rows) the dimensions

Please provide guidance. I couldn't find this information clearly stated in the documentation.

Good catch, thanks for pointing it out! I'll fix documentation of the shape method

@VolodymyrOrlov
Copy link
Copy Markdown
Collaborator

Question: in my last commit I added this line to explicitly define return_values as immutable before returning ownership, the same for indices in test_indices after the optional shuffle; is this worth it or the compiler can do this kind of optimisation automatically? I couldn't find a reference to this in Rust doc so I am not sure. If not, let me know I will revert the commit.

No, it is not necessary since you pass ownership of the return_values to the caller. A function that calls the method test_indices will be able to modify return_values and I see no reason why we should not let it to do so.

I am sure you've read this book but will leave this link here in case you haven't: https://doc.rust-lang.org/book/ch04-00-understanding-ownership.html.

@Mec-iS
Copy link
Copy Markdown
Collaborator Author

Mec-iS commented Oct 12, 2020

since you pass ownership of the return_values to the caller. A function that calls the method test_indices will be able to modify return_values and I see no reason why we should not let it to do so.

yes but I dont see the necessity of passing the ownership of a mutable vector. It could be an interface leak to allow mutability on something that is not meant to be modified. Once the indices and the splits are defined, there is no reason for the developer-user to be able to modify it. That is why is conceptually more correct to return an immutable maybe.

Mec-iS and others added 3 commits October 12, 2020 16:58
Co-authored-by: VolodymyrOrlov <[email protected]>
Co-authored-by: VolodymyrOrlov <[email protected]>
Co-authored-by: VolodymyrOrlov <[email protected]>
@VolodymyrOrlov
Copy link
Copy Markdown
Collaborator

since you pass ownership of the return_values to the caller. A function that calls the method test_indices will be able to modify return_values and I see no reason why we should not let it to do so.

yes but I dont see the necessity of passing the ownership of a mutable vector. It could be an interface leak to allow mutability on something that is not meant to be modified. Once the indices and the splits are defined, there is no reason for the developer-user to be able to modify it. That is why is conceptually more correct to return an immutable maybe.

I agree with you conceptually but I don't know an easy way to do it without wrapping your return value into another struct that will prevent a caller from modifying it. https://users.rust-lang.org/t/how-to-return-an-immutable-reference-of-a-mutable-vector/30773/5

I think wrapping return value into protective shell is an overkill

@Mec-iS
Copy link
Copy Markdown
Collaborator Author

Mec-iS commented Oct 12, 2020

... REDACTED ...

EDIT: ok, forget it. I didn't consider that the accessibility is defined by the caller, not the callee 🥇

@Mec-iS Mec-iS requested a review from VolodymyrOrlov October 12, 2020 16:32
Copy link
Copy Markdown
Collaborator

@VolodymyrOrlov VolodymyrOrlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Mec-iS Mec-iS merged commit a2588f6 into smartcorelib:development Oct 13, 2020
@Mec-iS Mec-iS deleted the issue-5-mod-selector-kfold branch October 13, 2020 09:10
morenol pushed a commit to morenol/smartcore that referenced this pull request Oct 31, 2020
* Add documentation and API
* Add public keyword
* Implement test_indices (debug version)
* Return indices as Vec of Vec
* Consume vector using drain()
* Use shape() to return num of samples
* Implement test_masks
* Implement KFold.split()
* Make trait public
* Add test for split
* Fix samples in shape()
* Implement shuffle
* Simplify return values
* Use usize for n_splits
Co-authored-by: VolodymyrOrlov <[email protected]>
morenol pushed a commit to morenol/smartcore that referenced this pull request Nov 5, 2020
* Add documentation and API
* Add public keyword
* Implement test_indices (debug version)
* Return indices as Vec of Vec
* Consume vector using drain()
* Use shape() to return num of samples
* Implement test_masks
* Implement KFold.split()
* Make trait public
* Add test for split
* Fix samples in shape()
* Implement shuffle
* Simplify return values
* Use usize for n_splits
Co-authored-by: VolodymyrOrlov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants