Add documentation for `Split()` and `StratifiedSplit()` #3803

rcurtin · 2024-09-26T02:47:25Z

I wanted to record a video today about doing a simple data science workflow in C++ with mlpack today, and I was going to cross-reference it with the documentation. It was going well until I realized that data::Split() isn't documented! So I sat down to document some of the things in the core/data/ directory:

I restructured the load_save.md documentation a little bit---I moved the documentation for normalizing labels out into its own page, under the Preprocessing section of the documentation. link
Along with that, I added a page doc/user/core/dataset_splitting.md that documents the dataset splitting techniques mlpack has (right now there are only two, data::Split() and data::StratifiedSplit()). link
I noticed that there are version of data::Split() that return std::tuples. But they are only ever used in one place (in the preprocess_split binding), and I don't think there are any other mlpack functions that work like this: in general, mlpack functions take in their "output" arguments as references and will overwrite matrices or similar. So, for the sake of consistency, I deprecated that version and did not document it. Open to other opinions on this point---my goal was just to make writing mlpack code "feel" the same.

shrit

I wanted to write some code but I am surprised that reviewing this PR took me a couple of hours, as I have a lot of ideas now how we can restructure things. This is a long post, so you might need to go over it a couple of times to see my idea, but what I am proposing basically is to nuke the start page and refactor it so it looks better, because as the docs are evolving the main page is starting to make less and less sense to me.

Adding in the new pages in the Preprocessing is good but really misleading, it took me sometime to find them and when started playing around it did not make much sense.

The reason I wrote all of that is, I gave it a lot of thoughts. The best action to do in my opinion is to re-organize the sidebar to follow some kind of traditional pipeline, for instance, you do not need to start with core/ or trees/ as the user does not care how the internal code is organized and it is always subject to change, but instead, I would organize the sidebar to follow data science pipeline. Start with Data and Preprocessing put under these two everything related to loading and saving the datasets, or matrix manipulation, images, etc.. and then I would start with preprocessing. Followed by Transformation as it is highly used before classification or regression.

The main idea of the above proposal is not to read the docs, but instead just follow the steps discover new things and you are done, so I would also refactor the front page and remove the boilerplate of quickstart and then algorithm documentation, and replace it with the following:

Put a small figure to explain the dataset pipeline in general, library installation, data collection, processing, transformation, etc.. I think we already have something like that, just a simple jpg, nothing fancy, on the first page.
Then start with Installation, and put all articles related to installation under that section.
I would create another paragraph related to Data (give it a nice name) and put inside everything related to data manipulation, preprocessing, images, anything that we have, even if it was under core, I did not check all of them but it could be.
Also here, you can remove the words "core and math", and replace them with "basic data science functionalities -- Feature extraction" or something similar, as someone might want to just do some distance computation or kernel computations, on some part of the data, or maybe extract some features
Then I would put another paragraph about Data transformation & Feature extraction, and list all of these methods that we have (PCA, LMNN, etc.) there.
Eventually, then we need to list everything related to classification and regression, in the future, I might find a different approach for both of these, but it is still fine,
For now, I would keep modeling at the end since this is something that we would be interested to do especially the cross-validation when we have finished the training of the model.
After that, from the first page, I would move the Binding to other languages to the end of the pages.
I would also remove the mlpack on embedded system and would change that paragraph to embedded deployment of the library.
Maybe it would be interesting to add another type of deployment ? similar to what you did in the video ? I think it would make sense.

For me, the key for good documentation is to tell what the thing is doing / why I need it / and why it is good at something instead of telling what it is or how we call it. We are basically providing higher-level language to explain more low-level concept

shrit · 2024-09-29T13:41:57Z

src/mlpack/core/data/split_data.hpp

 void Split(const FieldType& input,
           const arma::field<T>& inputLabel,
           FieldType& trainData,
-           arma::field<T>& trainLabel,
           FieldType& testData,
+           arma::field<T>& trainLabel,
           arma::field<T>& testLabel,
           const double testRatio,
           const bool shuffleData = true)


Would not this change mean that we need to release a new major version? or there is an overload? we are changing the order of the parameters here

Ahh, you're right, sorry about that. I re-added this overload as deprecated in c237bd3 and we can remove it in mlpack 5.0.0. The issue is that the ordering here did not match any of the other overloads of data::Split().

shrit · 2024-09-29T15:54:23Z

I did not look at the code yet, I was just in the docs, and I think we can merge the pervious two PR, and then do the updates in this one, once this is done, we can then add new PR's

shrit

I reviewed the code modifications and it looks good to me, let me know if you would like to handle the above comments in here or in a different one. I think it is reasonable to do it in this one, but it is up to you, If not, I can approve this one, and then will wait on another PR

shrit · 2024-10-02T15:38:31Z

src/mlpack/tests/split_data_test.cpp

-  const auto value = Split(input, 0.2);
-  REQUIRE(std::get<0>(value).n_cols == 8); // Train data.
-  REQUIRE(std::get<1>(value).n_cols == 2); // Test data.
+  mat trainData, testData;


100 percent agree, better to use this

rcurtin · 2024-10-07T17:42:25Z

I definitely agree that the current main page has outgrown its usefulness and it's time to redo it. I would like to do it in a different PR though, although it's worth discussing here:

Adding in the new pages in the Preprocessing is good but really misleading, it took me sometime to find them and when started playing around it did not make much sense.

At least for now I can't think of a better place to put it. These utilities shouldn't be hidden down under "core" somewhere because they are very important for any data science pipeline. But I think your next point covers it:

The best action to do in my opinion is to re-organize the sidebar to follow some kind of traditional pipeline, for instance, you do not need to start with core/ or trees/ as the user does not care how the internal code is organized and it is always subject to change, but instead, I would organize the sidebar to follow data science pipeline.

I agree with the general organization. The only thing is, we do need to document various miscellaneous utility classes (like the RNG support to take just one example); that does need to go somewhere, but it doesn't fit into the "pipeline" idea. So there will need to be some place to store these kinds of things; I will have to think about that.

What's nice also about this organization is that the last step of any data science pipeline is "deployment", so it gives a natural place to store all of the information about embedded systems and so forth.

Still I need to think about it further, but I will come back with a PR that overhauls the main page and sidebar soon. (I may document some other things in the mean time before I do that.)

I think all of the other comments specific to the Split() and StratifiedSplit() documentation are handled---let me know if you see anything else and I'll get it taken care of, and then we can merge this one once you're good with it. 👍

shrit

Looks good to me, did a quick review again and all is good, let us merge this first as you mentioned and then open another one for the main page

github-actions

Second approval provided automatically after 24 hours. 👍

rcurtin added 3 commits September 25, 2024 15:56

Add first pass at documentation for data::Split().

d696137

Fixes for documentation.

69fc141

Fix links in documentation.

defe9df

rcurtin added c: documentation t: added feature labels Sep 26, 2024

rcurtin added 7 commits September 25, 2024 22:47

Deprecate the versions that return tuples.

cb95340

Move to a shorter filename.

3c08a06

Clarify usage with arma::field types.

409ed7c

Fix to use new API in tests.

f9f93e0

Oops, add missing overload of Split().

ca72547

Clarify documentation slightly.

1d5dd6b

Fix link for dataset splitting page.

a64caf1

shrit reviewed Sep 29, 2024

View reviewed changes

rcurtin added 2 commits September 30, 2024 16:11

Re-add deprecated overload.

c237bd3

Merge remote-tracking branch 'origin/master' into doc-split

ebbee41

shrit reviewed Oct 2, 2024

View reviewed changes

shrit approved these changes Oct 8, 2024

View reviewed changes

github-actions bot approved these changes Oct 9, 2024

View reviewed changes

rcurtin merged commit 6753194 into mlpack:master Oct 10, 2024

rcurtin deleted the doc-split branch October 10, 2024 13:51

rcurtin mentioned this pull request Nov 23, 2024

Overhaul documentation homepage #3836

Merged

rcurtin mentioned this pull request Dec 3, 2024

Release version 4.5.1 #3844

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add documentation for `Split()` and `StratifiedSplit()` #3803

Add documentation for `Split()` and `StratifiedSplit()` #3803

Uh oh!

rcurtin commented Sep 26, 2024 •

edited

Loading

Uh oh!

shrit left a comment

Uh oh!

shrit Sep 29, 2024

Uh oh!

rcurtin Sep 30, 2024

Uh oh!

shrit commented Sep 29, 2024

Uh oh!

shrit left a comment

Uh oh!

shrit Oct 2, 2024

Uh oh!

rcurtin commented Oct 7, 2024

Uh oh!

shrit left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add documentation for Split() and StratifiedSplit() #3803

Add documentation for Split() and StratifiedSplit() #3803

Uh oh!

Conversation

rcurtin commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shrit left a comment

Choose a reason for hiding this comment

Uh oh!

shrit Sep 29, 2024

Choose a reason for hiding this comment

Uh oh!

rcurtin Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

shrit commented Sep 29, 2024

Uh oh!

shrit left a comment

Choose a reason for hiding this comment

Uh oh!

shrit Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

rcurtin commented Oct 7, 2024

Uh oh!

shrit left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add documentation for `Split()` and `StratifiedSplit()` #3803

Add documentation for `Split()` and `StratifiedSplit()` #3803

rcurtin commented Sep 26, 2024 •

edited

Loading