-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add documentation for Split() and StratifiedSplit()
#3803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
shrit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to write some code but I am surprised that reviewing this PR took me a couple of hours, as I have a lot of ideas now how we can restructure things. This is a long post, so you might need to go over it a couple of times to see my idea, but what I am proposing basically is to nuke the start page and refactor it so it looks better, because as the docs are evolving the main page is starting to make less and less sense to me.
- Adding in the new pages in the Preprocessing is good but really misleading, it took me sometime to find them and when started playing around it did not make much sense.
The reason I wrote all of that is, I gave it a lot of thoughts. The best action to do in my opinion is to re-organize the sidebar to follow some kind of traditional pipeline, for instance, you do not need to start with core/ or trees/ as the user does not care how the internal code is organized and it is always subject to change, but instead, I would organize the sidebar to follow data science pipeline. Start with Data and Preprocessing put under these two everything related to loading and saving the datasets, or matrix manipulation, images, etc.. and then I would start with preprocessing. Followed by Transformation as it is highly used before classification or regression.
- The main idea of the above proposal is not to read the docs, but instead just follow the steps discover new things and you are done, so I would also refactor the front page and remove the boilerplate of quickstart and then algorithm documentation, and replace it with the following:
- Put a small figure to explain the dataset pipeline in general, library installation, data collection, processing, transformation, etc.. I think we already have something like that, just a simple jpg, nothing fancy, on the first page.
- Then start with Installation, and put all articles related to installation under that section.
- I would create another paragraph related to Data (give it a nice name) and put inside everything related to data manipulation, preprocessing, images, anything that we have, even if it was under core, I did not check all of them but it could be.
- Also here, you can remove the words "core and math", and replace them with "basic data science functionalities -- Feature extraction" or something similar, as someone might want to just do some distance computation or kernel computations, on some part of the data, or maybe extract some features
- Then I would put another paragraph about
Data transformation & Feature extraction, and list all of these methods that we have (PCA, LMNN, etc.) there. - Eventually, then we need to list everything related to classification and regression, in the future, I might find a different approach for both of these, but it is still fine,
- For now, I would keep modeling at the end since this is something that we would be interested to do especially the cross-validation when we have finished the training of the model.
- After that, from the first page, I would move the Binding to other languages to the end of the pages.
- I would also remove the mlpack on embedded system and would change that paragraph to embedded deployment of the library.
- Maybe it would be interesting to add another type of deployment ? similar to what you did in the video ? I think it would make sense.
For me, the key for good documentation is to tell what the thing is doing / why I need it / and why it is good at something instead of telling what it is or how we call it. We are basically providing higher-level language to explain more low-level concept
| void Split(const FieldType& input, | ||
| const arma::field<T>& inputLabel, | ||
| FieldType& trainData, | ||
| arma::field<T>& trainLabel, | ||
| FieldType& testData, | ||
| arma::field<T>& trainLabel, | ||
| arma::field<T>& testLabel, | ||
| const double testRatio, | ||
| const bool shuffleData = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not this change mean that we need to release a new major version? or there is an overload? we are changing the order of the parameters here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, you're right, sorry about that. I re-added this overload as deprecated in c237bd3 and we can remove it in mlpack 5.0.0. The issue is that the ordering here did not match any of the other overloads of data::Split().
|
I did not look at the code yet, I was just in the docs, and I think we can merge the pervious two PR, and then do the updates in this one, once this is done, we can then add new PR's |
shrit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code modifications and it looks good to me, let me know if you would like to handle the above comments in here or in a different one. I think it is reasonable to do it in this one, but it is up to you, If not, I can approve this one, and then will wait on another PR
| const auto value = Split(input, 0.2); | ||
| REQUIRE(std::get<0>(value).n_cols == 8); // Train data. | ||
| REQUIRE(std::get<1>(value).n_cols == 2); // Test data. | ||
| mat trainData, testData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100 percent agree, better to use this
|
I definitely agree that the current main page has outgrown its usefulness and it's time to redo it. I would like to do it in a different PR though, although it's worth discussing here:
At least for now I can't think of a better place to put it. These utilities shouldn't be hidden down under "core" somewhere because they are very important for any data science pipeline. But I think your next point covers it:
I agree with the general organization. The only thing is, we do need to document various miscellaneous utility classes (like the RNG support to take just one example); that does need to go somewhere, but it doesn't fit into the "pipeline" idea. So there will need to be some place to store these kinds of things; I will have to think about that. What's nice also about this organization is that the last step of any data science pipeline is "deployment", so it gives a natural place to store all of the information about embedded systems and so forth. Still I need to think about it further, but I will come back with a PR that overhauls the main page and sidebar soon. (I may document some other things in the mean time before I do that.) I think all of the other comments specific to the |
shrit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, did a quick review again and all is good, let us merge this first as you mentioned and then open another one for the main page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approval provided automatically after 24 hours. 👍


I wanted to record a video today about doing a simple data science workflow in C++ with mlpack today, and I was going to cross-reference it with the documentation. It was going well until I realized that
data::Split()isn't documented! So I sat down to document some of the things in thecore/data/directory:I restructured the
load_save.mddocumentation a little bit---I moved the documentation for normalizing labels out into its own page, under thePreprocessingsection of the documentation. linkAlong with that, I added a page
doc/user/core/dataset_splitting.mdthat documents the dataset splitting techniques mlpack has (right now there are only two,data::Split()anddata::StratifiedSplit()). linkI noticed that there are version of
data::Split()that returnstd::tuples. But they are only ever used in one place (in thepreprocess_splitbinding), and I don't think there are any other mlpack functions that work like this: in general, mlpack functions take in their "output" arguments as references and will overwrite matrices or similar. So, for the sake of consistency, I deprecated that version and did not document it. Open to other opinions on this point---my goal was just to make writing mlpack code "feel" the same.