Skip to content

Conversation

@rcurtin
Copy link
Member

@rcurtin rcurtin commented Sep 26, 2024

I wanted to record a video today about doing a simple data science workflow in C++ with mlpack today, and I was going to cross-reference it with the documentation. It was going well until I realized that data::Split() isn't documented! So I sat down to document some of the things in the core/data/ directory:

  • I restructured the load_save.md documentation a little bit---I moved the documentation for normalizing labels out into its own page, under the Preprocessing section of the documentation. link

  • Along with that, I added a page doc/user/core/dataset_splitting.md that documents the dataset splitting techniques mlpack has (right now there are only two, data::Split() and data::StratifiedSplit()). link

  • I noticed that there are version of data::Split() that return std::tuples. But they are only ever used in one place (in the preprocess_split binding), and I don't think there are any other mlpack functions that work like this: in general, mlpack functions take in their "output" arguments as references and will overwrite matrices or similar. So, for the sake of consistency, I deprecated that version and did not document it. Open to other opinions on this point---my goal was just to make writing mlpack code "feel" the same.

Copy link
Member

@shrit shrit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to write some code but I am surprised that reviewing this PR took me a couple of hours, as I have a lot of ideas now how we can restructure things. This is a long post, so you might need to go over it a couple of times to see my idea, but what I am proposing basically is to nuke the start page and refactor it so it looks better, because as the docs are evolving the main page is starting to make less and less sense to me.

  1. Adding in the new pages in the Preprocessing is good but really misleading, it took me sometime to find them and when started playing around it did not make much sense.

Screenshot 2024-09-29 at 15-42-53 mlpack documentation

Screenshot 2024-09-29 at 17-48-36 mlpack documentation

The reason I wrote all of that is, I gave it a lot of thoughts. The best action to do in my opinion is to re-organize the sidebar to follow some kind of traditional pipeline, for instance, you do not need to start with core/ or trees/ as the user does not care how the internal code is organized and it is always subject to change, but instead, I would organize the sidebar to follow data science pipeline. Start with Data and Preprocessing put under these two everything related to loading and saving the datasets, or matrix manipulation, images, etc.. and then I would start with preprocessing. Followed by Transformation as it is highly used before classification or regression.

  1. The main idea of the above proposal is not to read the docs, but instead just follow the steps discover new things and you are done, so I would also refactor the front page and remove the boilerplate of quickstart and then algorithm documentation, and replace it with the following:
  • Put a small figure to explain the dataset pipeline in general, library installation, data collection, processing, transformation, etc.. I think we already have something like that, just a simple jpg, nothing fancy, on the first page.
  • Then start with Installation, and put all articles related to installation under that section.
  • I would create another paragraph related to Data (give it a nice name) and put inside everything related to data manipulation, preprocessing, images, anything that we have, even if it was under core, I did not check all of them but it could be.
  • Also here, you can remove the words "core and math", and replace them with "basic data science functionalities -- Feature extraction" or something similar, as someone might want to just do some distance computation or kernel computations, on some part of the data, or maybe extract some features
  • Then I would put another paragraph about Data transformation & Feature extraction, and list all of these methods that we have (PCA, LMNN, etc.) there.
  • Eventually, then we need to list everything related to classification and regression, in the future, I might find a different approach for both of these, but it is still fine,
  • For now, I would keep modeling at the end since this is something that we would be interested to do especially the cross-validation when we have finished the training of the model.
  • After that, from the first page, I would move the Binding to other languages to the end of the pages.
  • I would also remove the mlpack on embedded system and would change that paragraph to embedded deployment of the library.
  • Maybe it would be interesting to add another type of deployment ? similar to what you did in the video ? I think it would make sense.

For me, the key for good documentation is to tell what the thing is doing / why I need it / and why it is good at something instead of telling what it is or how we call it. We are basically providing higher-level language to explain more low-level concept

Comment on lines 496 to 503
void Split(const FieldType& input,
const arma::field<T>& inputLabel,
FieldType& trainData,
arma::field<T>& trainLabel,
FieldType& testData,
arma::field<T>& trainLabel,
arma::field<T>& testLabel,
const double testRatio,
const bool shuffleData = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not this change mean that we need to release a new major version? or there is an overload? we are changing the order of the parameters here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, you're right, sorry about that. I re-added this overload as deprecated in c237bd3 and we can remove it in mlpack 5.0.0. The issue is that the ordering here did not match any of the other overloads of data::Split().

@shrit
Copy link
Member

shrit commented Sep 29, 2024

I did not look at the code yet, I was just in the docs, and I think we can merge the pervious two PR, and then do the updates in this one, once this is done, we can then add new PR's

Copy link
Member

@shrit shrit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code modifications and it looks good to me, let me know if you would like to handle the above comments in here or in a different one. I think it is reasonable to do it in this one, but it is up to you, If not, I can approve this one, and then will wait on another PR

const auto value = Split(input, 0.2);
REQUIRE(std::get<0>(value).n_cols == 8); // Train data.
REQUIRE(std::get<1>(value).n_cols == 2); // Test data.
mat trainData, testData;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 percent agree, better to use this

@rcurtin
Copy link
Member Author

rcurtin commented Oct 7, 2024

I definitely agree that the current main page has outgrown its usefulness and it's time to redo it. I would like to do it in a different PR though, although it's worth discussing here:

Adding in the new pages in the Preprocessing is good but really misleading, it took me sometime to find them and when started playing around it did not make much sense.

At least for now I can't think of a better place to put it. These utilities shouldn't be hidden down under "core" somewhere because they are very important for any data science pipeline. But I think your next point covers it:

The best action to do in my opinion is to re-organize the sidebar to follow some kind of traditional pipeline, for instance, you do not need to start with core/ or trees/ as the user does not care how the internal code is organized and it is always subject to change, but instead, I would organize the sidebar to follow data science pipeline.

I agree with the general organization. The only thing is, we do need to document various miscellaneous utility classes (like the RNG support to take just one example); that does need to go somewhere, but it doesn't fit into the "pipeline" idea. So there will need to be some place to store these kinds of things; I will have to think about that.

What's nice also about this organization is that the last step of any data science pipeline is "deployment", so it gives a natural place to store all of the information about embedded systems and so forth.

Still I need to think about it further, but I will come back with a PR that overhauls the main page and sidebar soon. (I may document some other things in the mean time before I do that.)

I think all of the other comments specific to the Split() and StratifiedSplit() documentation are handled---let me know if you see anything else and I'll get it taken care of, and then we can merge this one once you're good with it. 👍

Copy link
Member

@shrit shrit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, did a quick review again and all is good, let us merge this first as you mentioned and then open another one for the main page

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@rcurtin rcurtin merged commit 6753194 into mlpack:master Oct 10, 2024
@rcurtin rcurtin deleted the doc-split branch October 10, 2024 13:51
@rcurtin rcurtin mentioned this pull request Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants