Skip to content

Conversation

@marlamb
Copy link
Contributor

@marlamb marlamb commented Sep 21, 2024

On version 4.4.0 an include of the kde_model.hpp header leads to compilation errors due to the missing symbol Log::Assert within multiple transitively included headers. This commit applies "include what you use" to all headers using Log::Assert in order to prevent dependencies on inclusion order within other headers.

Note that I did not simply change the include order in kde_model.hpp, such that mlpack/core.hpp is included first, as I would consider that solution as unstable, it could easily be changed again without noticing. Also would it solve the problem only in one inclusion path, whereas "include what you use" will solve it for each and every inclusion path.

@shrit
Copy link
Member

shrit commented Sep 23, 2024

While I fully understand why you are proposing this, but what I do not understand is the following:

  1. Why the error is happening ? did you include mlpack.hpp and you saw this?
  2. if not did you go directly and tried to core/ and started including files from there ?

Technically mlpack is fully header-only, but for now, we do not offer one single-file library and we are not thinking of doing so, not in the close of far future.

Also I am trying to understand what are the disadvantages of including something that you are not using, usually there is not any, it will not affect the performance or the binary size at all. However, from a practice point of view, yes it is useless and not necessary.

Could you please post part of the code that is causing this, including the compiler error just to have an idea what is happening ? I am surprised that our testing did not catch this

@marlamb
Copy link
Contributor Author

marlamb commented Sep 24, 2024

Hi @shrit

I got the compilation error within a translation unit which includes first

#include <mlpack/methods/kde/kde_model.hpp>

I have to admit I did not fully follow all inclusion paths, but trusting my compiler it and given the fact that within that header Log::Assert is used, but <mlpack/core/util/log.hpp> is not included, I concluded that the include is missing.
I will add a simple unit test to the code that should fail in case the patch is not applied. Would be great if you could try to reproduce (and hopefully confirm) my observation.


Edit: I looked up the compiler error, I got the same error from multiple files in multiple lines (all in all perhaps two dozens), e.g.

/<path_to_mlpack>/include/mlpack/core/tree/perform_split.hpp:78:3: error: 'Log' has not been declared

I probably don't get your second point fully. I did not include any other files before, as I don't want to rely on inclusion order (which is basically the major motivation of "include what you use").
From my point of view it is absolutely reasonable to not offer a single-header library. But I would expect that I can include every public header and all these headers are self-contained (i.e. include all transitive symbols they require themselves).

Including something that is not used will indeed not affect runtime (afaik), but it will affect compile time. In large codebases this is a concern and they therefore tend to strive to include only as much as required.

On version 4.4.0 an include of the `kde_model.hpp` header leads to
compilation errors due to the missing symbol `Log::Assert` within
multiple transitively included headers. This commit applies "include
what you use" to all headers using `Log::Assert` in order to prevent
dependencies on inclusion order within other headers.

Note that I did not simply change the include order in `kde_model.hpp`,
such that `mlpack/core.hpp` is included first, as I would consider that
solution as unstable, it could easily be changed again without noticing.
Also would it solve the problem only in one inclusion path, whereas
"include what you use" will solve it for each and every inclusion path.
@marlamb marlamb force-pushed the feature/include-what-you-use branch from dc896fe to 9d19795 Compare September 24, 2024 21:14
@marlamb
Copy link
Contributor Author

marlamb commented Sep 24, 2024

After having seen some other tests of mlpack I wonder why all files I had a look at always include <mlpack/core.hpp> first. Do they really use something out of the header? To me it looks like this pattern hides the issue from all the existing tests.

Might be a philosophical question how mlpack shall be consumed. I personally would strongly favour to have self-contained headers, such that users don't have to remember to include <mlpack/core.hpp> themselves all the time (or potentially also other headers?). But that is surely something you can judge a lot better than I can do for mlpack. It might also depend on which headers are considered "public". Should users even include single headers within methods?

@rcurtin
Copy link
Member

rcurtin commented Sep 24, 2024

The idea for users is that they just #include <mlpack.hpp> and that gets them everything in the library (documented in this section and also other places). Of course that can make the compiler do a lot of work and include many unnecessary headers, so we also suggest using method-specific includes to reduce compile time.

I think that most of the tests include mlpack/core.hpp for legacy reasons: that used to be the header that included everything except the machine learning methods themselves. So it would definitely be possible to go through and improve all of those---but of course it takes some time.

Each individual method's header should definitely be include-able on its own, and so the problem you encountered is definitely a bug we should fix (thank you for the PR by the way!). But beyond those "public" headers that we expect users to include (e.g. mlpack.hpp and method-specific includes), we don't have an internal policy or guidelines on specifically how to include things. If there are compile-time benefits that could be had by rearranging/reducing includes, I would definitely support doing that! But personally I think the gains would be relatively minor at this point. I did some ~halfway decent work on it a couple years back when we removed several dependencies, but haven't revisited it since then.

Glancing at the diff (I haven't had time to review in full yet), I can see a couple places where maybe the new includes aren't needed; one instance is ffn_impl.hpp, where log.hpp will already be included by ffn.hpp.

@marlamb
Copy link
Contributor Author

marlamb commented Sep 25, 2024

Hi @rcurtin ,
thanks for the detailed explanation.
My experience with "include what you use" is that it is usually worth trying to achieve it, but at the same time it is difficult to keep it clean. The most pragmatic solution I have seen so far is to try to slowly migrate to a clean state, i.e. improve on the fly, so new code or code touched again applies to the policy.
In fact there exist a few tricks to improve ones own capabilities to find missing includes. One is to use clang-format in order to automatically sort the includes that the "main" include (the header implemented or tested) always comes first. That way the tests would ensure consistency of the headers tested (but not that everything is included directly, one might still rely on transitive headers and get spurious errors in case the transitive header changes its includes - however the property of being self-contained would be tested and could be assured). Can of course also do that manually. Either way having one compilation unit (in mlpacks case a test file) for each public header would be desired.
Regarding your point of ffn_impl.hpp I totally agree, it is not required technically, as users are not meant to include the header directly. That might be something to be considered for the internal policy what should be included. I personally favor to have the includes in the same file, even if it might be duplicated, as it simplifies reviews and the second include has zero cost, as the include guards will make sure it will not be included twice. But that is very debatable -> up to your preferred include policy.

For me it would be fine to adapt this PR to your liking, you just should decide on a policy you want to follow. I could also add that policy to the documentation within this PR, if you want me to.
Looking forward to your feedback 😄 .


A small addition about how I would do it, knowing that I know almost nothing about mlpack compared to others:

  • Adapt the directory structure within test such that it mimics the directory structure of the library
    mlpack
    |_ bindings
    |  |_ ...
    |_ core
    |  |_ ...
    |_ methods
    |  |_ ...
    |_ tests
    |  |_ bindings
    |  |  |_ ...
    |  |_ core
    |  |  |_ ...
    |  |_ methods
    |  |  |_ ...
    |  |_ base_test.cpp
    |  |_ config_test.cpp
    |  |_ core_test.cpp
    |  |_ ...
    |_ base.hpp
    |_ config.hpp
    |_ core.hpp
    |_ ...
    
    This should simplify finding tests and checking if tests for headers exist. Larger tests that cannot be assigned to a single header could go into a dedicated directory or simply live aside the other tests. An even more aggressive way would be to place the test file aside the headers, which should be fine after installation (test files will not be present any more) and makes immediately visible if tests are missing. I would probably not do that from the start, but it could be considered in future.
  • Add a policy to always include the header under test first and add tests for every header file (except for *_impl.hpp, as they are not standalone headers).
  • Optional: Enforce inclusion order via clang-format. I know this is highly controversial, as it would mean to have some dependency on clang-format, which might have slightly different behavior on different versions, i.e. if checked in the CI it could lead to situations where the local setup of a contributor would conflict with the centrally used tool. However, I personally would be in favor of it (my current setup does lead to massive changes in files of mlpack, because no central .clang-format file is defined, so I currently do my edits on a simple editor compared to a full-fledged IDE). But that would also mean that the complete formatting needs to be forced one time into a consistent state (I cannot judge how far it is away from a state one could model with a clang-format config).

@shrit
Copy link
Member

shrit commented Sep 27, 2024

For your information, we are working on restructuring mlpack internals and header to make it look similar to other header-only libraries, so this is on our list, it will look more similar to the ensmallen project if you are interested.

This improvement might not directly respond to your question but will make the situation slightly better, I will try to find some time for it next week, but it is on our agenda.

@shrit
Copy link
Member

shrit commented Sep 27, 2024

I will put this PR on hold for now until we fix all the headers inside the library and then will see if this PR still make sense or not

@rcurtin
Copy link
Member

rcurtin commented Sep 27, 2024

@shrit I think you mean just moving the library's includes from src/ to include/, then I suppose leaving the tests and bindings .cpp files in-place?

@marlamb I don't have a problem with using clang-format, but I would prefer it to be as a CI job, not an auto-run post-commit hook or anything like that. Another important thing we have found over the years with style checkers and linters is that there has to be an easy way to whitelist things---there are always false positives. Ideally any system is very light-touch and so doesn't cause a huge amount of friction writing new code.

I agree with the rest of what you wrote, but the key here will be establishing a policy of some sort. I think it can be as simple as: any user-level include should include everything it needs. This would include all the method-level inclusion files like src/mlpack/methods/*.hpp, and the core directory inclusion files like src/mlpack/core/data/data.hpp and similar. Any _impl.hpp files don't need to be checked since users won't generally include them.

Some other little comments:

Adapt the directory structure within test such that it mimics the directory structure of the library

It's a nice idea but could be very complicated, as individual test files may test components not specific to a particular header file. Personally I don't see that kind of refactoring as worthwhile, but if you're up for it I suppose I don't have an issue with it.

Enforce inclusion order via clang-format. I know this is highly controversial, as it would mean to have some dependency on clang-format

If this was a road we went down, then controlling the version via specification in a CI job (like a Github Actions job or something we ran on Jenkins) would be the way to do it. However at least for now it would be ideal to do that with includes only and not the rest of the style guide.

Anyway, thanks again for pointing these issues out and the suggestions. I think for this PR, @shrit would it interfere with what you're planning to get this one in quickly for the user-level methods/kde.hpp header and then return to the other suggestions later?

@marlamb
Copy link
Contributor Author

marlamb commented Sep 28, 2024

Hi you two, thanks for all the nice replies!
If another restructuring is already scheduled, this activity can surely be postponed. It is not urgent.

Regarding the structure: yes indeed, it is sometimes difficult to then judge which tests should go where. I usually would try to have one test file per header file, which contain unit tests for functionality provided in that header file. In case the public header does only include transitive includes (convenience header for users), then the test file would only contain the include statement to ensure that the header is self-contained. Finally for all remaining tests, which cover functionality from multiple headers, an additional directory can be created (could call it "integration" or "component" tests or similar).

Regarding clang-format: in contrast to many linters (also e.g. clang-tidy) I am basically not aware of problems with clang-format at all. The only thing that might happen is that is formats stuff not optimal from a human point of view (e.g. a 2d-matrix, which a human would initialize accordingly, but clang-format does not understand the semantics and might break the natural structure), but for these cases you can switch it off locally. And applying it completely comes with the huge benefit that no style-guide needs to be provided any more, no manual adaptions to keep it consistent across the codebase, i.e. also no comments in reviews regarding style etc.

Regarding CI based formatting: I know multiple ways howto achieve this, e.g. creating an automated pull request pointing to the open pull request sorting the includes. Also directly pushing to the pull request is possible, but is a little more intrusive to the contributor. I assume simply failing the CI is probably not desired, as people without having clang-format in an appropriate version installed would not be able to apply the formatting? @rcurtin could you elaborate a little on what would be your desired behavior? Then I would offer to have a look howto implement it for header sorting only (formatting the rest could be discussed after a mechanism is in place and having some experience, if the selected mechanism feels good and worthwhile being used also for more).

My proposal how we could go forward:

  1. @shrit does the refactoring to become closer to the default layout of header-only libraries.
  2. I create a PR that sorts the includes. This should directly show multiple compilation errors for test files, which I can fix along and which should hopefully also solve the initial issue I tackled in this PR (as I said in the beginning: it is not urgent. I would say solving it before the next release should be fully sufficient. Or do you plan to apply it to already existing versions like 4.4.0 and 4.5.0 as well?).
  3. I create a PR which enforces the inclusion order via a mechanism we can agree on in this thread.

Does this sound reasonable to you? In case @shrit figures out that the activity creating an include directory starts later, we can also easily change the order. But we should be able to figure it out such, that no one has to face large merge conflicts 😃 .

@shrit
Copy link
Member

shrit commented Sep 29, 2024

We do not usually apply new features or code base modification to previous releases, the only reason to do so, is if there is a bug that is affecting previous releases. Otherwise, new features go to new releases.

You can but, I would wait until we have a better restructuring, it should more look like ensmallen (https://github.com/mlpack/ensmallen) or armadillo.

I will try to make it simpler, but you should know that it is very hard to include what you need, and our objective is simplicity comes first as long as it does not affect performance.

If you would like to help, we have so many things going on, we are more than happy to guide you, but this is not priority.

@rcurtin
Copy link
Member

rcurtin commented Sep 30, 2024

I assume simply failing the CI is probably not desired, as people without having clang-format in an appropriate version installed would not be able to apply the formatting?

Actually that would be the best way to do it, then contributors can just check what the failure was and go fix it. This is often how I fix CI issues; I tend to prefer manually fixing it as opposed to an automated fix (which could go wrong in its own ways, causing further confusion).

I think the rest of the plan you proposed is just fine, but yeah, like @shrit said, no need to consider anything other than the most current code.

@marlamb
Copy link
Contributor Author

marlamb commented Oct 9, 2024

Living on head sounds good to me 👍 .

Unfortunately it looks like llvm disabled sorting includes while keeping the rest unformatted (see https://github.com/llvm/llvm-project/blob/main/clang/lib/Format/Format.cpp#L3479 and llvm/llvm-project@61dc0f2). That means we would need to rely on an very old clang-format version (12), which apart from being the opposite of future-proof would be quite unattractive for contributors.
The alternative would be to apply a defined formatting onto the codebase. If this is fine for you I could prepare a config file which keeps the diff as small as possible. Just give me a "go", then I'll prepare it.

Edit:
To advertise auto-formatting a little more: this would also enable automated modernization of the codebase, e.g. replacing typename std::enable_if<...>::type with std::enable_if_t<...> without mixing it again with formatting or getting some random results, because the new line lengths would not be taken into account properly.

@marlamb
Copy link
Contributor Author

marlamb commented Oct 11, 2024

I had a little time for a first draft how clang-format could be used, see #3810 . But I have to admit, that I don't understand the currently running CI system and if the place I tried to hook in is appropriate.
However, the PR is currently not in a sane state, as it does not apply the formatting, yet. It is more meant to be a potential place for discussion if the .clang-format files suits your expectations (given the current style guidelines).

@rcurtin
Copy link
Member

rcurtin commented Oct 11, 2024

I don't really have time to review closely right now but auto-formatting code with clang-format is really not something we should do. The reason is that the rules for these formatting programs apply really badly to heavily templated code; some amount of "freedom" for people to format code in a way that is visually reasonable is really important. Especially with length limits of 80 characters it can be a huge problem and this is why we have avoided clang-format in the past. I don't mind a CI job that checks include orders and issues an error if there is a problem, or the same for style guides; but I don't want to set up a system that auto-formats all code to a very rigid style that has no room for subjectiveness. If you want to see what I mean take a look at classes like BinarySpaceTree or the SFINAE utility or other things with many template parameters.

@rcurtin
Copy link
Member

rcurtin commented Oct 11, 2024

(I will get back to this when I have a chance, I am just traveling right now so only have a few seconds for a quick response. Sorry about that)

@marlamb
Copy link
Contributor Author

marlamb commented Oct 11, 2024

No reason to apologize. To the contrary: I am quite impressed that you take the time to answer while traveling!
And I totally understand your argument. On the pro side of that kind of formatting is, that is simplifies actions like #3811 a lot, if there is no risk to kill the formatting and you don't get weird files checked in like here.

Perhaps I create a ticket on llvm side to enable pure sorting of includes again 😆 .

However, regarding the main discussion here: then I would not wait for a conceptional improvement of "include what you use", but only fix the now already apparent issue, which was the original intention of this PR. As far as I remember the main point in this more philosophical thread was this one here:

Glancing at the diff (I haven't had time to review in full yet), I can see a couple places where maybe the new includes aren't needed; one instance is ffn_impl.hpp, where log.hpp will already be included by ffn.hpp.

As it wouldn't matter to include it twice, but would IMO improve the maintenance (readers of one file don't have to look into another one to figure out, if the include is already done), I would prefer to keep the include. However, if you insist I will remove it.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the changes and that we should get this particular PR merged. Do you want to add a note to HISTORY.md? (I can do it before merge too.)

I'll respond to the other comments in the discussion next...

@rcurtin
Copy link
Member

rcurtin commented Oct 14, 2024

No reason to apologize. To the contrary: I am quite impressed that you take the time to answer while traveling!
And I totally understand your argument. On the pro side of that kind of formatting is, that is simplifies actions like #3811 a lot, if there is no risk to kill the formatting and you don't get weird files checked in like here.

Yeah---unfortunately the killing of the manual formatting is what has always kept us from using automated tools for this sort of thing. If it is possible to use a C++ style checker in a way where it leaves room for subjectivity, then this is okay; but clang-format's "my way or the highway" approach isn't suitable. As it stands, we use https://github.com/cpplint/cpplint to merely report some issues in the CI jobs. A better or more specific tool (or another one alongside it) could help enforce some of the more automatable points of the style guide.

As it wouldn't matter to include it twice, but would IMO improve the maintenance (readers of one file don't have to look into another one to figure out, if the include is already done), I would prefer to keep the include. However, if you insist I will remove it.

If possible I would actually prefer to remove these since they are unnecessary. The general format throughout has been that if something is needed in the definition file (e.g. ffn.hpp), then it will be already included in ffn_impl.hpp as part of the inclusion of ffn.hpp. If something is not needed in the definition file, then it can be included only in the _impl.hpp file. It is a minor point and I don't want to hold up the review on account of that alone, but I think it's cleaner to only include it once, whether that be in the definition or implementation file.

Note that we consider the wider include to `core` as being sufficient.
Although more files are touched in order to improve "include what you
use", the message focuses on the behavioral change that can be detected
by users for sure.
@marlamb
Copy link
Contributor Author

marlamb commented Oct 15, 2024

I agree with the changes and that we should get this particular PR merged. Do you want to add a note to HISTORY.md? (I can do it before merge too.)

I'll respond to the other comments in the discussion next...

I added a message, but keeping the scope very small to the change that can for sure be detected by users.

If possible I would actually prefer to remove these since they are unnecessary. The general format throughout has been that if something is needed in the definition file (e.g. ffn.hpp), then it will be already included in ffn_impl.hpp as part of the inclusion of ffn.hpp. If something is not needed in the definition file, then it can be included only in the _impl.hpp file. It is a minor point and I don't want to hold up the review on account of that alone, but I think it's cleaner to only include it once, whether that be in the definition or implementation file.

Should be done in all files where the pattern applied (there were a few more of them than ffn.hpp only).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@rcurtin rcurtin merged commit 5b0167b into mlpack:master Oct 16, 2024
@rcurtin
Copy link
Member

rcurtin commented Oct 16, 2024

Thanks again @marlamb! Very much appreciated.

@rcurtin rcurtin mentioned this pull request Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants