-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add maximum_common_ordered_subtree_embedding to algorithms.minors #4327
base: main
Are you sure you want to change the base?
Conversation
dd2dd00
to
2faec89
Compare
d7bd227
to
13591d2
Compare
13591d2
to
9c4f3a5
Compare
9c4f3a5
to
05a8fcb
Compare
Rebased this on master and updated based on comments from #4294 . This still depends on #4349, so that should get reviewed first. I'll probably also remove the example in the applications folder and put that into a separate PR to make this that much easier to review. There is also the issue of the cython backend and the |
9fafa63
to
51e245a
Compare
@rossbar @dschult Now that #4349 is merged, I've rebased on main, so now the meat and potatoes contribution is finally ready for review! I anticipate that the alternative Cython implementation and how that should or should not hook into the main API is going to be the most hotly contested issue. I really do want to push for its inclusion, as I think a reasonably fast backend is necessary for this algorithm to have any real-world usage. The pure python version is just too slow for moderately sized graphs. I would love it if the "autojit" logic I included, but I'm open to forcing the user to compile the c-extension themselves if they want to use it. I think the current state is the best tradeoff where it attempts to compile and use the c-extension if possible, but then it falls back to the python implementation if it can't do that. There is the issue of my use of "ubelt" in the autojit logic (which I use for making sure the autojit is only attempted once among other things). For now I haven't factored it out because I want to hear opinions on it. If we want to keep some sort of the autojit functionality (I would vote for this), I can factor out ubelt. But if we decide to remove it, then ubelt would be removed with it. |
51e245a
to
a066a58
Compare
@rossbar @dschult I took some time to remove the cython backend to make this easier to review and less controversal to merge. So now there are only two implementation backends: "iter" and "recurse". I noted that in the future a "cython-iter" backend may be added, but that's the extend to which I mentioned it. The patch to add it back in is fairly small, so it probably makes sense to discuss that in a separate thread and this can be a more streamlined review of the pure python implementations of the algorithm. EDIT: I also fixed the doc errors that were causing the CI to fail as well as some misc inconsistencies I noticed when reviewing the PR. |
@rossbar pinging to see if there is still interest in merging in this algorithm. Note, that I recently won 3rd place in the pytorch 2021 hackathon for an application of this algorithm: https://devpost.com/software/torchliberator-partial-weight-loading |
Honestly I'm not sure - this PR is huge so it's tough to get any sense of what's in it. Some things that jumped out to me:
These are the impressions that I got from just skimming the diff & PR discussion - there's so much here that I really struggle getting a handle on the proposed changes. If you were amenable to it, I think one thing that might really help w/ review would be to pare this down as much as possible to the minimal proposed change i.e. what is the absolute minimum amount of changes necessary to add the |
I've been thinking a lot about this, and I think this algorithm might be better served as an extension to networkx rather than belonging as a core part of it. This is mainly because it relies on the C implementation to be efficient. To this end I published a standalone version of the algorithms on pypi with both the pure python and cython implementations. https://github.com/Erotemic/networkx_algo_common_subtree I think the algorithm is about as specialized as some of the other algo that come bundled with networkx, the key thing it does is given two trees: find some common substructure. In this case that is a graph minor that is common between both trees. The new library I made also contains the variant of this problem where the common substructure are subtree-isomorphisms. The algorithm is definitely not only relevant to pytorch weights. That's just my motivating use case. The reason the algorithm isn't designed to work directly on the networkx data structures is because the algorithm I implemented is based on a reduction from an ordered tree to a "balance sequence" - hence why there is a lot of futzing with strings / Lists. Another reason I'm inclined to close this is to prevent bloat in networkx itself. While Lozano's 2003 algorithm solves a fairly fundamental graph algorithm, there are so many fundamental graph algorithms that it might not make sense for networkx to contain them all. However, whenever I come across a graph problem I always scan the networkx docs to make sure that an algorithm for it isn't already implemented, because networkx does have a wide array of core algorithms. I'm wondering if there should be a mechanism for third party implementations to be listed in the core networkx docs (perhaps in an extensions / contrib section) that can make it easier for someone searching for a specific graph algorithm to find an existing implementation of it? I guess my thought is that it would be nice if there was a centralized location that documents and references a distributed set of implementations that compute solutions to the zoo of graph problems that exist. |
I'm of two minds here. On the one hand it would be great if users had a resource that collated the full atlas of available graph algorithms across libraries! I think this would be extremely challenging to maintain however. I'm hesitant to link to external projects for many reasons; but most notably is that a) we don't have the bandwidth to keep track of the maintenance status of external projects and b) API difference between projects might make references to them less useful for users. The lightest-weight way I can think to do this would be to add links to external Python projects to the Now, if someone were to make a knowledge graph (ha) of graph theoretic/networkx analysis algorithms and tag the nodes with examples of concrete implementations across libraries, I think that would be an awesome project. We've got an open issue about doing this within networkx (see #5824), but if someone wanted to take that idea expand it to include cross-library links that'd be great. |
New version of #4169 where embedding is reorganized under the minors package. Please see original PR for discussion.
This PR
depends on#4326#4294and#4349no longer has any dependencies!This PR is depended on by #4350
EDIT: After starting the isomorphism PR I realized this PR could use some slight tweaks.
Overview:
Summary
This PR is for a new algorithm which I don't believe exists in networkx:
This algorithm is restricted to ordered trees because it is APX-hard for any other relaxation of the input types.
Modivation
When working with pytorch-based neural networks I found that I often encounter a problem where I have some custom module that uses resnet50 as a component, and I would like to simply start from existing resnet50 pretrained weights. However, to do that the "state_dict" of the model must match the "state_dict" of the saved pytorch weights file. But because my resnet50 model is a component, the keys don't exactly line up.
For instance, the keys in the resnet file may look like this:
And the "model-state" for the module may look like this:
Now, yes I could do (and have done) a hacky solution that tries removing common prefixes, but I wanted something more general. Something that would "just work" in almost all cases. After thinking about it for awhile I realize that this was a graph problem. I can break up the components by the "." and make a directory like tree structure.
Now, I want to ask the question, what is the biggest subgraph that these two directory structures have in common. After searching for awhile I found the Maximum common induced subgraph problem, which is NP-hard for graphs. But we have trees, so can we do better? I found the Maximum Common
Subtree Isomorphism and Maximum Common Subtree but I wasn't able to follow any of these links to get an algorithm working. (although perhaps the first one is worth revisiting).
Eventually I found the paper: On the Maximum Common Embedded Subtree
Problem for Ordered Trees, which outlines a polynomial time algorithm for finding maximum common embedded subtrees. The paper was written in a way that I could follow it, but unfortunately I missed the detail that an embedding --- although similar to --- is not an isomorphic subgraph.
However, the algorithm for finding a Common Embedded Subtree does still work well for solving my problem. The paper also links to external resources where they do tackle the actual common subtree isomorphism problem, but unfortunately they were all behind paywalls. This PR only addresses the Embedded Subtree problem.
However, I do believe I was able to modify the recurrence for the maximum common embedding into one that does produce a maximum common isomorphism, and I've empirically verified that it works on a few thousand randomly generated trees. The PR for that function is in #4350
Current Summary of Modifications
I'll maintain a top-level summary of modifications in this comment.
networkx/algorithms/__init__.py
- expose string and embedding modulesnetworkx/algorithms/minors/__init__.py
- new algorithm submodule for embedding problemsnetworkx/algorithms/minors/tree_embedding.py
- ⭐ implements reduction from graph problem to string problem. Defines the function which is the main API-level contribution of this PR:maximum_common_ordered_subtree_embedding
.networkx/algorithms/minors/tests/test_tree_embedding.py
- associated tests for tree embeddingsnetworkx/algorithms/string/__init__.py
- new algorithm submodule for string problemsnetworkx/algorithms/string/balanced_sequence.py
- Helpers for decomposing balanced sequences.networkx/algorithms/string/balanced_embedding.py
- ⭐ core dynamic program to solve the maximum common balanced subsequence problem. This is the main algorithmic component of this PR.networkx/algorithms/string/balanced_embedding_cython.pyx
- optional, but faster cython version of balanced_embedding.pynetworkx/algorithms/string/tests/test_balanced_sequence.py
- associated tests for balanced sequencesexamples/applications/path_correspondence.py
- demonstrates how to solve the path embedding problem using tree embedding. This file likely needs further reorganization, or possibly a separate PR, the rest of the PR does stand alone without this.setup.py
- registered string subpackages.