Conversation
|
Discussion so far: puzzlef#7 |
|
@wolfram77 Here's my learning so far:
As an end user, I have to chose between algorithms that are integrated into the language (typically cypher, lately PGQ) and handle graphs bigger than memory vs algorithms that are highly parallel and fast. My hope is that the two communities can collaborate more via data structures that are a de facto industry standard (pandas, arrow). This is a step in that direction. There is a lot of room for optimization (like it takes 6 seconds to copy the graph, 2 seconds to run clustering and another 2 seconds to copy it back to python). |
|
@adsharma Thanks again for thsi pull request. However, this is a huge amount of code! Are you using some agent-based CLI for this? I am reviewing the code currently, and will update you in a few days. |
|
Yes - I've used an IDE based agent to write some of the code. |
wolfram77
left a comment
There was a problem hiding this comment.
@adsharma Hope you are having a good day.
I have reviewed some of the files, and have some comments/suggestions. My understanding from your comment earlier is that this pull request is a step in the direction of bridging graph databases and graph algorithm packages. Here, we are only allowing GVE-Louvain and Leiden to be tested on python independently. Thus, I feel we should cut down to a minimal set of files - less is better any day.
Ideally, the next step would be to push this algorithm to existing packages, such as NetworKit. Graph databases should thus use something like NetworKit internally.
What do you say?
There was a problem hiding this comment.
@adsharma Let us rename this file to README.md, the old readme is not relevant for this project (python bindings).
| uv sync | ||
| uv pip install -e . | ||
| source .venv/bin/activate | ||
| python3 demo.py |
|
|
||
| ```python | ||
| import pandas as pd | ||
| from leiden_communities import louvain, leiden |
There was a problem hiding this comment.
Should we call it leiden_communities_openmp, just for consistency?
| # Run Louvain algorithm | ||
| result = louvain(edges, directed=False, resolution=1.0) | ||
| print(f"Found {len(result['communities'])} communities") | ||
| print(result['vertices']) # Vertex community assignments |
There was a problem hiding this comment.
If this is community membership, can we call it membership?
| result = louvain(edges, directed=False, resolution=1.0) | ||
| print(f"Found {len(result['communities'])} communities") | ||
| print(result['vertices']) # Vertex community assignments | ||
| print(result['communities']) # Community information |
There was a problem hiding this comment.
What additional information does this contain?
| - **OpenMP Parallelization**: Automatic multi-threading | ||
| - **Zero-Copy Operations**: Direct numpy array access | ||
| - **Apache Arrow Integration**: Efficient columnar data transfer | ||
| - **Memory Efficient**: Minimal data copying |
There was a problem hiding this comment.
Repating what was said earlier
| directed=False, | ||
| resolution=0.8, # Lower = larger communities | ||
| beta=0.01, # Leiden refinement parameter | ||
| seed=42, # Reproducible results |
There was a problem hiding this comment.
Can remove the above two options
| - Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 (2019). | ||
| - Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008. | ||
|
|
||
| For more details about the underlying GVE-Leiden implementation, see the main README.md. |
There was a problem hiding this comment.
Let it refer to the original repo instead.
| Run the Louvain community detection algorithm. | ||
|
|
||
| **Parameters:** | ||
| - `graph`: Input graph as pandas DataFrame, Arrow Table, or dict-like object |
There was a problem hiding this comment.
@adsharma It seems like an overkill to support all the possible formats. Should we focus on pyarrow only? We can rely on pyarrow for conversion from pandas or other formats instead.
| @@ -0,0 +1,122 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
@adsharma Lets drop this and keep only short examples in the readme.
Thank you for the pointer to NetworKit. I was briefly confused about it being the same as networkX, another python library. Even though they're solving the same problem, networkX is a pure python implementation and NetworKit uses C++/openmp. Further, NetworKit already has a parallel Leiden with a reference to your paper: and a performance claim:
I think the next step should be to evaluate NetworKit implementation. If it's equally performant or better, I don't see any point in improving this PR. Thank you for the review. I'll get back to you in a day or two. |
|
Here's the data comparing this code vs NetworKit. It's faster, but worse modularity. |
|
@adsharma I will try sending a pull request to NetworKit, updating their Leiden, and possibly Louvain implementation. You may have to wait for 2-3 weeks as I am currently busy with joining process after having completed my PhD. Are you a PhD student working on graph databases? |
|
I graduated many decades ago :) |
|
Folks are aware of bugs in the implementation (including the author who wrote the code). We're probably better off replacing the implementation instead of trying to fix it. I also benchmarked against Parallel Louvain Method: |
|
I found the bugs in networkit. Sending a PR now. Here's the summary after the fixes. |
People using python and pandas may benefit from having these bindings and a parallel implementation.
While the benchmarks are running and the tests are passing, they're not seeing a speedup due to single threaded runs.
I've spent some time debugging why louvainStaticOmp runs fast (< 2 secs) when run via main.cxx, but takes 20 secs when run via benchmark.py. No resolution yet.
Observation: there is only one thread (Thread 0), even when omp_get_num_threads() returns 64.
usage: