Skip to content

Python bindings#1

Closed
adsharma wants to merge 5 commits intonodef:mainfrom
adsharma:python2
Closed

Python bindings#1
adsharma wants to merge 5 commits intonodef:mainfrom
adsharma:python2

Conversation

@adsharma
Copy link
Copy Markdown

People using python and pandas may benefit from having these bindings and a parallel implementation.

While the benchmarks are running and the tests are passing, they're not seeing a speedup due to single threaded runs.

I've spent some time debugging why louvainStaticOmp runs fast (< 2 secs) when run via main.cxx, but takes 20 secs when run via benchmark.py. No resolution yet.

Observation: there is only one thread (Thread 0), even when omp_get_num_threads() returns 64.

usage:

uv sync
uv run examples/*.py
uv run pytest
uv run tests/benchmark.py --csv graph.csv

@adsharma
Copy link
Copy Markdown
Author

Discussion so far: puzzlef#7

@adsharma
Copy link
Copy Markdown
Author

@wolfram77 Here's my learning so far:

  • There is a communication gap between graph databases and graph algorithm communities
  • This repo and ParClusterers and cugraph are examples of graph algorithms
  • KuzuDB, memgraph and Neo4j are examples of Graph Databases. These databases also implement some graph algorithms.

As an end user, I have to chose between algorithms that are integrated into the language (typically cypher, lately PGQ) and handle graphs bigger than memory vs algorithms that are highly parallel and fast.

My hope is that the two communities can collaborate more via data structures that are a de facto industry standard (pandas, arrow). This is a step in that direction. There is a lot of room for optimization (like it takes 6 seconds to copy the graph, 2 seconds to run clustering and another 2 seconds to copy it back to python).

@wolfram77
Copy link
Copy Markdown
Member

@adsharma Thanks again for thsi pull request. However, this is a huge amount of code! Are you using some agent-based CLI for this?

I am reviewing the code currently, and will update you in a few days.

@adsharma
Copy link
Copy Markdown
Author

adsharma commented Jul 2, 2025

Yes - I've used an IDE based agent to write some of the code.

Copy link
Copy Markdown
Member

@wolfram77 wolfram77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adsharma Hope you are having a good day.

I have reviewed some of the files, and have some comments/suggestions. My understanding from your comment earlier is that this pull request is a step in the direction of bridging graph databases and graph algorithm packages. Here, we are only allowing GVE-Louvain and Leiden to be tested on python independently. Thus, I feel we should cut down to a minimal set of files - less is better any day.

Ideally, the next step would be to push this algorithm to existing packages, such as NetworKit. Graph databases should thus use something like NetworKit internally.

What do you say?

Comment thread PYTHON_README.md
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adsharma Let us rename this file to README.md, the old readme is not relevant for this project (python bindings).

Comment thread PYTHON_README.md
uv sync
uv pip install -e .
source .venv/bin/activate
python3 demo.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this examples/demo.py?

Comment thread PYTHON_README.md

```python
import pandas as pd
from leiden_communities import louvain, leiden
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it leiden_communities_openmp, just for consistency?

Comment thread PYTHON_README.md
# Run Louvain algorithm
result = louvain(edges, directed=False, resolution=1.0)
print(f"Found {len(result['communities'])} communities")
print(result['vertices']) # Vertex community assignments
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is community membership, can we call it membership?

Comment thread PYTHON_README.md
result = louvain(edges, directed=False, resolution=1.0)
print(f"Found {len(result['communities'])} communities")
print(result['vertices']) # Vertex community assignments
print(result['communities']) # Community information
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What additional information does this contain?

Comment thread PYTHON_README.md
- **OpenMP Parallelization**: Automatic multi-threading
- **Zero-Copy Operations**: Direct numpy array access
- **Apache Arrow Integration**: Efficient columnar data transfer
- **Memory Efficient**: Minimal data copying
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repating what was said earlier

Comment thread PYTHON_README.md
directed=False,
resolution=0.8, # Lower = larger communities
beta=0.01, # Leiden refinement parameter
seed=42, # Reproducible results
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove the above two options

Comment thread PYTHON_README.md
- Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 (2019).
- Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008.

For more details about the underlying GVE-Leiden implementation, see the main README.md.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let it refer to the original repo instead.

Comment thread PYTHON_README.md
Run the Louvain community detection algorithm.

**Parameters:**
- `graph`: Input graph as pandas DataFrame, Arrow Table, or dict-like object
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adsharma It seems like an overkill to support all the possible formats. Should we focus on pyarrow only? We can rely on pyarrow for conversion from pandas or other formats instead.

Comment thread examples/demo.py
@@ -0,0 +1,122 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adsharma Lets drop this and keep only short examples in the readme.

@adsharma
Copy link
Copy Markdown
Author

adsharma commented Jul 3, 2025

@adsharma Hope you are having a good day.

Ideally, the next step would be to push this algorithm to existing packages, such as NetworKit. Graph databases should thus use something like NetworKit internally.

Thank you for the pointer to NetworKit. I was briefly confused about it being the same as networkX, another python library. Even though they're solving the same problem, networkX is a pure python implementation and NetworKit uses C++/openmp.

Further, NetworKit already has a parallel Leiden with a reference to your paper:

https://github.com/networkit/networkit/blob/e025b97a270274e396752e23115847dc9b19037f/include/networkit/community/ParallelLeiden.hpp

and a performance claim:

For example, community detection in a 3 billion edge web graph can be performed on a 16-core server in a matter of a few minutes.

I think the next step should be to evaluate NetworKit implementation. If it's equally performant or better, I don't see any point in improving this PR.

Thank you for the review. I'll get back to you in a day or two.

@adsharma
Copy link
Copy Markdown
Author

adsharma commented Jul 4, 2025

Here's the data comparing this code vs NetworKit. It's faster, but worse modularity.
How do you feel about trying to fix the NetworKit implementation vs trying to merge this code?

Successfully loaded graph from soc-livejournal.csv
Using provided graph data...
Graph: 55866123 edges, 4033138 vertices

Preparing graphs...
NetworkIt graph: 55866123 edges, 4033137 nodes

Benchmarking leiden-communities-openmp Leiden...
  Run 1: 18.6783s
  Run 2: 10.2512s
  Run 3: 10.5043s
  Average time (excluding warmup): 10.3777s
  Communities found: 4033137.000000
  Modularity: 0.738021

Benchmarking NetworkIt Leiden...
/home/ubuntu/src/leiden-communities-openmp/test/benchmark_nk.py:214: UserWarning: The current implementation might produce results, which do not follow the guarantees from the Leiden algorithm. See documentation for more details: https://networkit.github.io/dev-docs/python_api/community.html#ParallelLeiden
  plm = nk.community.ParallelLeiden(nk_graph)
  Run 1: 3.2186s
  Run 2: 3.2056s
  Run 3: 3.1648s
  Average time (excluding warmup): 3.1852s
  Communities found: 338307.000000
  Modularity: 0.575227

================================================================================
RESULTS SUMMARY
================================================================================
Algorithm Performance:
  leiden-communities-openmp Leiden:  10.3777s
  NetworkIt Leiden:                  3.1852s

Performance Comparison:
  leiden-communities-openmp vs NetworkIt Leiden:  0.31x

Community Detection Results:
  leiden-communities-openmp Leiden:  4033137 communities, modularity 0.738021
  NetworkIt Leiden:                  338307 communities, modularity 0.5752272623428052

Quality Comparison:
  Modularity difference (LC vs NK Leiden):  0.162794

@wolfram77
Copy link
Copy Markdown
Member

@adsharma I will try sending a pull request to NetworKit, updating their Leiden, and possibly Louvain implementation. You may have to wait for 2-3 weeks as I am currently busy with joining process after having completed my PhD. Are you a PhD student working on graph databases?

@adsharma
Copy link
Copy Markdown
Author

adsharma commented Jul 4, 2025

I graduated many decades ago :)

@adsharma
Copy link
Copy Markdown
Author

adsharma commented Jul 4, 2025

networkit/networkit#1244

Folks are aware of bugs in the implementation (including the author who wrote the code). We're probably better off replacing the implementation instead of trying to fix it.

I also benchmarked against Parallel Louvain Method:

Algorithm Performance:
  leiden-communities-openmp Leiden:  10.1437s
  NetworkIt PLM:                  12.3595s

Performance Comparison:
  leiden-communities-openmp vs NetworkIt PLM:  1.22x

Community Detection Results:
  leiden-communities-openmp Leiden:  4033137 communities, modularity 0.742160
  NetworkIt PLM:                  3518 communities, modularity 0.7660565188495451

@adsharma
Copy link
Copy Markdown
Author

adsharma commented Jul 4, 2025

I found the bugs in networkit. Sending a PR now.

Here's the summary after the fixes.

================================================================================
RESULTS SUMMARY
================================================================================
Algorithm Performance:
  leiden-communities-openmp Leiden:  9.0327s
  NetworkIt Leiden:                  9.1313s

Performance Comparison:
  leiden-communities-openmp vs NetworkIt Leiden:  1.01x

Community Detection Results:
  leiden-communities-openmp Leiden:  4504 communities, modularity 0.743539
  NetworkIt Leiden:                  2283 communities, modularity 0.788739

Quality Comparison:
  Modularity difference (LC vs NK Leiden):  0.045199

@adsharma
Copy link
Copy Markdown
Author

adsharma commented Jul 4, 2025

networkit/networkit#1328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants