Python bindings by adsharma · Pull Request #1 · nodef/leiden-communities-openmp.py

adsharma · 2025-06-30T20:29:39Z

People using python and pandas may benefit from having these bindings and a parallel implementation.

While the benchmarks are running and the tests are passing, they're not seeing a speedup due to single threaded runs.

I've spent some time debugging why louvainStaticOmp runs fast (< 2 secs) when run via main.cxx, but takes 20 secs when run via benchmark.py. No resolution yet.

Observation: there is only one thread (Thread 0), even when omp_get_num_threads() returns 64.

usage:

uv sync
uv run examples/*.py
uv run pytest
uv run tests/benchmark.py --csv graph.csv

adsharma · 2025-06-30T20:32:54Z

Discussion so far: puzzlef#7

adsharma · 2025-06-30T20:40:33Z

@wolfram77 Here's my learning so far:

There is a communication gap between graph databases and graph algorithm communities
This repo and ParClusterers and cugraph are examples of graph algorithms
KuzuDB, memgraph and Neo4j are examples of Graph Databases. These databases also implement some graph algorithms.

As an end user, I have to chose between algorithms that are integrated into the language (typically cypher, lately PGQ) and handle graphs bigger than memory vs algorithms that are highly parallel and fast.

My hope is that the two communities can collaborate more via data structures that are a de facto industry standard (pandas, arrow). This is a step in that direction. There is a lot of room for optimization (like it takes 6 seconds to copy the graph, 2 seconds to run clustering and another 2 seconds to copy it back to python).

wolfram77 · 2025-07-01T15:21:14Z

@adsharma Thanks again for thsi pull request. However, this is a huge amount of code! Are you using some agent-based CLI for this?

I am reviewing the code currently, and will update you in a few days.

adsharma · 2025-07-02T05:40:21Z

Yes - I've used an IDE based agent to write some of the code.

wolfram77

@adsharma Hope you are having a good day.

I have reviewed some of the files, and have some comments/suggestions. My understanding from your comment earlier is that this pull request is a step in the direction of bridging graph databases and graph algorithm packages. Here, we are only allowing GVE-Louvain and Leiden to be tested on python independently. Thus, I feel we should cut down to a minimal set of files - less is better any day.

Ideally, the next step would be to push this algorithm to existing packages, such as NetworKit. Graph databases should thus use something like NetworKit internally.

What do you say?

wolfram77 · 2025-07-01T07:56:03Z

@adsharma Let us rename this file to README.md, the old readme is not relevant for this project (python bindings).

wolfram77 · 2025-07-01T07:59:03Z

+uv sync
+uv pip install -e .
+source .venv/bin/activate
+python3 demo.py


Is this examples/demo.py?

wolfram77 · 2025-07-01T08:01:50Z

+
+```python
+import pandas as pd
+from leiden_communities import louvain, leiden


Should we call it leiden_communities_openmp, just for consistency?

wolfram77 · 2025-07-01T08:03:52Z

+# Run Louvain algorithm
+result = louvain(edges, directed=False, resolution=1.0)
+print(f"Found {len(result['communities'])} communities")
+print(result['vertices'])  # Vertex community assignments


If this is community membership, can we call it membership?

wolfram77 · 2025-07-01T08:04:24Z

+result = louvain(edges, directed=False, resolution=1.0)
+print(f"Found {len(result['communities'])} communities")
+print(result['vertices'])  # Vertex community assignments
+print(result['communities'])  # Community information


What additional information does this contain?

wolfram77 · 2025-07-01T09:11:34Z

+- **OpenMP Parallelization**: Automatic multi-threading
+- **Zero-Copy Operations**: Direct numpy array access
+- **Apache Arrow Integration**: Efficient columnar data transfer
+- **Memory Efficient**: Minimal data copying


Repating what was said earlier

wolfram77 · 2025-07-01T09:14:10Z

+    directed=False,
+    resolution=0.8,          # Lower = larger communities
+    beta=0.01,              # Leiden refinement parameter
+    seed=42,                # Reproducible results


Can remove the above two options

wolfram77 · 2025-07-01T09:18:10Z

+- Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 (2019).
+- Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008.
+
+For more details about the underlying GVE-Leiden implementation, see the main README.md.


Let it refer to the original repo instead.

wolfram77 · 2025-07-03T12:31:36Z

+Run the Louvain community detection algorithm.
+
+**Parameters:**
+- `graph`: Input graph as pandas DataFrame, Arrow Table, or dict-like object


@adsharma It seems like an overkill to support all the possible formats. Should we focus on pyarrow only? We can rely on pyarrow for conversion from pandas or other formats instead.

wolfram77 · 2025-07-03T12:58:53Z

@@ -0,0 +1,122 @@
+#!/usr/bin/env python3


@adsharma Lets drop this and keep only short examples in the readme.

adsharma · 2025-07-03T21:47:04Z

@adsharma Hope you are having a good day.

Ideally, the next step would be to push this algorithm to existing packages, such as NetworKit. Graph databases should thus use something like NetworKit internally.

Thank you for the pointer to NetworKit. I was briefly confused about it being the same as networkX, another python library. Even though they're solving the same problem, networkX is a pure python implementation and NetworKit uses C++/openmp.

Further, NetworKit already has a parallel Leiden with a reference to your paper:

https://github.com/networkit/networkit/blob/e025b97a270274e396752e23115847dc9b19037f/include/networkit/community/ParallelLeiden.hpp

and a performance claim:

For example, community detection in a 3 billion edge web graph can be performed on a 16-core server in a matter of a few minutes.

I think the next step should be to evaluate NetworKit implementation. If it's equally performant or better, I don't see any point in improving this PR.

Thank you for the review. I'll get back to you in a day or two.

adsharma · 2025-07-04T00:12:28Z

Here's the data comparing this code vs NetworKit. It's faster, but worse modularity.
How do you feel about trying to fix the NetworKit implementation vs trying to merge this code?

Successfully loaded graph from soc-livejournal.csv
Using provided graph data...
Graph: 55866123 edges, 4033138 vertices

Preparing graphs...
NetworkIt graph: 55866123 edges, 4033137 nodes

Benchmarking leiden-communities-openmp Leiden...
  Run 1: 18.6783s
  Run 2: 10.2512s
  Run 3: 10.5043s
  Average time (excluding warmup): 10.3777s
  Communities found: 4033137.000000
  Modularity: 0.738021

Benchmarking NetworkIt Leiden...
/home/ubuntu/src/leiden-communities-openmp/test/benchmark_nk.py:214: UserWarning: The current implementation might produce results, which do not follow the guarantees from the Leiden algorithm. See documentation for more details: https://networkit.github.io/dev-docs/python_api/community.html#ParallelLeiden
  plm = nk.community.ParallelLeiden(nk_graph)
  Run 1: 3.2186s
  Run 2: 3.2056s
  Run 3: 3.1648s
  Average time (excluding warmup): 3.1852s
  Communities found: 338307.000000
  Modularity: 0.575227

================================================================================
RESULTS SUMMARY
================================================================================
Algorithm Performance:
  leiden-communities-openmp Leiden:  10.3777s
  NetworkIt Leiden:                  3.1852s

Performance Comparison:
  leiden-communities-openmp vs NetworkIt Leiden:  0.31x

Community Detection Results:
  leiden-communities-openmp Leiden:  4033137 communities, modularity 0.738021
  NetworkIt Leiden:                  338307 communities, modularity 0.5752272623428052

Quality Comparison:
  Modularity difference (LC vs NK Leiden):  0.162794

wolfram77 · 2025-07-04T15:10:46Z

@adsharma I will try sending a pull request to NetworKit, updating their Leiden, and possibly Louvain implementation. You may have to wait for 2-3 weeks as I am currently busy with joining process after having completed my PhD. Are you a PhD student working on graph databases?

adsharma · 2025-07-04T17:29:49Z

I graduated many decades ago :)

adsharma · 2025-07-04T19:54:59Z

networkit/networkit#1244

Folks are aware of bugs in the implementation (including the author who wrote the code). We're probably better off replacing the implementation instead of trying to fix it.

I also benchmarked against Parallel Louvain Method:

Algorithm Performance:
  leiden-communities-openmp Leiden:  10.1437s
  NetworkIt PLM:                  12.3595s

Performance Comparison:
  leiden-communities-openmp vs NetworkIt PLM:  1.22x

Community Detection Results:
  leiden-communities-openmp Leiden:  4033137 communities, modularity 0.742160
  NetworkIt PLM:                  3518 communities, modularity 0.7660565188495451

adsharma · 2025-07-04T23:11:54Z

I found the bugs in networkit. Sending a PR now.

Here's the summary after the fixes.

================================================================================
RESULTS SUMMARY
================================================================================
Algorithm Performance:
  leiden-communities-openmp Leiden:  9.0327s
  NetworkIt Leiden:                  9.1313s

Performance Comparison:
  leiden-communities-openmp vs NetworkIt Leiden:  1.01x

Community Detection Results:
  leiden-communities-openmp Leiden:  4504 communities, modularity 0.743539
  NetworkIt Leiden:                  2283 communities, modularity 0.788739

Quality Comparison:
  Modularity difference (LC vs NK Leiden):  0.045199

adsharma · 2025-07-04T23:20:07Z

networkit/networkit#1328

adsharma added 3 commits June 28, 2025 11:08

python bindings

f05b4cb

Add pytest

4497ebf

Parallelize python -> C++ -> python via OpenMP

504fe5c

gitignore

de53e4a

wolfram77 requested changes Jul 3, 2025

View reviewed changes

Add benchmark to compare this vs networkit

8ec3166

adsharma force-pushed the python2 branch from 2ef9d7b to 8ec3166 Compare July 4, 2025 00:15

adsharma mentioned this pull request Jul 4, 2025

Disconnected Communities in Parallel Leiden networkit/networkit#1244

Closed

adsharma closed this Jul 4, 2025

fabratu mentioned this pull request Jul 8, 2025

Fix bugs in ParallelLeiden networkit/networkit#1328

Merged

adsharma mentioned this pull request Jul 8, 2025

Optimize for Arrow in-memory format networkit/networkit#1331

Open

Conversation

adsharma commented Jun 30, 2025

Uh oh!

adsharma commented Jun 30, 2025

Uh oh!

adsharma commented Jun 30, 2025

Uh oh!

wolfram77 commented Jul 1, 2025

Uh oh!

adsharma commented Jul 2, 2025

Uh oh!

wolfram77 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adsharma commented Jul 3, 2025

Uh oh!

adsharma commented Jul 4, 2025

Uh oh!

wolfram77 commented Jul 4, 2025

Uh oh!

adsharma commented Jul 4, 2025

Uh oh!

adsharma commented Jul 4, 2025

Uh oh!

adsharma commented Jul 4, 2025

Uh oh!

adsharma commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants