Skip to content

Conversation

@SemyonSinchenko
Copy link
Collaborator

What changes were proposed in this pull request?

An initial and top level structures for the future PropertyGraphFrame module.

Let's imagine we have a property graph that represent the graph of legal entities for the needs of AML.

In that case our vertices can be:

  • Legal entity
  • Persons

And edges can be:

  • One entity pay to another entity -- bank transaction, a directed weighted edge with weigh eq to amount
  • One entity pay to a person -- the same, but vertex types are different
  • A person is a member of the entity board or work for an entity -- undirected edge without weight

The core idea is relying on three classes:

  • PropertyGraphFrame
  • EdgePropertyGroup
  • VertexPropertyGroup

Each property group contains it's own data as DataFrame and metadata.

VertexPropertyGroup contains data, related to a group of vertices. For example, for entities it can some metadata, like name, assets, legal form, history of AML violations, etc. For persons it may something else, like name, surname.

EdgePropertyGroup for example, for an edge group entity-pay-entity is a weighted edge group with directed edges. And for a case two entities share one board member (person) it is undirected and unweighted edge.

PropertyGraphFrame is just a sequence of edge and vertex groups with an ability to get a GraphFrame object and call algos like clustering, shortest pathes, etc.

Why are the changes needed?

  1. At the moment GraphFrame API is very low-level. It is really hard to realize where edges are directed, where edges are undirected. How to construct it, etc. It is nice for library devs, but it may be really hard to work with it for end users like analysts.
  2. We cannot support fully OpenCypher just because GraphFrame does not have any properties. With a PropertyGraph abstraction we are open to add a support of the OpenCypher / Gremlin
  3. Users can just take multiple data tables, pass it to the PropertyGraph and be happy, instead of doing low-level hacking with creation of edges/vertices, avoiding collisions, etc.

Overall: #602, #565 + some others.

IMPORTANT
Like Spark structures are based on Parquet, I mostly base the proposed structure on Apache GraphAr (incubating) as the only known to me "open-table" format for property graphs!

DISCLAIMER
At the same time, I'm PPMC in GraphAr and there is a potential conflict of interests!

I do not want to write more code before discussing the overall concept and idea. When we are done, I will continue the work by providing additional methods to the API.

@SemyonSinchenko
Copy link
Collaborator Author

@james-willis It looks like I addressed all the comments, thanks a lot for the feedback. Could you take another look?

@SemyonSinchenko
Copy link
Collaborator Author

Added:

  • projection method (transformation of bi-partite combination to the upartite)
  • tests

@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 86.66667% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.83%. Comparing base (06ee372) to head (9d9be21).
⚠️ Report is 26 commits behind head on master.

Files with missing lines Patch % Lines
...mes/propertygraph/property/EdgePropertyGroup.scala 83.87% 5 Missing ⚠️
...graphframes/propertygraph/PropertyGraphFrame.scala 90.69% 4 Missing ⚠️
...s/propertygraph/property/VertexPropertyGroup.scala 85.71% 2 Missing ⚠️
...hframes/propertygraph/property/PropertyGroup.scala 0.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

❗ There is a different number of reports uploaded between BASE (06ee372) and HEAD (9d9be21). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (06ee372) HEAD (9d9be21)
4 2
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #613      +/-   ##
==========================================
- Coverage   87.82%   80.83%   -6.99%     
==========================================
  Files          22       30       +8     
  Lines        1092     1320     +228     
  Branches      124      166      +42     
==========================================
+ Hits          959     1067     +108     
- Misses        133      253     +120     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@SemyonSinchenko SemyonSinchenko marked this pull request as ready for review July 28, 2025 09:19
@SemyonSinchenko
Copy link
Collaborator Author

Changes from the latest review round

  • Optional function to compute a new weight in projection
  • Back-join to easily get data from GraphFrame algorithms with original ID
  • Preserving "group" column in toGraphFrame for more efficient back-join
  • Option to avoid hashing IDs if a user is sure that is not needed for a specific group

@SemyonSinchenko
Copy link
Collaborator Author

cc: @SauronShepherd , @Kimahriman
Hello! If you can give a review it would be very cool!

@SemyonSinchenko
Copy link
Collaborator Author

Just for the simplicity / context: I made a human-readable description with diagram of what is PropertyGraph in my recent blogpost. And this PR provides an implementation exactly following that description of the concept.

@SemyonSinchenko SemyonSinchenko merged commit 4503055 into graphframes:master Aug 11, 2025
5 checks passed
@SemyonSinchenko SemyonSinchenko deleted the 602-property-graphs branch August 11, 2025 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants