Skip to content

Conversation

@ikysil
Copy link
Contributor

@ikysil ikysil commented Sep 11, 2025

What does this PR do?
Force embedded RID bags to avoid replication errors in distributed mode.
See https://orientdb.dev/docs/3.2.x/general/Concurrency.html#concurrency-when-adding-edges

Motivation
Minimize surprises when updating large graphs in distributed mode.

Related issues
N/A

Additional Notes
N/A

Checklist

  • I have run the build using mvn clean package command
  • My unit tests cover both failure and success scenarios

@tglman
Copy link
Member

tglman commented Sep 11, 2025

@ikysil

What issue do you have with the tree ridbag in distributed ? do you use lightweight edges ?

@ikysil ikysil force-pushed the feat-force-embedded-ridbags-when-distributed branch from 182b64f to caab146 Compare December 8, 2025 20:28
@ikysil
Copy link
Contributor Author

ikysil commented Dec 8, 2025

Hi @tglman

The number of full syncs dropped to almost zero in a distributed cluster of 3 MASTER and 2 REPLICA nodes after applying this configuration.

We don't use lightweight edges (AFAIR).

Scenario before applying these properties:

  • We use round-robin connection strategy on a client, so that write load is distributed between masters.
  • We observed that vertex version is not modified when only edges were added and/or removed.
  • One master commits a change with edge(s) modifications only - vertex version is not modified.
  • Another master commits a change with property modification, modifying vertex version.
  • If edge modification is applied after property modification, the cluster can not agree during the second phase of the transaction, complaining about stale version.
  • A node retries a couple of times, then asks for a delta sync, retries that a few times, fails, then asks for a full sync.
  • Occasionally, we observe behavior described in 【BUG】Three masters in cluster mode, any master node re-pulled up leads to cluster synchronization data jamming #10427. We have to pull affected node from a cluster and add it back later - during quiet time.

The full sync is very expensive for us and takes more than 30 minutes.

@tglman
Copy link
Member

tglman commented Dec 9, 2025

Hi,

This sounds like a bug with the integration of ridbag trees in the distributed flow, I will try to write a test case for this scenario, this may be due to the fact that changes in the ridbag tree do not change the version of the document, allowing concurrent write with other write operations, we do have some distributed scheduling to avoid concurrent writing in the same document, but maybe this check is skipped for changes in ridbag trees.

Any support on reproducing the case is welcome!

tglman added a commit that referenced this pull request Dec 22, 2025
@tglman
Copy link
Member

tglman commented Dec 22, 2025

Hi,

I could write a somewhat minimal test case that reproduced the case and fixed it, so from the next hotfix the issue with inverted apply of distributed transaction due to ridbags tree is solved.

the commit that fix it is referring to this PR.

Bye

tglman added a commit that referenced this pull request Dec 22, 2025
@tglman
Copy link
Member

tglman commented Dec 24, 2025

Hi,

the 3.2.48 is released that should fix this case.

@ikysil
Copy link
Contributor Author

ikysil commented Dec 31, 2025

Hi @tglman

TY for the fix and release.
It takes some time to verify as we don't want to risk production data integrity.

Let's close this PR as it is not needed by itself.
I will add a comment after the verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants