Skip to content

Comments

rgw: add support bucket replication between zonegroups#59911

Closed
clwluvw wants to merge 29 commits intoceph:mainfrom
clwluvw:rgw-zonegroup-replication
Closed

rgw: add support bucket replication between zonegroups#59911
clwluvw wants to merge 29 commits intoceph:mainfrom
clwluvw:rgw-zonegroup-replication

Conversation

@clwluvw
Copy link
Member

@clwluvw clwluvw commented Sep 20, 2024

The current implementation of bucket replication is limited to replication within a zonegroup and does not account for bucket location constraints. To align with AWS's model, this proposal introduces cross-zonegroup bucket replication, respecting location constraints and only replicating based on user requests through the PutBucketReplication API (https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketReplication.html).

In addition, the existing DataLogChanges and BiLogs systems are inefficient for multi-zonegroup scenarios, as they require all zones to process every entry. To improve this, a new property, log_zones, has been introduced for both DataLogChanges and BiLogs. For BiLogs, log_zones will include either all target_zones from the sync policy or, it will select the zones from the available rules for the bucket that match the criteria (prefix and tagset). For DataLogChanges, it will always include all target zones from the bucket's sync policy.

Both BiLogs Listing and DataLogChanges Listing APIs now report the last processed marker, allowing zones to report the actual last marker so that the source can trim logs efficiently. DataLogChanges are global across zones, so all zones must sync to the last marker for trimming. BiLogs, however, are specific to the bucket’s interest, and zones listing BiLogs must report the correct marker, even in cases where some zones may miss entries (due to the rules filters).

The rgwx-zonegroup property has been deprecated in favor of rgwx-zone, allowing log filtering based on the requesting zone. If a zonegroup reference is still required, it can be derived from the zone.

Additionally, logging has been optimized to occur only when an active sync pipe exists for the corresponding object. To prevent performance issues caused by circular replication, a new configuration option, rgw_data_sync_allow_chain_replication, has been introduced, allowing control over chain replication and reducing redundant logging.

Buckets created in other zonegroups will now operate in an indexless mode, controlling unnecessary operations and reporting on non-owned buckets.

Finally, since sync pipes can now encompass all zones across zonegroups, the wildcard (*) configuration for pipes is no longer effective. When pipes are bucket-specific, the wildcard is automatically translated into the available zones from the bucket’s zonegroup, while still allowing updates to the policy if the bucket is recreated in a different zonegroup.

Fixes: https://tracker.ceph.com/issues/66649

Related PRs:

@clwluvw clwluvw requested review from a team as code owners September 20, 2024 20:34
@clwluvw clwluvw force-pushed the rgw-zonegroup-replication branch from 165e38c to 98f9710 Compare September 20, 2024 20:47
@clwluvw clwluvw marked this pull request as draft September 23, 2024 23:04
@clwluvw
Copy link
Member Author

clwluvw commented Sep 24, 2024

If I understood correctly, currently the only overhead is every zone will process all logged buckets from the source zone but just by the shard entries and they will stop at

if (pipes.empty()) {
ldpp_dout(dpp, 20) << __func__ << "(): no relevant sync pipes found" << dendl;
return set_cr_done();
}

I'm not sure how expensive this could be.

@clwluvw clwluvw marked this pull request as ready for review September 24, 2024 19:41
@clwluvw clwluvw force-pushed the rgw-zonegroup-replication branch from b67a763 to fc8efc8 Compare September 24, 2024 21:41
@adamemerson adamemerson requested a review from cbodley September 27, 2024 21:55
@clwluvw clwluvw force-pushed the rgw-zonegroup-replication branch from c46029d to 0059779 Compare September 30, 2024 20:53
@github-actions
Copy link

github-actions bot commented Oct 3, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@clwluvw clwluvw force-pushed the rgw-zonegroup-replication branch from 2281f42 to 8f87e0a Compare October 3, 2024 21:44
@clwluvw
Copy link
Member Author

clwluvw commented Oct 4, 2024

I believe the write path is now fully optimized, as it only logs when an enabled pipe is associated with the object.

For the polling phase, I propose two possible approaches:

  1. Utilizing RADOS Namespacing: We can create a FIFO queue for each zone within its own RADOS namespace. This approach allows us to write logs directly to the corresponding zone's queue, ensuring that each zone only pulls the relevant logs.
  2. Maintaining a FIFO Queue Index: An alternative would be to maintain an index that records the offset and size for each log entry in the queue. When a zone pulls data, it would read only from the specified offset. However, this could complicate the FIFO design, especially with respect to trimming and reading, as it might diverge from the simplicity of traditional FIFO mechanics.

@clwluvw
Copy link
Member Author

clwluvw commented Oct 6, 2024

I think I misunderstood the purpose of RGWDataChangesLog. It seems that it only logs the bi log marker, so the ideas mentioned earlier are not suitable for this implementation. I think the best approach would be to enable rgw_data_change to store the destination zone, and when other zones are polling for the logs, they can filter the entries by the dest_zone. This filtering operation should be inexpensive for the source zone.

@smanjara
Copy link
Contributor

smanjara commented Oct 7, 2024

thanks @clwluvw. does replication between zonegroups work with the changes from df1867f alone? if it does, we might want to do some testing to see how expensive the polling would be under a reasonable load and a reasonable number of zones before we optimize for its efficiency.
two zonegroup testing was disabled in our multisite testing suite due to failing test cases. now we run a subset of the whole suite and most of them should pass. I have enabled two zonegroup yaml here: #60172

@clwluvw
Copy link
Member Author

clwluvw commented Oct 7, 2024

Hi @smanjara. Thanks for the interest. I'll break my commits as below:

  • df1867f will enable having zonegroups in the conn map. but the replication would still be blocked as log_data is set to false when we have only one zone in the zonegroup (so this alone needs a modification to set log_data: true manually in the zonegroup config.
  • 9fb96a9 - this will eliminate the need for manual activation by checking whether the bucket has any active pipes rather than checking that boolean. but this will add the load for loading bucket sync pipe per any operation that before was checked by only a const bool. (but I checked and there is a sycn pipe cache already in place so hopefully this shouldn't introduce any significant perf drop)
  • f980c03 - this is the same as above but I forgot to fix it for renew_entries() so you can assume these two are squashed.
  • 8f87e0a - this will check per object than a bucket to whether to log an object or not. it should have the same cost as per bucket checking I believe as the check per object should be just a loop by rules checking prefix and obj tags (sometimes it loads object tags but those are for APIs that probably response time should not matter in difference x ms).

So to summarise the first three commits should let you replicate without any manual work but the last one will optimize to only log if the rule has any prefix/tag filter.

@clwluvw
Copy link
Member Author

clwluvw commented Oct 7, 2024

how expensive the polling would be under a reasonable load and a reasonable number of zones before we optimize for its efficiency.

With the current implementation, as I understand from (#59911 (comment)), other zones in different zonegroups will just list the datalog and ignore entries when they are not interested in a particular bucket. This would likely affect the sync status more, though it's unclear how much of the destination zone's resources would be consumed, which might lead to issues. Your load test could help determine this need.

There’s one incompatibility with AWS BucketReplication: the priority field in the ReplicationRule. AWS only replicates to one bucket per object, and if there are collisions with the rules defined on the source bucket, it selects the highest priority. Currently, RGW does not respect this priority, which might confuse users who rely on it.

However, with the proposed approach outlined here (#59911 (comment)), I believe we can address this issue while also reducing log processing by destination zones. In the logging phase, we can define where the object should be replicated. If I'm not mistaken, this might also replace the need for the zones_trace concept currently implemented.

@adamemerson adamemerson requested a review from smanjara October 12, 2024 02:29
@smanjara
Copy link
Contributor

@clwluvw sorry it took me so long to get back. the commit df1867f establishes connection objects for zones across zonegroups when zonegroup policy is set. this alone along with bucket location constraint fixes #59305 and #59960 should be sufficient to setup replication between two zonegroups.

along with zonegroup policy set to allowed, we need to enable bucket sync policy to either sync within zonegroup or between zones across zonegroups. although I am not sure how sync between the zones on non-master zonegroup will behave and what the sync status will look like. this will need some testing and investigation.

documenting configuration example here for reference.

at the zonegroup level:

$radosgw-admin sync group create  --group-id=group1 --status=allowed
$radosgw-admin sync group flow create  --group-id=group1 --flow-id=flow-mirror --flow-type=symmetrical --zones=zg1-1,zg1-2,zg2-3,zg2-4
$radosgw-admin sync group pipe create --group-id=group1 --pipe-id=pipe1 --source-zones='*' --source-bucket='*' --dest-zones='*' --dest-bucket='*'
$radosgw-admin period update --commit

enable bucket1 replication between zg1-2 and zg1-2 belonging to zonegroup zg1.

$radosgw-admin sync group create --bucket=bucket1 --group-id=bucket1-default --status=enabled
$radosgw-admin sync group flow create --bucket=bucket1 --group-id=bucket1-default --flow-id=bucket1-flow --flow-type=symmetrical --zones="zg1-1, zg1-2"
$radosgw-admin sync group pipe create --bucket=bucket1 --group-id=bucket1-default --pipe-id=bucket1-pipe --source-zones='*' --dest-zones='*'

enable bucket3 replication to sync between the zonegroups zg and zg2 involving all zones:

$radosgw-admin sync group create --bucket=bucket3 --group-id=bucket3-default --status=enabled
$radosgw-admin sync group flow create --bucket=bucket3 --group-id=bucket3-default --flow-id=bucket3-flow --flow-type=symmetrical --zones="zg1-1, zg1-2, zg2-3, zg2-4"
$radosgw-admin sync group pipe create --bucket=bucket3 --group-id=bucket3-default --pipe-id=bucket3-pipe --source-zones='*' --dest-zones='*'

please note that setting zonegroup sync policy allowed forces you to set sync policy on each bucket individually.
we can further reduce the number of zones participating probably by adding only one of the zones from zonegroup zg2 in the bucket sync flow, if you don't need redundancy within the zonegroup and you only care about data locality.

there is a commit in #60018 that sets multiple zonegroups for you to test with.

the other commits deal with changing the way we log data, that I am not very comfortable with. the most common multisite configuration is the one where we sync to/from all zones within a zonegroup. adding conditional checks for sync pipes or for specific objects may not work and adds an overhead for configurations that does not care about sync policies.

@clwluvw
Copy link
Member Author

clwluvw commented Oct 17, 2024

Thank you for your time on this @smanjara.

the commit df1867f establishes connection objects for zones across zonegroups when zonegroup policy is set. this alone along with bucket location constraint fixes #59305 and #59960 should be sufficient to setup replication between two zonegroups.

I'm not sure if this would be enough. Basically as long as log_data is not true on the zone config, RGW should not log anything:

bool add_log = log_op && store->svc.zone->need_to_log_data();
ret = store->cls_obj_complete_add(*bs, obj, optag, poolid, epoch, ent, category, remove_objs, bilog_flags, zones_trace, add_log);
if (add_log) {
add_datalog_entry(dpp, store->svc.datalog_rados,
target->bucket_info, bs->shard_id, y);
}

Currently log_data will be enabled when we have more than one zone in within the RGW's zonegroup
bool log_data = zones.size() > 1;
So I guess in your test case it accidentally worked because you had always more than one zone per zonegroup(?) and if you have a zonegroup with only one zone it shouldn't have log_data set to true. that's why the other two commits are needed I believe.

along with zonegroup policy set to allowed, we need to enable bucket sync policy to either sync within zonegroup or between zones across zonegroups. although I am not sure how sync between the zones on non-master zonegroup will behave and what the sync status will look like. this will need some testing and investigation.

Right, from the concept I was thinking of the logging and polling mechanism independently and replaying them based on the filters and bucket's zonegroup availability. So I guess it should not matter whether the source or dest is master.

please note that setting zonegroup sync policy allowed forces you to set sync policy on each bucket individually. we can further reduce the number of zones participating probably by adding only one of the zones from zonegroup zg2 in the bucket sync flow, if you don't need redundancy within the zonegroup and you only care about data locality.

Right, currently AWS is also designed in a way that you can only replicate an object to a single bucket (zonegroup) and not concurrently to more. but still, that is not enough here, as you said before, the entry processing would be done by all zones no matter what the policy says, but only one will do that actual replication in terms of data.

the other commits deal with changing the way we log data, that I am not very comfortable with. the most common multisite configuration is the one where we sync to/from all zones within a zonegroup. adding conditional checks for sync pipes or for specific objects may not work and adds an overhead for configurations that does not care about sync policies.

Currently, the same check per object based on the sync policies available is done here:

if (!filter_bucket(dpp, bucket, y)) {
return 0;
}

The only difference that I'm making now is to limit the filter to the object's properties (prefix and tags) and only log if the pipe has any interest in that object particularly (which can be an enabled pipe on the bucket itself or on the zonegroup level). So, the load of processing the sync pipe is already being done (the pipe loading has a cache already though), but I'm just adding a minor check based on the filters, which I guess shouldn't cause a significant performance drop. Do you see it in another way?

@clwluvw clwluvw force-pushed the rgw-zonegroup-replication branch 2 times, most recently from c8f23e8 to 7fe1b75 Compare October 17, 2024 19:14
@github-actions github-actions bot added the tests label Oct 17, 2024
@smanjara
Copy link
Contributor

Thank you for your time on this @smanjara.

the commit df1867f establishes connection objects for zones across zonegroups when zonegroup policy is set. this alone along with bucket location constraint fixes #59305 and #59960 should be sufficient to setup replication between two zonegroups.

I'm not sure if this would be enough. Basically as long as log_data is not true on the zone config, RGW should not log anything:

bool add_log = log_op && store->svc.zone->need_to_log_data();
ret = store->cls_obj_complete_add(*bs, obj, optag, poolid, epoch, ent, category, remove_objs, bilog_flags, zones_trace, add_log);
if (add_log) {
add_datalog_entry(dpp, store->svc.datalog_rados,
target->bucket_info, bs->shard_id, y);
}

Currently log_data will be enabled when we have more than one zone in within the RGW's zonegroup

bool log_data = zones.size() > 1;

So I guess in your test case it accidentally worked because you had always more than one zone per zonegroup(?) and if you have a zonegroup with only one zone it shouldn't have log_data set to true. that's why the other two commits are needed I believe.

yeah I am looking at it as an extension of multisite where we will simply add new zones from other zonegroups seamlessly while 'log_data' is true on all zones. I have added your pr as a topic of discussion here: https://pad.ceph.com/p/rgw-weekly. hopefull you can make it.
cc @yehudasa

@clwluvw clwluvw force-pushed the rgw-zonegroup-replication branch from c7ea980 to 16262e4 Compare October 17, 2024 22:36
@clwluvw
Copy link
Member Author

clwluvw commented Oct 18, 2024

There is also another challenge regarding replicating an already replicated object. Currently, we log and replicate even when an object is being replicated again, which increases the load and adds complexity in managing zones_trace.

The main issue is that when an object is replicated via RGWBucketSyncSingleEntryCR(), we log it again because another active sync process matches this operation. If that sync points back to the original zone, it becomes redundant. I haven't found a clean way (even with the log_zonegroup implementation, which is insufficient since it might involve another zone) to stop logging when it's a circular replication. For now, the zones_trace logic simply ignores it during polling, but that still results in an unnecessary load.

AWS S3 doesn't support this type of replication, as described here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication-what-is-isnot-replicated.html

Objects in the source bucket that are replicas that were created by another replication rule. For example, suppose you configure replication where bucket A is the source and bucket B is the destination. Now suppose that you add another replication configuration where bucket B is the source and bucket C is the destination. In this case, objects in bucket B that are replicas of objects in bucket A are not replicated to bucket C.

For performance and efficiency, could we consider dropping this replication? Or at least maybe we introduce an option in the configuration to disable this functionality.

This bucket sync state used for buckets that cannot be replicated
to the zonegroup or they have been deleted in the middle of the
sync and so we don't want to keep the bilogs so the zonegroup needs
to report a bilog status or them as the pipe for the source bucket
still points to that zone.
On bilog trimming, the peers that got this status will be excluded
from min generation and min position calculation as they are not
intrested in replicating that bucket.

Signed-off-by: Seena Fallah <[email protected]>
indexless buckets do not have bilogs.

Signed-off-by: Seena Fallah <[email protected]>
The RGWOp_BILog_List API now reports the last processed marker. This
allows destination zones that have no bilog entries to process to
update their bucket sync status to the last marker, preventing them
from reporting an obsolete marker and blocking trimming.

This scenario can occur when multiple rules with filters exist on the
source bucket, where some zones may not receive all entries due to
log_zones limiting entry processing. These zones still need to follow
the markers to stay up-to-date, even if they don't process the actual
entries.

Signed-off-by: Seena Fallah <[email protected]>
With the introduction of log_zones, a zone might not receive certain
datalog entries if they are irrelevant to that zone. To support
proper trimming, we now return the last processed marker, allowing
zones to report this marker even when they haven't processed those
irrelevant entries. This ensures that the source zone can proceed
with trimming.

Signed-off-by: Seena Fallah <[email protected]>
Consider all sources (including resolved sources) in sync info as
some sources like the ones in another zonegroup are only included
in the resolved sources.

Signed-off-by: Seena Fallah <[email protected]>
Rule ID in ReplicationConfiguration is not required, therefore pipe
id can be empty.
This happens mostly on PutBucketReplication API as the user would
not provide an ID and a sync will be created with an empty ID and
radosgw-admin doesn't allow to modify the pipe because of the check.

Signed-off-by: Seena Fallah <[email protected]>
With the zonegroup replication, buckets can have zones fron another
zonegroups as well in the sync. this will allow to consider all
available zones defined in the sync pipe than only the ones in the
zonegroup.

Signed-off-by: Seena Fallah <[email protected]>
When running `radosgw-admin bucket sync run` with only the target
bucket specified and no source bucket, RGWGetBucketPeersCR doesn't
account for resolved sources from the sync pipe, resulting in no
pipes being returned and causing the command to fail. This change
ensures that hint sources are considered to avoid this issue.

Signed-off-by: Seena Fallah <[email protected]>
as per destination bucket existence check before sync policy creation,
we can assure that there are no policies poining to my zonegroup from
other zonegroups so we can safely skip this bucket instance if it's
not in my zonegroup for full sync.

Signed-off-by: Seena Fallah <[email protected]>
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

github-actions bot commented Jun 8, 2025

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jun 8, 2025
@clwluvw clwluvw removed the stale label Jun 16, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Aug 15, 2025
@clwluvw clwluvw removed the stale label Sep 4, 2025
@github-actions
Copy link

github-actions bot commented Nov 3, 2025

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 3, 2025
e->exists = true;
e->meta = *m;
e->tag = "tag";
e->log_zones = { rgw_zone_id("1588bb2c-439a-4b75-91ef-f0b31d02563b"), rgw_zone_id("1f9654c2-3a66-4407-b07c-0f6727c9df17") };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional to hardcode these UUIDs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a function that generates test data.

type: bool
level: advanced
default: true
desc: This option controls whether replication of already replicated objects (chain replication)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest

already-replicated

@github-actions github-actions bot removed the stale label Nov 3, 2025
@github-actions
Copy link

github-actions bot commented Jan 2, 2026

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jan 2, 2026
@github-actions
Copy link

github-actions bot commented Feb 1, 2026

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Feb 1, 2026
@github-actions
Copy link

Config Diff Tool Output

+ added: rgw_data_sync_allow_chain_replication (rgw.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants