Skip to content

Conversation

@etschannen
Copy link
Contributor

The satellites serve no purpose when we only are replication to one region, and we have observed out of memory errors when they are recruiting in this situation.

@etschannen
Copy link
Contributor Author

closes #2801

Copy link
Contributor

@ajbeamon ajbeamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a variety of other places we are checking this sateliteTLogReplicationFactor, and I'm not well-versed enough to know whether they to be changed as well:

SimulatedCluster.actor.cpp
ManagementAPI.actor.cpp
DatabaseConfiguration.h (besides what's already changed)
DatabaseConfiguration.cpp

There's also satelliteTLogReplicationFactorFallback, not sure if it's relevant.

@alexmiller-apple
Copy link
Contributor

I ended up going down a weird rabbithole when trying to figure out exactly how and where to put an assert to make sure there's no other case of this happening in simulation. Instead, I think I've figured out that you could get to this same problem via configuring more satellites than primary logs.

Consider a multi-region cluster with one primary log and two satellite logs. It looks like we round log_routers, and thus logRouterTags, up to be a multiple of the number of logs in the primary:

    auto newTLogs = getWorkersForTlogs(db.config, db.config.tLogReplicationFactor, db.config.getDesiredLogs(), db.config.tLogPolicy, id_used, true, primaryDC);
...
    int newRouterCount = newTLogFit.count * std::max<int>(1, db.config.desiredLogRouterCount / std::max(1,newTLogFit.count));

And we statically choose one satellite log to hold the one log router tag that we'll have by default. This means that even in a usable_regions=2 config, we'll have a satellite tlog that will only receive empty commits. This is less of a realistic problem, because with triple replication, satellite logs / primary logs would have to be greater than 3, and hopefully no one would configure a cluster that way.

If in checkSatelliteTagLocations, we keep used around, then it wouldn't be too hard to just not initialize or recruit a satellite tlog in that generation if there's no log router tag its supposed to store.

@etschannen
Copy link
Contributor Author

The reason to have log router tags a multiple of tlogs is so that each log has equal work. However, now that log routers peek from satellites, it probably makes sense to just make the log router tags a max of primary tlogs and satellite tlogs.

…telliteTLogs to avoid having satellites with no data.
@ajbeamon
Copy link
Contributor

@etschannen I think there was a thing or two you were going to double check, such as taking a quick look at the fallback value to confirm it's ok as is. Let me know if you are now happy with this or if you are still checking.

@ajbeamon ajbeamon merged commit fe19f30 into apple:release-6.2 Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants