Skip to content

Conversation

@xumengpanda
Copy link
Contributor

This PR solves the Issue #1761.

The serverTeamRemover() is similar to machine team remover: periodically pick a server team to remove until the total number of teams is no larger than the desired number.

To make each server have similar number of teams, serverTeamRemover() picks the server team whose members are on the largest number of server teams to remove first. In addition, when TeamCollection builds server teams, it builds more server teams than the desired number so that the serverTeamRemover() can better balance the number of teams per server.

We build more teams than we finally want so that we can use serverTeamRemover() actor to remove the teams
whose member belong to too many teams. This allows us to get a more balanced number of teams per server.
@xumengpanda xumengpanda force-pushed the mengxu/server-team-remover-PR branch from 98c6abf to 61c1138 Compare July 4, 2019 06:26
Each server has the maximum of DESIRED_TEAMS_PER_SERVER and
(DESIRED_TEAMS_PER_SERVER * storageTeamSize) / 2)
@xumengpanda xumengpanda force-pushed the mengxu/server-team-remover-PR branch from 61c1138 to e39c9d1 Compare July 5, 2019 23:53
Also change state variable to variable.
@xumengpanda xumengpanda force-pushed the mengxu/server-team-remover-PR branch from e39c9d1 to c7a9962 Compare July 5, 2019 23:54
Pick the team whose minimum team number of a server is the largest one to remove.

AddTeamsBestOf should keep building teams until each server has at least the
target number of teams.
Otherwise, simulation may time out when team remover needs to
remove hundreds of teams.
Also further speed up serverTeamRemover in simulation, and
Add comments
…false

Because serverTeamRemover takes time to remove teams,
getTeamCollectionValid() need to wait for a while before concluding that
the number of server teams is larger than the desired number.
@xumengpanda xumengpanda requested a review from etschannen July 9, 2019 21:27
@xumengpanda xumengpanda force-pushed the mengxu/server-team-remover-PR branch from 2f88c88 to 600f16c Compare July 11, 2019 07:29
When a teamTracker is cancelled, e.g, by redundant teamRemover or badTeamRemover,
we should decrease the optimalTeamCount if the team is considered as an
optimal team, i.e., all members' machine fitness is no worse than unset, and
the team is healthy.
@xumengpanda xumengpanda force-pushed the mengxu/server-team-remover-PR branch from 600f16c to cf935ff Compare July 12, 2019 05:46
…move

Before the serverTeamRemover tries to pick a team to remove,
it waits for all data movement to finish, which means all teams are healthy.

When the serverTeamRemover starts to pick a team to remove,
we believe all servers are healthy.
Also change some code format in self review
@xumengpanda xumengpanda added this to the 6.2 milestone Jul 13, 2019
@xumengpanda
Copy link
Contributor Author

The commit 5c5e883745595a459ec7f10347de06104ba6d936 passes the 100K random tests.

Change to remove machine team with most machine teams, using the same
logic as the serverTeamRemover.

The featue is guarded by TR_FLAG_REMOVE_MT_WITH_MOST_TEAMS knob.
Do not overbuild teams because we may oscillate between building more teams and
removing the redundant teams. The oscillation happens when the machines are not
evenly distributed across availability zones.
For example, in three_data_hall mode, we have 1 machine in 1 data hall for 2 data halls.
We have 3 machines in the 3rd data hall. To build enough (and more teams) for servers
in the 3rd data hall, we will overbuild teams. However,
the teamRemover will remove those newly teams.
If the minimum number of teams of servers in a team is less than the
target value (desired_team_number_per_server * (teamSize + 1) / 2),
the team remover should not remove it. Otherwise, DD will oscillate in
building more teams and removing redundant teams.

Do not do consistency check for three_data_hall mode because when
machines are not evenly distributed across data halls, we will
need to build more teams than the total desired number to make sure
the number of teams per server is no less than the target value.
Because team remover does not remove a team if it causes 0 team per server.
So we currently disable the check until we have a better strategy to enforce the
desired number of teams.

This will not cause much problem in real situation, while having 0 team on a server
will make the server unable to host data, which is bad.
@xumengpanda xumengpanda requested a review from etschannen July 17, 2019 04:25
@xumengpanda
Copy link
Contributor Author

xumengpanda commented Jul 17, 2019

Commit 915732ce24a925949c79c93cac2dd44a21a29b0f passed 100K randomness tests.
Ready for review.

1) No need to check server with only one team when teamRemover finds
a server team or machine team to remove

2) Fix optimalTeamCount counting in teamTracker
@xumengpanda xumengpanda force-pushed the mengxu/server-team-remover-PR branch from d56bfdd to 64bee63 Compare July 19, 2019 01:47
@xumengpanda
Copy link
Contributor Author

The latest commit 64bee63dbc25186fea4ac987496a8da37663e4b7 passed 100K random tests.
Good to merge now since it has been reviewed.
@etschannen

If serverTeamRemover removes a team before machineTeamRemover brings
the machine team number down to the desired number, DD may create a new
team (due to teams removed by serverTeamRemover), which may be removed
later by machineTeamRemover. This causes unnnecessary extra data movement.
@etschannen etschannen merged commit c70e762 into apple:master Jul 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants