*: overcome postgres sync repl limit causing lost transactions under some events.#514
Merged
Conversation
5fe441c to
9c3a923
Compare
7cb5629 to
52c4c92
Compare
52c4c92 to
8ed7d42
Compare
…some events. Postgres synchronous replication has a downside explained in the docs: https://www.postgresql.org/docs/current/static/warm-standby.html `If primary restarts while commits are waiting for acknowledgement, those waiting transactions will be marked fully committed once the primary database recovers. There is no way to be certain that all standbys have received all outstanding WAL data at time of the crash of the primary. Some transactions may not show as committed on the standby, even though they show as committed on the primary. The guarantee we offer is that the application will not receive explicit acknowledgement of the successful commit of a transaction until the WAL data is known to be safely received by all the synchronous standbys.` Under some events this will cause lost transactions. For example: * Sync standby goes down. * A client commits a transaction, it blocks waiting for acknowledgement. * Primary restart, it'll mark the above transaction as fully committed. All the clients will now see that transaction. * Primary dies * Standby comes back. * The sentinel will elect the standby as the new master since it's in the synchronous_standby_names list. * The above transaction will be lost despite synchronous replication being enabled. So there can be some conditions where a syncstandby could be elected also if it's missing the last transactions if it was down at the commit time. It's not easy to fix this issue since these events cannot be resolved by the sentinel because it's not possible to know if a sync standby is really in sync when the master is down (since we cannot query its last wal position and the reporting from the keeper is asynchronous). But with stolon we have the power to overcome this issue by noticing when a primary restarts (since we control it), allow only "internal" connections until all the defined synchronous standbys are really in sync. Allowing only "internal" connections means not adding the default rules or the user defined pgHBA rules but only the rules needed for replication (and local communication from the keeper). Since "internal" rules accepts the defined superuser and replication users, client should not use these roles for normal operation or the above solution won't work (but they shouldn't do it anyway since this could cause exhaustion of reserved superuser connections needed by the keeper to check the instance).
8ed7d42 to
87766c9
Compare
sgotti
added a commit
that referenced
this pull request
Sep 10, 2018
…_causing_lost_transactions_under_some_events *: overcome postgres sync repl limit causing lost transactions under some events.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Postgres synchronous replication has a downside explained in the docs:
https://www.postgresql.org/docs/current/static/warm-standby.html
If primary restarts while commits are waiting for acknowledgement, those waiting transactions will be marked fully committed once the primary database recovers. There is no way to be certain that all standbys have received all outstanding WAL data at time of the crash of the primary. Some transactions may not show as committed on the standby, even though they show as committed on the primary. The guarantee we offer is that the application will not receive explicit acknowledgement of the successful commit of a transaction until the WAL data is known to be safely received by all the synchronous standbys.Under some events this will cause lost transactions. For example:
clients will now see that transaction.
synchronous_standby_names list.
enabled.
So there can be some conditions where a syncstandby could be elected also if it's
missing the last transactions if it was down at the commit time.
It's not easy to fix this issue since these events cannot be resolved by the
sentinel because it's not possible to know if a sync standby is really in sync
when the master is down (since we cannot query its last wal position and the
reporting from the keeper is asynchronous).
But with stolon we have the power to overcome this issue by noticing when a
primary restarts (since we control it), allow only "internal" connections until
all the defined synchronous standbys are really in sync.
Allowing only "internal" connections means not adding the default rules or the
user defined pgHBA rules but only the rules needed for replication (and local
communication from the keeper).
Since "internal" rules accepts the defined superuser and replication users,
client should not use these roles for normal operation or the above solution
won't work (but they shouldn't do it anyway since this could cause exhaustion of
reserved superuser connections needed by the keeper to check the instance).