Ensure idle database connections don't hang around for ever #10933

ashb · 2020-09-14T14:20:55Z

As we switch over to using row-level locks more in the scheduler, it's
is better if we don't have transactions hang around being open "for
ever", otherwise an errant process may end up holding on to a lock for a
lot longer than we intend.

For example, on mysql the default lock wait timeout is 50s. We could
have set innodb_lock_wait_timeout to a shorter value -- but that only
affects the waiting side, not the locking side. So instead we set an
idle timeout -- if the transaction is not doing anything we ask the DB
server to kill it, ending the transaction and freeing any held row
locks.

Part of #9630.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

ashb · 2020-09-14T14:22:33Z

airflow/utils/orm_event_handlers.py

I didn't add this to airflow.cfg as I was thinking this was a hidden/advanced setting. WDYT?

So I will know about it only if I know the code or stumble upon it in some Stackoverflow post 😄 I would be in favor of adding it to airflow.cfg

See my other comment - I believe it (might) have a profound impact on how MySQL server behaves (especially) because effectively each client will be - in some cases - continuously closing and opening the connections, where previously they were opened for up to 8 hours by default. So we REALLY need to describe it and I think we have to understand the impact and describe it also for the MySQL users who would like to enable it. I guess setting it by default (like it is currently) to a low value is rather a bad idea in those cases so quite a bit more complex logic should be implemented here - for example, it could be set to a low value only when HA scheduler is enabled (when we get to take a look at it).

ashb · 2020-09-14T14:30:57Z

Note, this setting isn't required but gives a bit more ability to reason how long a lock might be held should something go wrong.

As we switch over to using row-level locks more in the scheduler, it's is better if we don't have transactions hang around being open "for ever", otherwise an errant process may end up holding on to a lock for a lot longer than we intend. For example, on mysql the default lock wait timeout is 50s. We could have set `innodb_lock_wait_timeout` to a shorter value -- but that only affects the waiting side, not the locking side. So instead we set an idle timeout -- if the transaction is not doing anything we ask the DB server to kill it, ending the transaction and freeing any held row locks.

ashb · 2020-09-14T14:38:43Z

Oh, pg9.6 doesn't accept the "10s", that came later. (It want's 10000 -- value in milliseconds.)

turbaszek · 2020-09-14T15:19:19Z

airflow/utils/orm_event_handlers.py

            cursor.close()

+        is_mariadb = engine.dialect.server_version_info and "MariaDB" in engine.dialect.server_version_info
+        if idle_session_timeout > 0:


Not sure if this a best idea but how about:

if idle_session_timeout <= 0: return

in this way we may remove one indentation

turbaszek · 2020-09-14T15:20:13Z

airflow/utils/orm_event_handlers.py

+    if engine.dialect.name == "postgresql":
+        if idle_session_timeout > 0:


Those ifs can be combined into one

potiuk · 2020-09-14T14:55:36Z

airflow/utils/orm_event_handlers.py

    Setups event handlers.
    """
+
+    idle_session_timeout = conf.getfloat("core", "sql_alchemy_idle_transaction_timeout", fallback=10)


I guess we should have add it to the defaultconfig/template?

#10933 (comment)

potiuk · 2020-09-14T15:24:28Z

airflow/utils/orm_event_handlers.py

+                # MySQL doesn't have the equivalent of postgres' idle_in_transaction_session_timeout -- the
+                # best we can do is set the time we wait just set the timeout for any idle connection
+                @event.listens_for(engine, "connect")
+                def set_mysql_transaction_idle_timeout(dbapi_connection, connection_record):


Do you think (from observation) it might have an undesireable effect for MySQL ? From what I understand, the side effect of this change is that when workers will be idle, they will close and re-open connections continuously and frequently. From what I know this is much more problematic for Postgres (because closing/reopening the connection spawns a new process and is a costly event.
For Postgres this is mitigated by PBBouncer (and with Postgres' transaction idle timeout, this will not happen, really (it's only some long, hanging transaction that will be impacted).

For MySQL also I understand that such wait timeout will close any connection that has no activity rather quickly (including those that have open transaction)?

Is HA introducing some long hanging transactions in the long term that could impact this? Are you aware of some long transactions already present in Airflow?

Since the default value for MySQL is 28.880 (8 hours), I believe this one might have a tremendous impact on the behavior of MySQL in some cases.

What is the effect on HA if we do not set this timeout to a long value? It's not yet visible as we see only this PR for now but I would love to understand what kind of problems we can expect.

Yeah, I was of two minds on this whole PR. As I say up above, this is not about performance of HA, but "what happens if a transaction goes wild and hangs on to a lock -- how long should we wait."

(HA PR coming -- I am doing the work to pull out as much as possible on to separate PRs as that one is already going to be huge enough)

The default for both postgres and mysql(likes) is the locking process/connection can wait For Ever (I think mysql's default is 1 year. I'm not kidding).

10s is perhaps a low default, and 5mins/1 hour might be a better default, certainly until we have the time to do more testing of impact of this on mysql.

10s is perhaps a low default, and 5mins/1 hour might be a better default, certainly until we have the time to do more testing of impact of this on mysql.

Have you tested HA with mysql in large scale?

Yeah. We will definitely have to make some tests on MySQL.

I am also a bit afraid about performance impact of that one when the database is remote and there is an SSL enabled for example - the effect of this change might be that pretty much every transaction will start paying the handshake penalty. This is probably neglectible for postgres/pgbouncer but for MySQL and Cloud SQL in particular, it might become a real problem. For now I think setting the default to original value if no HA should be OK I think and I'd love to see the locking mechanisms in HA scheduler later and see how big of an impact it will have.

I have done some testing on MySQL, but not enough (and not with SSL) to answer this question.

I will come back to this PR later (hopefully pre 2.0, but if that is looking not possible I will change this to have default off), as this isn't on the critical path for AIP-15

10s is perhaps a low default, and 5mins/1 hour might be a better default, certainly until we have the time to do more testing of impact of this on mysql.

Have you tested HA with mysql in large scale?

Largest test I did was with 30 workers, 4 schedulers and 100k TIs to run. Not sure if that counts as large or not.

Though thinking about it -- it likely wouldn't have hit this case very much.

I think the only way to properly test it is to engage our users with some test/staging systems where they can simulate and test with production-like loads. I think the nature of this makes it super difficult to test. The problem is that the low-level behaviour that we have no way to control from the "higher-layer" logic which depends on factors that are pretty much random (i.e. how many transactions we run in parallel, whether some other threads were using the connections recently and returned it to the pool and how much load on the database we have in general (how responsive the DB server is).

I can easily imagine scenarios where - for example - some workers will get periodic bursts of timed-out connections for all their threads when other workers start to do a lot of database transactions - because those workers will just cross the timeout limit and all of them start failing. Or a periodic failure of random tasks without a reason we could track or mitigate. Those are the most difficult things to track.

I do not have a very concrete scenario that could be easily simulated - It's more from intuition and experiencing similar problems in telecom industry years ago. We can try to mitigate it for example by adding some randomness around the timeout values and other things. but I think there is a good reason why the default timeout is 8 hrs and not say - 2 minutes. That makes me a little worried.

We might ask our customers to try this out once the whole thing is merged (including HA Scheduler). But I think during the process I would like to understand how critical it is to have this timeout. Maybe we can try some other approaches if all that we want to protect from is avoiding stale locks. Seems to me this is a bit risky to potentially change behaviour of all transactions when we want to protect from something that will happen very rarely.

For example, I could easily imagine we could run a periodic monitor to check if there are stale locks and selectively kill only those sessions.? That might call for a solution which is much more database-specific, but it might be less-risky, more efficient approach.

stale · 2020-11-01T07:50:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ashb requested review from kaxil, mik-laj, potiuk and turbaszek September 14, 2020 14:21

ashb changed the title ~~Ensure database connections don't hang around for ever~~ Ensure idle database connections don't hang around for ever Sep 14, 2020

ashb commented Sep 14, 2020

View reviewed changes

ashb force-pushed the txn-lock-timeout branch from f26273f to b6e081b Compare September 14, 2020 14:38

turbaszek reviewed Sep 14, 2020

View reviewed changes

turbaszek added area:MetaDB Meta Database related issues. area:Scheduler including HA (high availability) scheduler labels Sep 14, 2020

potiuk reviewed Sep 14, 2020

View reviewed changes

ashb marked this pull request as draft September 14, 2020 15:48

mik-laj added the AIP-15 label Sep 14, 2020

ashb mentioned this pull request Sep 16, 2020

Officially support HA for scheduler component (AIP-15) #9630

Closed

10 tasks

stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 1, 2020

stale bot closed this Nov 15, 2020

ashb deleted the txn-lock-timeout branch December 7, 2020 17:26

		if engine.dialect.name == "postgresql":
		if idle_session_timeout > 0:

Ensure idle database connections don't hang around for ever #10933

Ensure idle database connections don't hang around for ever #10933

Uh oh!

Conversation

ashb commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

potiuk Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashb commented Sep 14, 2020

Uh oh!

ashb commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

turbaszek Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashb Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

potiuk Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stale bot commented Nov 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ashb commented Sep 14, 2020 •

edited

Loading

potiuk Sep 14, 2020 •

edited

Loading

ashb commented Sep 14, 2020 •

edited

Loading

turbaszek Sep 14, 2020 •

edited

Loading

ashb Sep 14, 2020 •

edited

Loading

potiuk Sep 14, 2020 •

edited

Loading