-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Ensure idle database connections don't hang around for ever #10933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
airflow/utils/orm_event_handlers.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add this to airflow.cfg as I was thinking this was a hidden/advanced setting. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I will know about it only if I know the code or stumble upon it in some Stackoverflow post 😄 I would be in favor of adding it to airflow.cfg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment - I believe it (might) have a profound impact on how MySQL server behaves (especially) because effectively each client will be - in some cases - continuously closing and opening the connections, where previously they were opened for up to 8 hours by default. So we REALLY need to describe it and I think we have to understand the impact and describe it also for the MySQL users who would like to enable it. I guess setting it by default (like it is currently) to a low value is rather a bad idea in those cases so quite a bit more complex logic should be implemented here - for example, it could be set to a low value only when HA scheduler is enabled (when we get to take a look at it).
|
Note, this setting isn't required but gives a bit more ability to reason how long a lock might be held should something go wrong. |
As we switch over to using row-level locks more in the scheduler, it's is better if we don't have transactions hang around being open "for ever", otherwise an errant process may end up holding on to a lock for a lot longer than we intend. For example, on mysql the default lock wait timeout is 50s. We could have set `innodb_lock_wait_timeout` to a shorter value -- but that only affects the waiting side, not the locking side. So instead we set an idle timeout -- if the transaction is not doing anything we ask the DB server to kill it, ending the transaction and freeing any held row locks.
|
Oh, pg9.6 doesn't accept the "10s", that came later. (It want's 10000 -- value in milliseconds.) |
f26273f to
b6e081b
Compare
| cursor.close() | ||
|
|
||
| is_mariadb = engine.dialect.server_version_info and "MariaDB" in engine.dialect.server_version_info | ||
| if idle_session_timeout > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this a best idea but how about:
if idle_session_timeout <= 0:
return
in this way we may remove one indentation
| if engine.dialect.name == "postgresql": | ||
| if idle_session_timeout > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those ifs can be combined into one
| Setups event handlers. | ||
| """ | ||
|
|
||
| idle_session_timeout = conf.getfloat("core", "sql_alchemy_idle_transaction_timeout", fallback=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should have add it to the defaultconfig/template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # MySQL doesn't have the equivalent of postgres' idle_in_transaction_session_timeout -- the | ||
| # best we can do is set the time we wait just set the timeout for any idle connection | ||
| @event.listens_for(engine, "connect") | ||
| def set_mysql_transaction_idle_timeout(dbapi_connection, connection_record): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think (from observation) it might have an undesireable effect for MySQL ? From what I understand, the side effect of this change is that when workers will be idle, they will close and re-open connections continuously and frequently. From what I know this is much more problematic for Postgres (because closing/reopening the connection spawns a new process and is a costly event.
For Postgres this is mitigated by PBBouncer (and with Postgres' transaction idle timeout, this will not happen, really (it's only some long, hanging transaction that will be impacted).
For MySQL also I understand that such wait timeout will close any connection that has no activity rather quickly (including those that have open transaction)?
Is HA introducing some long hanging transactions in the long term that could impact this? Are you aware of some long transactions already present in Airflow?
Since the default value for MySQL is 28.880 (8 hours), I believe this one might have a tremendous impact on the behavior of MySQL in some cases.
What is the effect on HA if we do not set this timeout to a long value? It's not yet visible as we see only this PR for now but I would love to understand what kind of problems we can expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was of two minds on this whole PR. As I say up above, this is not about performance of HA, but "what happens if a transaction goes wild and hangs on to a lock -- how long should we wait."
(HA PR coming -- I am doing the work to pull out as much as possible on to separate PRs as that one is already going to be huge enough)
The default for both postgres and mysql(likes) is the locking process/connection can wait For Ever (I think mysql's default is 1 year. I'm not kidding).
10s is perhaps a low default, and 5mins/1 hour might be a better default, certainly until we have the time to do more testing of impact of this on mysql.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10s is perhaps a low default, and 5mins/1 hour might be a better default, certainly until we have the time to do more testing of impact of this on mysql.
Have you tested HA with mysql in large scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. We will definitely have to make some tests on MySQL.
I am also a bit afraid about performance impact of that one when the database is remote and there is an SSL enabled for example - the effect of this change might be that pretty much every transaction will start paying the handshake penalty. This is probably neglectible for postgres/pgbouncer but for MySQL and Cloud SQL in particular, it might become a real problem. For now I think setting the default to original value if no HA should be OK I think and I'd love to see the locking mechanisms in HA scheduler later and see how big of an impact it will have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done some testing on MySQL, but not enough (and not with SSL) to answer this question.
I will come back to this PR later (hopefully pre 2.0, but if that is looking not possible I will change this to have default off), as this isn't on the critical path for AIP-15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10s is perhaps a low default, and 5mins/1 hour might be a better default, certainly until we have the time to do more testing of impact of this on mysql.
Have you tested HA with mysql in large scale?
Largest test I did was with 30 workers, 4 schedulers and 100k TIs to run. Not sure if that counts as large or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though thinking about it -- it likely wouldn't have hit this case very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only way to properly test it is to engage our users with some test/staging systems where they can simulate and test with production-like loads. I think the nature of this makes it super difficult to test. The problem is that the low-level behaviour that we have no way to control from the "higher-layer" logic which depends on factors that are pretty much random (i.e. how many transactions we run in parallel, whether some other threads were using the connections recently and returned it to the pool and how much load on the database we have in general (how responsive the DB server is).
I can easily imagine scenarios where - for example - some workers will get periodic bursts of timed-out connections for all their threads when other workers start to do a lot of database transactions - because those workers will just cross the timeout limit and all of them start failing. Or a periodic failure of random tasks without a reason we could track or mitigate. Those are the most difficult things to track.
I do not have a very concrete scenario that could be easily simulated - It's more from intuition and experiencing similar problems in telecom industry years ago. We can try to mitigate it for example by adding some randomness around the timeout values and other things. but I think there is a good reason why the default timeout is 8 hrs and not say - 2 minutes. That makes me a little worried.
We might ask our customers to try this out once the whole thing is merged (including HA Scheduler). But I think during the process I would like to understand how critical it is to have this timeout. Maybe we can try some other approaches if all that we want to protect from is avoiding stale locks. Seems to me this is a bit risky to potentially change behaviour of all transactions when we want to protect from something that will happen very rarely.
For example, I could easily imagine we could run a periodic monitor to check if there are stale locks and selectively kill only those sessions.? That might call for a solution which is much more database-specific, but it might be less-risky, more efficient approach.
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
As we switch over to using row-level locks more in the scheduler, it's
is better if we don't have transactions hang around being open "for
ever", otherwise an errant process may end up holding on to a lock for a
lot longer than we intend.
For example, on mysql the default lock wait timeout is 50s. We could
have set
innodb_lock_wait_timeoutto a shorter value -- but that onlyaffects the waiting side, not the locking side. So instead we set an
idle timeout -- if the transaction is not doing anything we ask the DB
server to kill it, ending the transaction and freeing any held row
locks.
Part of #9630.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.