Skip to content

Conversation

@ashb
Copy link
Member

@ashb ashb commented Sep 14, 2020

As we switch over to using row-level locks more in the scheduler, it's
is better if we don't have transactions hang around being open "for
ever", otherwise an errant process may end up holding on to a lock for a
lot longer than we intend.

For example, on mysql the default lock wait timeout is 50s. We could
have set innodb_lock_wait_timeout to a shorter value -- but that only
affects the waiting side, not the locking side. So instead we set an
idle timeout -- if the transaction is not doing anything we ask the DB
server to kill it, ending the transaction and freeing any held row
locks.

Part of #9630.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@ashb ashb changed the title Ensure database connections don't hang around for ever Ensure idle database connections don't hang around for ever Sep 14, 2020
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add this to airflow.cfg as I was thinking this was a hidden/advanced setting. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I will know about it only if I know the code or stumble upon it in some Stackoverflow post 😄 I would be in favor of adding it to airflow.cfg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair.

Copy link
Member

@potiuk potiuk Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment - I believe it (might) have a profound impact on how MySQL server behaves (especially) because effectively each client will be - in some cases - continuously closing and opening the connections, where previously they were opened for up to 8 hours by default. So we REALLY need to describe it and I think we have to understand the impact and describe it also for the MySQL users who would like to enable it. I guess setting it by default (like it is currently) to a low value is rather a bad idea in those cases so quite a bit more complex logic should be implemented here - for example, it could be set to a low value only when HA scheduler is enabled (when we get to take a look at it).

@ashb
Copy link
Member Author

ashb commented Sep 14, 2020

Note, this setting isn't required but gives a bit more ability to reason how long a lock might be held should something go wrong.

As we switch over to using row-level locks more in the scheduler, it's
is better if we don't have transactions hang around being open "for
ever", otherwise an errant process may end up holding on to a lock for a
lot longer than we intend.

For example, on mysql the default lock wait timeout is 50s. We could
have set `innodb_lock_wait_timeout` to a shorter value -- but that only
affects the waiting side, not the locking side. So instead we set an
idle timeout -- if the transaction is not doing anything we ask the DB
server to kill it, ending the transaction and freeing any held row
locks.
@ashb
Copy link
Member Author

ashb commented Sep 14, 2020

Oh, pg9.6 doesn't accept the "10s", that came later. (It want's 10000 -- value in milliseconds.)

cursor.close()

is_mariadb = engine.dialect.server_version_info and "MariaDB" in engine.dialect.server_version_info
if idle_session_timeout > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this a best idea but how about:

if idle_session_timeout <= 0:
    return

in this way we may remove one indentation

Comment on lines +81 to +82
if engine.dialect.name == "postgresql":
if idle_session_timeout > 0:
Copy link
Member

@turbaszek turbaszek Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those ifs can be combined into one

@turbaszek turbaszek added area:MetaDB Meta Database related issues. area:Scheduler including HA (high availability) scheduler labels Sep 14, 2020
Setups event handlers.
"""

idle_session_timeout = conf.getfloat("core", "sql_alchemy_idle_transaction_timeout", fallback=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should have add it to the defaultconfig/template?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# MySQL doesn't have the equivalent of postgres' idle_in_transaction_session_timeout -- the
# best we can do is set the time we wait just set the timeout for any idle connection
@event.listens_for(engine, "connect")
def set_mysql_transaction_idle_timeout(dbapi_connection, connection_record):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think (from observation) it might have an undesireable effect for MySQL ? From what I understand, the side effect of this change is that when workers will be idle, they will close and re-open connections continuously and frequently. From what I know this is much more problematic for Postgres (because closing/reopening the connection spawns a new process and is a costly event.
For Postgres this is mitigated by PBBouncer (and with Postgres' transaction idle timeout, this will not happen, really (it's only some long, hanging transaction that will be impacted).

For MySQL also I understand that such wait timeout will close any connection that has no activity rather quickly (including those that have open transaction)?

Is HA introducing some long hanging transactions in the long term that could impact this? Are you aware of some long transactions already present in Airflow?

Since the default value for MySQL is 28.880 (8 hours), I believe this one might have a tremendous impact on the behavior of MySQL in some cases.

What is the effect on HA if we do not set this timeout to a long value? It's not yet visible as we see only this PR for now but I would love to understand what kind of problems we can expect.

Copy link
Member Author

@ashb ashb Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was of two minds on this whole PR. As I say up above, this is not about performance of HA, but "what happens if a transaction goes wild and hangs on to a lock -- how long should we wait."

(HA PR coming -- I am doing the work to pull out as much as possible on to separate PRs as that one is already going to be huge enough)

The default for both postgres and mysql(likes) is the locking process/connection can wait For Ever (I think mysql's default is 1 year. I'm not kidding).

10s is perhaps a low default, and 5mins/1 hour might be a better default, certainly until we have the time to do more testing of impact of this on mysql.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10s is perhaps a low default, and 5mins/1 hour might be a better default, certainly until we have the time to do more testing of impact of this on mysql.

Have you tested HA with mysql in large scale?

Copy link
Member

@potiuk potiuk Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We will definitely have to make some tests on MySQL.

I am also a bit afraid about performance impact of that one when the database is remote and there is an SSL enabled for example - the effect of this change might be that pretty much every transaction will start paying the handshake penalty. This is probably neglectible for postgres/pgbouncer but for MySQL and Cloud SQL in particular, it might become a real problem. For now I think setting the default to original value if no HA should be OK I think and I'd love to see the locking mechanisms in HA scheduler later and see how big of an impact it will have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some testing on MySQL, but not enough (and not with SSL) to answer this question.

I will come back to this PR later (hopefully pre 2.0, but if that is looking not possible I will change this to have default off), as this isn't on the critical path for AIP-15

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10s is perhaps a low default, and 5mins/1 hour might be a better default, certainly until we have the time to do more testing of impact of this on mysql.

Have you tested HA with mysql in large scale?

Largest test I did was with 30 workers, 4 schedulers and 100k TIs to run. Not sure if that counts as large or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though thinking about it -- it likely wouldn't have hit this case very much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only way to properly test it is to engage our users with some test/staging systems where they can simulate and test with production-like loads. I think the nature of this makes it super difficult to test. The problem is that the low-level behaviour that we have no way to control from the "higher-layer" logic which depends on factors that are pretty much random (i.e. how many transactions we run in parallel, whether some other threads were using the connections recently and returned it to the pool and how much load on the database we have in general (how responsive the DB server is).

I can easily imagine scenarios where - for example - some workers will get periodic bursts of timed-out connections for all their threads when other workers start to do a lot of database transactions - because those workers will just cross the timeout limit and all of them start failing. Or a periodic failure of random tasks without a reason we could track or mitigate. Those are the most difficult things to track.

I do not have a very concrete scenario that could be easily simulated - It's more from intuition and experiencing similar problems in telecom industry years ago. We can try to mitigate it for example by adding some randomness around the timeout values and other things. but I think there is a good reason why the default timeout is 8 hrs and not say - 2 minutes. That makes me a little worried.

We might ask our customers to try this out once the whole thing is merged (including HA Scheduler). But I think during the process I would like to understand how critical it is to have this timeout. Maybe we can try some other approaches if all that we want to protect from is avoiding stale locks. Seems to me this is a bit risky to potentially change behaviour of all transactions when we want to protect from something that will happen very rarely.

For example, I could easily imagine we could run a periodic monitor to check if there are stale locks and selectively kill only those sessions.? That might call for a solution which is much more database-specific, but it might be less-risky, more efficient approach.

@ashb ashb marked this pull request as draft September 14, 2020 15:48
@mik-laj mik-laj added the AIP-15 label Sep 14, 2020
@stale
Copy link

stale bot commented Nov 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 1, 2020
@stale stale bot closed this Nov 15, 2020
@ashb ashb deleted the txn-lock-timeout branch December 7, 2020 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:MetaDB Meta Database related issues. area:Scheduler including HA (high availability) scheduler stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants