Fix Timer thread classloader leak#197
Conversation
Rename cancelTimer to cancelTimerTask to clarify usage. There is already an actual Timer object named cancelTimer in Driver. Also rename the killTimer() method in AbstractJdbc2Statement to killTimerTask() to clarify it's usage.
Change all references to addTimerTask(...) and purgeTimerTasks() to go through the Connection object. This removes any direct references to the TimerTask related methods in Driver from the rest of the driver. The new addTimerTask(...) and purgeTimerTask() methods in AbstractJdbc2Connection still use the shared static Timer in Driver.
Previously, when the driver is loaded a static Timer is allocated in the Driver class itself. That Timer is shared by any TimerTask that gets scheduled (ex: for Statement timeout handling). If the driver is later unloaded (e.g. in a containerized environment) the Timer thread is never stopped. Additionally, the Timer thread itself may maintain references to other driver classes that will never get unloaded. The changed classes now use a SharedTimer that maintains a reference count of requests for Timer. A Connection requests a Timer via the getTimer() method. This increments a reference count to keep track of the number of outstanding references. The Timer reference is marked as returned by calling releaseTimer() in Connection.close(). If there are no outstanding references left, the internal shared Timer object is closed so it's resources can be GCed. Well behaved JDBC code that calls close() on all Connections should be able to unload the Driver without leaving dangling references.
|
On 10/05/2014 11:36 PM, Sehrope Sarkuni wrote:
Thanks for keeping it well structured.
I like this approach a lot more, though I'm still concerned about direct
It'll be important to test how this behaves when an exception is thrown
It may prove easier to use Arquillian to do this against an embedded
I'll try, time permitting. Things are busy. Craig Ringer http://www.2ndQuadrant.com/ |
Yes I don't particularly like manually managing Timers like this but The Timer API usage was already there. The difference now is that we're cleaning up behind ourselves when we're done.
Looks like you're right on this one, an exception causes the Timer to stop running. See the line "If f the timer's task execution thread terminates unexpectedly..." in the Javadocs: http://docs.oracle.com/javase/7/docs/api/java/util/Timer.html Maybe we should use a ScheduledThreadPoolExecutor. It's Java 1.5+ only but I suppose that's okay as at this point even 1.6 is out of support. Anybody that wants to use this new resource cleanup feature better be running something supported. Alternatively (or in addition to that), we could have an Executor for running the actual cancellation tasks so they happen in separate Threads. This goes down the rabbit hole of doing more thread management directly in the driver but I think it may be a better approach. With a single thread handling cancellations, you can only cancel one query at a time. Cancellations are meant to be quick but if something slows it down (e.g. server down or networking issue) then it'll hang and delay all other query cancellations. In the common case where an app is connecting to a single database this isn't much of an issue; i.e. if one can't connect, the rest probably can't either. However, if you're connecting to separate databases with a single shared Timer thread for cancellation then one bad connection will starve all the rest.
No rush on this for me. My involvement in helping fix this is because I want it done right, not because I want it done soon. |
In one of my previous attempts, I had a simple test that had such custom classloader. It tried to load classes of the driver several times (using this custom classloader that always loaded classes from org.postgresql regardless of whether the parent classloader had them or not). I then naively tried to create so much churn on the heap so as to force the GC to collect the drivers and then checked if indeed they all disappeared (which would not be the case with the eagerly created Timer thread as it is in the current codebase). If you find it useful, you can reuse/adapt it to your needs. I ressurected the code to: https://gist.github.com/metlos/59c9cf891482f14d1784 |
+1. The number of threads in that thread pool could be configured using a sys prop with some conservative default. Is there any consequence of 2 cancellations running in parallel on a single connection? |
There was a problem hiding this comment.
Why is this field volatile? Since it is only ever accessed from synchronized methods, there should be no need for that?
|
Yes if everything is marked synchronized then the |
|
Any further updates on this? |
|
I haven't had a chance yet to add betters tests for this as it's a bit more complicated then other tests (because of the external classloader). The small test I have there is better than nothing but something more extensive would be nice. @metlos: If you're still interested in getting this merged do you think you could come up with something that simulates your use case? |
|
My usecase was quite well described by https://gist.github.com/metlos/59c9cf891482f14d1784 as mentioned in one of my previous comments. We discovered this issue in http://rhq.jboss.org where we have a plugin for monitoring Postgres. This plugin lives in a sort of container with a child-first classloader so that individual plugins are isolated. On certain occasions the container stops, reloads and restarts all the plugins (without actually stopping the JVM). After a couple of such occasions we've noticed the increased permgen due to classes not being unloaded and tracked it down to this issue. |
|
I added another test based upon the one in the linked gist. I changed it a bit to:
... the problem is it doesn't quite work. The Driver instances themselves are getting GCed so I'm fairly certain the new shared timer code is working fine. The problem is the Driver classes themselves don't get GCed. I tried running the test stand alone (rather than with everything else on the classpath) and I get the same result. I think it may be because the JVM doesn't actually GC classes (vs objects)... they stay in PermGen. I've been testing it with v1.7 on OSX and have been fiddiling with some JVM -XX flags but no luck. Any ideas? I haven't updated the PR with the new test as the bulk of it is debug statements from me experimenting with it. You can find it here: https://gist.github.com/sehrope/3b4e11124f6e27d9e680 You can run it via: (might need to replace the port or add a host if your're not testing locally) |
|
I commented on your gist with my theory for why the GC of classes is not working for you. |
|
Creating extra objects doesn't make a difference. I get the same results when I add the for loop. |
Fix Timer thread classloader leak Thanks @sehrope !
This patch is an alternative to #188. It fixes the ClassLoader leak when the driver is repeatedly loaded/unloaded (e.g. in a containerized environment).
The PR is split into three commits. The first two don't have any functional changes to the driver, they just rename some variables to clarify their usage and move TimerTask related method calls from Statement to go through the Connection rather than statically against the Driver.
The third commit adds a reference counting SharedTimer with two methods,
getTimer()andreleaseTimer(). The former creates a Timer on demand and the latter calls.cancel()on the underlying Timer if the reference count goes to zero (to allow it to be GCed) before clearing it to null.One nice side effect of this patch is that if a cancellation task is never scheduled, then no Timer or associated Timer thread is ever created.
I added one basic test that creates two connections, two statements, and issues two cancels to test that only a single Timer is created and subsequently deallocated. It and the the rest of the driver tests run successfully.
I think some additional tests are needed, particularly one that simulates loading/unloading the driver via a custom ClassLoader. It would need to run separately from the rest of the driver unit tests as it'll need to run without the driver on the classpath (as it'll load it dynamically). If nobody else gets around to this I'll see if I can write one some time next week.
I'd also like if someone could walk through the Timer allocation/deallocation steps to verify the happens-before and memory semantics of it. Again, all the tests pass but another pair of eyes would be appreciated.