Avoid a classloader leak through a dangling statement cancel timer thread.#188
Avoid a classloader leak through a dangling statement cancel timer thread.#188metlos wants to merge 2 commits intopgjdbc:masterfrom
Conversation
There was a problem hiding this comment.
I just pulled this into my private repo and ran the tests they all failed with an NPE here.
I'm very curious why Travis-CI passes this but that is another issue
There was a problem hiding this comment.
The NPE is happening because the timer is lazy initialized when a timer task is added via addTimerTask(...). If no task is added, for example just open a connection then close it, it'll still be null when cancelTimer.cancel() is called on line 675.
There was a problem hiding this comment.
Yes, I did see the problem, wanted to verify that checking for a null
would fix it?
Dave Cramer
On 1 October 2014 12:55, Sehrope Sarkuni [email protected] wrote:
In org/postgresql/jdbc2/AbstractJdbc2Connection.java:
@@ -670,6 +672,7 @@ private void initObjectTypes(Properties info) throws SQLException
*/
public void close()
{
cancelTimer.cancel();The NPE is happening because the timer is lazy initialized when a timer
task is added via addTimerTask(...). If no task is added, for example
just open a connection then close it, it'll still be null when
cancelTimer.cancel() is called on line 675.—
Reply to this email directly or view it on GitHub
https://github.com/pgjdbc/pgjdbc/pull/188/files#r18291598.
|
Separate from the NPE issue, I'd like this a lot better if it was a single Timer registered for the entire use of the Driver. Otherwise each connection that registers a timer task will be creating a separate timer and an associated thread. Timers are thread safe and can/should be re-used. Why not have the Driver still maintain a single static Timer shared by all Connections and have it keep track of how many Connections are using it? The outline for it would be something like:
Both getTimer() and releaseTimer() would need to be synchronized as well though I don't think it'll impact performance much as it'll only be at connection creation/close. |
|
Seems cleaner to me. Do you have time to write the code ? Dave Cramer On 1 October 2014 13:06, Sehrope Sarkuni [email protected] wrote:
|
|
Yes I should be able to code it this week. If anybody else wants to take a stab at it first that's fine too. |
…read. When the driver class was loaded, it started a Timer thread that was used to cancel statements on timeout. This timer thread was never stopped. When the class is loaded in containerized environment (web container, ee container, ...) this might lead to eventual permgen depletion when the deployment with the driver is merely reloaded (which may involve reloading the classes of the deployment) repeatedly. The timer thread holds on to the protection domain of the code that created it and that protection domain holds on to the classloader that is used to load the classes. Through this chain of references, the dangling thread cause the classloader of the "previous" deployment to never be GC'ed and its classes unloaded. The fix is to track the usage of the timer by connections using reference counting. Connections acquire a "loan" of the timer when they're created and release it when they're closed. Note that exactly the same issue existed in MySQL JDBC driver where it was fixed in 2009: http://bugs.mysql.com/bug.php?id=36565
|
I reworked my patch to use a shared reference-counted timer. Thanks for the suggestions and feel free to take it or leave it. The testsuite still has 1 failure and 1 error but those seem totally unrelated to my changes. The error probably is just caused by my improper test setup - it errors out on auth in |
|
Travis, while succeeding, reports 3 completely different failures than the ones I saw locally. But those were reported for previous commits, too, so this at least doesn't introduce any new failure in Travis ;) |
|
While I'm not averse to the refcounting based approach, I wonder if it's needed, and I'm concerned that it may not be reliable. Java offers no guarantees that anything will be finalized. There are no hard guarantees that Usually it's better to solve issues with classloader leaks by identifying the reference cycle and breaking it by introducing a suitable weak reference. In this case, I think what we should probably be doing is using the JDBC4 service unload hook to terminate the timer, as the container should be invoking that reliably. You might find this stack overflow post I raised some time ago interesting, as it outlines the issues in depth and has some very useful responses. |
|
The issue here was that the cancellation thread lived LONGER than the lifetime of the driver. While all you say is true, I think that cleaning up after the driver when everything is released/closed properly is at least a great start. Also, the fact that there are no guaranties of calling I'm not sure how weak references would help us in this case, because there needs to be an explicit call to IMHO, the reference-counted approach is a good compromise. It minimizes the amount of resources needed to keep that timer and also prevents the classloader leak when the driver is used in accordance to the guidelines. Note that this fix does NOT rely on finalization. |
outside of the SharedTimer class. This is to prevent misuse and possible NPEs resulting from that misuse.
|
After reading Craig's message and the SO post (nice post btw!) I've got a slightly different idea. Why not allow the lifecycle of the Timer to be controlled by the caller? Since the Timer is statically initialized when the Driver class is loaded, the only "hook" I can think of is a JVM system property. Something like I don't particularly like using system properties like this for hooks but I don't see much other choice if this is going to be managed at the Driver level. There's no parameters that can be passed to the Driver itself. The other alternative would be to remove all reference to Timers from the Driver class, have the logic pushed in the Connection class, then add a Connection property (ex: What do you guys think? |
|
The system property approach would not be favorable in EE environments, IMHO. The former is dangerous because system properties are modifiable by anyone (unless forbidden by a security manager, using which is not that common), the latter is not favorable because the connection properties are specified by the user during datasource creation. Relying on the user to remember to put in the right property with the right value is not the right thing to do (tm) ;) Nor it is to have a special logic just for postgres jdbc driver inside the EE container to try and handle this property specially. I look at this from a slightly different perspective. JDBC drivers can leak. These leaks happen when you forget to close your connections or results sets or whatever. Everybody is aware of that and great care is usually taken to avoid those leaks. The default behavior of the driver, when used correctly, should not cause any leaks (and I am not sure requiring specific (system|connection) properties to avoid a leak can be considered a default behavior). Both of the solutions I proposed satisfy that (with the latter being largely nicer, thanks to your suggestions). When you forget to close your connections, you kinda expect there to be leaks. The situation would be no different with this fix in place but instead of just the normal connection leak, you would also have a classloader leak. Nevertheless, both of them would disappear the moment you clean up correctly. That's why I favor this approach. It provides a clean default behavior under the assumption that the driver is used correctly. When used incorrectly, bad things can happen, but hardly anyone can be surprised by that ;) |
|
Couple things I don't like in this patch:
I wrote an alternative implementation that handles these. A PR for it is available here: #197 |
Fair enough, I don't know what's public/private in posgres.
It is made so that it is not possible to have unbalanced loan/release calls (or rather more release() than loan() calls). This is possible with #197 but since this is internal API, it might not be that important to be that defensive.
True, that might be improved.
I'm very glad that you had a look :) I suggest closing this PR and continue on #197 if for nothing else then for the fact that it has better test coverage (with which I honestly struggled, not knowing the codebase). |
|
Per discussion, closing in favour of #197. Thanks for looking into this. |
When the driver class was loaded, it started a Timer thread that was used
to cancel statements on timeout. This timer thread was never stopped.
When the class is loaded in containerized environment (web container,
ee container, ...) this might lead to eventual permgen depletion when the
deployment with the driver is merely reloaded (which may involve reloading
the classes of the deployment) repeatedly.
The timer thread holds on to the protection domain of the code that created
it and that protection domain holds on to the classloader that is used to
load the classes. Through this chain of references, the dangling threads
cause the classloader of the "previous" deployments to never be GC'ed and
their classes unloaded.
The fix is to have the cancel timer thread per-connection and tear it down
when closing the connection.
Note that exactly the same issue existed in MySQL JDBC driver where it was
fixed in 2009:
http://bugs.mysql.com/bug.php?id=36565