Skip to content

Avoid a classloader leak through a dangling statement cancel timer thread.#188

Closed
metlos wants to merge 2 commits intopgjdbc:masterfrom
metlos:cl-leak
Closed

Avoid a classloader leak through a dangling statement cancel timer thread.#188
metlos wants to merge 2 commits intopgjdbc:masterfrom
metlos:cl-leak

Conversation

@metlos
Copy link
Copy Markdown

@metlos metlos commented Sep 18, 2014

When the driver class was loaded, it started a Timer thread that was used
to cancel statements on timeout. This timer thread was never stopped.

When the class is loaded in containerized environment (web container,
ee container, ...) this might lead to eventual permgen depletion when the
deployment with the driver is merely reloaded (which may involve reloading
the classes of the deployment) repeatedly.

The timer thread holds on to the protection domain of the code that created
it and that protection domain holds on to the classloader that is used to
load the classes. Through this chain of references, the dangling threads
cause the classloader of the "previous" deployments to never be GC'ed and
their classes unloaded.

The fix is to have the cancel timer thread per-connection and tear it down
when closing the connection.

Note that exactly the same issue existed in MySQL JDBC driver where it was
fixed in 2009:
http://bugs.mysql.com/bug.php?id=36565

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pulled this into my private repo and ran the tests they all failed with an NPE here.

I'm very curious why Travis-CI passes this but that is another issue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NPE is happening because the timer is lazy initialized when a timer task is added via addTimerTask(...). If no task is added, for example just open a connection then close it, it'll still be null when cancelTimer.cancel() is called on line 675.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did see the problem, wanted to verify that checking for a null
would fix it?

Dave Cramer

On 1 October 2014 12:55, Sehrope Sarkuni [email protected] wrote:

In org/postgresql/jdbc2/AbstractJdbc2Connection.java:

@@ -670,6 +672,7 @@ private void initObjectTypes(Properties info) throws SQLException
*/
public void close()
{

  •    cancelTimer.cancel();
    

The NPE is happening because the timer is lazy initialized when a timer
task is added via addTimerTask(...). If no task is added, for example
just open a connection then close it, it'll still be null when
cancelTimer.cancel() is called on line 675.


Reply to this email directly or view it on GitHub
https://github.com/pgjdbc/pgjdbc/pull/188/files#r18291598.

@sehrope
Copy link
Copy Markdown
Member

sehrope commented Oct 1, 2014

Separate from the NPE issue, I'd like this a lot better if it was a single Timer registered for the entire use of the Driver. Otherwise each connection that registers a timer task will be creating a separate timer and an associated thread. Timers are thread safe and can/should be re-used.

Why not have the Driver still maintain a single static Timer shared by all Connections and have it keep track of how many Connections are using it?

The outline for it would be something like:

  • Add a static getTimer() function to the Driver that returns the existing Timer or creates a new one. In either case it bumps a reference counter.
  • Add a static releaseTimer() function to the Driver that decrements the reference count. If it drops to zero, then free the underlying Timer object.
  • On connection close, call releaseTimer()

Both getTimer() and releaseTimer() would need to be synchronized as well though I don't think it'll impact performance much as it'll only be at connection creation/close.

@davecramer
Copy link
Copy Markdown
Member

Seems cleaner to me.

Do you have time to write the code ?

Dave Cramer

On 1 October 2014 13:06, Sehrope Sarkuni [email protected] wrote:

Separate from the NPE issue, I'd like this a lot better if it was a single
Timer registered for the entire use of the Driver. Otherwise each
connection that registers a timer task will be creating a separate timer
and an associated thread. Timers are thread safe and can/should be re-used.

Why not have the Driver still maintain a single static Timer shared by all
Connections and have it keep track of how many Connections are using it?

The outline for it would be something like:

  • Add a static getTimer() function to the Driver that returns the
    existing Timer or creates a new one. In either case it bumps a reference
    counter.
  • Add a static releaseTimer() function to the Driver that decrements
    the reference count. If it drops to zero, then free the underlying Timer
    object.
  • On connection close, call releaseTimer()

Both getTimer() and releaseTimer() would need to be synchronized as well
though I don't think it'll impact performance much as it'll only be at
connection creation/close.


Reply to this email directly or view it on GitHub
#188 (comment).

@sehrope
Copy link
Copy Markdown
Member

sehrope commented Oct 1, 2014

Yes I should be able to code it this week.

If anybody else wants to take a stab at it first that's fine too.

…read.

When the driver class was loaded, it started a Timer thread that was used
to cancel statements on timeout. This timer thread was never stopped.

When the class is loaded in containerized environment (web container,
ee container, ...) this might lead to eventual permgen depletion when the
deployment with the driver is merely reloaded (which may involve reloading
the classes of the deployment) repeatedly.

The timer thread holds on to the protection domain of the code that created
it and that protection domain holds on to the classloader that is used to
load the classes. Through this chain of references, the dangling thread
cause the classloader of the "previous" deployment to never be GC'ed and
its classes unloaded.

The fix is to track the usage of the timer by connections using reference
counting. Connections acquire a "loan" of the timer when they're created
and release it when they're closed.

Note that exactly the same issue existed in MySQL JDBC driver where it was
fixed in 2009:
http://bugs.mysql.com/bug.php?id=36565
@metlos
Copy link
Copy Markdown
Author

metlos commented Oct 2, 2014

I reworked my patch to use a shared reference-counted timer. Thanks for the suggestions and feel free to take it or leave it.

The testsuite still has 1 failure and 1 error but those seem totally unrelated to my changes. The error probably is just caused by my improper test setup - it errors out on auth in testConcurrentReplace. The failing test is testTimezoneWithSeconds for which I fail to see how it could be caused by the changes. But then I am no expert in anything JDBC ;)

@metlos
Copy link
Copy Markdown
Author

metlos commented Oct 2, 2014

Travis, while succeeding, reports 3 completely different failures than the ones I saw locally. But those were reported for previous commits, too, so this at least doesn't introduce any new failure in Travis ;)

@ringerc
Copy link
Copy Markdown
Member

ringerc commented Oct 2, 2014

While I'm not averse to the refcounting based approach, I wonder if it's needed, and I'm concerned that it may not be reliable.

Java offers no guarantees that anything will be finalized. There are no hard guarantees that Connection.close() will get called on connections if an app that bundles PgJDBC is re-deployed. So we can't rely on the refcount reaching zero and triggering the driver to terminate the timer thread.

Usually it's better to solve issues with classloader leaks by identifying the reference cycle and breaking it by introducing a suitable weak reference.

In this case, I think what we should probably be doing is using the JDBC4 service unload hook to terminate the timer, as the container should be invoking that reliably.

You might find this stack overflow post I raised some time ago interesting, as it outlines the issues in depth and has some very useful responses.

@metlos
Copy link
Copy Markdown
Author

metlos commented Oct 2, 2014

The issue here was that the cancellation thread lived LONGER than the lifetime of the driver.

While all you say is true, I think that cleaning up after the driver when everything is released/closed properly is at least a great start.

Also, the fact that there are no guaranties of calling Connection.close() during re-deploy can be considered a bug and a leak in and of itself. This bug is different, the leak was always there, you didn't even have to establish any connection, just loading the driver was enough.

I'm not sure how weak references would help us in this case, because there needs to be an explicit call to Timer.cancel() somewhere in the codebase so that the timer is cleaned up. If it were (indirectly) used through a weak reference, it would just be GC'ed but the thread would live on - the timer is not finalizable and relying on finalization is a bad idea anyway, as you also correctly point out.

IMHO, the reference-counted approach is a good compromise. It minimizes the amount of resources needed to keep that timer and also prevents the classloader leak when the driver is used in accordance to the guidelines. Note that this fix does NOT rely on finalization.

outside of the SharedTimer class.

This is to prevent misuse and possible NPEs
resulting from that misuse.
@sehrope
Copy link
Copy Markdown
Member

sehrope commented Oct 2, 2014

After reading Craig's message and the SO post (nice post btw!) I've got a slightly different idea.

Why not allow the lifecycle of the Timer to be controlled by the caller?

Since the Timer is statically initialized when the Driver class is loaded, the only "hook" I can think of is a JVM system property. Something like PGJDBC_TIMER_FACTORY that, if set, would be used to create the Timer. Otherwise, the existing behavior would happen (i.e. new Timer(true)). If someone wants to pull it from JNDI or have a single Timer shared across multiple reloads of the Driver, they would deal with that themselves.

I don't particularly like using system properties like this for hooks but I don't see much other choice if this is going to be managed at the Driver level. There's no parameters that can be passed to the Driver itself.

The other alternative would be to remove all reference to Timers from the Driver class, have the logic pushed in the Connection class, then add a Connection property (ex: timerfactoryclass) that does the same thing. Even though this doesn't require a system property, it doesn't feel as right to me as solving this at the Driver level.

What do you guys think?

@metlos
Copy link
Copy Markdown
Author

metlos commented Oct 3, 2014

The system property approach would not be favorable in EE environments, IMHO.
And thinking of it, neither would be the connection properties approach.

The former is dangerous because system properties are modifiable by anyone (unless forbidden by a security manager, using which is not that common), the latter is not favorable because the connection properties are specified by the user during datasource creation. Relying on the user to remember to put in the right property with the right value is not the right thing to do (tm) ;) Nor it is to have a special logic just for postgres jdbc driver inside the EE container to try and handle this property specially.

I look at this from a slightly different perspective. JDBC drivers can leak. These leaks happen when you forget to close your connections or results sets or whatever. Everybody is aware of that and great care is usually taken to avoid those leaks.

The default behavior of the driver, when used correctly, should not cause any leaks (and I am not sure requiring specific (system|connection) properties to avoid a leak can be considered a default behavior).

Both of the solutions I proposed satisfy that (with the latter being largely nicer, thanks to your suggestions). When you forget to close your connections, you kinda expect there to be leaks. The situation would be no different with this fix in place but instead of just the normal connection leak, you would also have a classloader leak. Nevertheless, both of them would disappear the moment you clean up correctly.

That's why I favor this approach. It provides a clean default behavior under the assumption that the driver is used correctly. When used incorrectly, bad things can happen, but hardly anyone can be surprised by that ;)

@sehrope
Copy link
Copy Markdown
Member

sehrope commented Oct 5, 2014

Couple things I don't like in this patch:

  • PGConnection is the public interface for the API. The Timer related methods are internal and should not be exposed.
  • The memory semantics of the loan/release are a bit hard to follow.
  • The connection class calls .loan() when it is instantiated so it'll get a Timer regardless of whether it'll actually use it. It'd be nice if the Timer (and associated Thread) are not created at all if no TimerTask is ever scheduled.

I wrote an alternative implementation that handles these. A PR for it is available here: #197

@metlos
Copy link
Copy Markdown
Author

metlos commented Oct 6, 2014

PGConnection is the public interface for the API. The Timer related methods are internal and should not be exposed.

Fair enough, I don't know what's public/private in posgres.

The memory semantics of the loan/release are a bit hard to follow.

It is made so that it is not possible to have unbalanced loan/release calls (or rather more release() than loan() calls). This is possible with #197 but since this is internal API, it might not be that important to be that defensive.

The connection class calls .loan() when it is instantiated so it'll get a Timer regardless of whether it'll actually use it. It'd be nice if the Timer (and associated Thread) are not created at all if no TimerTask is ever scheduled.

True, that might be improved.

I wrote an alternative implementation that handles these. A PR for it is available here: #197

I'm very glad that you had a look :) I suggest closing this PR and continue on #197 if for nothing else then for the fact that it has better test coverage (with which I honestly struggled, not knowing the codebase).

@ringerc
Copy link
Copy Markdown
Member

ringerc commented Dec 1, 2014

Per discussion, closing in favour of #197. Thanks for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants