Fix for Remote port forwarding buffers can grow without limits (issue #658) #913
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses #658.
Problem
The
Bufferclass can grow out of control when its consumer is unable to read data quickly enough to keep up with the producer for an extended period of time. This results in an out of memory (OOM) condition during ssh port forwarding. This issue has been hit in production at Delphix many times.Diagnosis
When
ChannelInputStreamreceives new data from its producer, it callsBuffer.putRawByteswhich will double its internal array if there is not enough room to store data starting at its current write positionwpos. Meanwhile, when the consumer reads data fromChannelInputStream, the buffer will advance its read positionrposto the next unread byte. If, and only if, the read position catches up with the write position,ChannelInputStreamwill callBuffer.clearto reset both positions to0.However, if for an extended amount of time the consumer is slower than the producer and the latter produces a large amount of data, the buffer will continue to double itself until it cannot fit in memory anymore, causing an OOM.
Solution
The key issue with this behavior of
Bufferis that the amount of data betweenrposandwposcan be relatively small compared to the space in the array beforerpos, which is being wasted. Yet, as long asrposdoesn't catch up withwpos(i.e. as long asBuffer.cleardoes not get called),Bufferwill continue to add new data afterwposand grow the array past its current end -- no matter how much space beforerposis available.This PR proposes to use a circular buffer instead, so that the space before
rposis no longer wasted and the buffer no longer needs to grow beyond any limits while utilizing only a fraction of the necessary space. Also, the circular buffer can grow just likeBufferwhen more capacity is needed, but now it throws an exception when a configurable size limit is reached instead of potentially triggering an OOM (a scenario that is a lot less likely anyway because the space beforerposis now being utilized).At first, I thought of refactoring the entire
Bufferclass to make it circular, but there are too many uses ofBufferin the code base, too many operations in the interface ofBufferwhose implementation would need to change, and perhaps more critically too many instances of the internaldata,rposandwposfields being exposed via getters to outside manipulation (e.g.PacketReader.readPacketwhich callsBuffer.arrayto get the internaldataarray) to make this practical and safe from side effects.My proposed solution instead is to focus on the
ChannelInputStreamuse case ofBuffer, which implements only a subset of allBuffermethods, and create a newCircularBufferclass that implements only whatChannelInputStreamneeds, while solving the race condition that causes this OOM. As far as I can tell,ChannelInputStreamappears to be the only use ofBufferwhere this race condition can happen. We use SSHJ at Delphix for all SSH needs and we have only hit this issue in that use case. (If other parts of SSHJ turn out to be affected, we can always expandCircularBufferlater to be used there as well.)Testing in Controlled Environment
In a controlled environment, the issue can be reproduced by placing a debugger breakpoint at
ChannelInputStream.readto temporarily stop the consumption of data (which movesrposcloser towpos) whileChannelInputStream.receivecontinues to receive new data (which moveswposfarther fromrposand eventually forces buffer expansions).After this fix, we can see with this breakpoint that the underlying array is being fully utilized and that it doesn't get expanded until there is truly no more space in it.
Just for reference, I include in this diff the test
RemotePFPerformanceTestwhich allowed me to reproduce the issue locally. It's disabled, however, and I might just remove it in the end, because of its heavy impact on the performance of the system running it.Testing in Production
In production, we hit this issue by using port forwarding to read a very large file on one end and write it to disk on the other end, usually when disk writes are slowed down by other processes also writing to it.
After applying this patch on top of SSHJ, we have been running the fix for almost three years at Delphix (including re-applying it on top of each new release of SSHJ). We haven't hit this issue again and we haven't experienced any negative side effects.