Skip to content

Conversation

@shelhamer
Copy link
Member

Share the columnation buffer for im2col / col2im transformations across all Caffe convolution layers. The memory usage is now equal to the maximum buffer size instead of the sum over all layers. In particular this is useful for many-layered architectures like the VGG ILSVRC14 19 layer model.

Advice and Cautions:

  • This is worth it for fully-convolutional models where Caffe convolution is faster than cuDNN.
  • No parallelism. Only a single net can do forward / backward at a time. As the buffer is shared by all layers within and across nets, no convolution can be done in parallel. (A fix for parallel nets is to make the buffer a member of net. DAG parallelism is still out in that case, but isn't currently parallelized anyway.)
  • This has no effect on cuDNN convolution, but that consumes less memory anyway.

All credit to @longjon who reshaped our world in #594 and suggested this patch in #520 (comment).

Do not merge.

@sguada
Copy link
Contributor

sguada commented Oct 16, 2014

@longjon sweet!, very nice and small change compared to #520

@shelhamer
Copy link
Member Author

@sguada this simplification to sharing could still be paired with a buffer owned by Net to solve the parallel execution issue. It will be much simpler patch thanks to @longjon

@longjon
Copy link
Contributor

longjon commented Oct 16, 2014

Re: moving the shared buffer to Net, do note that this is a tradeoff; it allows net concurrency at the cost of memory when nets are used sequentially. For example, this patch currently shares column buffer memory between all train and test nets during solving. (Also note that this will still break layer concurrency, when added in the future, either way.)

@shelhamer
Copy link
Member Author

This might cause the following post-optimization crash

I1020 06:44:01.849261  3396 caffe.cpp:121] Optimization Done.
F1020 06:44:02.442028  3396 syncedmem.cpp:16] Check failed: error == cudaSuccess (29 vs. 0)  driver shutting down
*** Check failure stack trace: ***
    @     0x7fc406272daa  (unknown)
    @     0x7fc406272ce4  (unknown)
    @     0x7fc4062726e6  (unknown)
    @     0x7fc406275687  (unknown)
    @           0x499d99  caffe::SyncedMemory::~SyncedMemory()
    @           0x4aa862  boost::detail::sp_counted_impl_p<>::dispose()
    @           0x4b3499  caffe::Blob<>::~Blob()
    @     0x7fc401b5f149  (unknown)
    @     0x7fc401b5f195  (unknown)
    @     0x7fc401b44ecc  (unknown)
    @           0x4167a7  (unknown)
    @              (nil)  (unknown)

but I don't have time to investigate at the moment so I'm just noting it here.

@longjon
Copy link
Contributor

longjon commented Oct 20, 2014

Yes, I've noticed that as well. I don't immediately see anything in our code that would cause that, so my guess is that cuda has some static variables that are being destroyed before the shared buffer.

@futurely
Copy link

There is almost no document about the behavior of static variable in CUDA. Another solution is to create a singleton class containing the shared buffer.

// http://stackoverflow.com/questions/1008019/c-singleton-design-pattern
class SharedBuffer {
    public:
        static SharedBuffer& instance() {
            static SharedBuffer  instance; // Guaranteed to be destroyed.
                                  // Instantiated on first use.
            return instance;
        }

        shared_ptr<Blob<Dtype> > buffer() { return buffer_; }
    private:
        SharedBuffer() {};     // Constructor? (the {} brackets) are needed here.
        // Dont forget to declare these two. You want to make sure they
        // are unaccessable otherwise you may accidently get copies of
        // your singleton appearing.
        DISABLE_COPY_AND_ASSIGN(SharedBuffer);
        shared_ptr<Blob<Dtype> > buffer_;
};

@longjon
Copy link
Contributor

longjon commented Oct 22, 2014

@futurely I don't think that solves the problem; when would the static SharedBuffer instance be destroyed (thus destroying buffer_)?

One solution is to explictly destroy the shared buffer with a clean-up function. But, that's somewhat irritating, and requires an extra line to be added to all Caffe invocations.

@futurely
Copy link

The static instance itself should also be a shared_ptr.

// http://stackoverflow.com/questions/1008019/c-singleton-design-pattern
class SharedBuffer {
    public:
        static shared_ptr<SharedBuffer> instance() {
            // Guaranteed to be destroyed. Instantiated on first use.
            static shared_ptr<SharedBuffer> instance(new SharedBuffer());                                   
            return instance;
        }
       ...
};

@longjon
Copy link
Contributor

longjon commented Oct 24, 2014

@futurely, I don't see how using a shared_ptr is any different... the lifetime of the SharedBuffer object is the same as the lifetime of the shared_ptr, right?

However, I think I see the point of the original code now, which might actually address the issue... if the static variable is method-local, it doesn't get allocated until the method is called, and therefore gets destroyed before any global static variables. (If I understand correctly, then, there's no need for a singleton class, just for an accessor function.)

I'm not going to bother updating an unmergeable PR to fix a post-optimization crash, but we can keep that trick in mind for the future.

@futurely
Copy link

There are still some chances to save this PR. It's all about the lifetime of static variable.

@longjon
Copy link
Contributor

longjon commented Oct 24, 2014

Yeah... that's what we're discussing here.

To be clear, the shared column buffer is still planned for merge, it just needs to be an option to avoid breaking concurrency. Until that's done there's no point in fixing the static destruction order issue, but let's keep it in mind for when it's ready (if it's still relevant then).

@futurely
Copy link

futurely commented Dec 5, 2014

ArrayFire simply ignored this error.

Anatoly Baksheev and others added 21 commits February 22, 2015 18:58
… was overwritten with symlink created at build time and installed with install(DIRECTORY ...)
… systems).

This commit specifies Python2 with which cpp_lint.py works :-)
APPLE was misspelled in Line 27
fixes: cpp_lint.py fails silently with Python3
Making python3 work with cmake and the new python wrapper
Commands, such as $(error ...), are not allowed to be indented with tabs
outside of targets, throwing an error instead of outputting the actual
error. The solution is to use innocuous spaces instead. Ideally, spaces
should be used everywhere outside targets, but since make does not mind
it if variable assignments are tab-indented outside targets, a complete
overhaul is not necessary. However, if more errors are added, it might
make more sense to be consistent.

Also, make will already add a period so I removed it.
fix accelerate / veclib path for OS X 10.10
Replaced illegal tab in Makefile with spaces.
The sample was missing some additional spaces to be correctly rendered on the HTML. The mistake was mine.
Decoding the datum before feeding it into the reshaping data layer
Small fix (visualization) on SLICE layer's documentation
@shelhamer shelhamer force-pushed the share-col-buffer branch 2 times, most recently from 9aea056 to 0d052dd Compare March 3, 2015 02:10
share the im2col / col2im buffers among convolution + deconvolution
layers by making the buffer a static member.

@longjon deserves all the credit for the reshaping BVLC#594 and this patch.
@shelhamer
Copy link
Member Author

Replaced by #2016.

@shelhamer shelhamer closed this Mar 3, 2015
@naibaf7 naibaf7 mentioned this pull request Jun 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.