Skip to content

Conversation

@jjlilley
Copy link

@jjlilley jjlilley commented Oct 11, 2019

Summary:
This change uses a small buffer in the Unpickler to avoid
calling reader_() byte-by-byte. Particularly, the unpickler has a
tight loop reading 1-byte opcodes.

This can be more efficient because we avoid the variable-sized
memcpy (due to templating) and std::function indirection for the
common fast path.

This improves the unpickle-1m-ints benchmark by ~20%.

This change requires changing the std::function<> interface
to Unpickler to return size_t rather than bool, but there are
only a few uses of this api.

Differential Revision: D17869980

Summary:
This change uses a small buffer in the Unpickler to avoid
calling reader_() byte-by-byte. Particularly, the unpickler has a
tight loop reading 1-byte opcodes.

This can be more efficient because we avoid the variable-sized
memcpy (due to templating) and std::function indirection for the
common fast path.

This improves the unpickle-1m-ints benchmark by ~20%.

This change requires changing the std::function<> interface
to Unpickler to return size_t rather than bool, but there are
only a few uses of this api.

Test Plan: tests

Differential Revision: D17869980

fbshipit-source-id: fdfc6acea613ad8d7ec2dd70ceafa4cb79c0367a
@jjlilley jjlilley requested a review from apaszke as a code owner October 11, 2019 02:03
@pytorchbot pytorchbot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Oct 11, 2019
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D17869980

@jjlilley jjlilley requested review from resistor, suo and zdevito October 11, 2019 17:20
memcpy(&data[0], buffer_.data() + buffer_pos_, from_old_buf);
}
const size_t needed = length - from_old_buf;
size_t nread = reader_(&data[from_old_buf], needed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do an over-read here? Otherwise won't the buffer usually be empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this is because the needed might be larger than the buffer size (it is a string of any length), so it cannot read directly into the buffer because there might not be enough room there. There is a tradeoff in reading to the buffer and then having to copy to the string when the size is already big as well and the expected overhead of the reader_ call is relatively smaller.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the reason is the arbitrary length.
It seems that there is a threshold (maybe about a cache line or two?) where double-buffering cost is effectively small, but then once it grows too big, that reading into the end location can make more sense.

Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

memcpy(&data[0], buffer_.data() + buffer_pos_, from_old_buf);
}
const size_t needed = length - from_old_buf;
size_t nread = reader_(&data[from_old_buf], needed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this is because the needed might be larger than the buffer size (it is a string of any length), so it cannot read directly into the buffer because there might not be enough room there. There is a tradeoff in reading to the buffer and then having to copy to the string when the size is already big as well and the expected overhead of the reader_ call is relatively smaller.

@jjlilley
Copy link
Author

Thanks for looking at this!

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7e8420b.

thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Pull Request resolved: pytorch#27727

This change uses a small buffer in the Unpickler to avoid
calling reader_() byte-by-byte. Particularly, the unpickler has a
tight loop reading 1-byte opcodes.

This can be more efficient because we avoid the variable-sized
memcpy (due to templating) and std::function indirection for the
common fast path.

This improves the unpickle-1m-ints benchmark by ~20%.

This change requires changing the std::function<> interface
to Unpickler to return size_t rather than bool, but there are
only a few uses of this api.

Test Plan:
buck test caffe2/test/...
benchmark in experimental/jeremyl/c2/SerializationBench

Differential Revision: D17869980

fbshipit-source-id: 37e752744d19e12b7282252c8963355970bd4feb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants