reduced allocation v04 span representation#598
Conversation
93c4665 to
c7e268b
Compare
BenchmarksComparisonBenchmark execution time: 2024-09-19 15:25:19 Comparing candidate commit 7fc6dcf in PR branch Found 8 performance improvements and 34 performance regressions! Performance is the same for 9 metrics, 2 unstable metrics. scenario:benching deserializing traces from msgpack to their internal representation
scenario:credit_card/is_card_number/
scenario:credit_card/is_card_number/ 3782-8224-6310-005
scenario:credit_card/is_card_number/ 378282246310005
scenario:credit_card/is_card_number/37828224631
scenario:credit_card/is_card_number/378282246310005
scenario:credit_card/is_card_number/37828224631000521389798
scenario:credit_card/is_card_number/x371413321323331
scenario:credit_card/is_card_number_no_luhn/
scenario:credit_card/is_card_number_no_luhn/ 3782-8224-6310-005
scenario:credit_card/is_card_number_no_luhn/ 378282246310005
scenario:credit_card/is_card_number_no_luhn/37828224631
scenario:credit_card/is_card_number_no_luhn/378282246310005
scenario:credit_card/is_card_number_no_luhn/37828224631000521389798
scenario:credit_card/is_card_number_no_luhn/x371413321323331
scenario:normalization/normalize_name/normalize_name/bad-name
scenario:normalization/normalize_name/normalize_name/good
scenario:normalization/normalize_service/normalize_service/A0000000000000000000000000000000000000000000000000...
scenario:normalization/normalize_service/normalize_service/Test Conversion 0f Weird !@#$%^&**() Characters
scenario:normalization/normalize_service/normalize_service/[empty string]
scenario:redis/obfuscate_redis_string
scenario:sql/obfuscate_sql_string
scenario:tags/replace_trace_tags
CandidateCandidate benchmark detailsGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
Group 7
Group 8
Group 9
Group 10
Group 11
Group 12
BaselineOmitted due to size. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #598 +/- ##
==========================================
+ Coverage 73.05% 73.19% +0.13%
==========================================
Files 252 254 +2
Lines 36072 36312 +240
==========================================
+ Hits 26352 26578 +226
- Misses 9720 9734 +14
|
b3b1fdc to
2626b37
Compare
|
I wonder whether we should put |
a3e38c1 to
f48e785
Compare
@bwoebi - That's a good idea. The NoAllocString is sufficiently coupled to tinybytes at the moment that it doesn't make sense to separate them. |
f48e785 to
bd5d7bb
Compare
bd5d7bb to
3caff9e
Compare
1a18a73 to
1de65da
Compare
|
@ekump (and reviewers) Please consider 386a684 first. It is quite a refactor on the ByteWrapper approach, which hopefully increases soundness and is probably also a bit more straightforward. (see also commit description) To view the full diff of this PR plus my commit, please look at https://github.com/DataDog/libdatadog/compare/bob/refactor-bytes-wrapper-away. |
|
There is a reason why |
|
@bantonsson What actually needs to be immutable is the underlying content, which it still is. This commit just allows manipulating the range into the underlying slice. Which in itself doesn't violate Sync/Send constraints (as you still need a |
|
@bwoebi You're indeed right about |
|
Okay, brought my commit to this branch after talking to Edmund :-) |
| pub struct Bytes { | ||
| ptr: *const u8, | ||
| len: usize, | ||
| slice: &'static [u8], |
There was a problem hiding this comment.
I think the {ptr, len} is better, since the slice is not really static.
I think it'd be pretty easy to misuse the definition.
On a sidenote, NonNull<u8> is better than *const since the type can then benefit from non null pointer optimization
There was a problem hiding this comment.
I do prefer (ptr, len), but it doesn't really work with as_mut_slice(). But I think 'static is fine - it's ultimately no different to a bare pointer in terms of lifetime, whenever you reassemble it into a slice.
| pub const fn len(&self) -> usize { | ||
| self.len | ||
| self.slice.len() | ||
| } | ||
|
|
||
| /// Returns `true` if the `Bytes` is empty. | ||
| #[inline] | ||
| pub const fn is_empty(&self) -> bool { | ||
| self.len == 0 | ||
| self.slice.is_empty() | ||
| } |
There was a problem hiding this comment.
Nit but theses functions don't actually need to be implemented on Bytes since already present the type from impl Deref<[u8]>
There was a problem hiding this comment.
Except they are not const through deref, so that would be one reason to keep them (doubt that it helps here, this is merely an educational mention).
| })?; | ||
|
|
||
| let mut trace: Vec<Span> = Default::default(); | ||
| (0..trace_count).try_fold( |
There was a problem hiding this comment.
Really a small nit, but I think the for loops was clearer than the nested try_fold
| pub unsafe fn as_mut_slice(&mut self) -> &mut &'static [u8] { | ||
| &mut self.slice | ||
| } |
There was a problem hiding this comment.
I guess you added this function because you want to pass en &mut &[u8] to the rpm decode functions, but really there is a simpler and safer way to do that.
rpm reads from types implementing RmpRead, which is implemented by default on all types implementing io::Read.
So IMO the correct way to use Bytes with rpm is to just impl Read on it and advance the base pointer from how many bytes were read.
impl io::Read for Bytes {
fn read(&mut self, &mut buff) -> Result<u8> {
let bytes_read = min(buff.len, self.len);
std::slice::copy_from_slice(&self[..bytes_read], buff[..bytes_read]);
self.ptr += bytes_read;
self.len -= bytes_read;
Ok(bytes_read)
}
}Then you could just do rmp::decode::read_array_len(&mut data)
There was a problem hiding this comment.
It's not just as simple as that, In particular with the string reading:
read_string_ref_nomut(unsafe { buf.as_mut_slice() }).map(|(str, newbuf)| {
let string = BytesString::from_bytes_slice(buf, str);
*unsafe { buf.as_mut_slice() } = newbuf;
string
})
there's a dance around the ordering of executions: first, read the string. Then fetch the string slice. Then adjust the size.
Because Bytes asserts that any read from it is within its bounds. So we'd then have to craft an unsafe method for that instead.
Also, your Read impl is actually copying every single byte. Passing a slice directly does not incur that overhead. It's maybe a bit more elegant to use Read, but actually not as performant. And this PR here is supposed to solve ... performance problems.
There was a problem hiding this comment.
I've just been profiling the whole patch. There's currently still one copy happening (the whole data buffer at once) in SidecarServer::send_trace_v04. That single copy accounts for 35% of the whole overhead. So copying here is most likely unacceptable.
There was a problem hiding this comment.
Yeah for sure we don't want to copy the strings we read, but for that you can make cleaner abstractions than using read_string_ref_nomut.
We could very well rewrite it to use the following primitives to encapsulate the unsafe manipulation in the Bytes implementation
impl Bytes {
// make the bytes instance point to original_buffer[n..]
// returns None if the buffer is smaller than n
fn advance((&mut self, n: usize) -> Option<()> {
if n > self.len { return None }
self.ptr += n;
self.len -= n;
Ok(())
}
// make the bytes instance point to original_buffer[..n]
// returns None if the buffer is smaller than n
fn shrink((&mut self, n: usize) -> Option<()> {
if n > self.len { return None }
self.len = n;
Ok(())
}
// Yield the first n bytes and advance the offset from as much
fn split_front_ref(&mut self, n: usize) -> Option<&[u8]> {
let start = self.ptr;
self.advance(n)?
// Safe because the lifetime of the slice is smaller than the Bytes
// and n smaller than length
Ok(unsafe { slice:: from_raw_parts(start, n) })
}
fn split_front(&mut self, n: usize) -> Option<Bytes> {
if n > self.len { return None }
let mut front = self.clone();
self.advance(n)?;
front.shrink(n)?;
Ok(front)
}
}
fn read_string_ref(bytes: &'a mut Bytes) -> Result<&'a str, DecodeError> {
let len = rpm::decode::read_str_len(bytes)?;
let str_ref = self.split_front_ref(len).map_err(|| DecodeStringError::BufferSizeTooSmall(len))?;
str::from_utf8(str_ref).map_err(|_| DecodeStringError::InvalidUtf8())
}
fn read_string_bytes(bytes: &'a mut Bytes) -> Result<&'a str, DecodeError> {
let len = rpm::decode::read_str_len(bytes)?;
let str_bytes = self.split_front(len).map_err(|| DecodeStringError::BufferSizeTooSmall(len))?;
BytesString::from_bytes(str_bytes).map_err(...)
}There was a problem hiding this comment.
By the way, I'm curious were we spend time other than the serialization, can your share the profile?
There was a problem hiding this comment.
@bwoebi and @paullegranddc - Is this conversation still ongoing? Any changes either of you require here? I don't have a strong opinion, so happy to implement whatever the group agrees on.
There was a problem hiding this comment.
Feel free to implement @paullegranddc suggestion with io::Read and the additional methods to Bytes. Should probably also fix the miri failure I think.
There was a problem hiding this comment.
On my end, I don't have a strong opinion. The code currently works, so I think we can merge it, but it'd be nice if you have time to refactor it. Either in this PR or latter
There was a problem hiding this comment.
@paullegranddc @bwoebi - I do like the impl io::Read solution as it seems to contain less sharp edges, but benchmarks wind up being around 30% slower compared to the as_mut_slice implementation. It's possible I misunderstood the suggestion and implemented something incorrectly here. But if neither of you see any glaring errors I will revert to the as_mut_slice implementation for now and create a ticket to follow up on this in another PR.
There was a problem hiding this comment.
This does not surprise me too much, at least it would need much more optimization by the compiler. So, if it's actually 30% slower, then please revert it.
c814f9b to
7ca377c
Compare
Bytes itself is already a structure with a slice into some underlying bytes.
Thus, we are adding a cheap way to directly modify the slice of the Bytes struct.
To facilitate this, we now store the tuple (ptr, len) as a &[] slice within the Bytes struct. The representation of these is exactly identical to before, but it allows trivial manipulation of the slice via direct assignment or &mut &[u8] reference.
The way the underlying field was exposed on the BytesWrapper struct was also unsound. Access to it was not wrapped within an unsafe method.
The decoder code can now trivially clone() the given Bytes struct and carry it around, directly shrinking the slice size of the underlying Bytes instance as the data is being processed.
Additionally, there now exist two helper methods for numbers and strings (read_string_bytes and read_number_bytes) which handle a bit of the boilerplate.
Sadly it is necessary to always write "unsafe { buf.as_mut_slice() }", but this is necessary for soundness, we only can actually guarantee safety when calling the actual decode functions, as long as we want to carry Bytes around.
Signed-off-by: Bob Weinand <[email protected]>
Signed-off-by: Bob Weinand <[email protected]>
a9211b7 to
e0f127e
Compare
bwoebi
left a comment
There was a problem hiding this comment.
Looks good now, thanks a lot for the effort in optimizing this!
Leverage tinybytes to perform zero-copy msgpack decoding of v04 spans within trace-utils for use by data-pipeline and the sidecar. This was driven by the need to reduce memory footprint of the sidecar in PHP. --------- Signed-off-by: Bob Weinand <[email protected]> Co-authored-by: Bob Weinand <[email protected]>
What does this PR do?
Deserialize V.04 Spans into a new internal representation that attempts to minimize new String allocations. Using the existing protobuf definitions leads to a level of performance that tracer libraries are/will be uncomfortable with.
Introduce a
BytesStringthat borrows from a tinybytes::Bytes buffer slice and aBufferWrapperfor keeping track of where in the bytes buffer we are slicing from.This will impact only v04 Spans at the moment.
Motivation
The PHP Language Platform team observed an unacceptable level of increased memory footprint when testing the use of the Sidecar for sending traces to the agent.
Additional Notes
The use of non-static
&strorCowtypes was not possible as once spans are decoded they may need to be passed to async code paths inSendDataon the way to the agent and this caused several lifetime issues. The use oftinybytesis necessary asBytesdoesn't support&[u8]input, which is necessary when we can't own the data (i.e. shared memory from the tracer using IPC).How to test the change?
Unit and integration tests have been updated.