[DSM] - Set DSM checkpoints for S3 put / get operations#6859
[DSM] - Set DSM checkpoints for S3 put / get operations#6859
Conversation
|
|
||
| if (key != null && bucket != null && awsOperation != null) { | ||
| if ("GetObject".equalsIgnoreCase(awsOperation.toString())) { | ||
| LinkedHashMap<String, String> sortedTags = new LinkedHashMap<>(); |
There was a problem hiding this comment.
LinkedHashMap can be replaced with a fixed size array. This will be done in a followup PR in Q2.
There was a problem hiding this comment.
I think a simple struct-like object would be better than an array.
| CharSequence awsRequestName = AwsNameCache.getQualifiedName(request); | ||
|
|
||
| span.setResourceName(awsRequestName, RPC_COMMAND_NAME); | ||
| if ("Amazon S3".equalsIgnoreCase(awsServiceName) && span.traceConfig().isDataStreamsEnabled()) { |
There was a problem hiding this comment.
Basing this on service name seems brittle to me, it would make more sense to base this check on the raw information -- e.g. the type of Request.
There was a problem hiding this comment.
I've switched to "s3" service name for both versions of AWS SDK. I didn't use the request type to avoid adding new dependencies.
|
|
||
| Object key = span.getTag(InstrumentationTags.AWS_OBJECT_KEY); | ||
| Object bucket = span.getTag(InstrumentationTags.AWS_BUCKET_NAME); | ||
| Object awsOperation = span.getTag(InstrumentationTags.AWS_OPERATION); |
There was a problem hiding this comment.
There are a lot or repetitive toString calls here.
If these are Strings or CharSequences, I would type check and cast them sooner.
Or, just create variables to hold the String values immediately after calling getTag.
Something like the line below would be fine...
String keyStr = (key == nul) ? null : key.toString()
| payloadSize = (long) requestSize; | ||
| } | ||
|
|
||
| LinkedHashMap<String, String> sortedTags = new LinkedHashMap<>(); |
There was a problem hiding this comment.
I would like to see DSM replace the use of LinkedHashMap with a simple struct-like object or builder.
LinkedHashMap is creating a lot of unnecessary allocation.
|
|
||
| // S3 | ||
| request.getValueForField("Bucket", String.class).ifPresent(name -> setBucketName(span, name)); | ||
| if ("s3".equalsIgnoreCase(awsServiceName) && span.traceConfig().isDataStreamsEnabled()) { |
There was a problem hiding this comment.
Why is service name different here?
There was a problem hiding this comment.
SDK v1 and V2 have different naming for services. I'll check if this can be unified.
There was a problem hiding this comment.
Both implementations now use 's3'
| private final StringBuilder builder; | ||
|
|
||
| public DataSetHashBuilder() { | ||
| builder = new StringBuilder(); |
There was a problem hiding this comment.
Builder should be given an initial size if possible.
|
|
||
| public long generateDataSourceHash(long parentHash) { | ||
| builder.append(parentHash); | ||
| return FNV64Hash.generateHash(builder.toString(), FNV64Hash.Version.v1); |
There was a problem hiding this comment.
It seems like the builder & string aren't actually needed for the hash calculation.
I presume the FNV64Hash computation could be done in a streaming fashion as the tags are added.
There was a problem hiding this comment.
Makes sense, I'll update the code.
dougqh
left a comment
There was a problem hiding this comment.
I think there's some opportunity for clean-up and optimization.
What Does This Do
put/getoperationsMotivation
Setting "data source" checkpoints unlocks end-to-end pipeline lineage. For instance, this particular change allows visualizing data flow between streaming and batch processing. The same mechanism can be used to support more types of data sources.
Additional Notes
Jira ticket