[Spark] Instrument YARN application end to not drop traces#5267
[Spark] Instrument YARN application end to not drop traces#5267paul-laffon-dd merged 3 commits intomasterfrom
Conversation
3004bfe to
556c6b0
Compare
BenchmarksParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 22 cases. See unchanged results
|
556c6b0 to
597a83a
Compare
There was a problem hiding this comment.
Is YarnFinishAdvice run by a separate thread then the SparkListener? If so finishApplication needs to be thread safe
There was a problem hiding this comment.
Yes, it will run in the main thread of the application while the sparkListener will run on another thread so this method has to be thread safe
There was a problem hiding this comment.
All methods are now synchronized that could lead to performance issue.
What is the thing you are protecting against in the first place?
If only finishApplication() can be called from another thread, it might worth it to handle the thread safety at this level only, like using volatile or atomic boolean for applicationEnded. AgentSpan instance should be fine.
There was a problem hiding this comment.
Yes, only the finishApplication method can be called from another thread, but it is referencing multiple variables (applicationEnded, lastJobFailed, liveExecutors, applicationMetrics) from the class so synchronized abstract the complexity of making each variable tread-safe
Knowing that all methods are called from the same thread, except the finishApplication() that can be called once per spark application from another thread, the performance impact of synchronized should be minimal right ?
There was a problem hiding this comment.
The biased locking (which I think you are alluding to) have been removed from newer versions of the JDK, since the maintenance overhead outweighed the performance gains. That being said, from what I gather, these synchronized blocks are not in a performance critical path, so I think it should be fine for the case of simplicity.
There was a problem hiding this comment.
I don’t think this will do what you expect:
- We do already flush the traces in the shutdown hook
- You don’t know if your hook will run after or before the tracer one. So it might be already dead when you call
sleep.
Moreover, please avoid using thread.sleep(). It should be added to forbidden API by the way.
There was a problem hiding this comment.
I removed the sleep since as you said it was not needed.
The issue with the dropped trace was because the onApplicationEnd could be triggered after the processor was exited, with the log Datadog trace processor exited. Publishing traces stopped.
By manually triggering the event before the finish function, we can already be sure that the trace is written before the tracer is exited
There was a problem hiding this comment.
I removed the sleep since as you said it was not needed.
I did not say it was not needed, rather than it might not do what you expect.
Hence, it implies we would like to have some tests to back the feature you introduce.
It would ensure it works properly and we can maintain it in the future.
There was a problem hiding this comment.
The issue with this change is that it involves interactions between Spark and YARN, which makes is very complicated to implement in tests
In the current tests, spark is created with config("spark.master", "local") so that all spark components are launched in the same JVM. In order to test this change, it would have to be launched with config("spark.master", "yarn"), where spark would expect to connect to a YARN cluster https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn
To validate this change, I manually launched multiple SparkPi application on an EMR cluster with a build of this branch
There was a problem hiding this comment.
I agree with @PerfectSlayer that it would be good to at least have one test. I know that orchestrating a spark test with yarn is out of the question, but a test that checks that the method gets instrumented and the tags added could be possible anyway. I quickly looked at the ApplicationMaster and I think it would be possible, even if it's a horrible hack, to force use it without having it orchestrate the whole spark application. Basically just create a SparkSession and then create a ApplicationMaster and call finish(...) on it. The spark/yarn conf supplied to it would need to be faked to make sure it doesn't try to do anything.
There was a problem hiding this comment.
You are right, an ApplicationMaster can be instantiated without orchestrating the whole spark application (and it is definitely easier than I would have expected)
I added a test calling the finish(...) to end the application span
33e8587 to
37fdecc
Compare
37fdecc to
07d63db
Compare
eebbc8e to
a2467b9
Compare
…ener interface (#5505) Instrument the SparkSubmit.runMain(...) function to capture all errors occurring in a spark application This SparkSubmit class is used by spark to launch the application in non YARN/Mesos environnements. The YARN case is already covered by instrumenting the ApplicationMaster.finish() function (from this PR #5267) ## Motivation We are currently using the interface SparkListener to receive events from spark. However, this interface is only capturing errors that occur during the spark computations, but not in arbitrary user code ``` def main() = { val spark = SparkSession.builder.getOrCreate() val df = spark.read.parquet("...") // Exception will be captured because it's happening during spark computations df.map(x => new RuntimeException("Some exception")) // Exception currently not captured because it is thrown outside of spark computations throw new RuntimeException("Some exception") } ```
What Does This Do
Instrument the spark yarn.ApplicationMaster.finish() function in order to
spark.applicationspan is finished before the trace processor is exitedCapturing exit code
An application can fail during spark computations (when performing operations on the data) or in the user code (like trying to select a column that does not exists)
Failures during spark computations are caught by the SparkListener, but not the errors in the user code so instrumenting the finish function allows to also capture those
Finish the
spark.applicationspanThe spark.application span is the local root span and is finished when the event
onApplicationEndis received. However if the sparkContext is not stopped explicitly by the user (usingspark.stop()), spark is stopped in a shutdown hook. In this case, theonApplicationEndevent can be received after the trace processor was exited, leading to the trace being dropped, with the following logsBy manually calling the function behind
onApplicationEnd, we can make sure that the trace is always finished before the tracer is exited