Skip to content

Conversation

@lancelui-amzn
Copy link
Contributor

@lancelui-amzn lancelui-amzn commented Jun 20, 2024

Description of changes:

  • Use async-profiler to profile java virtual machines and generate flamegraphs. The flamegraph tab is modified to allow the selection between perf and async-profiler flamegraphs.
  • --profile-java option added to enable and pass args for JVM profiling, specify JVMs by comma separated PIDs or names
  • Datatypes have new is_profile_option bool field that flags whether the datatype can be enabled and disabled by the profile subcommand (applicable to perf_profile, flamegraph, and java_profile datatypes).
  • InitParams and CollectorParams now also contain profiler arguments.

Test Cases:
14 test cases were run with varying options and JVM setups. Tests were evaluated visually through the aperf report.

  1. record
  2. record with java profile, 2 JVMs in background
  3. record with java profile, specify 1 of 2 JVMs in background
  4. record with java profile, specify 2 of 2 JVMs in background
  5. record with java profile, 2 JVMs in background and 1 started at 7 seconds
  6. record with profile
  7. record with profile and java profile, 2 JVMs in background
  8. record with java profile, specify 1 of 2 JVMs in background by name
  9. record with java profile, specify 2 of 2 JVMs in background by name and pid
  10. record with java profile, specify 2 JVMs in background and 1 started at 2 seconds (should not record last JVM)
  11. record with java profile, no JVMs on system
  12. record with java profile, 2 JVMs with disabled perf shared mem
  13. record with perf and java profile, 0 JVMs
  14. record with perf and java profile, 3 JVMS started during and 1 ended early

Example of output:
Screenshot 2024-07-11 at 11 08 14 AM

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@lancelui-amzn lancelui-amzn requested a review from a team as a code owner June 20, 2024 14:20
@lancelui-amzn lancelui-amzn requested a review from janaknat June 20, 2024 18:20

impl CollectData for JavaProfileRaw {
fn prepare_data_collector(&mut self, params: CollectorParams) -> Result<()> {
let jps_out = Command::new("jps").output().expect("'jps' command failed.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we propagate the error?

let jps: Vec<&str> = jps_str.split_whitespace().collect();
let mut process_map: HashMap<String, Vec<String>> = HashMap::new();
for i in (0..jps.len()).step_by(2) {
if jps[i + 1] != "Jps" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the output of the jps command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default jps lists the PID and name for each JVM separated by a newline:

<PID> <NAME>
<PID> <NAME>
...
<PID> <NAME>

}

let mut jids: Vec<String> = Vec::new();
let jprofile = params.profile.get(JAVA_PROFILE_FILE_NAME).unwrap().as_str();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the get fails and we blindly unwrap()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java profile is only enabled when params.profile has JAVA_PROFILE_NAME so it is not possible for unwrap to fail.

if !jps.contains(&arg) {
error!("No JVM with name/PID '{}'.", arg);
continue;
} else if jps.iter().position(|&r| r == arg).unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this so the user can provide either a pid or a name? Would it make sense to simplify it and only allow pids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes right now the user can specify either pids or names. We could simplify but that would diverge from the functionality of async-profiler, which allows this.


process_map
.entry(process)
.and_modify(|v| v.push(html.len().to_string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This saves the filesize with the PID and Name for here.

<div id="flamegraphs" class="tabcontent">
<div class="extra">
<button class="flamegraphs-select" id="flamegraphs"> perf </button>
<button class="flamegraphs-select" id="javaprofile"> java </button>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These buttons exists even if there are no javaprofiles? What happens if the user clicks it when no javaprofiles exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It displays "No data collected." similar to the perf flamegraph.

function display_tab(name) {
let datatype = DataTypes.get(name);
if (datatype.hideClass != "") {
if(datatype.hideClass == "flamegraphsSelection"){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we special case the flamegraphsSelection? This can go inside the hideclass != "" check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flamegraph datatype callback takes a string, while the other ones take a bool or nothing.

jps_str = String::from_utf8(jps_out.stdout).unwrap_or_default();
let jps: Vec<&str> = jps_str.split_whitespace().collect();
for i in (0..jps.len()).step_by(2) {
if jps[i + 1] != "Jps" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a print for the new JVM with name and JID. Both this and the print at the start of the function should be debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New JVM/JID info is printed at the trace level when async-profiler is successfully launched. Should that be changed to debug?

Copy link
Contributor

@wash-amzn wash-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help the review if the PR description included the cases tested and the corresponding outputs (at least snippets of them that are relevant to the case).

README.md Outdated
- Network stats
- Meminfo
- Profile data (if enabled with `--profile` and `perf` binary present)
- JVM profile data with `jps` command and [async-profiler](https://github.com/async-profiler/async-profiler/tree/master) binary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 'JVM profile data with async profiler', no need to mention the jps detail

src/data.rs Outdated
pub fn collect_data(&mut self) -> Result<()> {
trace!("Collecting Data...");
self.data.collect_data()?;
self.data.collect_data(self.collector_params.clone())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be passable as a reference instead of cloning them every time.

Definitely would help avoid having so many .clone() calls all over the place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, this is matched with the other stages of record (prepare_data_collector, finish_data_collection, after_data_collection). Should we change those too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data types should be able to figure out all of this kind of thing at prepare time, and then just store it in their own fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataTypes store this in their own fields, they are passed to the respective Data enums when their functions are called.

Comment on lines 387 to 389
let params = CollectorParams::new();

stat.collect_data(params).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to do this in places suggests that this is probably the wrong approach (re-passing params down to collectors on every collect_data call). There should be a way to get this to the collectors just once at initialization/setup time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does get passed to the DataType, but not Data where these functions are called. Adding params to the Data itself kind of breaks the Data enum pattern. Passable by reference seems like a better approach?

Comment on lines 42 to 44
if(datatype.hideClass == "flamegraphsSelection") {
datatype.callback(name);
} else if (datatype.hideClass != "") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely looking like we need a different kind of change, instead of abusing an existing attribute and having to start special-casing certain values of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will use union type to remove this special case in index.

continue;
}

datatype.collector_params.elapsed_time = start.elapsed().as_secs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the part of this diff referencing times actually new, or is it an artifact of merging Janak's recent changes?

If it's new (well, even if it's merged), this appears to be doing things in seconds, (at least the new field it's stored in us u64, not floating-point), which can't be right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to Janak's changes. start.elapsed() is a Duration that measures the seconds and we convert it to u64 with as_secs(). This is originally used to keep track of the time elapsed for the record and we are passing it to the collector_params so java profile knows the remaining time to launch new async-profilers.

Comment on lines 159 to 160
error!("No JVM with name/PID '{}'.", arg);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to figure out what the behavior is here. I would say specifying a pid that doesn't exist should be fatal, but names are not as clear-cut. We may want to keep looking for only that name to appear during recording.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the names, anyway? Can you provide examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names are either the classname or JAR file name that the JVM was launched on. For example, if I compile and run program.java, I can specify it with the java profile flag: ./aperf record --profile-java program.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When providing multiple PIDs, what if some are valid but some aren't?

Comment on lines 164 to 165
error!("Multiple JVMs with the name '{}', please provide PID.", arg);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to just profile all of them. However, that depends on what name actually is, I might change my view if I saw them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to this example with program.java.
Running,

java program &
java program &
./aperf record --profile-java program

will result in this error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to profile all of them. Some workloads will be multiple JVMs (typically to have separate, smaller heaps for shorter GC pauses) running the same thing.

Comment on lines 167 to 161
if let Some(jid) = self.get_jid(arg) {
jids.push(jid);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it be possible this is None, given the if statement a couple lines up? We need to be careful about the various data structures we have and how they can be inconsistent with respect to eachother.

And if this legitimately can happen, it should be reflected to the user somehow, not silently dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it is not possible to be none. Changed to handle inside of get_jid function.


fn launch_asprof(&self, jids: Vec<String>, params: CollectorParams) -> Result<()> {
let data_dir = PathBuf::from(params.data_dir.clone());
for jid in &jids {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jid is simply the pid of a (presumably) java process, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Comment on lines +211 to +202
Err(e) => {
error!("'asprof' did not exit successfully: {}", e);
return Ok(());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens if process is owned by another user, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, and any other error when async-profiler is started properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does async-profiler's stderr output go in cases like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is inherited from aperf process.

])
.spawn()
{
Err(e) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to fail here? The user java profiling but we are unable to do it. This might be also applicable to perf. Maybe something common could be done. It shouldn't be a blocker for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we print this message for n JVMs? Async-profiler is unavailable and we have 10 JVMs, this is printed 10 times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we do want to fail here. If not, then we should break out and remove the datatype on the first error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the output on a panic? We want to fail gracefully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the output when asprof is not downloaded:

thread 'main' panicked at src/data/java_profile.rs:53:25:
'asprof' command not found.
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return an Error defined in PDError? That should force aperf to remove the java collector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would allow aperf to continue right? Should we keep the panics on missing jps/pgrep?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I see. Yeah. We want to not continue.

src/lib.rs Outdated
match datatype.prepare_data_collector() {
Err(e) => {
if name == data::java_profile::JAVA_PROFILE_FILE_NAME
|| name == data::perf_profile::PERF_PROFILE_FILE_NAME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use data.is_profile_option which was added in this? Then we won't have to specify a new profiler if/when we add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that will work. 👍

interval: 1,
period: 2,
profile: false,
profile_java: None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add one more new test case to test Java profile ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot test the profile because of the asprof dependency and no java workload on the machine.

@janaknat janaknat merged commit ef2dd06 into main Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants