Skip to content

Status check based on counting the number of files #1342

@wyphan

Description

@wyphan

While developing tests for HPCToolkit's CUDA support in the E4S deployment on Perlmutter (NERSC), we came up with several ideas on new features in buildtest that would be useful for validation of the collected HPCToolkit measurement data and post-mortem analysis, in order to determine whether this specific HPCToolkit test passes or fails.

For reference, the HPCToolkit workflow for a GPU-accelerated application consists of 4 phases:

  1. Performance measurement phase: using hpcrun to collect profiles and/or traces. This phase results in a "measurement directory".
  2. Binary analysis phase: using hpcstruct to analyze the program structure of executables and libraries loaded in the address space during application execution. This phase modifies the measurement directory.
  3. Attribution phase: using hpcprof or hpcprof-mpi to correlate the profiling samples and trace records in the measurement directory into a performance database. This phase takes the measurement directory as input and generates a "database directory" as output.
  4. Analysis phase: using the hpcviewer graphical user interface (GUI) to view the performance database and conduct performance analysis. This phase uses the database directory.

Our first feature request to support the aforementioned HPCToolkit CUDA tests was file/directory existence checks #1327. This has been implemented in #1329 and further enhanced by #1331. The specific use cases are:

  • validating the performance measurement (hpcrun) phase via the existence of *.hpcrun profiles and *.hpctrace trace files in the measurement directory
  • validating the binary analysis (hpcstruct) phase via the existence of one specific *.hpcstruct program structure file in the measurement directory.
  • validating the attribution (hpcprof/hpcprof-mpi) phase via the existence of performance database components in the database directory.

This issue contains our second feature request for a status check based on counting the number of files. The specific use case is validating the hpcrun phase by verifying the correct number of *.hpcrun profiles and *.hpctrace trace files in the measurement directory. The "correct" numbers depend on the command line flags to hpcrun as well as the Slurm srun parameters. At the moment the HPCToolkit CUDA tests contain two buildspec YAML files:

  • hpctoolkit_cuda_vecadd_perlmutter.yml for a coarse-grained profiling -e gpu=nvidia and tracing -t run using the cuda_vecadd test case.
    Correct = 2 *.hpcrun files and 2 *.hpctrace files.

  • hpctoolkit_cuda_vecadd_pc_perlmutter.yml for a fine-grained profiling run using the with -e gpu=nvidia,pc with no tracing using the cuda_vecadd test case.
    Correct = 1 *.hpcrun file and 0 *.hpctrace files.

Our third feature request for grep into log files is detailed in #1343.

Metadata

Metadata

Assignees

No one assigned

    Labels

    new featureNew Feature in buildtest such as command line or improvement to buildspec

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions