Survey™
The Survey collector/analytics framework is a new generation, high-level, light-weight tool for HPC application performance metric collection
Product Overview
Survey is a broad collection and reporting tool with less impact than the more in-depth performance tools. Survey is a multi-platform Linux tool which targets collection of high-level performance metrics and analysis of applications running on both single node and large-scale platforms, including Cray platforms.
The data collected can serve as input to your existing analysis tools and dashboards covering many use cases.
A product demo and free trial of the Survey tool can be arranged by contacting us.
Use Cases
-
Utilize Survey to support application development efforts and help understand resource capabilities of computer architectures and software environments. Integrate into system and performance studies and planning process for system procurements.
-
Integrate Survey into your development framework (e.g. gitlabCI) to provide a continuous measure of performance impacts as development progresses in real time. If you are developing for multiple architectures, compilers, etc., Survey can be used to proactively alert you to potential issues that may impact performance.
-
Integrate Survey into periodic system test suites that ensure that the system is healthy. Survey collects performance and system metadata that can be used to track and monitor identified performance levels and surface potential outliers.
survey - Key Features
-
The Survey collector is designed to work on sequential, MPI, OpenMP, and hybrid codes and directly leverages several interfaces available for tools inside current MPI implementations including: MPICH, MVAPICH, MPT, and OpenMPI. It also supports multiple architectures and has been tested on machines based on Intel, AMD, ARM, and IBM P8/9 processors and integrated GPUs.
-
Is very lightweight with target goal of 1% overhead
Gathers multiple application performance metrics in one run
Gathers job metadata that includes job, hardware, and system
Gives a high-level performance overview (no mapping back to the source)
Identify potential areas that you may want to use a more detailed tool such as Open|SpeedShop, HPCToolkit, etc.
Creates data files for application metrics (.csv and .json) and metadata (json) for ingestion to local analysis frameworks
Pull specific metrics and metadata using our extractor
All raw per-thread of execution csv files are available after run (.dir that contains per thread files)
min, max, average output across threads of execution (including top-down for Intel)
-
executable
cmd line
linked libraries
launch - start/end
# threads/ranks
hardware
cpu
memory
file systems
HW components
system
operating system
resource limits
environmental variables
slurm info (RM)
-
Aggregate Metrics of all MPI ranks and OMP thread
Memory information (e.g. high water mark, memory allocation and free calls, allocation sizes)
Hardware counter information and derived metrics
Input/output information, I/O time, read/write times and byte counts
MPI information, MPI time, and percent across the threads of execution
OpenMP information, serial time, and time spent in OpenMP regions
-
The Survey framework has the capability to add external collection tools to build a data store that can cover additional aspects of the machine and environment (e.g. NVIDIA-smi integration). An application output collection process is also being integrated to collect and monitor application specific data.
-
Survey provides capability to extract from your collected data and has reporting capability for comparison across collection sets.