Running vtune

Introduction

Intel® VTune™ Profiler is an application profiling and tuning tool for serial and threded codes. In this tutorial we will show how to use vtune in tetralith system

Compiling code for vtune

We first create a run directory for our test and then load the module:

$ export VTUNETEST=/home/${USER}/intel_perf_tool/vtune/$(date +%d%m%y)
$ mkdir -p $VTUNETEST
$ module load buildenv-intel/2021.3.0-oneapi

One can choose some other suitable folder name instead of the above if one wants to. Next, we check compute accounting information

$ projinfo $USER

This will show compute project account name. We use that to set up the following:

$ export ACCOUNT_NAME=project_account_name    ## Add your project's account name here

Now we download the sample code to be used for this tutorial.

$ cd $VTUNETEST
$ wget -N https://www.nsc.liu.se/support/tutorials/Intel_Perf_Tools/nqueens_fortran_vtune_amp_xe.tgz
$ tar -zxvf nqueens_fortran_vtune_amp_xe.tgz
$ cd nqueens_fortran/linux

To compile the code run:

$ make clean
$ make
$ test -f nqueens_serial

Next download the run script for vtune profiling:

$ wget -N https://www.nsc.liu.se/support/tutorials/Intel_Perf_Tools/run.sh
$ cat run.sh
#!/bin/bash

## hotspots analysis ##
mkdir hotspots
cd hotspots
amplxe-cl -collect hotspots -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../

## hpc-performance analysis ##
mkdir hpc_perf
cd hpc_perf
amplxe-cl -collect hpc-performance -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../

## memory-access analysis ##
mkdir mem_access
cd mem_access
amplxe-cl -collect memory-access -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../

## memory-consumption analysis ##
mkdir mem_consump
cd mem_consump
amplxe-cl -collect memory-consumption -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../

## io analysis ##
mkdir io
cd io
amplxe-cl -collect io -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../

As can be seen different types of vtune analysis expamles are shown above. To run the example do the following:

$ sbatch -A $ACCOUNT_NAME -n1 run.sh <!---

Download the batch script:

$ ### example 1.1 ###
$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpi.sh
$ chmod +x jacobi_mpi.sh
$ cat jacobi_mpi.sh 
#!/bin/bash

#SBATCH -J jacobi_mpi
#SBATCH -t 00:05:00
#SBATCH -n 32
#SBATCH -o out_jacobi_mpi
#SBATCH -e err_jacobi_mpi

mpprun jacobi_mpi

In the above script it can be seen that mpprun does not require the number of ranks to be specified with thempprun command line because mpprun picks it up from the SLURM environment. Then run this script with:

$ sbatch -A $ACCOUNT_NAME jacobi_mpi.sh
Submitted batch job ........

This will start the job on 32 MPI ranks when the resource is allocated. The output can be seen in the file out_jacobi_mpi. Errors if any can be seen in the file err_jacobi_mpi. There can be some variations of the above script which will also work.

$ ### example 1.2 ###
$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpi_1.sh
$ cat jacobi_mpi_1.sh
#!/bin/bash

#SBATCH -J jacobi_mpi_1
#SBATCH -t 00:05:00
#SBATCH -N 1
#SBATCH -o out_jacobi_mpi_1
#SBATCH -e err_jacobi_mpi_1

mpprun jacobi_mpi

$ sbatch -A $ACCOUNT_NAME jacobi_mpi_1.sh
Submitted batch job ........

In this example script #SBATCH -N 1 will allocate a full node. If a node contains 32 CPU cores then, the scipts jacobi_mpi.shand jacobi_mpi_1.sh are “almost” equivalent. However, SLURM treats the two cases slightly differently and we consider it is better to use #SBATCH -n xxx than #SBATCH -N YYY.

The number of ranks can be specified at mpprun command line if desired. This is shown in the example given below:

$ ### example 1.3 ###
$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpi_2.sh
cat jacobi_mpi_2.sh 
#!/bin/bash

#SBATCH -J jacobi_mpi_2
#SBATCH -t 00:05:00
#SBATCH -n 16
#SBATCH -o out_jacobi_mpi_2
#SBATCH -e err_jacobi_mpi_2

mpprun -n 16 jacobi_mpi

$ sbatch -A $ACCOUNT_NAME jacobi_mpi_2.sh
Submitted batch job ........

Note that in the above example the number of ranks specified to mpprun matches the number of ranks allocated by SLURM. It is also possible to choose other values of -n in the mpprun command line. But such job spawning may not use the resources most optimally.

Example 2: Running a MPI + OpenMP hybrid job

For hybrid jobs, we need to tell the SLURM that we need both ranks and threads. This is illustrated in the example below:

$ ### example 2 ###
$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpiomp.sh
$ cat jacobi_mpiomp.sh
#!/bin/bash

#SBATCH -J jacobi_mpiomp
#SBATCH -t 00:05:00
#SBATCH -n 8	       ## allocate 4 MPI ranks
#SBATCH -c 4	       ## allocate 4 threads/rank
#SBATCH -o out_jacobi_mpiomp
#SBATCH -e err_jacobi_mpiomp

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK   ## explicitly set this
mpprun jacobi_mpiomp

In the above example OMP_NUM_THREADS is set to 4. To run this example use:

$ sbatch -A $ACCOUNT_NAME jacobi_mpiomp.sh
Submitted batch job ........

In the example above 8 MPI ranks will be created. Each rank will run 4 threads.

Example 3: Debugging a parallel program with mpprun

Using mpprun we can easily launch DDT debugger tool. For debugging a program we need to first compile the code with -g option. We will use our example jacobi code.

$ cd $MPPRUNTEST
$ module load buildenv-intel/2018.u1-bare
$ mpif90 -g jacobi_mpiomp.F90 -o jacobi_mpi_dbg
$ module unload buildenv-intel/2018.u1-bare

In mpprun we only support interactive debugging. For using debugger via mpprun we will first allocate an interactive node

$ interactive -t 01:00:00 -n 32 -A $ACCOUNT_NAME
salloc: Granted job allocation ......
srun: Job step created

Once the allocation is made we can launch the DDT debugger as shown below:

$ module load allinea-DDT/21.0.2              ## Check and choose from DDT modules available
$ mpprun -ddt jacobi_mpi_dbg

The above commands will launch the DDT GUI. Please keep in mind that -ddt flag works only in the interactive environment and does not work in the batch mode.

Within DDT GUI we can now run/debug the job. In this tutorial, we will not attempt to teach how to use different features of DDT. More details about DDT is available in the user manual.

DDT User guide

Example 4: Pass miscelleneous flags to mpprun

The complete list of mpprun flags which can be seen by:

$ mpprun -h

Some of the flags are developmental in nature and can not be tried by users in general. One flag that is of particular significance is --pass flag. If one wants to pass MPI distribution specific flag to MPI job then that can be done via the --pass flag. An example is shown here.

$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpi_pass-flag.sh
$ cat jacobi_mpi_pass-flag.sh
#!/bin/bash

#SBATCH -J jacobi_mpi_pass-flag
#SBATCH -t 00:05:00
#SBATCH -n 16
#SBATCH -o out_jacobi_mpi_pass-flag
#SBATCH -e err_jacobi_mpi_pass-flag

## this will fail ###
mpprun -print-rank-map jacobi_mpi

## this will run ###
mpprun --pass="-print-rank-map" jacobi_mpi

## this will run: pass two flags ###
mpprun --pass="-print-rank-map -prepend-rank" jacobi_mpi

In the above example script -print-rank-map is a flag specific to Intel MPI. It prints the rank map for the job. However -print-rank-map is not a mpprun flag. Hence, if this flag is passed directly to mpprun then the job will fail. However if it is passed through --pass="-print-rank-map" then mpprun will pass it to the underlying Intel MPI job launcher. Run the job as usual:

$ sbatch -A $ACCOUNT_NAME jacobi_mpi_pass-flag.sh
Submitted batch job ........

After the job is over check the output file out_jacobi_mpi_pass-flag.

Conclusion

After completing this tutorial one should be able to use mpprun to submit, debug jobs. Many of the NSC compiled binaries are mpprun compatible. Using mpprun will increase the chances of better utilization of resources. Another advantage of mpprun is that logs written by mpprun helps NSC support team in debugging job erros.

–>

Running vtune

Introduction

Compiling code for vtune

Example 2: Running a MPI + OpenMP hybrid job

Example 3: Debugging a parallel program with mpprun

Example 4: Pass miscelleneous flags to mpprun

Conclusion

User support

Getting access

Everything OK!

Self-service