Intel® VTune™ Profiler is an application profiling and tuning tool for serial and threded codes. In this tutorial we will show how to use vtune in tetralith system
We first create a run directory for our test and then load the module:
$ export VTUNETEST=/home/${USER}/intel_perf_tool/vtune/$(date +%d%m%y)
$ mkdir -p $VTUNETEST
$ module load buildenv-intel/2021.3.0-oneapi
One can choose some other suitable folder name instead of the above if one wants to. Next, we check compute accounting information
$ projinfo $USER
This will show compute project account name. We use that to set up the following:
$ export ACCOUNT_NAME=project_account_name ## Add your project's account name here
Now we download the sample code to be used for this tutorial.
$ cd $VTUNETEST
$ wget -N https://www.nsc.liu.se/support/tutorials/Intel_Perf_Tools/nqueens_fortran_vtune_amp_xe.tgz
$ tar -zxvf nqueens_fortran_vtune_amp_xe.tgz
$ cd nqueens_fortran/linux
To compile the code run:
$ make clean
$ make
$ test -f nqueens_serial
Next download the run script for vtune profiling:
$ wget -N https://www.nsc.liu.se/support/tutorials/Intel_Perf_Tools/run.sh
$ cat run.sh
#!/bin/bash
## hotspots analysis ##
mkdir hotspots
cd hotspots
amplxe-cl -collect hotspots -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../
## hpc-performance analysis ##
mkdir hpc_perf
cd hpc_perf
amplxe-cl -collect hpc-performance -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../
## memory-access analysis ##
mkdir mem_access
cd mem_access
amplxe-cl -collect memory-access -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../
## memory-consumption analysis ##
mkdir mem_consump
cd mem_consump
amplxe-cl -collect memory-consumption -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../
## io analysis ##
mkdir io
cd io
amplxe-cl -collect io -result-dir r001hs ../nqueens_serial
test -d r001hs
amplxe-cl -report hotspots -r r001hs
cd ../
As can be seen different types of vtune analysis expamles are shown above. To run the example do the following:
$ sbatch -A $ACCOUNT_NAME -n1 run.sh <!---
Download the batch script:
$ ### example 1.1 ###
$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpi.sh
$ chmod +x jacobi_mpi.sh
$ cat jacobi_mpi.sh
#!/bin/bash
#SBATCH -J jacobi_mpi
#SBATCH -t 00:05:00
#SBATCH -n 32
#SBATCH -o out_jacobi_mpi
#SBATCH -e err_jacobi_mpi
mpprun jacobi_mpi
In the above script it can be seen that mpprun
does not require the number of ranks to be specified with thempprun
command line because mpprun
picks it up from the SLURM environment. Then run this script with:
$ sbatch -A $ACCOUNT_NAME jacobi_mpi.sh
Submitted batch job ........
This will start the job on 32
MPI ranks when the resource is allocated. The output can be seen in the file out_jacobi_mpi
. Errors if any can be seen in the file err_jacobi_mpi
. There can be some variations of the above script which will also work.
$ ### example 1.2 ###
$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpi_1.sh
$ cat jacobi_mpi_1.sh
#!/bin/bash
#SBATCH -J jacobi_mpi_1
#SBATCH -t 00:05:00
#SBATCH -N 1
#SBATCH -o out_jacobi_mpi_1
#SBATCH -e err_jacobi_mpi_1
mpprun jacobi_mpi
$ sbatch -A $ACCOUNT_NAME jacobi_mpi_1.sh
Submitted batch job ........
In this example script #SBATCH -N 1
will allocate a full node. If a node contains 32
CPU cores then, the scipts jacobi_mpi.sh
and jacobi_mpi_1.sh
are “almost” equivalent. However, SLURM treats the two cases slightly differently and we consider it is better to use #SBATCH -n xxx
than #SBATCH -N YYY
.
The number of ranks can be specified at mpprun
command line if desired. This is shown in the example given below:
$ ### example 1.3 ###
$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpi_2.sh
cat jacobi_mpi_2.sh
#!/bin/bash
#SBATCH -J jacobi_mpi_2
#SBATCH -t 00:05:00
#SBATCH -n 16
#SBATCH -o out_jacobi_mpi_2
#SBATCH -e err_jacobi_mpi_2
mpprun -n 16 jacobi_mpi
$ sbatch -A $ACCOUNT_NAME jacobi_mpi_2.sh
Submitted batch job ........
Note that in the above example the number of ranks specified to mpprun
matches the number of ranks allocated by SLURM. It is also possible to choose other values of -n
in the mpprun
command line. But such job spawning may not use the resources most optimally.
For hybrid jobs, we need to tell the SLURM that we need both ranks and threads. This is illustrated in the example below:
$ ### example 2 ###
$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpiomp.sh
$ cat jacobi_mpiomp.sh
#!/bin/bash
#SBATCH -J jacobi_mpiomp
#SBATCH -t 00:05:00
#SBATCH -n 8 ## allocate 4 MPI ranks
#SBATCH -c 4 ## allocate 4 threads/rank
#SBATCH -o out_jacobi_mpiomp
#SBATCH -e err_jacobi_mpiomp
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ## explicitly set this
mpprun jacobi_mpiomp
In the above example OMP_NUM_THREADS
is set to 4. To run this example use:
$ sbatch -A $ACCOUNT_NAME jacobi_mpiomp.sh
Submitted batch job ........
In the example above 8
MPI ranks will be created. Each rank will run 4
threads.
Using mpprun we can easily launch DDT debugger tool. For debugging a program we need to first compile the code with -g option. We will use our example jacobi code.
$ cd $MPPRUNTEST
$ module load buildenv-intel/2018.u1-bare
$ mpif90 -g jacobi_mpiomp.F90 -o jacobi_mpi_dbg
$ module unload buildenv-intel/2018.u1-bare
In mpprun
we only support interactive debugging. For using debugger via mpprun
we will first allocate an interactive node
$ interactive -t 01:00:00 -n 32 -A $ACCOUNT_NAME
salloc: Granted job allocation ......
srun: Job step created
Once the allocation is made we can launch the DDT debugger as shown below:
$ module load allinea-DDT/21.0.2 ## Check and choose from DDT modules available
$ mpprun -ddt jacobi_mpi_dbg
The above commands will launch the DDT GUI. Please keep in mind that -ddt flag works only in the interactive environment and does not work in the batch mode.
Within DDT GUI we can now run/debug the job. In this tutorial, we will not attempt to teach how to use different features of DDT. More details about DDT is available in the user manual.
The complete list of mpprun
flags which can be seen by:
$ mpprun -h
Some of the flags are developmental in nature and can not be tried by users in general. One flag that is of particular significance is --pass
flag. If one wants to pass MPI distribution specific flag to MPI job then that can be done via the --pass
flag. An example is shown here.
$ cd $MPPRUNTEST
$ wget -N https://www.nsc.liu.se/support/tutorials/mpprun/jacobi_mpi_pass-flag.sh
$ cat jacobi_mpi_pass-flag.sh
#!/bin/bash
#SBATCH -J jacobi_mpi_pass-flag
#SBATCH -t 00:05:00
#SBATCH -n 16
#SBATCH -o out_jacobi_mpi_pass-flag
#SBATCH -e err_jacobi_mpi_pass-flag
## this will fail ###
mpprun -print-rank-map jacobi_mpi
## this will run ###
mpprun --pass="-print-rank-map" jacobi_mpi
## this will run: pass two flags ###
mpprun --pass="-print-rank-map -prepend-rank" jacobi_mpi
In the above example script -print-rank-map
is a flag specific to Intel MPI. It prints the rank map for the job. However -print-rank-map
is not a mpprun
flag. Hence, if this flag is passed directly to mpprun
then the job will fail. However if it is passed through --pass="-print-rank-map"
then mpprun
will pass it to the underlying Intel MPI job launcher. Run the job as usual:
$ sbatch -A $ACCOUNT_NAME jacobi_mpi_pass-flag.sh
Submitted batch job ........
After the job is over check the output file out_jacobi_mpi_pass-flag
.
After completing this tutorial one should be able to use mpprun
to submit, debug jobs. Many of the NSC compiled binaries are mpprun
compatible. Using mpprun
will increase the chances of better utilization of resources. Another advantage of mpprun
is that logs written by mpprun
helps NSC support team in debugging job erros.
–>
Guides, documentation and FAQ.
Applying for projects and login accounts.