A common cause for a job to fail is exhausting the memory on one or more compute nodes in the job.
Some signs of hitting an OOM condition are:
The job stops prematurely (or immediately). This happens if
the application dies by itself when running out of memory, or if
the Linux kernel kills the process. Often you will find signs of
this in the application output (usually in the slurm-JOBID.out
file if you have not redirected it elsewhere). The message
slurmstepd: Exceeded step memory limit at some point
is a
typical sign of having run out of memory.
The job stop making progress, but continue running until it hits the walltime limit.
The queue system stops a running job. This happens when one or more of the nodes became so slow due to the lack of available memory that the scheduler took it offline and killed the whole job. In this case, you might get an email from NSC informing you of what happened.
You can try following the suggestions below, or ask NSC for assistance in determining if your job ran out of memory. Please remember to include the job ID. We can check system logs on the compute nodes that are not available to you, and these logs will usually tell us if the node ran out of memory or not.
These are some of your options:
Check actual memory usage while the job is running Use
jobsh
(see this page for details)
to login to one or more compute nodes in your job, then you can
monitor memory usage in real-time using e.g the top
command. Also
see the jobstats
tool at the bottom of this page.
Use the seff
command when the job has ended This will show you
(among other things) how much memory was used. See below for an example.
Use nodes with more memory. Most clusters have “fat” nodes (see
the hardware information) with more memory than normal
nodes. You can use the --mem
option to sbatch/interactive to request
nodes with more memory. Please note, though, that most clusters only
have a small number of fat nodes, so your job might need to wait for
longer than usual in the queue when you request fat nodes.
Use less cores per compute node. If you run an MPI application, you can usually try running fewer ranks per node, and either run on more nodes or accept a longer runtime. On Tetralith, you could try e.g.:
sbatch --ntasks-per-node=16 --mem=0
or
sbatch --ntasks-per-node=16 --cpus-per-task=2
which will run 16 MPI ranks per node instead of 32 (note 16x2=32 in
the second example). The --mem=0
flag (a bit counterintuitively) assures access to the full
memory of a node. Each rank now gets twice the effective memory at
the expense of a longer runtime. Many programs run a lot faster than
you would expect with just half the number of cores, so this can be
economical compared to e.g. the next option.
Use more compute nodes. Some MPI-parallel program will distribute their data (more or less) evenly over all compute nodes. In such a case, you might be be able to fit your calculation within the available node memory by running on more nodes. The trade-off is that your program might not scale very well to many compute nodes and you will run with low efficiency and using up to twice the amount of core hours for the job.
Limit application memory. If your application has a configuration option for how much memory to use per node, try lowering that. Gaussian, for example, has such a switch. Remember that even if the compute nodes has e.g 32GiB RAM, you cannot use all of that for your application, some room must be left for the operating system, disk cache etc. A value of around 30GiB for a 32GiB node is usually OK. Some programs are also notoriously bad at estimating memory use, so you might need to set a large safety marign.
The slurm-[jobid].out
file is created in the directory from where you
submitted the job. Unless you have redirected the output somewhere
else, this is where the output from your job script will end up.
Any log files written by your application.
The NSC accounting logs in /var/log/slurm/accounting/YYYY-MM-DD on the login node. All jobs that have ended are listed there. In this file you can look at the “jobstate”. Some common states are:
COMPLETED
: the job script exited normally (i.e with exit status
== 0). This does not necessarily mean that the job was
successful, only that the job script did not return an error to
the scheduler.
FAILED
: the job script returned a non-zero exit status. This
usually means that something went wrong, but it does not
necessarily mean that the application itself failed, it might be
e.g a failed “cp” command that was run as the last command in
the job script.
CANCELLED
: you (or a system administrator) cancelled the job
(using scancel).
NODE_FAIL
: one or more of the compute nodes in the job failed in
such a way that the scheduling system decided to take it offline.
TIMEOUT
: the job ran until it had used all the walltime
requested by it, and was terminated by the scheduler.
Especially NODE_FAIL
is commonly associated with out of memory conditions.
On some NSC systems (e.g Tetralith and Sigma) you can use jobstats
to monitor the CPU and memory use of a job. Its usage is for a single full node (i.e. not shared) or over several nodes.
Load the module and wrap your mpprun command in the batch script by jobstats [start|stop]
. Example:
module load jobstats/0.8.1
jobstats start
mpprun /path/to/binary
jobstats stop
...
jobstats report
You will then get a summary in the form of a jobstats.txt file. Example:
CPU usage per node
-----------------------------
Minimum amount : 0.2 %
First quartile : 79.3 %
Median : 88.6 %
Third quartile : 98.6 %
Maximum amount : 100.0 %
Average(trimean): 88.8 %
Memory usage per node
-----------------------------
Minimum amount : 886 MB
First quartile : 9102 MB
Median : 9213 MB
Third quartile : 9238 MB
Maximum amount : 10131 MB
Average(trimean): 9191 MB
The most interesting data point is usually average CPU and maximum amount of memory. Average CPU may not always be 100% if your job spends a lot of time waiting for disk input/output or network traffic. In addition to above, the measured free memory is also provided. Note that even without a job load, some part of the memory is used.
Technical info
The statistics are aggregated in the following way. First, samples from all nodes are averaged into a time series of average node values, then statistics over time are done. The minimum and maximum values, however, are not node averages, but the actual min/max observed in the full node set. I choose the trimean as the main statistics, as it is more sensitive to biased distributions, which is usually what we are looking for.
The monitoring is done by the nmon program which runs locally on each node, and writes log data to /scratch/local
every 10 seconds. A preliminary test indicates that the runtime overhead, as measured on 32-node VASP job, is 0.5%. The logs consume ca 300KB per hour per node uncompressed, and are currently copied to your current working directory when you stop sampling.
seff
toolThe seff
command displays data that the resource manager (Slurm)
collected while the job was running. Please note that the data is
sampled at regular intervals and might miss short peaks in memory
usage.
If your job failed and seff shows memory utilization close to 100%, you can assume that the job ran out of memory. If you need to know for sure, contact NSC Support and ask us to analyze the logs for the job in question.
Example (in this case job 12345678 ran out of memory and was killed,
and seff
shows very close to 100% memory utilization):
$ seff 12345678
Job ID: 12345678
Cluster: tetralith
User/Group: x_makro/x_makro
State: FAILED (exit code 9)
Nodes: 1
Cores per node: 32
CPU Utilized: 00:23:37
CPU Efficiency: 11.84% of 03:19:28 core-walltime
Job Wall-clock time: 00:06:14
Memory Utilized: 88.20 GB
Memory Efficiency: 97.19% of 90.75 GB
Guides, documentation and FAQ.
Applying for projects and login accounts.