A batch job is a non-interactive (no user input is possible) way to run an application in a pre-determined way. What happens during the batch job is controlled by the job script (sometimes known as “submit script”). When a batch job is submitted to the system, it is put in a queue, and is then started at a later time (sometimes immediately). An obvious advantage with this approach is that you can queue many batch jobs at the same time, which will start automatically once resources are available, i.e. you do not need to sit in front of the computer in order to start calculations.
Preparing a batch job:
Submitting a batch job:
Load any modules needed to run your job. The environment in the shell where you run “sbatch” will be saved and recreated when starting the job. This includes the current working directory. You can also place the “module load” commands in your job script, and they will be run automatically then the job starts.
Once in the queue, the job might be started immediately (if enough idle compute resources are available) or it might have to wait in the queue for a while (minutes, hours, days or in extreme cases even longer).
Different NSC systems have very different scheduling policies and utilization, so queue times vary significantly between systems and projects. See the system documentation for more details.
If you don’t understand why your job won’t start, please contact NSC Support.
You can monitor all your jobs, both batch and interactive, using the
“squeue” command (e.g squeue -u $USER
to see your jobs).
When the job has started, the standard output and standard error from
the job script (which will contain output from your application if you
have not redirected it elsewhere) will be written to a file named
slurm-NNNNN.out
in the directory where you submitted the job (NNNNN
is replaced with the job ID).
If you need all the details about a pending or running job, use
scontrol show job JOBID
. Use squeue
to find the job ID you need.
If you want to cancel (end) a queued or running job, use the scancel
command and provide the job ID (e.g scancel 12345
).
The environment (current working directory and environment variables such as $PATH) that was set when you submitted the job is recreated on the node where the job will be started.
The job script starts executing on the first node allocated to the job. If you have requested more than one node, your job script is responsible for starting your processes on all nodes in the job, e.g by using srun, ssh or an MPI launcher.
sbatch -t HH:MM:SS
), the job will be killed automatically.Sample job script: run an MPI application “mympiapp” on two “exclusive” (not shared with others) nodes
#!/bin/bash
#
#SBATCH -J myjobname
#SBATCH -t 00:30:00
#SBATCH -N 2
#SBATCH --exclusive
#
mpprun ./mympiapp
# Script ends here
Sample job script: run a single-threaded application on a single core and allocate 2GB RAM (the node might be shared with other jobs). Also send an email then when job starts and ends. Note! Replace the string “put-your-email-address-here” with the real email address you want the notifications to be sent to.
#!/bin/bash
#
#SBATCH -J myjobname
#SBATCH -t 00:30:00
#SBATCH --mem=2000
#SBATCH -n 1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=put-your-email-address-here
#
# Run a single task in the foreground.
./myapp --flags
#
# Script ends here
Hint: most of our clusters have a few nodes reserved for test and development (see the system documentation for details). Use these nodes to quickly check your job script before submitting it to the normal queue (where you might need to wait for hours or days before your job starts, only to find out that you made a simple error in the job script).
The reservation name can vary, on Tetralith it is now
, on most other
clusters it is devel
.
You can also use the interactive
command to get an interactive login
session on a compute node. From there you can test your application
and job script interactively in an environment that is almost 100%
identical with the environment the real batch job will run in.
interactive
takes the same command line options (e.g -t
, -N
) as
sbatch
.
The advantage of testing batch jobs in an interactive session is that you can quickly fix a bug, re-run the script, find another bug, fix it, … This can speed up the process of debugging job scripts significantly compared to submitting them normally.
Example:
[x_makro@tetralith1 ~]$ interactive -t 00:10:00 -n2 --reservation=now
Waiting for JOBID 1817147 to start
[x_makro@n1 ~]$ bash myjob.sh
myjob.sh: line 2: badspell: command not found
Now I edit myjob.sh and fix the problem, and run it again:
[x_makro@n1 ~]$ bash myjob.sh
1
2
3
Here I press Control-C to stop the job, as it seems to be working now.
[x_makro@n1 ~]$
Great, now all that remains is to end the interactive session (type
exit
) and submit the job normally:
[x_makro@n1 ~]$ exit
[x_makro@tetralith1 ~]$ sbatch -t 3-00:00:00 -N 128 --exclusive myjob.sh
Submitted batch job 1817151
[x_makro@tetralith1 ~]$
The “wall time” limit (set with the -t D-HH:MM:SS
option to
sbatch/interactive) determines how long your job may run (in actual
hours, not core hours) before it is terminated by the system.
If your job ends before the time limit is up, your project will only be charged for the actual time used.
However, there are a few reasons for not always asking for the maximum allowed time:
We recommend adding a margin to the wall time setting to prevent jobs failing if they for some reason run slightly slower than expected (e.g due to high load in the disk storage system).
Please read the man pages (e.g run man sbatch
) on the cluster or read them online.
Guides, documentation and FAQ.
Applying for projects and login accounts.