This page tries to describe the job scheduling policy on Tetralith, i.e how the system decides which jobs should be started and when.
It also describes how users and projects can interact with the system and adapt the job scheduling to their workflow (or in some cases, adapt their workflow to the job scheduling).
If you don’t have time to read all of this, I recommend that you read this page and decide which model works best for your project. Then run your jobs, and come back to this page if you need to.
The policy is complicated, which is why there are so many footnotes1 on this page. Read them for more in-depth information. There are also a number of short articles on various related subjects which you will find at the bottom of this page.
The basic goals for the scheduling policy are:
An key decision we have made is to never terminate/preempt a running job to let a higher priority job run.5
On Tetralith we use Slurm for job scheduling.6
limit) prevents a project from having more remaining core hours in
running jobs than 1.0x its monthly allocation. Without this limit,
if the queue was empty and there were many idle nodes, a single
project (even a very small one) could fill the entire system with
7-day jobs, running much more than its allocated time and making it
difficult for other projects to get any work done in the next 7
days. Example: if a project with an allocation of 1000 core
hours/month already has one job running that uses 600 cores and has
one hour left to run, the biggest job that can be started using that
project is 400 core hours (e.g a 40-core job for 10 hours). Note
that this also puts an upper limit on how large a job a project can
run, as a job larger than 1.0x the monthly allocation violates this
rule and will never start. Jobs blocked by this rule have the Reason
“AssocGrpCPURunMinutesLimit” set (shown by squeue
).
project’s future fair-share priority), but they show up in e.g
projinfo and SUPR. Bonus jobs are only started when too many nodes
are idle, and we never fill all idle nodes with bonus jobs (as that
would prevent new high-priority jobs from being started
quickly). Bonus jobs are started in priority order (i.e a project
that has run 1.5x its allocation will get its jobs started as bonus
jobs before a project that has run 1.7x its allocation). Bonus jobs
show up in projinfo, SUPR etc but do not affect the project’s future
priority. The 24 hour limit is set to limit how far into the future
bonus jobs can affect normal jobs. It’s common to see bonus jobs
start during weekends when demand is lower, and allowing them to run
for e.g 7 days would cause them to prevent higher-priority jobs from
starting on Monday. We will sometimes allow bonus jobs longer than
24h if there is more than 24h remaining until the start of the next
workday. Bonus jobs should be considered an extra bonus, not a
right! Remember, you can still run up to 1.5x your allocation
before you’re blocked from running normal jobs! The command
projinfo --qos=bonus
can be used to see how much CPU time that has
been used by bonus jobs. Jobs may still be shown as
“QOSUsageThreshold” even if they might be started as bonus jobs.
The scheduling policy and its various limits, tools etc is complex. We are very aware of this. Unfortunately, all simple scheduling policies (e.g a plain FIFO queue) have problems, and no matter which one you start with, you tend to end up with many added limitations, exceptions and external tools, and then the simple policy is no longer simple… ↩
There are of course situations in which this is not possible. E.g if a project has run nothing for the first 29 days of the month, it’s very difficult to let it run all its core hours on the last day. ↩
Even if idle compute nodes use less power, the depreciation cost is a very large part of the total system cost, so it makes sense to use the system as much as possible. ↩
Some projects want nothing except to run as many jobs as possible but does not care about how long an individual job has to wait in the queue. Another project might need to get access to compute nodes quickly, but cares less about the total throughput. ↩
In many ways, terminating running low-priority jobs to allow new high-priority jobs to start would simplify job scheduling and provide a lot of flexibility, but it has a high cost to users - all jobs must be made restartable (and many applications cannot even easily be restarted). Making jobs preemptable would be unfair to the many users whose applications cannot be restarted - their jobs would then randomly fail if a higher-priority job was submitted. ↩
As far as possible, we use Slurm’s built-in features to realize the scheduling policy, but we have also developed tools (e.g boost-tools and bonus job scheduling) that work outside Slurm. ↩
Backfilling is the process of scheduling jobs on compute nodes that would otherwise be idle while waiting for enough nodes to become available to start a large job. If there are idle nodes available and a lower priority job can be started without affecting the estimated start time of the highest priority job, the lower priority job is started. If more than one low-priority job could be started using backfill, the highest priority one is selected. In general, jobs shorter than a few hours have a good chance to be started using backfill. However, please dont make your jobs too short, see this page for why. ↩
To use the development nodes, add “–reservation=now” to
your sbatch/interactive options and request one hour or less of
walltime. Example: interactive --reservation=now -N1 -t
01:00:00
. A single user can only use a total of two devel nodes
(64 cores) at any one time. We encourage you to use the devel
nodes to test new jobs before submitting them to the normal
queue. That way you can quickly find simple problems like syntax
errors in the job script, etc. It is not possible to move an
existing job into the “now” reservation, if you do that the job
will be cancelled. ↩
This limit (sometimes referred to as the “MAXPS” ↩
When your project has ended, any jobs still left in
the queue will be shown with the Reason “QOSUsageThreshold” set
(shown by squeue
). This is a little confusing, but due to the
fact that your project size is now zero, and you have used more
than 1.5x zero core hours recently. If you do not remove such jobs
from the queue, NSC might do so. ↩
You can under some conditions run longer jobs, see boost-tools. If you ask for a much larger Timelimit than your job actually needs, it becomes harder to predict the start time for queued jobs (which is bad for both you and other users) and your job might start later than it otherwise would (due to not being eligble for backfill or bonus). ↩
The reason why we don’t prioritize interactive jobs over batch jobs is that many batch jobs are urgent, and that many interactive jobs aren’t (e.g starting a Matlab GUI in an ineractive session and then running it for 7 days). We also believe there would be some abuse of the system if we prioritized interactive jobs. ↩
Bonus jobs run “for free” (they don’t affect the ↩
How you can adapt the job scheduling to your workflow, or vice versa
A more detailed description of fair-share scheduling and settings
Flat jobs and their hidden cost
Tips and tricks for monitoring queued and running jobs
Short jobs and job steps and their hidden cost
Why it is not possible to determine when a job will start
The hard limit for how much a project may run. Also known as the QOSUsageThreshold limit
Some reasons for why your job won't start, and what you can do about it
Guides, documentation and FAQ.
Applying for projects and login accounts.