The storage/file system software (GPFS/Scale Storage) used on NSC Centre Storage (i.e /proj and /home on Tetralith and Sigma) will sometimes prefetch data. If your application read a little bit of data from a file, GPFS might read more than that from our storage servers since it believes that you will continue reading more data from the file.
This is not always a good idea, and GPFS tries to be smart about when if should prefetch data and when it should not. But sometimes it gets it wrong. In extreme cases a compute node might read many times more data from the storage system than the application actually looks at.
GPFS will typically get the easy cases right. An application doing e.g seek(random location)-read(4MB)-seek(random location)-read(4MB)-… will not trigger prefetch. An application reading a file sequentially from beginning to end will trigger prefetch.
An example of an I/O pattern that GPFS does not handle well is seek(random location)-read(4MB)-read(4MB)-seek(random location)-read(4MB)-read(4MB)-… GPFS will then do a large prefetch read as soon as it sees the second 4MB read from the application, and but all that data except the first 4MB will never be used.
Doing excessive prefetch will usually give a little worse performance for your application, but more importantly it puts a lot of extra load on the storage system, making disk I/O slower for everyone.
When we detect unusually high storage load we will track down the source and so you might be contacted by us and be asked to modify your jobs or run fewer jobs in parallell.
But we also have another option - disable prefetch completely for the compute node(s) that runs your job.
To do this, submit your job with the option -C gpfsnoprefetch
.
If you believe you have an application that could benefit from turning off prefetch, we suggest that you
OR
-C gpfsnoprefetch
. If you see
a clear difference in performance, feel free to use it. If
performance drops or is unchanged, do not use it.There is also a script check_read_amplification
you can use to
measure the approximate “read amplification” that happens on a node,
i.e how much more data is read from the storage system than is read by
the application.
If the script shows a high read amplification and a lot of network bandwidth used, that job type is a good candidate for trying the “nogpfsprefetch” feature on.
In order to get meaning ful data from the script, the following must be true:
-N1
--exclusive
)Example of how to use the script:
jobsh
check_read_amplification --help
to see them.Here is an example where we measure a job that is reading the first block of a file, then seeks to a random position in the file, reads 10 blocks, and repeats.
[kronberg@tetralith1 ~]$ jobsh n1428
[kronberg@n1428 ~]$ check_read_amplification --interval=10
2024-09-24 16:38:32.309197: read amplification is 1.8 X network: 181.93 MiB/s, applications: 101.60 MiB/s
2024-09-24 16:38:42.349928: read amplification is 1.8 X network: 203.84 MiB/s, applications: 115.20 MiB/s
[...]
Guides, documentation and FAQ.
Applying for projects and login accounts.