Moving data

How can I transfer data to and from Centre Storage?

See this page for a basic guide of how to copy files to and from Centre Storage using scp, sftp, rsync, etc.

Note: the advanced multiple-pass rsync method described below might be useful when moving large amounts of data to and from Centre Storage.

Moving data within a Centre Storage project directory

When moving data within a project directory, you can use the normal Linux “mv” command.

Example:

mv -i /proj/someproject/users/x_abcde/dataset /proj/someproject/shared_datasets/

The -i option ensures that you won’t accidentally overwrite existing files.

The above works regardless if dataset is a directory, a directory tree or a file.

The destination directory /proj/someproject/shared_datasets must exist, or mv will complain and refuse to move the files.

Inside a project directory, mv is atomic and near-instantaneous (i.e it will not copy files and then delete the original copies).

Moving data between Centre Storage project directories

The technology¹ used to implement quota limits for project directories cause each project directory to appear as a different file system to Linux. This means that mv will not be atomic and near-instantaneous, instead mv will in turn copy each file and delete the original copy (just as is would when moving files between physically separate disks).

This is significant when you move large amounts of data. If mv is being interrupted while it is running, any file not yet moved will remain in the original directory and any file already moved will remain in the destination directory. Restarting mv after being interrupted in this way is usually not possible, and you will end up having to recover manually, usually using cp+rm or rsync.

Due to this behavior, we recommend always using rsync when moving a large amount of data between two project directories. The example below will safely move (and rename to dataset42) the directory tree /proj/someproject/users/x_abcde/dataset to /proj/anotherproject/shared/dataset42.

Please note that you will need read access to all data that is being copied. If this is not the case, rsync will complain, but continue and copy the files that it can. So please check the rsync output carefully. If you do not have read access, or want to preserve file ownership, NSC will need to do the transfer for you. In this case, contact NSC Support to discuss your options.

In the example below, we will ask rsync to preserve as many properties of the moved files as possible. Please note that some things (e.g file ownerships) cannot be preserved unless you run rsync as root (which NSC would then have to do for you).

If you know that some of these options are not needed (e.g if you know you have no hard links or sparse files), you can omit those options, this will speed up the transfer. You can find the full definition of these options in the rsync man page (run man rsync to read it). Please note that -a is shorthand for -rlptgoD.

    -a, --archive               archive mode; equals -rlptgoD (no -H,-A,-X)
    -r, --recursive             recurse into directories
    -l, --links                 copy symlinks as symlinks
    -p, --perms                 preserve permissions
    -t, --times                 preserve modification times
    -g, --group                 preserve group
    -H, --hard-links            preserve hard links
    -S, --sparse                handle sparse files efficiently
    -v, --verbose               increase verbosity
    -n, --dry-run               perform a trial run with no changes made
        --delete                delete extraneous files from dest dirs

Simple version: copy data, then delete

Check that the destination directory does not already exist

ls /proj/anotherproject/shared

First, create our rsync command and test it using the --dry-run option (will not actually copy anything):

rsync -aHSv --dry-run /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42

Please note the trailing slash (“/”) in “…/dataset/”. This is important when using rsync (as opposed to many other Linux commands). The reason is explained in the rsync man page.

If you see nothing strange in the output, run it without the --dry-run option to actually copy the files.

rsync -aHSv /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42

Now, have a quick look to see that the copied files are actually present in the destination directory. We also do a quick sanity check by comparing the total size of the directories.

ls -lR /proj/anotherproject/shared/dataset42
du -sh --apparent-size /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42

If you feel happy with the result, now is the time to remove the original files:

rm -rvf /proj/someproject/users/x_abcde/dataset

Advanced version: two passes, useful if the data is in active use

If you are moving a large volume of data, it might take hours or days to copy it. If the directory tree is in active use and being written to this can be a problem.

This method allows you to make an initial copy of the data, then stop all accesses to the original files and run one final rsync (which will be much faster since it only needs to copy data that has changed since the first rsync).

Step 1: perform an initial rsync as described above

Step 2: stop all writes to the original directory tree.

If you’re the only person with access to the files, you can simply stop writing to it. :) If telling other people to stop writing to it is not an option, you can change the permissions of the top level directory so only you can access it (chmod go= /proj/someproject/users/x_abcde/dataset). Another option is to rename the top level directory (mv /proj/someproject/users/x_abcde/dataset /proj/someproject/users/x_abcde/dataset.hidden).

Step 3: run the final rsync.

You can simply-re-run the same rsync command you used for the initial copy. However, if files have been deleted in the original directory tree, those will not be deleted on the destination side. You can ask rsync to delete files on the destination side not present in the original directory. THIS CAN BE DANGEROUS, so be very careful when using --delete.

To safely use --delete in our example, first run once using --dry-run:

rsync -aHSv --delete --dry-run /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42

Then, if the output looks OK (i.e no unexpected new files, updates or deletions), run again without --dry-run to update the destination copy and delete any files removed in the original copy:

rsync -aHSv --delete /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42

An example of how to NOT use --delete (the destination directory /proj/anotherproject/shared is all wrong, the command will delete EVERYTHING in /proj/anotherproject/shared except the files we are copying):

#DO NOT RUN THIS# rsync -aHSv --delete /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared

If this all sounds complicated and scary, NSC can help with large or complicated file transfers within Centre Storage, contact NSC Support to discuss your options.

Parallell rsync

Rsync uses only a single thread to copy data. Sometimes this can be a bottleneck. Also, when copying data to and from the cluster over SSH, SSH will usually only use a single CPU core for encrypting the data, which can create a bottleneck if the network and the remote computer are fast enough.

The command fpsync tries to solve these two problems. It works by calling fpart to split a directory tree into chunks. Each chunk is then handed over to a separate rsync process for transfer.

Since a chunk has to contain at least one file, this limits the possible speedup if you have very few files.

However, fpsync have some issues you need to be aware of:

It is more difficult to use than rsync. If rsync gives you acceptable transfer performance, there is no need to use fpsync.
It is less mature (fewer people using it, so probably more buggy) than rsync. To mitigate this, you can transfer your data once with fpsync, and then do a final rsync (as in “run the final rsync” above) to make sure fpsync didn’t miss any files.
It can put significantly more load on the server where you run it (since it typically spawns multiple rsync processes each using a full CPU core). While fpsync is available on the login nodes, we strongly suggest you only use it for testing, and run the actual file transfers on a compute node.
It can put significantly higher load on the shared storage system (since it typically spawns multiple rsync processes that each cause as much load on the storage system as a single normal rsync). Please limit yourself to only as many concurrent jobs (-n option to fpsync) as you need to get acceptable performance.

If you want to try fpsync, start by reading the man page (man fpsync) and the example below.

If you’re unsure about how to use fpsync or how many concurrent jobs is safe to run, feel free to contact NSC Support.

Example: basic fpsync

In this example, we will use a compute node to copy a large directory tree from one project directory to another.

Note: This directory tree is not very large (126 GiB, 4018 files). In real life most people would probably just use rsync to copy it, but it is large enough to show the performance boost you can get from fpsync.

For this particular directory tree, the optimum number of concurrent rsync processes turned out to be just four.

If you are going to copy the same or similiar directory trees many times, it might pay off to do some tests to determine what gives the best result for that specific tree. If not, I recommend using a low number (e.g 4).

First, allocate a compute node:

[kronberg@tetralith0 ~]$ interactive -N1 --exclusive -t 24:00:00
salloc: Granted job allocation 10409514
srun: Step created for job 10409514
[kronberg@n190 ~]$ 

Copy the data using rsync and see how long it takes:

[kronberg@n190 ~]$ time rsync -aHS /proj/nsc/users/kronberg/fpsynctest/s1/ /proj/nsc-guest/users/kronberg/fpsynctest/rsync1

real    12m10.337s

While this is running, we can login to the node using jobsh and check CPU usage using top. In this case we can see that significant amounts of CPU is used by two rsync processes (one reading, one writing, in total using between 0.2 and 1.5 CPU cores) and “mmfsd” (system process that communicates with the storage system, using less than 0.5 CPU cores).

The CPU usage and distribution between rsync and mmfsd will vary over time depending on e.g the size of the files being copied.

Now, we run fpsync with 2, 4, and 8 workers/concurrent processes to see what the performance will be:

[kronberg@n190 ~]$ time fpsync -n 2 -o "-aHS" -O "-b" /proj/nsc/users/kronberg/fpsynctest/s1 /proj/nsc-guest/users/kronberg/fpsynctest/fpsync-2

real    9m33.082s

[kronberg@n190 ~]$ time fpsync -n 4 -o "-aHS" -O "-b" /proj/nsc/users/kronberg/fpsynctest/s1 /proj/nsc-guest/users/kronberg/fpsynctest/fpsync-4

real    6m4.310s

[kronberg@n190 ~]$ time fpsync -n 8 -o "-aHS" -O "-b" /proj/nsc/users/kronberg/fpsynctest/s1 /proj/nsc-guest/users/kronberg/fpsynctest/fpsync-8

real    17m22.176s

Note: there are several rsync options that can have a significant performance impact.

If you know that you don’t have any hard links in your data (links created with “ln” rather than “ln -s”), you can skip the ‘-H’ option.

If you know that you have no sparse files in your data, you can skip the “-S” option.

IBM Spectrum Scale “filesets” ↩