Cluster Storage
Danger
There is no backup of the data stored on the cluster. Any removed file is lost for ever. It is the user’s responsibility to keep a copy of the contents of their home in a safe place.
FEDGEN HPC cluster is equipped with file systems that can be used to store files. These disk spaces have different properties and each of them is designed to best fit different usage intents. They are listed in the table below.
Disk space |
Scope |
Environment variables |
Lifetime |
Filesystem |
fedgenhome |
cluster |
$FEDGENHOME or $HOME |
cluster or account lifetime |
NFS |
fedgenscratch |
cluster |
$FEDGENSCRATCH |
1 month |
NFS |
fedgenlocal |
node |
$FEDGENLOCAL |
job lifetime |
NFS |
The ‘Scope’ column indicates the point where a storage space can be accessed. A scope of ‘cluster’ means all compute nodes, gpu nodes and login node in the cluster have access to that storage space. By contrast, a scope of ‘node’ refers to storage space that is available ONLY on that particular node.
FEDGENHOME (User Home)
This is the file system for user home directories. Upon login on the login node, you will be end up in your home directory. It is a 4TB cluster-level storage space, which is accessible from both the login nodes and all the compute nodes of the cluster. Its full path can be shown with echo $FEDGENHOME or $HOME environment variable.
The fedgenhome filesystem is dedicated to source code (programs, scripts), configuration files, and small datasets (like input files.)
Note
Do not use this area for your main working activities; use fedgenscratch directory instead (see next section).
FEDGENSCRATCH (Work Area)
The FEDGENSCRATCH is a high-performance shared disk space common to all compute/gpu nodes and to the login node of the cluster. Its full path can be shown with echo $FEDGENSCRATCH. The FEDGENSCRATCH is built using the Network filesystem (NFS).
The fedgenscratch should be used to store the files generated by your batch jobs. You can also copy there large input files if required. After that, it is customary in the job script to create a subdirectory with the job id, where all temporary data will be written, and to clean that directory when the job finishes, and after having copied results to another location either on the same filesystem if the results are to be consumed by a later job, or on another filesystem such as the $HOME or a remote long-term storage.
To use the FEDGENSCRATCH, you might first need to create a directory by yourself. It is then common to name it after your login.
The data in the FEDGENSCRATCH directory is not backed up can be removed at any time specialy during maintenance periods.
FEDGENLOCAL
The FEDGENLOCAL file system is the temporary disk space available on all compute nodes and is only visible from within the compute node it belongs to. It is available through the $FEDGENLOCAL environment variable. They are built on top of a fast, redundant, RAID-1 system.
There you can write/read temporary results during your job, copy the results of interest back to the home directory, and then delete it at the end of the job script.
Note
Files stored in the fedgenlocal directory on each node are removed immediately after the job terminates. You will not be able to access files in the scratch directory after your job has completed. Furthermore files in fedgenlocal directory are not accessible from any other nodes (compute node and login nodes). Therefore, all files you want to save must be copied from the fedgenlocal directory to your home as part of your job.
Using the fedgenlocal directory when running a batch job is often more efficient If your jobs performs a lot of disk I/O to files that does not need to be shared between nodes.
Warning
There is no quota limit on the fedgenlocal disk space. The user has to be careful not to fill the space otherwise the job will probably crash.
The scratch size depends of the node type.
Final remarks
As fedgenscratch is not meant to store data in the long term, you should expect them to be cleaned automatically after some time. You are expected to always clean up after your job. This is especially important after a job crashes before the cleaning operations in the submission script had a chance to run.