Update 202405: Difference between revisions
No edit summary |
m (Bgo moved page Update 202405 - Rocky Linux and New Compute Nodes to Update 202405 without leaving a redirect) |
||
(22 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Recent node purchases and the discontinuation of CentOS have prompted the need for major updates on the Magnolia HPC cluster. | |||
__TOC__ | |||
The operating system of the Magnolia HPC cluster will be transitioned from CentOS 7 to [//www.rockylinux.org Rocky Linux 9]. This will require some custom packages to be recompiled, since the libraries of Rocky 9 are more recent than those of CentOS 7. These include those listed in [[#Modules|Modules]], and also from user installed packages included with [//docs.anaconda.com/free/miniconda/index.html miniconda]. | |||
== Roadmap == | == Roadmap == | ||
Changes are required for the transition to Rocky Linux and the addition of new nodes; these include: | |||
{| class="table table-striped" | {| class="table table-striped" | ||
Line 10: | Line 13: | ||
! Topic !! Completed? !! Description | ! Topic !! Completed? !! Description | ||
|- | |- | ||
| Install new compute node hardware - Part I ||{{Yes}} | | Install new compute node hardware - Part I ||{{Yes}} | ||
| Vendor Installed | |||
* 6 Compute nodes, each with: | |||
** 2 Intel(R) Xeon(R) Gold 6448H Processors | |||
** 512 MiB RAM | |||
* 4 High memory compute nodes, each with: | |||
** 2 Intel(R) Xeon(R) Gold 6426Y Processors | |||
** 2 TiB RAM | |||
* 3 GPU compute nodes, each with: | |||
** 2 Intel(R) Xeon(R) Gold 6426Y Processors | |||
** 256 GiB RAM | |||
** 2 NVIDIA A100 80GB PCIe GPUs | |||
|- | |- | ||
| Install new compute node hardware - Part II|| {{No}} || | | Install new compute node hardware - Part II|| {{No}} || Hardware not yet shipped by vendor. | ||
|- | |- | ||
| Install operating system on new compute nodes || {{Yes}} | | Install operating system on new compute nodes || {{Yes}} | ||
| Using [//www.rockylinux.org Rocky Linux 9] since CentOS 7 reaching [//www.redhat.com/en/topics/linux/centos-linux-eol end of life (EOL)] on June 30, 2024. Took a lot of time getting the compute nodes with the new Rocky Linux operating system to integrate with the much older CentOS operating system nodes; Finally done. | | Using [//www.rockylinux.org Rocky Linux 9] since CentOS 7 reaching [//www.redhat.com/en/topics/linux/centos-linux-eol end of life (EOL)] on June 30, 2024. Took a lot of time getting the compute nodes with the new Rocky Linux operating system to integrate with the much older CentOS operating system nodes; Finally done. | ||
|- | |- | ||
| Update backup and cloning system || {{Yes}} || | | Update backup and cloning system || {{Yes}} || Significant programming used to enable backup and cloning software to work with both Rocky Linux and CentOS. | ||
|- | |- | ||
| Update Slurm || {{Yes}} | | Update Slurm || {{Yes}} | ||
| Updated to | | Updated to version that works on both Rocky Linux 9 and CentOS 7. Required very short downtime. | ||
|- | |- | ||
| Add new compute nodes to Slurm partitions || {{Yes}} | | Add new compute nodes to Slurm partitions || {{Yes}} | ||
| | | Two new partitions created: | ||
* | * kisame | ||
* suliaoma | * suliaoma | ||
These are currently only available to those who purchased the nodes, but will be available to all users after preemption (see below) is enabled. | |||
Four nodes were added to the {{C|himem}} node, which is available to all users. | |||
|- | |- | ||
| Update module setup || {{No}} || WIP | | Update module setup || {{No}} || WIP. See the [[#New Modules Layout|New Modules Layout]] section for details on how to preview the new setup. | ||
|- | |- | ||
| Update modules for Rocky Linux 9 || {{No}} || WIP | | Update modules for Rocky Linux 9 || {{No}} || WIP | ||
Line 40: | Line 55: | ||
| Update login nodes to Rocky Linux 9 || {{No}} || TODO | | Update login nodes to Rocky Linux 9 || {{No}} || TODO | ||
|} | |} | ||
Efforts are being made to make these changes as transparent to the general user as possible, however there will come a point where this will no longer be possible. Ample of notice will be give when it comes time for the general user to make changes in how they utilize the Magnolia HPC cluster. | |||
== Contacts == | |||
Upgrades may inadvertently cause some software packages to fail and miss checks by administraton; if you suspect such an event, please contact [mailto:Brian.Olson@usm.edu?Subject=Magnolia%20HPC%20Cluster Brian Olson] with details. | |||
== Slurm Partitions == | == Slurm Partitions == | ||
As mentioned in the [[#roadmap|roadmap]], two new partitions have been created. The standard slurm commands can be used to show current partition usage, or an overview of cores and RAM available on each partition can be shown with the {{C|coresavail}} command: | |||
{{Cmd|coresavail|output=<pre> | {{Cmd|coresavail|output=<pre> | ||
Line 50: | Line 73: | ||
3 suliaoma 32 257094 | 3 suliaoma 32 257094 | ||
4 himem 32 2063430 | 4 himem 32 2063430 | ||
24 node 20 128000 | |||
4 himem 20 512000 | 4 himem 20 512000 | ||
2 gpu 20 128000 | 2 gpu 20 128000 | ||
Line 56: | Line 79: | ||
</pre>}} | </pre>}} | ||
This example output shows that the {{C|himem}} partition has 4 compute nodes each with 32 cores available and just under 2TiB RAM free. There are another 4 himem compute nodes free, but these each have 20 cores and 0.5 TiB RAM free. | |||
The {{C|kisame}} and {{C|suliaoma}} partitions are currently only accessible to those who purchased the hardwaare (see [[#Roadmap|roadmap]] for details). | |||
== Modules == | == Modules == | ||
During the upgrade, the cluster will have some compute nodes running on CentOS 7, and other on Rock Linux 9. Since many of the software packages installed as a Module on the cluster will not run on Rocky Linux, because of library version mismatches, a change in the layout of modules is required (see [[#Roadmap|roadmap]]). This change will make it easier to run software on the appropriate operating system, and also allow for easier upgrades in the future. | |||
=== New Modules Layout === | |||
The new module layout may be previewed by loading the {{C|newsetup}} module: | |||
{{Cmd|module load newsetup|module avail|output=<pre> | {{Cmd|module load newsetup|module avail|output=<pre> | ||
Line 102: | Line 129: | ||
</pre>}} | </pre>}} | ||
This new module layout will show modules in the {{Path|/modules/node/common}} path, which will run on both Rocky Linux 9 and CentOS 7. The last module path on the list is {{Path|/modules/node/Types}}, which, when loaded, will reveal packages that will run on that particular node type; the naming scheme for these modules is {{C|distribution name-distribution version-CPU model}}. | |||
The special {{C|thisnode}} module determines which type of node it is currently on and loads the appropriate node module. After the upgrade, this will most likely be in the Slurm submission script of all users. To show which type of node you are currently one, use the {{C|module help thisnode}} commands: | |||
{{Cmd|module help thisnode|output=<pre> | |||
----------- Module Specific Help for 'thisnode/0.1' --------------- | |||
Adds node specific modulefiles directory to MODULEPATH. | |||
Node Type: centos-7-79 | |||
</pre>}} | |||
=== Rocky Linux Modules === | === Rocky Linux Modules === | ||
To show packages that will run on Rocky Linux nodes, run the following commands: | |||
{{Cmd|module load rocky-9.3-143|module avail|output=<pre> | {{Cmd|module load rocky-9.3-143|module avail|output=<pre> | ||
Line 152: | Line 191: | ||
hdf5/1.12.1 hdf5/1.8.19 netcdf/4.4.1.1 netcdf/4.8.1 | hdf5/1.12.1 hdf5/1.8.19 netcdf/4.4.1.1 netcdf/4.8.1 | ||
</pre>}} | </pre>}} | ||
Along with the modules previously listed, additional modules for packages that will run on Rocky Linux nodes are now shown. These can be loaded and used like regular modules, but they will fail to run on CentOS 7 nodes. | |||
This list is current, as of May 28, 2024. | |||
=== Installation Priority === | |||
The bulk of my time is currently spent installing modules for the nodes running Rocky Linux. I am prioritizing modules that have been used 10 or more times over the last 90 days: | |||
{{GenericCmd|output=<pre> | |||
All nodes over last 90 days: | |||
Module Count | |||
hdf5 2089 | |||
netcdf 2079 | |||
intel 1381 | |||
mkl 1317 | |||
lammps 1294 | |||
mpfr 1225 | |||
gcc 1222 | |||
libpng 1109 | |||
isl 1057 | |||
x264 804 | |||
ffmpeg 792 | |||
esmf 749 | |||
MCT 744 | |||
scrip-coawst 742 | |||
fftw 737 | |||
nco 683 | |||
ncview 676 | |||
gaussian 645 | |||
python 433 | |||
cuda-toolkit 317 | |||
bedtools 258 | |||
samtools 258 | |||
openmpi 221 | |||
bowtie 203 | |||
R 165 | |||
orca 150 | |||
fastx-toolkit 146 | |||
digimat 109 | |||
icu 96 | |||
readline 93 | |||
ncurses 88 | |||
matlab 69 | |||
impi 56 | |||
cmake 54 | |||
openmpi-gcc 52 | |||
lapack 45 | |||
salmon 34 | |||
git 27 | |||
trimmomatic 20 | |||
newsetup 18 | |||
gmake 14 | |||
bowtie2 13 | |||
lammps-tools 13 | |||
mopac 13 | |||
gromacs 10 | |||
STAR 10 | |||
</pre>}} | |||
Every effort will be taken to install the same version of module packages as on CentOS 7, however some older versions will simply not compile against the newer libraries of Rocky Linux. | |||
If you require a module not on this list, please make a request to those listed in the [[#Contacts|contacts]] section. | |||
== Using Rocky Linux Nodes == | |||
For users that wish to use the nodes running Rocky Linux, you should modify your slurm submission scripts so that they load the {{C|newsetup}} and {{C|thisnode}} modules. A minimal slurm batch script to run on the new {{C|suliaoma}} partition with 1 GPU is shown here: | |||
{{FileBox|filename=sample.sh|1= | |||
#!/bin/bash -l | |||
#SBATCH --partition=suliaoma | |||
#SBATCH --gres=gpu:1 | |||
#SBATCH --qos=bxmarg | |||
#SBATCH --nodes=1 | |||
#SBATCH --ntasks-per-node=5 | |||
#SBATCH --cpus-per-task=1 | |||
#SBATCH --mem=1000G | |||
#SBATCH --time=00:10:00 | |||
#SBATCH --job-name=example | |||
#SBATCH --output=myoutput-%j.txt | |||
cd "$SLURM_SUBMIT_DIR" | |||
module purge | |||
module load newsetup | |||
module load thisnode | |||
srun hostname | |||
}} |
Latest revision as of 16:02, 28 May 2024
Recent node purchases and the discontinuation of CentOS have prompted the need for major updates on the Magnolia HPC cluster.
The operating system of the Magnolia HPC cluster will be transitioned from CentOS 7 to Rocky Linux 9. This will require some custom packages to be recompiled, since the libraries of Rocky 9 are more recent than those of CentOS 7. These include those listed in Modules, and also from user installed packages included with miniconda.
Roadmap
Changes are required for the transition to Rocky Linux and the addition of new nodes; these include:
Topic | Completed? | Description |
---|---|---|
Install new compute node hardware - Part I | Yes | Vendor Installed
|
Install new compute node hardware - Part II | No | Hardware not yet shipped by vendor. |
Install operating system on new compute nodes | Yes | Using Rocky Linux 9 since CentOS 7 reaching end of life (EOL) on June 30, 2024. Took a lot of time getting the compute nodes with the new Rocky Linux operating system to integrate with the much older CentOS operating system nodes; Finally done. |
Update backup and cloning system | Yes | Significant programming used to enable backup and cloning software to work with both Rocky Linux and CentOS. |
Update Slurm | Yes | Updated to version that works on both Rocky Linux 9 and CentOS 7. Required very short downtime. |
Add new compute nodes to Slurm partitions | Yes | Two new partitions created:
These are currently only available to those who purchased the nodes, but will be available to all users after preemption (see below) is enabled. Four nodes were added to the himem node, which is available to all users. |
Update module setup | No | WIP. See the New Modules Layout section for details on how to preview the new setup. |
Update modules for Rocky Linux 9 | No | WIP |
Install CUDA drivers for new GPU compute nodes | Yes | |
Preemption partitions | No | TODO |
Update CentOS 7 compute nodes to Rocky Linux 9 | No | TODO |
Update login nodes to Rocky Linux 9 | No | TODO |
Efforts are being made to make these changes as transparent to the general user as possible, however there will come a point where this will no longer be possible. Ample of notice will be give when it comes time for the general user to make changes in how they utilize the Magnolia HPC cluster.
Contacts
Upgrades may inadvertently cause some software packages to fail and miss checks by administraton; if you suspect such an event, please contact Brian Olson with details.
Slurm Partitions
As mentioned in the roadmap, two new partitions have been created. The standard slurm commands can be used to show current partition usage, or an overview of cores and RAM available on each partition can be shown with the coresavail command:
user $
coresavail
Number of nodes in partition with N available cores and RAM. Nodes Partition Available Cores Available RAM (MiB) 6 kisame 64 515134 3 suliaoma 32 257094 4 himem 32 2063430 24 node 20 128000 4 himem 20 512000 2 gpu 20 128000 17 lomem 16 64000
This example output shows that the himem partition has 4 compute nodes each with 32 cores available and just under 2TiB RAM free. There are another 4 himem compute nodes free, but these each have 20 cores and 0.5 TiB RAM free.
The kisame and suliaoma partitions are currently only accessible to those who purchased the hardwaare (see roadmap for details).
Modules
During the upgrade, the cluster will have some compute nodes running on CentOS 7, and other on Rock Linux 9. Since many of the software packages installed as a Module on the cluster will not run on Rocky Linux, because of library version mismatches, a change in the layout of modules is required (see roadmap). This change will make it easier to run software on the appropriate operating system, and also allow for easier upgrades in the future.
New Modules Layout
The new module layout may be previewed by loading the newsetup module:
user $
module load newsetup
user $
module avail
------------------------ /usr/share/Modules/modulefiles ------------------------ dot module-git module-info modules null use.own --------------------------- /modules/node/common/MPI --------------------------- impi/2017.4.196 ------------------------ /modules/node/common/Programs ------------------------- bamtools/2.5.1 ffmpeg/4.4 bedtools/2.31.0 gaussian/16B.01-avx2 bowtie/1.2.2 gaussian/16C.01-avx2 bowtie2/2.3.4.3 gaussian/16C.01-LINDA-avx2 bowtie2/2.5.1 nco/4.7.6 cmake/3.15.3 nco/4.9.3 cmake/3.24.2(default) openssl/1.0.2k cmake/3.9.1 openssl/3.0.9 fastx-toolkit/0.0.14 salmon/0.12.0 ffmpeg/3.3.3 salmon/1.1.0 ------------------------ /modules/node/common/Libraries ------------------------ isl/0.22 mpfr/4.0.1 trimmomatic/0.39 lapack/3.7.1 ncurses/5.9 x264/20171213 libjpeg-turbo/2.1.5.1 newsetup/0.1 zstd/1.5.5 libpng/1.5.30 openssl/1.0.2k mkl/2017.0.3 openssl/3.0.9 ------------------------ /modules/node/common/Languages ------------------------ cuda-toolkit/10.1.243(default) gcc/6.4.0 cuda-toolkit/11.6.2 gcc/7.3.0 cuda-toolkit/8.0.61 gcc/8.3.0 gcc/11.4.0 gcc/9.2.0 gcc/12.3.0 intel/2017.4.196 gcc/13.2.0 ----------------------------- /modules/node/Types ------------------------------ centos-7-63/0.1 centos-7-79/0.1 rocky-9.3-143/0.1 thisnode/0.1
This new module layout will show modules in the /modules/node/common path, which will run on both Rocky Linux 9 and CentOS 7. The last module path on the list is /modules/node/Types, which, when loaded, will reveal packages that will run on that particular node type; the naming scheme for these modules is distribution name-distribution version-CPU model.
The special thisnode module determines which type of node it is currently on and loads the appropriate node module. After the upgrade, this will most likely be in the Slurm submission script of all users. To show which type of node you are currently one, use the module help thisnode commands:
user $
module help thisnode
----------- Module Specific Help for 'thisnode/0.1' --------------- Adds node specific modulefiles directory to MODULEPATH. Node Type: centos-7-79
Rocky Linux Modules
To show packages that will run on Rocky Linux nodes, run the following commands:
user $
module load rocky-9.3-143
user $
module avail
------------------------ /usr/share/Modules/modulefiles ------------------------ dot module-git module-info modules null use.own --------------------------- /modules/node/common/MPI --------------------------- impi/2017.4.196 ------------------------ /modules/node/common/Programs ------------------------- bamtools/2.5.1 ffmpeg/4.4 bedtools/2.31.0 gaussian/16B.01-avx2 bowtie/1.2.2 gaussian/16C.01-avx2 bowtie2/2.3.4.3 gaussian/16C.01-LINDA-avx2 bowtie2/2.5.1 nco/4.7.6 cmake/3.15.3 nco/4.9.3 cmake/3.24.2(default) openssl/1.0.2k cmake/3.9.1 openssl/3.0.9 fastx-toolkit/0.0.14 salmon/0.12.0 ffmpeg/3.3.3 salmon/1.1.0 ------------------------ /modules/node/common/Libraries ------------------------ isl/0.22 mpfr/4.0.1 trimmomatic/0.39 lapack/3.7.1 ncurses/5.9 x264/20171213 libjpeg-turbo/2.1.5.1 newsetup/0.1 zstd/1.5.5 libpng/1.5.30 openssl/1.0.2k mkl/2017.0.3 openssl/3.0.9 ------------------------ /modules/node/common/Languages ------------------------ cuda-toolkit/10.1.243(default) gcc/6.4.0 cuda-toolkit/11.6.2 gcc/7.3.0 cuda-toolkit/8.0.61 gcc/8.3.0 gcc/11.4.0 gcc/9.2.0 gcc/12.3.0 intel/2017.4.196 gcc/13.2.0 ----------------------------- /modules/node/Types ------------------------------ centos-7-63/0.1 centos-7-79/0.1 rocky-9.3-143/0.1 thisnode/0.1 ----------------------- /modules/node/rocky-9.3-143/MPI ------------------------ openmpi-gcc/4.1.5 --------------------- /modules/node/rocky-9.3-143/Programs --------------------- lammps/20220623 -------------------- /modules/node/rocky-9.3-143/Libraries --------------------- hdf5/1.12.1 hdf5/1.8.19 netcdf/4.4.1.1 netcdf/4.8.1
Along with the modules previously listed, additional modules for packages that will run on Rocky Linux nodes are now shown. These can be loaded and used like regular modules, but they will fail to run on CentOS 7 nodes.
This list is current, as of May 28, 2024.
Installation Priority
The bulk of my time is currently spent installing modules for the nodes running Rocky Linux. I am prioritizing modules that have been used 10 or more times over the last 90 days:
All nodes over last 90 days: Module Count hdf5 2089 netcdf 2079 intel 1381 mkl 1317 lammps 1294 mpfr 1225 gcc 1222 libpng 1109 isl 1057 x264 804 ffmpeg 792 esmf 749 MCT 744 scrip-coawst 742 fftw 737 nco 683 ncview 676 gaussian 645 python 433 cuda-toolkit 317 bedtools 258 samtools 258 openmpi 221 bowtie 203 R 165 orca 150 fastx-toolkit 146 digimat 109 icu 96 readline 93 ncurses 88 matlab 69 impi 56 cmake 54 openmpi-gcc 52 lapack 45 salmon 34 git 27 trimmomatic 20 newsetup 18 gmake 14 bowtie2 13 lammps-tools 13 mopac 13 gromacs 10 STAR 10
Every effort will be taken to install the same version of module packages as on CentOS 7, however some older versions will simply not compile against the newer libraries of Rocky Linux.
If you require a module not on this list, please make a request to those listed in the contacts section.
Using Rocky Linux Nodes
For users that wish to use the nodes running Rocky Linux, you should modify your slurm submission scripts so that they load the newsetup and thisnode modules. A minimal slurm batch script to run on the new suliaoma partition with 1 GPU is shown here:
#!/bin/bash -l #SBATCH --partition=suliaoma #SBATCH --gres=gpu:1 #SBATCH --qos=bxmarg #SBATCH --nodes=1 #SBATCH --ntasks-per-node=5 #SBATCH --cpus-per-task=1 #SBATCH --mem=1000G #SBATCH --time=00:10:00 #SBATCH --job-name=example #SBATCH --output=myoutput-%j.txt cd "$SLURM_SUBMIT_DIR" module purge module load newsetup module load thisnode srun hostname