Why a Job Scheduler?
A job scheduler is software that starts and manages compute jubs running on a computing cluster. The job scheduler also allocates and manages computing resources, and which resources are available to which users (login accounts). This example starts with a simple configuration, permitinging all rsources within one partition, and without any constraints.
Although there are several open source job schedulears aailable, this example uses the Slurm Workload Manager. Slurm -- originally SLURM for Simple Linux Utility for Resoure Management, and not to be confused with the fictional Futurama beverage -- is used in many large HPC (High Performance Computing) clusters, but can also scale down to very small clusters, such as the one in this example configuration.
Installing Slurm
On the first (01) Master node, install the Slurm Job Scheduler / Login Server:
1. Install the Slurm Workload Manager
$ sudo apt-get install slurm-wlm
2. Install the sample Slurm configuration file on shared storage
$ sudo bash -c 'echo "include /data/slurm/slurm.conf" > /etc/slurm/slurm.conf'
$ sudo bash -c 'echo "include /data/slurm/cgroup.conf" > /etc/slurm/cgroup.conf'
$ sudo mkdir /data/slurm
$ sudo chown slurm:slurm /data/slurm
$ cd /data/slurm
$ sudo cp /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz slurm.conf.gz
$ sudo gunzip slurm.conf.gz
$ sudo cp /etc/munge/munge.key .
3. Edit the Slurm daemon configuration file
Note: I am running node 01 as both a master (login and job controller) node and a compute node, given the small number of nodes in this example cluster; to prevent using node 01 as a compute node, simply change '[01-04]' to '[02-04]' in the PartitionName line below
Note: Change names and IPs if following different node conventions than used in this example
$ sudo vi /data/slurm/slurm.conf and change the following lines:
SlurmctldHost=hydra01(192.168.4.111)
PartitionName=all Nodes=hydra[01-04] Default=YES MaxTime=INFINITE State=UP
then remove the single NodeName line and add the following to the end of the file:
NodeName=hydra01 NodeAddr=192.168.4.111 CPUs=4 State=UNKNOW
NodeName=hydra02 NodeAddr=192.168.4.112 CPUs=4 State=UNKNOWN
NodeName=hydra03 NodeAddr=192.168.4.113 CPUs=4 State=UNKNOWN
NodeName=hydra04 NodeAddr=192.168.4.114 CPUs=4 State=UNKNOW
4. Edit the Slurm control group configuration file
Note: Initially enforcing no constraints, which may be updted later
$ sudo vi /data/slurm/cgroup.conf and add
CgroupAutomount=yes
ConstrainCores=no
ConstrainRAMSpace=no
5. Add all of the cluster nodes to the hosts file
Note: Change names and IPs if following different node conventions than used in this example
$ sudo vi /etc/hosts
And add:
192.168.4.111 hydra01.colornetlabs.com hydra01
192.168.4.112 hydra02.colornetlabs.com hydra02
192.168.4.113 hydra03.colornetlabs.com hydra03
192.168.4.114 hydra04.colornetlabs.com hydra04
On each of the remaining (02 -> 04) Compute nodes, install an NFS client:
1. Install the Slurm compute node daemon and client
$ sudo apt-get install slurmd slurm-client
2. Reference the Slurm configuration file on shared storage
$ sudo bash -c 'echo "include /data/slurm/slurm.conf" > /etc/slurm/slurm.conf'
$ sudo bash -c 'echo "include /data/slurm/cgroup.conf" > /etc/slurm/cgroup.conf'
$ sudo cp /data/slurm/munge.key /etc/munge/munge.key
3. Verify that the munge keys are in-sync (if not, verify that all of the keys in /etc/munge are the same)
$ systemctl restart munge
$ ssh pi@hydra01 munge -n | unmunge
which should display something like:
STATUS: Success (0)
ENCODE_HOST: hydra01.colornetlabs.com (192.168.4.111)
ENCODE_TIME: 2024-01-23 12:47:44 -0500 (1706032064)
DECODE_TIME: 2024-01-23 12:47:44 -0500 (1706032064)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: pi (1000)
GID: pi (1000)
LENGTH: 0
Slurm Startup
Reboot each of the four (4) nodes (master and compute):
$ sudo reboot
and once rebooted, check the status of the Slurm scheduler processes:
$ systemctl status slurmd (on all nodes 01 => 04)
$ systemctl status slurmctld (only on the 01 master node)
Because Slurm is installed as an enabled service (by default), it should start automatically. However if it does not display as active (running), it may have to do with a startup problem involving Slurm starting before the network is available. Others have reported this, and a potential fix has been suggested by changing network.target to network-online.target in both service files (slurmd.service and slurmctld.service, both found in in /usr/lib/systemd/system). However, I have not found this to work.
As a work-around, with the added benefit of restarting the Slurm processes if they ever crash, I have created the following script to be run as root via crontab:
$ sudo su -
# mkdir /data/slurm/bin
# vi /data/slurm/bin/start_slurm and add the following:
# start_slurm
#
# Start the slurm daemon if it is not already running
#
# 23-Jan-2023 - DEF, original coding
#
DATA=/data/slurm
PATH=$PATH:/usr/sbin
SLURMD=`which slurmd`
SLURMCTLD=`which slurmctld`
# Check that slurmd is installed
if [ -z "SLURMD" ]
then
echo "`date` Cannot find slurmd in: $PATH"
exit 1
fi
# Check to make sure we have a slurm config file
if [ ! -f $DATA/slurm.conf ]
then
echo "`date` Slurm daemon configuration file not found!"
exit 2
fi
# Check to see of slurmd is running
running=`ps -aef | grep $SLURMD | grep -v grep | wc -l`
if [ $running -ge 1 ]
then
echo "`date` Slurm daemon is running"
else
echo "`date` Starting the slurm daemon"
systemctl start slurmd
if [ $? -ne 0 ]
then
echo "`date` Error starting slurm daemon!"
exit 3
fi
fi
# Check to see if we are the master controller
master=`cat $DATA/slurm.conf | grep '^SlurmctldHost=' | awk -F\= '{print $2}' | awk -F\( '{print $1}'`
if [ -z "$master" ]
then
echo "`date` Slurm master node is not defined!"
exit 4
fi
if [ "$master" != "`hostname -s`" ]
then
exit 0
fi
# Check to see if the slurm job controller is running
running=`ps -aef | grep $SLURMCTLD | grep -v grep | wc -l`
if [ $running -ge 1 ]
then
echo "`date` Slurm job controller is running"
exit 0
fi
echo "`date` Starting the slurm job controller"
systemctl start slurmctld
if [ $? -ne 0 ]
then
echo "`date` Error starting slurm job controller!"
exit 5
fi
And then add that script to the root crontab:
# crontab -e and then add:
* * * * * /data/slurm/bin/start_slurm >> start_slurm.log
0 0 * * * mv start_slurm.log start_slurm.old && echo "`date`" > start_slurm.log
# exit
Running Jobs with Slurm
Now that the Slurm scheduler is running, the easiest way to test running commands across the cluster is with the srun command. Use the --nodes parameter to determine how many cluster nodes on which to run the command. For example, to run the hostname command on all four (4) cluster nodes:
$ srun --nodes=4 hostname
hydra01
hydra03
hydra02
hydra04
To verify the cluster nodes that are available, use the sinfo command:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 4 idle hydra[01-04]
If a node is unavailable, it will display as down (e.g., if the 04 node has been shut down):
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 3 idle hydra[01-03]
all* up infinite 1 down hydra04
Once the node is back up, it can be made available with the scontrol command to change the state to resume:
$ sudo scontrol update nodename=hydra04 state=resume
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 4 idle hydra[01-04]
Open MPI
While a job scheduler such as Slum is useful for running commands in parallel across a computing cluster, a Message Passing Interface is required to pass messages across running programs in a standardized manner.
Open MPI provides an open-source library for passing messages between parallel running programs. It is supported across a wide range of operating systems with APIs available across numerous programming languages, and it has been integrated with Slurm.
If developing programs that need to communicate with each other while running in parallel (beyond simply sharing data files), the next step -- Part 4: Open MPI -- provides an introduction to doing that with Open MPI.