ColorNet Software

Why a Job Scheduler?

A job scheduler is software that starts and manages compute jubs running on a computing cluster. The job scheduler also allocates and manages computing resources, and which resources are available to which users (login accounts). This example starts with a simple configuration, permitinging all rsources within one partition, and without any constraints.

Although there are several open source job schedulears aailable, this example uses the Slurm Workload Manager. Slurm -- originally SLURM for Simple Linux Utility for Resoure Management, and not to be confused with the fictional Futurama beverage -- is used in many large HPC (High Performance Computing) clusters, but can also scale down to very small clusters, such as the one in this example configuration.

Installing Slurm

On the first (01) Master node, install the Slurm Job Scheduler / Login Server:

1. Install the Slurm Workload Manager

$ sudo apt-get install slurm-wlm

2. Install the sample Slurm configuration file on shared storage

$ sudo bash -c 'echo "include /data/slurm/slurm.conf" > /etc/slurm/slurm.conf'
$ sudo bash -c 'echo "include /data/slurm/cgroup.conf" > /etc/slurm/cgroup.conf'

$ sudo mkdir /data/slurm
$ sudo chown slurm:slurm /data/slurm

$ cd /data/slurm
$ sudo cp /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz slurm.conf.gz
$ sudo gunzip slurm.conf.gz
$ sudo cp /etc/munge/munge.key .

3. Edit the Slurm daemon configuration file

Note: I am running node 01 as both a master (login and job controller) node and a compute node, given the small number of nodes in this example cluster; to prevent using node 01 as a compute node, simply change '[01-04]' to '[02-04]' in the PartitionName line below

Note: Change names and IPs if following different node conventions than used in this example

$ sudo vi /data/slurm/slurm.conf and change the following lines:

SlurmctldHost=hydra01(192.168.4.111)

PartitionName=all Nodes=hydra[01-04] Default=YES MaxTime=INFINITE State=UP

    then remove the single NodeName line and add the following to the end of the file:

NodeName=hydra01 NodeAddr=192.168.4.111  CPUs=4 State=UNKNOW
NodeName=hydra02 NodeAddr=192.168.4.112  CPUs=4 State=UNKNOWN
NodeName=hydra03 NodeAddr=192.168.4.113  CPUs=4 State=UNKNOWN
NodeName=hydra04 NodeAddr=192.168.4.114  CPUs=4 State=UNKNOW

4. Edit the Slurm control group configuration file

Note: Initially enforcing no constraints, which may be updted later

$ sudo vi /data/slurm/cgroup.conf and add

CgroupAutomount=yes
ConstrainCores=no
ConstrainRAMSpace=no

5. Add all of the cluster nodes to the hosts file

Note: Change names and IPs if following different node conventions than used in this example

$ sudo vi /etc/hosts

And add:

192.168.4.111	hydra01.colornetlabs.com	hydra01
192.168.4.112	hydra02.colornetlabs.com	hydra02
192.168.4.113	hydra03.colornetlabs.com	hydra03
192.168.4.114	hydra04.colornetlabs.com	hydra04

On each of the remaining (02 -> 04) Compute nodes, install an NFS client:

1. Install the Slurm compute node daemon and client

$ sudo apt-get install slurmd slurm-client

2. Reference the Slurm configuration file on shared storage

$ sudo bash -c 'echo "include /data/slurm/slurm.conf" > /etc/slurm/slurm.conf'
$ sudo bash -c 'echo "include /data/slurm/cgroup.conf" > /etc/slurm/cgroup.conf'
$ sudo cp /data/slurm/munge.key /etc/munge/munge.key

3. Verify that the munge keys are in-sync (if not, verify that all of the keys in /etc/munge are the same)

$ systemctl restart munge
$ ssh pi@hydra01 munge -n | unmunge

which should display something like:

STATUS:          Success (0)
ENCODE_HOST:     hydra01.colornetlabs.com (192.168.4.111)
ENCODE_TIME:     2024-01-23 12:47:44 -0500 (1706032064)
DECODE_TIME:     2024-01-23 12:47:44 -0500 (1706032064)
TTL:             300
CIPHER:          aes128 (4)
MAC:             sha256 (5)
ZIP:             none (0)
UID:             pi (1000)
GID:             pi (1000)
LENGTH:          0

Slurm Startup

Reboot each of the four (4) nodes (master and compute):

$ sudo reboot

and once rebooted, check the status of the Slurm scheduler processes:

$ systemctl status slurmd (on all nodes 01 => 04)
$ systemctl status slurmctld (only on the 01 master node)

Because Slurm is installed as an enabled service (by default), it should start automatically. However if it does not display as active (running), it may have to do with a startup problem involving Slurm starting before the network is available. Others have reported this, and a potential fix has been suggested by changing network.target to network-online.target in both service files (slurmd.service and slurmctld.service, both found in in /usr/lib/systemd/system). However, I have not found this to work.

As a work-around, with the added benefit of restarting the Slurm processes if they ever crash, I have created the following script to be run as root via crontab:

$ sudo su -
# mkdir /data/slurm/bin
# vi /data/slurm/bin/start_slurm and add the following:

# start_slurm
#
# Start the slurm daemon if it is not already running
#
# 23-Jan-2023 - DEF, original coding
#

DATA=/data/slurm
PATH=$PATH:/usr/sbin
SLURMD=`which slurmd`
SLURMCTLD=`which slurmctld`

# Check that slurmd is installed

if [ -z "SLURMD" ]
then
    echo "`date` Cannot find slurmd in: $PATH"
    exit 1
fi

# Check to make sure we have a slurm config file

if [ ! -f $DATA/slurm.conf ]
then
    echo "`date` Slurm daemon configuration file not found!"
    exit 2
fi

# Check to see of slurmd is running

running=`ps -aef | grep $SLURMD | grep -v grep | wc -l`

if [ $running -ge 1 ]
then
    echo "`date` Slurm daemon is running"
else
    echo "`date` Starting the slurm daemon"
    systemctl start slurmd

    if [ $? -ne 0 ]
    then
	echo "`date` Error starting slurm daemon!"
	exit 3
   fi
fi

# Check to see if we are the master controller

master=`cat $DATA/slurm.conf | grep '^SlurmctldHost=' | awk -F\= '{print $2}' | awk -F\( '{print $1}'`

if [ -z "$master" ]
then
    echo "`date` Slurm master node is not defined!"
    exit 4
fi

if [ "$master" != "`hostname -s`" ]
then
    exit 0
fi

# Check to see if the slurm job controller is running

running=`ps -aef | grep $SLURMCTLD | grep -v grep | wc -l`

if [ $running -ge 1 ]
then
    echo "`date` Slurm job controller is running"
    exit 0
fi

echo "`date` Starting the slurm job controller"
systemctl start slurmctld

if [ $? -ne 0 ]
then
    echo "`date` Error starting slurm job controller!"
    exit 5
fi

And then add that script to the root crontab:

# crontab -e and then add:

* * * * * /data/slurm/bin/start_slurm >> start_slurm.log
0 0 * * * mv start_slurm.log start_slurm.old && echo "`date`" > start_slurm.log

# exit

Running Jobs with Slurm

Now that the Slurm scheduler is running, the easiest way to test running commands across the cluster is with the srun command. Use the --nodes parameter to determine how many cluster nodes on which to run the command. For example, to run the hostname command on all four (4) cluster nodes:

$ srun --nodes=4 hostname
hydra01
hydra03
hydra02
hydra04

To verify the cluster nodes that are available, use the sinfo command:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      4   idle hydra[01-04]

If a node is unavailable, it will display as down (e.g., if the 04 node has been shut down):

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      3   idle hydra[01-03]
all*         up   infinite      1   down hydra04

Once the node is back up, it can be made available with the scontrol command to change the state to resume:

$ sudo scontrol update nodename=hydra04 state=resume
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      4   idle hydra[01-04]

Open MPI

While a job scheduler such as Slum is useful for running commands in parallel across a computing cluster, a Message Passing Interface is required to pass messages across running programs in a standardized manner.

Open MPI provides an open-source library for passing messages between parallel running programs. It is supported across a wide range of operating systems with APIs available across numerous programming languages, and it has been integrated with Slurm.

If developing programs that need to communicate with each other while running in parallel (beyond simply sharing data files), the next step -- Part 4: Open MPI -- provides an introduction to doing that with Open MPI.

Cluster Computing (Part 3: Install a Job Scheduler)

Leave a Reply Cancel reply