University of Pretoria
Operational / Internal Site

Batch Scheduling

The EECE clusters are only accessible via a batch scheduling system. The idea is that tasks are submitted on alpha1-1 (known as the head node) to a batch queue and these tasks are then automatically run by the system as resources become available.

A complication is however that the AFS home directories are not accessible on the queue processing nodes due to the strict security model AFS uses (tickets/tokens that have a finite lifetime). As a result jobs to be submitted to the SLURM system must be copied into a NFS based home directory. On alpha1-1 the “default” home directory is this NFS based home directory. The AFS is also available under the /afs/ee.up.ac.za/user/ area (use “echo $HOME” on pst or wiener to find your AFS home directory path).

Therefore, before tasks can be submitted all necessary files must be copied from the AFS home directory to the NFS home directory. Once the job has been completed the results should be copied back to the AFS home directory.

Overview

Simple Linux Utility for Resource Management (SLURM) is an open-source resource manager and job scheduling system. The entities managed by SLURM include nodes, partitions (group of nodes), jobs and job steps. The partitions can also be considered as job queues and each of which has a sets of constraints such as job size limit, time limit, etc. Submitting a job to the system requires you to specify a partition. Under some circumstances, a Quality of Service (QoS), which indicates a classification that determines what kind of resources your job can use, is also expected. Jobs within a partition will then be allocated to nodes based on the scheduling policy, until all resources within a partition are exhausted.

There are several basic commands you will need to know to submit jobs, cancel jobs, and check status. These are:

  • sbatch - Submit a job to the batch queue system, e.g., sbatch myjob.slurm
  • squeue - Check the current jobs in the batch queue system, e.g., squeue
  • sinfo - View the current status of the queues, e.g., sinfo
  • scancel - Cancel a job, e.g., scancel 123

For more detail information on SLURM please consult the SLURM tutorials and documentation. Local copies can be found here: http://lftp.ee.up.ac.za/youtube.com/SLURM/tutorial_slurm_intro.avi, http://lftp.ee.up.ac.za/youtube.com/SLURM/Introduction_to_SLURM_Tools.mp4.

Information on particular commands can also be obtained using the standard Unix manual page system:

man sbatch

Usage

The general approach to using SLURM on the clusters is as follows:

  1. Log onto the queue management node, i.e. “ssh alpha1-1” from the pass-through servers (External / Off-campus Access), or connecting on campus to “alpha1-1.ee.up.ac.za using Putty (Windows SSH client).
  2. Copy any needed files from your AFS home directory to the NFS home directory. Alternatively you can copy any needed files directly from your computer using WinSCP or a similar program.
  3. Update the job specification/script file. In particular, take care that any output being generated on one node will not be overwritten by a job running concurrently on another node. There are a number of environment labels that are set when the job runs and these could be used to create subdirectories during the job start-up sequence.
  4. Submit the job using the ”sbatch <scriptname>“ command.
  5. You can monitor your jobs using the ”squeue -l -u <username>“ command.
  6. You can obtain more information about a particular job using the ”squeue -l -j <jobnumber>“ command.
  7. Once the job has been completed, download the results or copy them back to your AFS home directory.

SLURM requires the job specification/script file to be in Unix format. If your files were edited on a Windows system you can do the conversion with:

fromdos <scriptname>

To edit the job specification/script file on the head node you can use one of the installed terminal text editors: nano (recommended for Windows users), jed, vim. If logged in with X11 forwarding enabled, the following GUI based text editors are available: geany, scite.

When the job is scheduled to run, by default SLURM will automatically change to the same directory where you originally submitted the job file before running the job script. You can also change the working directory as necessary in your job script (using the standard Unix cd command).

If you need to cancel a job, for example if there is a problem with the simulation parameters, this can be accomplished using the following command:

scancel <jobnumber>

Batch Jobs and Job Steps

A SLURM batch job is started as indicated above with the sbatch command. The job specification/script file is actually a standard Unix shell script (e.g. BASH script file) with additional comments containing special tags near the top of the file. These tags are in turn identical to command-line options to the sbatch command, embedding them in the script file is just more compact. Each job script typically then consists of: SBATCH command tags as comments, various setup shell operations and the actual simulation commands (i.e. the execution of a software binary on an input model file that will run for some time).

To allow SLURM to track the progress of a simulation, and to allow correct cancellation of jobs, each of the simulation steps should be initiated using the srun command. Most simulations would therefore have the following form:

simple.slurm
#!/bin/bash
# Special tags that contain parameters which define the resources required
# by the simulation and which will be used by SLURM to allocate these
# resources.
#SBATCH --time=00:30:00
 
# Perform simulation setup. For example, create a temporary directory in the
# scratch area.
mkdir -p /tmp/$USERNAME/my_sim
 
# Now run my program (simulation) as a separate job step.
srun ./my_program

In these examples it is assume the ./my_program binary is single-threaded and hence will use only one CPU core. Another possibility is a simulation with a number of sequential runs:

serial.slurm
#!/bin/bash
#SBATCH --time=00:30:00
 
# Perform simulation setup. For example, create a temporary directory in the
# scratch area.
mkdir -p /tmp/$USERNAME/my_sim
 
# Now run my program (simulation) as a separate job step, one for each
# index (in this example 0 to 9).
for i in {0..9}
do
    srun -l ./my_program $i
done

And for the case of a parallel simulation:

parallel.slurm
#!/bin/bash
#SBATCH --cpus-per-task=4
#SBATCH --time=00:30:00
 
# Perform simulation setup. For example, create a temporary directory in the
# scratch area.
mkdir -p /tmp/$USERNAME/my_sim
 
# Now run my program (simulation) as four concurrent job steps, each with
# different parameter (index). Since only four CPUs have been allocated,
# only four simulations will be run concurrently.
for i in {0..3}
do
    srun -l ./my_program $i &
done
 
# Wait until all four concurrent simulations complete before the job ends.
wait

General SLURM Batch Parameters

To ensure your job starts as early as possible, the minimum amount of resources required for your job should be requested. In particular the number of CPUs and memory per CPU must be set correctly, see Job Accounting for instruction on how to determine these from past and current jobs.

The following parameters are typically used in SLURM scripts:

  • –cpus-per-task=4: Advise the SLURM controller that ensuing job steps will require 4 processors. Without this option, the controller will just try to allocate one processor per task.
  • –time=<1-10:00:00>: the maximum time the job is expected to run. The job will automatically be cancelled after this time. Specified as DAYS+HOURS:MINUTES:SECONDS, for example 1+10:00:00 for 1 day 10 hours. Note that shorter jobs are given preference by the scheduler.
  • –mail-user=<username@domain>: set to your email address in order to receive notification of job status.
  • –mail-type=END: together with the previous option, will send notification on job completion. To receive notification on both job start and completion, use BEGIN,END. If you set this option you MUST also set a valid email address with the previous option.
  • –job-name=<MYJOB>: the name of the jobs as it will be shown in the queue list. Use a unique and descriptive name so that you can easily identify particular simulations.
  • –output=<filename.log>: the name of a log file into which all output and status messages for the job will be written. Note that any existing file will be truncated at job start.

For most users it is not necessary to set the following:

  • –partition=<partition>: request specific partition, only possible for some users as most users only have access to the default partition.
  • –account=<account>: request specific account, only possible for some users as most users only have access to their default account.

Similarly, the following do not need to be set as the defaults will be correct for most simulations:

  • –begin=now+5hours: queue the job but delay the start for 5 hours. See sbatch man page for time format options.
  • –nodes=<min-max>: the default is 1 node, which is the maximum which most users are currently allowed.
  • –ntasks=1: Generally all jobs should consist of a single task (simulation) potentially running on multiple processors.
  • –workdir=<directory>: set working directory for batch script, the default is the directory within which the job was submitted.

SLURM Examples

The various commercial software available on the clusters have unique requirements in terms of licenses and resources. Below you can find pages where templates for specific types of simulations can be downloaded. To download the template, click on the link at the top of the embedded template.

Job Accounting

SLURM keeps track of various statistics for each job and you can use sacct to extract the information (see man page for usage instructions). The most useful statistics are available with the custom smemio command. The output is as follows:

       JobID    JobName  Timelimit    Elapsed   UserCPU  AllocCPUS     ReqMem  MaxVMSize     MaxRSS  MaxDiskRead MaxDiskWrite
------------ ---------- ---------- ---------- --------- ---------- ---------- ---------- ---------- ------------ ------------
3174             lambda   20:00:00   05:45:04  23:00:14          4     4000Mc
3174.batch        batch              05:45:04  23:00:14          4     4000Mc    263296K      9012K        0.40M        0.20M
3174.0          runfeko              05:45:03  23:00:13          4     4000Mc   5419024K   3948592K          51M          27M

Only completed jobs for the current user within the past week will be shown. For a currently running job, use “sjmemio <jobid>” to get the current stats. The first line gives the overall job parameters for the particular job id. The line containing .batch represents the job script statistics, excluding any job steps initiated with a srun. In this example there is one job step, indicated with .0 that was initiated with srun, a FEKO simulation. The fields are as follows:

  • Timelimit: The requested job time limit.
  • Elapsed: The actual elapsed time for the job.
  • UserCPU: The amount of user CPU time used by the job or job step.
  • AllocCPUS: Total number of CPU cores allocated to the job.
  • ReqMem: Minimum required memory for the job, in MB. A 'c' at the end of number represents Memory Per CPU, a 'n' represents Memory Per Node.
  • MaxVMSize: Maximum amount of virtual memory the simulation step requested (but may not actually have used).
  • MaxRSS: The maximum resident set size, the maximum amount of physical memory the simulation step actually used.
  • MaxDiskRead: Maximum number of bytes read from disk storage in job step.
  • MaxDiskWrite: Maximum number of bytes written to disk storage in job step.

The above example therefore illustrates a poorly specified batch job. A run time of 20 hours was requested whilst the job only ran for about 6 hours. Similarly, 16 GB of memory was requested (4 CPU cores with 4 GB per core) whilst the job only actually used about 4 GB in total (but did attempt to allocate about 6 GB total). For this job –mem-per-cpu should have been set to no more than 2000.

High values in the last two columns would indicate an I/O bound job. Such jobs will most like cause increased latency on the clusters if run from the NFS home directory. Such jobs should rather copy the required files to a newly created directory in the temporary scratch space area (/tmp). The following SLURM template indicates how this could be done, for example, for the CST software (which is particularly problematic in this regard). The example assumes that the model has been uploaded to the user's NFS home directory as a single ZIP file, together with the associated SLURM template. The SLURM script will automatically create a temporary directory for the simulation and unpack the ZIP file. Once the simulation completes the ZIP file in the user's home directory will be updated to include all changed and new files. The user can then directly download the resultant ZIP file for processing on their desktop.

cst_tmp.slurm
#!/bin/bash
#SBATCH --output=<CHANGETHIS>.log
#SBATCH --job-name=<CHANGETHIS>
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=<CHANGETHIS>
#SBATCH --licenses=cst:1
#SBATCH --time=<CHANGETHIS>
#SBATCH --mail-type=END
#SBATCH --mail-user=<CHANGETHIS>
 
# Define simulation base name
export SIMNAME=<CHANGETHIS>
 
# Create temporary directory for CST simulation in scratch space
mkdir -p /tmp/$USERNAME/$SIMNAME
cd /tmp/$USERNAME/$SIMNAME
echo "Work directory: $PWD"
 
# Unpack CST simulation
ls -alp $HOME/$SIMNAME.zip
srun unzip $HOME/$SIMNAME.zip
 
# Run CST model simulation
echo
srun /usr/local/CST/CST_STUDIO_SUITE/cst_design_environment --m --q --numthreads 12 $SIMNAME.cst
 
# Update the home directory ZIP file with the results
srun zip -9r $HOME/$SIMNAME.zip $SIMNAME.cst $SIMNAME