Batch Jobs
Contents
- LoadLeveler
- Class Overview
- Accounting
- Job scheduling policy
- Checkpointing
- LoadLeveler Keywords
- Consumable Resources
- Large memory nodes
- General Batch Scripts
- Application Specific Sample Jobscripts
- Batch Job Status
- Recommended Environment Variable Settings
LoadLeveler
Executable programs should run in the LoadLeveler batch system. Submitting and running your jobs through the batch system is necessary to avoid system overload and enforces a fair share of the resources among the users. Before you can submit a job or perform any other job related tasks, you need to build a job command file. A job command file describes the job you want to submit and include LoadLeveler keyword statements.
If your job needs to read and write large amounts of data (in the GiB-range), create a directory on /work/$USER for intermediate storage. Files on /work expires after a period of 1-3 weeks, and are deleted automatically if not used. Files resides on /work for a minimum of 1 week. The exact time of expiration depends on the total amount of data stored on the file system. Copy important files for future work back to your home directory ($HOME) on job exit. Files on /home are backed up, while files on /work are lost on expiration. Consider using the local file system /scratch for small scale frequent I/O, as /work and /home are parallel global file systems optimized for large accesses (1 MiB blocks and beyond). Note /scratch is a small 32 GiB file system, for low volume I/O only
Commands
-
Command Description llsubmit Submit a job llclass Returns information about classes llq Query information about jobs in the queues llcancel Cancel job from the queue llstatus Returns status information about nodes in the cluster
References
For more information about LoadLeveler, see IBM's redbook: Workload Management with LoadLeveler by Subramanian Kannan, Peter Mayes, Mark Roberts, Dave Brelsford, Joseph F Skovira, ISBN: 0738422096.
Class overview
-
Class Max nodes Min nodes Max nodes/job Max runtime Description forecast 180 1 160 unlimited Top priority class dedicated to forecast jobs bigmem 6 1 4 7 days High priority 120GB memory class large 180 4 128 21 days High priority class for jobs of 64 processors or more normal 52 1 42 21 days Default class express 186 1 4 1 hour High priority class for debugging and test runs small 0.5 - 0.5 14 days Low priority class for serial or small SMP jobs
Job accounting
Except for jobs in the small class, nodes are allocated for exclusive use. Jobs are accounted for the wall clock time used to finish the job, multiplied with 16 times the number of nodes reserved. Notur accounting is offset with respect to a calendar year. In calendar year 2010, the accounting period 2010.1 starts April 1 and runs through September. The second accounting period, 2010.2, starts October 1 and ends March 31 2011. Application deadline usually is 1 month before expiration of the current period. Users applying for quota after the deadline, or applying for extra time in the current period, usually are granted non-priority CPU-time only.
Type
$ costfor an overview of CPU-time charged on account(s) you are a member of.
To retrieve CPU-time charged to any account, type cost -k account_number, for instance
$ cost -k nn0666k
Accounting report for Njord IBM P5-575 AIX 5L
Last updated on Fri Mar 5 00:00:03 2010
===========================================================================
Account User Login Used CPU
name id name [hours]
===========================================================================
nn0666k 167671 bigtime 858.86
nn0666k 716167 bouncer 0.00
nn0666k 80085 *mabeagle 141.14
---------------------------------------------------------------------------
Usage : 1000.00
Quota : 20000.00
---------------------------------------------------------------------------
Avail (pri) : 9000.00
Avail (nonpri) : 10000.00
===========================================================================
Any logins prefixed with * are currently not a member of the account.
In the above accounting, user bigtime is charged with 858.86 hours during this period, while bouncer has not used any time so far. The user mabeagle is charged with 141.14 hours, but she is currently not a member of this account, and is not allowed to submit new jobs. It is the responsibility of the project manager to keep the member list updated. This account received a CPU grant of 10000 hours of priority time, together with 10000 non-pri hours, for a total of 20000 hours. It is not possible to specify which of the two to charge on a particular job. Priority time is always charged first. When emptied, usage switch over to non-pri time. You are allowed to start a new job with llsubmit provided there is CPU-time left on the account at the time of submission.
For information on other switches to cost, type
$ cost -h
Job scheduling policy
Priority
- jobs requesting more than 4 nodes are prioritized
- short (< 1hour) express jobs (i.e. debugging and test runs) are prioritized
General job limitations (per user)
- maximum running jobs are 40
- maximum number of idle jobs are 5, jobs above this maximum are placed in the NotQueued (NQ) state
Jobs that are rejected
- the CPU account requested has run out of quota
- jobs running non-priority quota when number of idle jobs exceeds 25
- resources (ConsumableMemory and/or ConsumableCpus) are not specified
- the memory requested per node is higher than memory available
- total number of cores per node requested is more than available (16 without SMT specified)
Checkpointing
Checkpointing is a method of periodically saving the state of a job step so that if the step does not complete it can be restarted from the saved state.
When checkpointing is enabled, checkpoints can be initiated from within the application at major milestones, or by the user, administrator or LoadLeveler external to the application. Both serial and parallel job steps can be checkpointed.
Once a job step has been successfully checkpointed, if that step terminates before completion, the checkpoint file can be used to resume the job step from its saved state rather than from the beginning. When a job step terminates and is removed from the LoadLeveler job queue, it can be restarted from the checkpoint file by submitting a new job and setting the restart_from_ckpt = yes job command file keyword. When a job is terminated and remains on the LoadLeveler job queue, such as when a job step is vacated, the job step will automatically be restarted from the latest valid checkpoint file. A job can be vacated as a result of flushing a node, issuing checkpoint and hold, stopping or recycling LoadLeveler or as the result of a node crash.
The following items cannot be checkpointed:
- Programs that are being run under dynamic probe class library (DPCL) or any debugger.
- MPI programs that are not compiled with mpcc_r, mpCC_r, mpxlf_r, mpxlf90_r, or mpxlf95_r.
- Processes that use extended shmat support, pinned shared memory segments or debug malloc tool (MALLOCTYPE=debug)
- Sets of processes in which any process is running a setuid program when a checkpoint occurs.
- Sets of processes if any process is running a setgid program when a checkpoint occurs.
- Interactive parallel jobs for which POE input or output is a pipe.
- Interactive parallel jobs for which POE input or output is redirected, unless the job is submitted from a shell that had the CHECKPOINT environment variable set to yes before the shell was started. If POE is run from inside a shell script and is run in the background, the script must be started from a shell started in the same manner for the job to be checkpointable.
The node on which a process is restarted must have:
- The same operating system level (including PTFs). In addition, a restarted process may not load a module that requires a system call from a kernel extension that was not present at checkpoint time.
LoadLeveler Keywords
-
Keyword Description #@ job_name Specifies the name of the job. #@ shell Specifies the name of the shell to use for the job
Notice: specifying a shell with #! on the first line in the batch script overrides the shell keyword.#@ account_no Specifies the account number to charge for the job. #@ class Specifies the name of the class were the job will be run, see Class overview. Use command llclass to find out information on job classes. #@ job_type Specifies the type of job step to process. For serial and SMP (OpenMP) programs specify serial, for MPI programs specify parallel. #@ node Specifies the number of nodes requested by a job step. #@ tasks_per_node Specifies the number of tasks per node. #@ node_usage Specifies whether the job shares nodes with other jobs. On njord, except for the 'small' class this is by default set to 'not_shared' #@ resources Specifies the resources consumed by each task of a job step, see Consumable resources below. #@ output Specifies the name of the file to use as standard output (stdout). #@ error Specifies the name of the file to use as standard error (stderr). #@ wall_clock_limit Sets the time limit for the job. It is important to specify the wall clock limit as close to the actual run-time as possible, since this makes it easier for LoadLeveler to find a time-slot for the job. A job with a short wall clock limit may therefore pass quicker through the queue than a job with a long wall clock limit. If you do not specify the wall clock limit at all, your job will most likely have to wait in the queue for a very long time. #@ environment Specifies login initial environment variables set by LoadLeveler. Set COPY_ALL to copy all environment variables from your shell. #@ env_copy Specify all to copy environment variables to all nodes. #@ notification Specifies when to notify user by mail. Default is mail sent when job ends. #@ queue Marks the end of the job step and is required.
Consumable resources
When running batch jobs use the resources keyword in the job command file to specify the resources to be consumed by each task of a job step.
-
MPI programs Set the 'job_type' to parallel, specify 1 for the 'ConsumableCpus' and define the number of MPI tasks through the 'node' and 'tasks_per_node' keywords. The 'ConsumableMemory' should be set to the available amount of memory per node (i.e. 13 GB) divided by the number of MPI tasks per node ('tasks_per_node') # @ job_type = parallel # @ resources = ConsumableCpus(1) ConsumableMemory(832 mb) # @ node = 2 # @ tasks_per_node = 16
OpenMP programs Set the 'job_type' to serial, and specify the number of threads for the 'ConsumableCpus' and the 'OMP_NUM_THREADS' variable. The 'ConsumableMemory' should be set to the available amount of memory per node (i.e. 13 GB) # @ job_type = serial # @ resources = ConsumableCpus(16) ConsumableMemory(13 gb) # @ environment = OMP_NUM_THREADS=16
Hybrid MPI/OpenMP programs Set the 'ConsumableCpus' to the number of threads each MPI task will spawn and 'ConsumableMemory' to the amount of memory each MPI task will use. The example specifies 8 MPI tasks with 2 threads each and a total of 12 GB memory for the job # @ job_type = parallel # @ resources = ConsumableCpus(2) ConsumableMemory(1536 mb) # @ total_tasks = 8 # @ environment = OMP_NUM_THREADS=2
Large memory nodes
Six of the njord compute nodes are large memory nodes with 115 GB available memory. To request these nodes specify 'bigmem' for the job class. E.g. running 16 MPI processes, specify (upto) 115 GB / 16 = 7360 MB for the 'ConsumableMemory':# @ class = bigmem # @ tasks_per_node = 16 # @ resources = ConsumableCpus(1) ConsumableMemory(7360 mb)
General batch scripts
Sequential job
#!/bin/ksh # @ job_name = myjob # @ account_no = nn1234k # @ class = small # @ job_type = serial # @ resources = ConsumableCpus(1) ConsumableMemory(1 gb) # @ error = $(job_name).$(jobid).err # @ output = $(job_name).$(jobid).out # @ wall_clock_limit = 01:00:00 # @ environment = COPY_ALL # @ env_copy = all # # @ queue # # Create (if necessary) and move to my working directory w=$WORKDIR/$USER if [ ! -d $w ]; then mkdir $w; fi cd $w # # Copy inputfiles to working directory cp $HOME/what_ever_path/inputfiles . # # Run my program $HOME/what_ever_path/my_program # # Move results back to home directory mv outputfiles $HOME/what_ever_path
MPI job
#!/bin/ksh # @ job_name = myjob # @ account_no = nn1234k # @ class = normal # @ job_type = parallel # @ node = 2 # @ tasks_per_node = 16 # @ node_usage = not_shared # @ resources = ConsumableCpus(1) ConsumableMemory(832 mb) # @ network.MPI = sn_all,,us # @ error = $(job_name).$(jobid).err # @ output = $(job_name).$(jobid).out # @ wall_clock_limit = 01:00:00 # @ environment = COPY_ALL # @ env_copy = all # # @ queue # # Create (if necessary) and move to my working directory w=$WORKDIR/$USER if [ ! -d $w ]; then mkdir $w; fi cd $w # # Copy inputfiles to working directory cp $HOME/what_ever_path/inputfiles . # # Run my program $HOME/what_ever_path/my_mpi_program # # Move results back to home directory mv outputfiles $HOME/what_ever_path
OpenMP job
#!/bin/ksh # @ job_name = myjob # @ account_no = nn1234k # @ class = normal # @ job_type = serial # @ resources = ConsumableCpus(16) ConsumableMemory(4 gb) # @ node_usage = not_shared # @ error = $(job_name).$(jobid).err # @ output = $(job_name).$(jobid).out # @ wall_clock_limit = 01:00:00 # @ environment = COPY_ALL; OMP_NUM_THREADS=16 # @ env_copy = all # # @ queue # # Create (if necessary) and move to my working directory w=$WORKDIR/$USER if [ ! -d $w ]; then mkdir $w; fi cd $w # # Copy inputfiles to working directory cp $HOME/what_ever_path/inputfiles . # # Run my program $HOME/what_ever_path/my_omp_program # # Move results back to home directory mv outputfiles $HOME/what_ever_path
Hybrid MPI/OpenMP job
#!/bin/ksh # @ job_name = myjob # @ account_no = nn1234k # @ class = normal # @ job_type = parallel # @ node = 4 # @ tasks_per_node = 2 # @ node_usage = not_shared # @ resources = ConsumableCpus(8) ConsumableMemory(6656 mb) # @ network.MPI = sn_all,,us # @ error = $(job_name).$(jobid).err # @ output = $(job_name).$(jobid).out # @ wall_clock_limit = 01:00:00 # @ environment = COPY_ALL; OMP_NUM_THREADS=8 # @ env_copy = all # # @ queue # # Create (if necessary) and move to my working directory w=$WORKDIR/$USER if [ ! -d $w ]; then mkdir $w; fi cd $w # # Copy inputfiles to working directory cp $HOME/what_ever_path/inputfiles . # # Run my program $HOME/what_ever_path/my_mpi_openmp_program # # Move results back to home directory mv outputfiles $HOME/what_ever_path
Chapel job
#!/bin/ksh # @ job_name = myjob # @ account_no = nn1234k # @ class = normal # @ job_type = parallel # @ node = 1 # @ tasks_per_node = 16 # @ node_usage = not_shared # @ resources = ConsumableCpus(1) ConsumableMemory(832 mb) # @ network.LAPI = sn_all,,us # @ error = $(job_name).$(jobid).err # @ output = $(job_name).$(jobid).out # @ wall_clock_limit = 01:00:00 # @ environment = COPY_ALL # @ env_copy = all # # @ queue # # Create (if necessary) and move to my working directory w=$WORKDIR/$USER if [ ! -d $w ]; then mkdir $w; fi cd $w # # Copy inputfiles to working directory cp $HOME/what_ever_path/inputfiles . # # Run my program module load chapel $HOME/what_ever_path/my_chapel_program -nl 16 # # Move results back to home directory mv outputfiles $HOME/what_ever_path
Application specific sample jobscripts
Enter commands on Njord:OpenFOAM
$ tar zxf /usr/local/OpenFOAM/testing/damBreak-test.tgz $ edit openfoam.sh and change the account number $ llsubmit openfoam.sh
Fluent
$ tar zxf /usr/local/fluent/commonfiles/example/test_fluent.tgz $ cd test_fluent $ edit fluent.job and change the account number $ llsubmit fluent.job
Batch Job Status
R Running
The job is executed.
I Idle
The job is queued, but currently no machine has been selected for execution.
It may happen that the job will rest in this state forever,
if the command file requests resources that match no target.
E Preempted
A running job step can be preempted to become inactive and be held in the virtual memory.
A preempted job step can be resumed to continue running again. Using preemption,
resources are released from preempted jobs. These resources can then be used to run other jobs
which might otherwise not be able to run due to lack of resources.
EP Preemption pending
The job is in the process of being preempted.
This state applies only when LoadLeveler uses the suspend method to preempt the job.
NQ Not Queued
The job is not being considered to run on a machine.
A job can enter this state because the associated Schedd is down, the user
or group associated with the job is at its maximum maxqueued or maxidle value,
or because the job has a dependency which cannot be determined.
CA Cancelled
The job is cancelled by the owner or an administrator.
C Completed
The job has completed.
H Hold
The job was hold by the owner of the job.
S System Hold
The job is put in system hold by an administrator.
ST Starting
The job was received by the target machine, and LoadLeveler is setting up the environment
in which to run the job.
Recommended Environment Variable Settings
All regular shell recommended environment variables are loaded automatically.
POE Environment Flags
There are a number of additional POE environment variables for monitoring and controlling program execution. You can read about them in IBM's POE documentation here.
Output
The environment variables MP_STDOUTMODE and MP_LABELIO control the stdout and stderr output from the processes. With the following settings:
$ export MP_STDOUTMODE = ordered $ export MP_LABELIO = yes
the output is sorted by increasing order of ranks, and the rank number is added in front of the output from each process.
Fortran Environment Flags
If your program is written in Fortran and has been compiled with IBM's Fortran compilers, then you can specify a number of run-time options with the XLFRTEOPTS environment flag.
Endianess
As described in Copying Binary Data to Njord, Njord is a big-endian computer and programs that run on Njord can usually not read little-endian binary data, as produced by programs that run on Linux PC's. However, this behaviour can be changed by setting the XLFRTEOPTS environtment flag. To perform little endian I/O on unit 2, type
$ export XLFRTEOPTS=ufmt_littleendian=2
in the shell or job script before running the program. A comma separated list of unit numbers and dash separated range of units is also accepted. To perform little endian I/O on units 2,5 and 10,11,12,...,20 the assignment should be written
$ export XLFRTEOPTS=ufmt_littleendian=2,5,10-20
Naming Scratch Files
To place all scratch files in a particular directory, set the TMPDIR environment variable to the name of the directory. The program then opens the scratch files in this directory. You might need to do this if your /tmp directory is too small to hold the scratch files.
To give a specific name to a scratch file, you must set the run-time option scratch_vars, and then set an environment variable with a name of the form XLFSCRATCH_unit for each scratch file. The association is between a unit number in the Fortran program and a path name in the file system. In this case, the location of the scratch file is not affected by the TMPDIR variable.
For example, suppose that the Fortran program contains the following statements:
OPEN (UNIT=1, STATUS='SCRATCH', &
FORM='FORMATTED', ACCESS='SEQUENTIAL', RECL=1024)
OPEN (UNIT=12, STATUS='SCRATCH', &
FORM='UNFORMATTED', ACCESS='DIRECT', RECL=131072)
OPEN (UNIT=123, STATUS='SCRATCH', &
FORM='UNFORMATTED', ACCESS='SEQUENTIAL', RECL=997)
Then your environemnt flags could look like this:
$ export XLFRTEOPTS = "scratch_vars=yes" # Turn on scratch file naming. $ export XLFSCRATCH_1 = "/tmp/molecules.dat" # Use this named file. $ export XLFSCRATCH_12 = "../data/scratch" # Relative to current directory. $ export XLFSCRATCH_123 = "$HOME/data" # Somewhere besides /tmp.

