Personal tools
You are here: Home UiT Stallo Documentation Stallo User Guide 80 Scheduling policy on the machine
Document Actions

80 Scheduling policy on the machine

Up to Table of Contents

Priority

The scheduler is set up to

  1. prioritize large jobs, that is, jobs that request large amount of cpus.
  2. prioritize short jobs. The priority is calculated as proportional to the expansion factor: (queuetime+walltime)/walltime.
  3. use fairshare, so a users with a lot of jobs running will get a decreased priority compared to other users.

Job to node distribution

Due to a large increase in demand from users we have made some changes to the job to compute node mappings. Up until April 2010 we have been running in a free for all fashion with very liberal policies as to which nodes a job would be mapped on.

The stallo archictecture

Before we dive into the detail we need to say a few things about the stallo architecture.

  • The Stallo cluster has 704 compute nodes with 8 cpu-cores each totalling 5632 cpu-cores (hereafter denoted as cpus).
  • The Stallo cluster has two different memory configurations, 654 nodes have 16GB memory and 50 nodes have 32GB memory.
  • The Stallo cluster has two different networks, infiniband and gigabit ethernet. Infiniband is a highspeed very expensive network and is meant to serve the parallel applications that need fast communication between the processes. 448 nodes are connected to the infiniband network and 256 nodes are connected only to the gigabit ethernet network. (Infiniband nodes also have gigabit ethernet and highmem nodes are also infiniband nodes.)

See here for more details.

Job to node mapping

The basic philosophy for the mapping is to run the job on the nodes best suited for the task.

  • Short jobs are allowed to run anywhere. Short jobs are defined as jobs with walltime < 48 hours.
  • Single node jobs, that is jobs requesting less than 8 cpus are mapped onto the gigabit ethernet nodes only.
  • Multi node jobs, are allowed to run on both gigabit ethernet nodes and infiniband nodes. It is still up to the user to specify if the job needs infiniband or not using the :ib flag. The :gige flag will work as before too.
  • Large memory jobs with walltime > 48 should run in the highmem queue. This queue is restricted access so the user will need to notify the support team if access to these nodes are needed. Memory usage in this queue will be monitored to check for misuse.

Examples.

Short jobs:

qsub -lnodes=4,walltime=48:00:00 ........

Will be allowed to run anywhere.

Infiniband parallel job:

qsub -lnodes=8:ppn=8:ib,walltime=240:00:00 .........

Will be mapped onto the infiniband nodes.

Ethernet parallel job:

qsub -lnodes=8:ppn=8:gige,walltime=240:00:00 .........

Will be run on the ethernet only nodes.

Single node jobs:

qsub -lnodes=1,walltime=240:00:00 .........
qsub -lnodes=1:ppn=8,walltime=240:00:00 .........

will be mapped onto gigabit ethernet nodes. This is new behaviour, earlier it would be mapped onto any free node. Also note that trying to run single node jobs on infiniband nodes will fail:

qsub -lnodes=1:ib,walltime=240:00:00 .........

This job will never be allowed to start.

Highmem jobs:

qsub -q highmem -lnodes=4,pmem=14gb,walltime=240:00:00 ........

This job will run on the higmem nodes if the user is granted access by the administrators. Otherwise it will never start. Note that jobs that try to use both highmem and gigabit ethernet nodes will never start:

qsub -q highmem -lnodes=4:gige,pmem=14gb,walltime=240:00:00 ........

This job will never start.

by Roy Einar Dragseth last modified Aug 26, 2010 12:00 PM Notur