Personal tools
You are here: Home UiT Stallo Documentation Stallo User Guide (old)
Document Actions

Stallo User Guide (old)

View entire FAQ in full Up to Table of Contents
Welcome to the User Guide for Stallo. Use the navigation menu or the links to the right to navigate around in it.

Running jobs on the system

Logging on the compute nodes

Information on how to log in on a compute node.

Some times you may want to log on a compute node (for instance to check out output files on the local work area on the node), and this is also done by using SSH. From stallo.uit.no you can log in to compute-x-y the following way:

ssh -Y compute-x-y		(for instance: ssh compute-5-8)

or short

ssh -Y cx-y		(for instance: ssh c5-8)

If you don't need display forwarding you can omit the "-Y" option above.


If you for instance wants to run a program interactively on a compute node, and with display forwarding  to your desktop you should in stead do something like this:

  1. first login to Stallo with display forwarding,
  2. then you should reserve a node, with display forwarding, trough the queuing system
Below is an example on how you can do this:
ssh -Y stallo.uit.no                       1) Long in on Stallo with display forwarding.
qsub -lnodes=1,walltime=1:0:0 -I -X 2) Reserve and log in on a compute node with display forwarding.
This example assumes that you are running an X-server on your local desktop, which should be available for most users running Linux, Unix and Mac Os X. If you are using Windows you must install some X-server on your local PC.

Transferring data to/from stallo using ftp.

The ssh protocol (or rather the openssh implementation) has some limitations that becomes noticeable on long haul data transfers on high-speed networks like the one we have between the sites in Norway. The ftp protocol does not have these limitations and gives superior performance (10X over scp/sftp) when moving data to/from stallo from the other sites in Norway.

The basics

You need a ftp client that supports encrypted authentication.

Please note that it is only the authentication that is encrypted, the data you copy will flow unencrypted over the network so do not copy any sensitive information using ftp.

Also note that ftp-access is only available from the university networks in Norway.

Here is a list of ftp clients that is reported to work with encrypted authentication:

Name OS TYPE Support encryption
lftp linux/unix CLI YES
gftp linux/unix Graphical YES
kasablanca linux/unix Graphical YES
FileZilla Windows/MacOs/UNIX Graphical YES

CLI: Command Line Interface

Clients that most probably will not work: the std. ftp client on your system, that is, the one you get when you use the command ftp, ncftp also seems to have problems.

How to connect

The hostname of the ftp server is stallo-wgw.uit.no (this will change to stallo-ftp.uit.no soon).

Example using lftp on linux:

> lftp userA@stallo-wgw.uit.no
lftp userA@stallo-wgw.uit.no:~> ls
    ........ file listing .........
lftp userA@stallo-wgw.uit.no:~> get a-file-on-the-system
84291584 bytes transferred in 3 seconds (28.98M/s)

Problems

We seem to have some problems with the openssl library that takes care of the encryption, newer versions seems to work better but we cannot change the library without recompiling a lot of other stuff so we have to live with it until we upgrade stallo this fall.

The problem gives the following error message when transferring a file using lftp:

lftp userA@stallo-wgw.uit.no:~> get filename
get: Fatal error: SSL_read: wrong version number
lftp userA@stallo-wgw.uit.no:~> get filename
84291584 bytes transferred in 3 seconds (29.00M/s)

As one sees, just retrying fixes the problem(??).

The PBS queuing system and job submission

About the queuing system and job submission on Snowstorm.

To learn about the PBS queuing system, PBS commands and job submission, check out this Metasenter page.


Sample job script(s) for Stallo.

  • General job script
  • ADF job script (not available yet)
  • Amber job script (not available yet)
  • Gaussian 03 job script (not available yet)

Job script example (Stallo)

This is an example on how a job script could be built on Stallo.

The script is available as a text file you can download. You have to edit it to fit your own needs.

NB! If you are using Windows: be careful, use an editor that don't leave any "garbage" (often invisible in the editor itself) in the file!

Prioritizing of jobs and resource limits.

How is the priority of the jobs calculated and how much resources can a user expect to be allowed to allocate at once?

April 30th 2010, new changes to the scheduling policies. See the section on Job to node distribution.

Priority

The scheduler is set up to

  1. prioritize large jobs, that is, jobs that request large amount of cpus.
  2. prioritize short jobs. The priority is calculated as proportional to the expansion factor: (queuetime+walltime)/walltime.
  3. use fairshare, so a users with a lot of jobs running will get a decreased priority compared to other users.

Resource Limits

No user will be allowed to have more than 168 000 cpu-hours allocated for running jobs at any time. This means that a user at most can allocate 1000 cpus for a week for concurrently running jobs (or 500 cpus for two weeks or 2000 cpus for half a week).

No single user will be allowed to use more than 200 jobs at any time. (you can well submit more, but you cannot have more than 200 running at the same time)

Users can apply for exceptions to these rules by contacting support-uit@notur.no.

Job to node distribution

Due to a large increase in demand from users we have made some changes to the job to compute node mappings. Up until April 2010 we have been running in a free for all fashion with very liberal policies as to which nodes a job would be mapped on.

The stallo archictecture

Before we dive into the detail we need to say a few things about the stallo architecture.

  • The Stallo cluster has 704 compute nodes with 8 cpu-cores each totalling 5632 cpu-cores (hereafter denoted as cpus).
  • The Stallo cluster has two different memory configurations, 654 nodes have 16GB memory and 50 nodes have 32GB memory.
  • The Stallo cluster has two different networks, infiniband and gigabit ethernet. Infiniband is a highspeed very expensive network and is meant to serve the parallel applications that need fast communication between the processes. 448 nodes are connected to the infiniband network and 256 nodes are connected only to the gigabit ethernet network. (Infiniband nodes also have gigabit ethernet and highmem nodes are also infiniband nodes.)

See here for more details.

Job to node mapping

The basic philosophy for the mapping is to run the job on the nodes best suited for the task.

  • Short jobs are allowed to run anywhere. Short jobs are defined as jobs with walltime < 48 hours.
  • Single node jobs, that is jobs requesting less than 8 cpus are mapped onto the gigabit ethernet nodes only.
  • Multi node jobs, are allowed to run on both gigabit ethernet nodes and infiniband nodes. It is still up to the user to specify if the job needs infiniband or not using the :ib flag. The :gige flag will work as before too.
  • Large memory jobs with walltime > 48 should run in the highmem queue. This queue is restricted access so the user will need to notify the support team if access to these nodes are needed. Memory usage in this queue will be monitored to check for misuse.

Examples.

Short jobs:

qsub -lnodes=4,walltime=48:00:00 ........

Will be allowed to run anywhere.

Infiniband parallel job:

qsub -lnodes=8:ppn=8:ib,walltime=240:00:00 .........

Will be mapped onto the infiniband nodes.

Ethernet parallel job:

qsub -lnodes=8:ppn=8:gige,walltime=240:00:00 .........

Will be run on the ethernet only nodes.

Single node jobs:

qsub -lnodes=1,walltime=240:00:00 .........
qsub -lnodes=1:ppn=8,walltime=240:00:00 .........

will be mapped onto gigabit ethernet nodes. This is new behaviour, earlier it would be mapped onto any free node. Also note that trying to run single node jobs on infiniband nodes will fail:

qsub -lnodes=1:ib,walltime=240:00:00 .........

This job will never be allowed to start.

Highmem jobs:

qsub -q highmem -lnodes=4,pmem=14gb,walltime=240:00:00 ........

This job will run on the higmem nodes if the user is granted access by the administrators. Otherwise it will never start. Note that jobs that try to use both highmem and gigabit ethernet nodes will never start:

qsub -q highmem -lnodes=4:gige,pmem=14gb,walltime=240:00:00 ........

This job will never start.

Large file considerations.

Some special care needs to be taken if you want to create very large files on the system. With large we mean filesizes over 200GB or so.

Storage architecture.

The /global/work file system (and /global/home too) is served by a number of storage arrays that each contain smaller pieces of the file system, the size of the chunks are 2TB (2000GB) each. In the default setup each file is contained within one storage array so the default filesize limit is thus 2TB. In practice the file limit is considerably smaller as each array contains a lot of files.

Increasing the file size limitation by striping.

Each user can change the default placement of the files it creates by striping files over several storage arrays. This is done with the following command:

lfs setstripe . 0 -1 4

after this has been done all new files created in the current directory will be spread over 4 storage arrays each having 1/4th of the file. The file can be accessed as normal no special action need to be taken. When the striping is set this way it will be defined on a per directory basis so different dirs can have different stripe setups in the same file system, new subdirs will inherit the striping from its parent at the time of creation.

Stripe count recommendation.

We recommend users to set the stripe count so that each chunk will be approx. 200-300GB each, for example

File size Stripe count Command
500-1000GB 4 lfs setstripe . 0 -1 4
1TB - 2TB 8 lfs setstripe . 0 -1 8

Changing stripe count for files.

Once a file is created the stripe count cannot be changed. This is because the physical bits of the data already are written to a certain subset of the storage arrays. However the following trick can used after one has changed the striping as described above:

# mv file file.bu
# cp -a file.bu file
# rm file.bu

The use of -a flag ensures that all permissions etc are preserved.

Running many short tasks.

Recommendations on how to run a lot of short tasks on the system. The overhead in the job start and cleanup makes it unpractical to run thousands of short tasks as individual jobs on Stallo.

Background

The queueing setup on stallo, or rather, the accounting system generates overhead in the start and finish of a job of about 1 second at each end of the job. This overhead is insignificant when running large parallel jobs, but creates scaling issues when running a massive amount of shorter jobs. One can consider a collection of independent tasks as one large parallel job and the aforementioned overhead becomes the serial or unparallelizable part of the job. This is because the queuing system can only start and account one job at a time. This scaling problem is described by Amdahls Law.

Without going into any more details, let's look at the solution.

Running tasks in parallel within one job.

By using some shell trickery one can spawn and load-balance multiple independent task running in parallel within one node, just background the tasks and poll to see when some task is finished until you spawn the next:

for t in $tasks; do
  ./dowork.sh $t &
  activetasks=$(jobs | wc -l)
  while [ $activetasks -ge $maxpartasks ]; do
    sleep 1
    activetasks=$(jobs | wc -l)
  done
done
wait

Complete examples with descriptive comments can be found here: partasks.sh, dowork.sh.

Job accounting.

We charge for used resources, both cpu and memory.

CPU quota.

To use the batch system you have to have a cpu quota, either local or natinoal. For every job you submit we check that you have sufficient quota to run it and you will get a warning if you do not have sufficient cpu-hours to run the job. The job will be submitted to queue, but will not start until you have enough cpu-hours to run it.

Resource charging.

The accounting system charges for used processor equivalents (PE) times used walltime so if you ask for more than 2GB of memory per cpu you will get charged for more than the actual cpus you use.

Processor equivalents.

The best way to describe PE is maybe by example: Assume that you have a node with 8 cpu-cores and 16 GB memory (as most nodes on stallo are):

if you ask for less than 2GB memory per core then PE will equal the cpu count.

if you ask for 4GB memory per core then PE will be twice the cpu-count.

if you ask for 16GB memory then PE=8 as you only can run one cpu per compute node.

Express queue for testing job scripts and interactive jobs.

A high priority queue called express can be used for testing and interactive jobs.

The express queue

By submitting a job to the express queue you can get higher throughput for testing and shorter start up time for interactive jobs. Just use the -q express flag to submit to this queue:

qsub -q express jobscript.sh

or for an interactive job:

qsub -q express -I

This will give you a faster access if you have special needs during development, testing of job script logic or interactive use.

Priority and limitations

Jobs in the express queue will get higher priority than any other jobs in the system and will thus have a shorter queue delay than regular jobs. To prevent misuse the express queue has the following limitations:

  • Only one running job per user.
  • Maximum 8 hours walltime.
  • Only one job queued at any time, remark this is for the whole queue. This is to prevent express jobs delaying large regular jobs.

So, it is more or less pointless to try to use the express queue to sneak regular production jobs passed the other regular jobs. Submitting a large amount of jobs to the express queue will most probably decrease the overall throughput of your jobs. Also remark that large jobs get prioritized anyway so they will most probably not benefit anything from using the express queue.

How can I submit many jobs in one command

job arrays

use job arrays:

qsub -t 1-16 Myjob

will send Myjob 16 times into the queue. They can be distinguished by the value of the environmental variable 

$PBS_ARRAYID
by toj000 — last modified Jan 15, 2009 11:34 AM Notur