|
|
Document ActionsStallo User Guide (old)View entire FAQ in full Up to Table of ContentsWelcome to the User Guide for Stallo. Use the navigation menu or the links to the right to navigate around in it.
Running jobs on the system
|
| Name | OS | TYPE | Support encryption |
|---|---|---|---|
| lftp | linux/unix | CLI | YES |
| gftp | linux/unix | Graphical | YES |
| kasablanca | linux/unix | Graphical | YES |
| FileZilla | Windows/MacOs/UNIX | Graphical | YES |
CLI: Command Line Interface
Clients that most probably will not work: the std. ftp client on your system, that is, the one you get when you use the command ftp, ncftp also seems to have problems.
The hostname of the ftp server is stallo-wgw.uit.no (this will change to stallo-ftp.uit.no soon).
Example using lftp on linux:
> lftp userA@stallo-wgw.uit.no
lftp userA@stallo-wgw.uit.no:~> ls
........ file listing .........
lftp userA@stallo-wgw.uit.no:~> get a-file-on-the-system
84291584 bytes transferred in 3 seconds (28.98M/s)
We seem to have some problems with the openssl library that takes care of the encryption, newer versions seems to work better but we cannot change the library without recompiling a lot of other stuff so we have to live with it until we upgrade stallo this fall.
The problem gives the following error message when transferring a file using lftp:
lftp userA@stallo-wgw.uit.no:~> get filename get: Fatal error: SSL_read: wrong version number lftp userA@stallo-wgw.uit.no:~> get filename 84291584 bytes transferred in 3 seconds (29.00M/s)
As one sees, just retrying fixes the problem(??).
About the queuing system and job submission on Snowstorm.
To learn about the PBS queuing system, PBS commands and job submission, check out this Metasenter page.
This is an example on how a job script could be built on Stallo.
The script is available as a text file you can download.
You have to edit it to fit your own needs.
NB! If you are using Windows: be careful, use an editor that don't leave any "garbage" (often invisible in the editor itself) in the file!
How is the priority of the jobs calculated and how much resources can a user expect to be allowed to allocate at once?
April 30th 2010, new changes to the scheduling policies. See the section on Job to node distribution.
The scheduler is set up to
No user will be allowed to have more than 168 000 cpu-hours allocated for running jobs at any time. This means that a user at most can allocate 1000 cpus for a week for concurrently running jobs (or 500 cpus for two weeks or 2000 cpus for half a week).
No single user will be allowed to use more than 200 jobs at any time. (you can well submit more, but you cannot have more than 200 running at the same time)
Users can apply for exceptions to these rules by contacting support-uit@notur.no.
Due to a large increase in demand from users we have made some changes to the job to compute node mappings. Up until April 2010 we have been running in a free for all fashion with very liberal policies as to which nodes a job would be mapped on.
Before we dive into the detail we need to say a few things about the stallo architecture.
See here for more details.
The basic philosophy for the mapping is to run the job on the nodes best suited for the task.
Short jobs:
qsub -lnodes=4,walltime=48:00:00 ........
Will be allowed to run anywhere.
Infiniband parallel job:
qsub -lnodes=8:ppn=8:ib,walltime=240:00:00 .........
Will be mapped onto the infiniband nodes.
Ethernet parallel job:
qsub -lnodes=8:ppn=8:gige,walltime=240:00:00 .........
Will be run on the ethernet only nodes.
Single node jobs:
qsub -lnodes=1,walltime=240:00:00 ......... qsub -lnodes=1:ppn=8,walltime=240:00:00 .........
will be mapped onto gigabit ethernet nodes. This is new behaviour, earlier it would be mapped onto any free node. Also note that trying to run single node jobs on infiniband nodes will fail:
qsub -lnodes=1:ib,walltime=240:00:00 .........
This job will never be allowed to start.
Highmem jobs:
qsub -q highmem -lnodes=4,pmem=14gb,walltime=240:00:00 ........
This job will run on the higmem nodes if the user is granted access by the administrators. Otherwise it will never start. Note that jobs that try to use both highmem and gigabit ethernet nodes will never start:
qsub -q highmem -lnodes=4:gige,pmem=14gb,walltime=240:00:00 ........
This job will never start.
Some special care needs to be taken if you want to create very large files on the system. With large we mean filesizes over 200GB or so.
The /global/work file system (and /global/home too) is served by a number of storage arrays that each contain smaller pieces of the file system, the size of the chunks are 2TB (2000GB) each. In the default setup each file is contained within one storage array so the default filesize limit is thus 2TB. In practice the file limit is considerably smaller as each array contains a lot of files.
Each user can change the default placement of the files it creates by striping files over several storage arrays. This is done with the following command:
lfs setstripe . 0 -1 4
after this has been done all new files created in the current directory will be spread over 4 storage arrays each having 1/4th of the file. The file can be accessed as normal no special action need to be taken. When the striping is set this way it will be defined on a per directory basis so different dirs can have different stripe setups in the same file system, new subdirs will inherit the striping from its parent at the time of creation.
We recommend users to set the stripe count so that each chunk will be approx. 200-300GB each, for example
| File size | Stripe count | Command |
|---|---|---|
| 500-1000GB | 4 | lfs setstripe . 0 -1 4 |
| 1TB - 2TB | 8 | lfs setstripe . 0 -1 8 |
Once a file is created the stripe count cannot be changed. This is because the physical bits of the data already are written to a certain subset of the storage arrays. However the following trick can used after one has changed the striping as described above:
# mv file file.bu # cp -a file.bu file # rm file.bu
The use of -a flag ensures that all permissions etc are preserved.
Recommendations on how to run a lot of short tasks on the system. The overhead in the job start and cleanup makes it unpractical to run thousands of short tasks as individual jobs on Stallo.
The queueing setup on stallo, or rather, the accounting system generates overhead in the start and finish of a job of about 1 second at each end of the job. This overhead is insignificant when running large parallel jobs, but creates scaling issues when running a massive amount of shorter jobs. One can consider a collection of independent tasks as one large parallel job and the aforementioned overhead becomes the serial or unparallelizable part of the job. This is because the queuing system can only start and account one job at a time. This scaling problem is described by Amdahls Law.
Without going into any more details, let's look at the solution.
By using some shell trickery one can spawn and load-balance multiple independent task running in parallel within one node, just background the tasks and poll to see when some task is finished until you spawn the next:
for t in $tasks; do
./dowork.sh $t &
activetasks=$(jobs | wc -l)
while [ $activetasks -ge $maxpartasks ]; do
sleep 1
activetasks=$(jobs | wc -l)
done
done
wait
Complete examples with descriptive comments can be found here: partasks.sh, dowork.sh.
We charge for used resources, both cpu and memory.
To use the batch system you have to have a cpu quota, either local or natinoal. For every job you submit we check that you have sufficient quota to run it and you will get a warning if you do not have sufficient cpu-hours to run the job. The job will be submitted to queue, but will not start until you have enough cpu-hours to run it.
The accounting system charges for used processor equivalents (PE) times used walltime so if you ask for more than 2GB of memory per cpu you will get charged for more than the actual cpus you use.
The best way to describe PE is maybe by example: Assume that you have a node with 8 cpu-cores and 16 GB memory (as most nodes on stallo are):
if you ask for less than 2GB memory per core then PE will equal the cpu count. if you ask for 4GB memory per core then PE will be twice the cpu-count. if you ask for 16GB memory then PE=8 as you only can run one cpu per compute node.
A high priority queue called express can be used for testing and interactive jobs.
By submitting a job to the express queue you can get higher throughput for testing and shorter start up time for interactive jobs. Just use the -q express flag to submit to this queue:
qsub -q express jobscript.sh
or for an interactive job:
qsub -q express -I
This will give you a faster access if you have special needs during development, testing of job script logic or interactive use.
Jobs in the express queue will get higher priority than any other jobs in the system and will thus have a shorter queue delay than regular jobs. To prevent misuse the express queue has the following limitations:
So, it is more or less pointless to try to use the express queue to sneak regular production jobs passed the other regular jobs. Submitting a large amount of jobs to the express queue will most probably decrease the overall throughput of your jobs. Also remark that large jobs get prioritized anyway so they will most probably not benefit anything from using the express queue.
job arrays
use job arrays:
qsub -t 1-16 Myjob
will send Myjob 16 times into the queue. They can be distinguished by the value of the environmental variable
$PBS_ARRAYID
