PBS
From Peyton Hall Documentation
The Portable Batch System, or PBS, is a cluster management package which handles farming jobs out to execution nodes for processing. Here in Peyton, we also use the Maui scheduler, which ties in to PBS to determine what jobs should be run and when (making sure that all users get fair time for their jobs based on the job requirements and cluster availability).
Contents |
Submitting jobs
Overview
There's two ways to submit jobs to PBS, using a graphical interface (xpbs) or using the command line. We'll go over the command line version for now, with other descriptions to follow later. First, you need a script to run. This is a simple shell script which will contain extra information for PBS about the kind of job you're running, etc. The first type we'll touch on is a serial job (which would be better served by Condor, but is here as an example):
#!/bin/bash #PBS -q @hydra #PBS -m abe cd $PBS_O_WORKDIR your_command
The #PBS lines tell PBS different things. In this case, to submit to the default queue on hydra, and to mail the user when the job starts, finishes, or is aborted. The $PBS_O_WORKDIR variable is set by PBS to be the current directory from where you submit the job; this would likely be somewhere in /chimera since that is mounted across all machines (so input and output files will be available to all jobs). Lastly, you run the command(s) to actually run your job.
- Another good option to PBS is #PBS -j oe, which joins stdout and stderr into one output file.
All that's left is to chmod 755 the script file, and then type (from the directory where you want your job to actually run) 'qsub scriptname'. PBS will tell you the name of the job in the queue, and submit it. When resources are available, your job will execute.
To use a MPI program on the cluster, you'll need to run it through the 'mpiexec' program. You simply need to tell PBS how many nodes (and processors on each node) you want. mpiexec will pull the information from PBS, including which machines to use for the job. For example:
#!/bin/bash #PBS -l nodes=3:ppn=1 /usr/local/bin/mpiexec /path/to/your/mpi/program
In this example, you ask for 3 nodes, with one processor per node.
PBS options
Note that options can be given in the script file (by starting the line with "#PBS" as shown above), or as an option on the command line to qsub.
- -j [oe|eo]
- Join output streams. 'oe' adds stderr to stdout, 'eo' adds stdout to stderr
- -m [a][b][e]
- Mail information to you. Options are 'a' for only mail on abort, 'b' for mail when execution begins, and 'e' for when execution ends. May be combined, as in '-m ae' to mail when job ends, or is aborted.
- -l <resource spec>
- Request resources. See below for some examples, and the man page 'man pbs_resources' for a list of all possible resources you can request and how.
node_specs are a little tricky. That's how you define (with the '-l nodes=' option) what kind of nodes you want. For example, nodes=3:ppn=1 means 3 nodes, one processor per node (the other will be made available to other jobs). Likewise, nodes=3:ppn=2 means three nodes, using both processors on each node.
Resource examples
Jobs should be submitted with certain information, so that the software knows what to expect from the job once it runs. For example, you should tell the cluster how much memory each process of the job will require, using the 'pmem' specification. -l pmem=1234MB will tell PBS and Maui that the job will use 1234MB of RAM PER PROCESS. Not only is this good practice, so that jobs can properly share nodes (Maui will know that the two jobs it wishes to run somewhere will fit in available memory space), but is REQUIRED in order to make sure a job will not run on a node that doesn't have enough memory. If you're running a single-process job, don't worry about the fact that this is for memory "per process", it will work in just the same way.
Another thing which should be specified when starting a job is how long it will run, in wall-time (ie, real time as opposed to CPU time). This helps with backfilling jobs (when many small jobs can run ahead of schedule, while a larger job completes and another larger job waits in queue), and also instills a backup in case your job fails to exit properly, it will be killed when it exceeds the wall clock limit. To specify how long your job will need to run, add the resource specification -l walltime=HH:MM:SS, where HH is hours, MM is minutes, and SS seconds. Note that if you specify a time above 10,000 hours, Maui will deem this "INFINITY" and display it as such in listings. A *very* low default may be set soon, to make sure that users define the wall time in their jobs (since if you don't, the default will be used instead, which will end up killing the job well before it's finished).
Example job script
If you're not using an MPI program, then substitute the program's executable for the "mpiexec" line at the end.
#!/bin/bash # # This line tells PBS to mail you when the job ends, or if it is aborted: #PBS -m ae # # This line requests a number of nodes. ppn means 'processors per node', for # MPI jobs you don't want to use more than one due to a problem with shared # memory + ethernet methods of IPC: #PBS -l nodes=8:ppn=1 # # If you want to have STDOUT and STDERR joined into one file, then insert the # word PBS in between the hash mark and the -j here: # -j oe # # To pass variables to your program, define them as environment variables # before you call qsub, and add the -V option to qsub like this: # qsub -V /chimera/user/path/to/this/script # # Otherwise, you can pass variables with the -v option when you call it, like # this: # qsub /chimera/user/path/to/this/script -v variable1=value1,variable2=value2,... # # Also note that all options above that start with pound-PBS can be added to # the qsub commandline, like so: # qsub -l nodes=3:ppn=1 /chimera/user/path/to/this/script # This is handy if you write a superscript which calls qsub directly, so you # can tell the superscript how many nodes you want, instead of hard-coding it # into this job script. # # Now we call mpiexec, and give it the name of the MPI program: /usr/local/bin/mpiexec /chimera/user/path/to/program
Yet More Examples *NEW*
As of April, 2008, we've completely overhauled the OS and libraries on hydra. While most (if not all) the information on this page should still be valid, below are a few quick examples from Bill Wichser on launching different types of jobs on the 'newer' hydra.
Serial Job =================================================== module load intel ifort code.f PBS script ========== #PBS -l nodes=1:ppn=1,walltime=1:00:00 module load intel cd $PBS_O_WORKDIR ./a.out Parallel Job ==================================================== module load openmpi (or mpich) - assumes Intel compiler mpif90 parallel.f PBS script ========== #PBS -l nodes=2:ppn=2,walltime=1:00:00 load module openmpi cd $PBS_O_WORKDIR mpiexec a.out #PBS -l nodes=2:ppn=2,walltime=1:00:00 load module mpich cd $PBS_O_WORKDIR mpiexec -comm none a.out Please note: 1) for standard mpich you'll need to disable shared memory using the -comm none flag for mpiexec 2) module [list|avail|purge|load] are all useful commands in setting up your environment but you MUST add these to the PBS script 3) openmpi is encouraged due to it's better collective routines (if these are used). This one can also make use of IB, TCP, and SHMEM rather trivially, if available such that with one compile you get it all. This is also the version supported on all the TIGRESS machines.
Scheduling
When a job is submitted on hydra to the PBS queuing system, it is assigned to a class (queue) according to it's wall time parameter. There are currently three classes of jobs: debug (15 minute limit), short (4 hour limit), and long (7 day limit). The way these classes are used is to assign a priority to a job so that different types of jobs get preference over others.
In the current scheme, debug class jobs get highest priority. They get a 7 day bump in the job queue. This means that a debug class job will run before a long class job submitted 7 days ago. The short class jobs receive a 12 hour bump.
Priority is also determined by the wall time requested with shorter jobs in the same class (debug, short, long) receiving higher priority and therefore running sooner.
Two tools can help in determining how best to fit your jobs into a heavily used machine. '/opt/maui/bin/showbf' will list the number of processors available for use right now with any time limit associated. The command '/opt/maui/bin/diagnose -p' will show the priorities of all jobs in the wait queue.
Limits
Due to jobs running away with too many resources, or a user taking up too much of the cluster for single-processor jobs, some limits were put in place.
- Users - maxproc=48,64
- 48 processors max when cluster is heavily used, 64 when/if the nodes are not in use, sitting idle.
Yes, this can be a problem. Typically some user will gain access to 64 processors and that's almost exactly the time when nodes are needed most. But it does force some limits. As for other limits, there is only class LONG which is limited now to 178 processors.
Documentation
Checking the man pages for PBS commands will yield quite a bit of information, but here's some other documentation links for PBS and Maui.
Useful commands
- '/opt/maui/bin/showres -n' will show what is currently reserving time on the nodes
- '/opt/pbs/bin/pbsnodes -n' shows more info
- '/opt/maui/bin/showbf -v' shows information too
- '/opt/pbs/bin/qstat -f'
- '/opt/maui/bin/diagnose -n | egrep fourg' would show which 4GB nodes are available
Other documentation
The following PDFs are for "PBS Pro", while we run "OpenPBS" here. These files may serve as guides to supplement the manpages and such, but may not be 100% accurate.
The
administrator's guide
for Maui, the scheduler we run with PBS
More documentation for Maui can be found at http://supercluster.org/documentation/index.html. TeraGrid also has some useful information on PBS located at http://www.teragrid.org/userinfo/guide_jobs_pbs.html.
Tips & tricks
Redirecting output in job scripts
To redirect output you have to use quote marks around the > symbol, like /usr/local/bin/mpiexec a.out ">" stdout_file
