Condor

From Peyton Hall Documentation

Revision as of 19:55, 17 May 2017 by Huston (Talk | contribs)

(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)

Condor is a batch scheduling system which is perfect for single processor "serial" jobs. It allows you to submit jobs to be run, and will farm them out to idle processors in the department as it finds room. Here you'll find information on Condor, and the department Condor configuration.

Overview

What is Condor?

From http://www.cs.wisc.edu/condor/description.html:

Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.

While providing functionality similar to that of a more traditional batch queueing system, Condor's novel architecture allows it to succeed in areas where traditional scheduling systems fail. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. For instance, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource.

Condor can be used for serial or parallel jobs, such as those that use MPI and the like. However it's probably a good idea to only run MPI jobs on Hydra's nodes, and not the general pool. It would only "require" that MPI be installed on the network so it's visible to all the machines, however there are other reasons why parallel jobs would not work as well - for example, not every machine is the same speed, nor do they all have the same speed ethernet interfaces (and even if they did, it would still be slower than the gigabit ethernet that Hydra has). It is, however, quite useful for serial jobs, for many reasons:

Condor is a batch scheduler.
This means you can queue up a list of jobs to be completed, and Condor will go through and do them however it can - you can tell it that some jobs must complete before others can start, or that they can all be done simultaneously, or however your jobs need to be executed.
All the machines that can participate in the "cluster" are setup to do so.
This includes fast and powerful desktop machines. While those desktop users might be thinking, "But that means someone's job will be making my machine slower when I want to do work!" that is not the case. Condor is designed with this in mind:
- If you start doing work on the computer, the Condor jobs will be immediately suspended.
- If you're still working and the job remains suspended, after a time the job will be told to vacate the machine - if it can, it will checkpoint itself, and then remove itself from the computer and go back to the queue.
- If another machine is available, the job will be transferred there, otherwise it will wait in queue until another machine is ready for jobs.
This means that in the worst case scenario, you'll have to wait a few minutes while the job is transferred off of your desktop and back to the queue, and its associated data is moved off the machine over the network.

Condor documentation

Documentation for Condor is available at http://research.cs.wisc.edu/htcondor/manual. For a quick start guide, see http://research.cs.wisc.edu/htcondor/manual/quickstart.html which also has some quick job submission script examples.

Condor was not "fully installed" as per the documentation - this is because to do so we would have to replace the linker (ld) on all machines with their linker. So if you see a part in the documentation that refers to running one of two commands, one if Condor was fully installed and the other if not, run the second one. This only pertains to compiling programs using "condor_compile".

Our condor configuration

One no longer needs to source any files or modules to use condor; if condor is setup on the machine to which you're connected, the binaries will be in your $PATH automatically.

FAQ

What machines are in the pool?

Most department owned machines are in the pool, as are all the graduate student desktops. If your machine is not in the pool, but you plan on using Condor heavily, then your desktop should be added to the pool as well. Since jobs vacate off of computers fairly quickly, there's no harm in having your computer added to the pool, and everyone who uses Condor benefits from the addition.

How can I see the pool status?

Use the command 'condor_status' to view the members of the pool, and their current status. You can also see some graphs of Condor usage and availability at http://www.astro.princeton.edu/private/cv/ (only hosts within the building network can view this page).

How do I view the job queue(s)?

Using 'condor_q' you can see the queue of the scheduler on your local machine. You can add '-name <hostname>' to view the scheduler of another machine, such as Hydra. And you can use the option '-global' to see all of the queues in the building.

How do I get jobs to run on both INTEL and X86_64?

NOTE: This is no longer needed, as there are no 32-bit machines left in the pool.

By default, when you submit a job to Condor, it will only run on the same architecture as the host from which you submitted the job. This means if you use one of the machines with the "INTEL" arch, it will only run on those - and there's not that many of them! But, since jobs that are compiled for 32-bit execution will run on both "INTEL" and "x86_64" architectures, add this bit of magic to the top of your job file:

Requirements   = (Arch =="INTEL" && OpSys == "LINUX") || \
                 (Arch =="X86_64" && OpSys == "LINUX")

This will tell your job that it can run on either of the architectures just the same. Also have a look in section 2.5.6 of the manual for more information.

How do I run my jobs on only a specific major version?

Since the department (currently) has a heterogeneous computing environment, made up of more than one OS distribution, you may want to limit your jobs to only a specific version of the OS. Using a Requirements line (as shown in the previous question), you can specify "OpSysAndVer == foo" where foo is the release you want to run on. So, for example, if you want to run on Springdale 6 systems only, add the following line to your job file:

 Requirements = "OpSysAndVer == RedHat6"

This will run your jobs on only machines running Springdale 6.x. Likewise use "RedHat7" if you prefer Springdale 7 machines.

Why does my job in the standard universe fail with "Job not properly linked for Condor" even though I compiled it with condor_compile?

While this can be related to a number of issues, so far the most common problem we've seen is compiling using the gcc -pg flag, which enables profiling support. Remove this flag from your compile options and recompile.

Some user is running condor_exec and eating all my CPU!

When you see the "condor_exec" process running on a computer, this means that user's job was run through Condor. Many of the machines on the network have multiple cores, so it's not uncommon to see 2-4 of these running at the same time. As an example, if you're just browsing the web and doing some minor work, it's quite likely that three of the four cores on a quad-core machine are idle and available for Condor usage. Just having other processes running doesn't mean your usage of the computer is degraded - in fact, many times you can't even tell it's there since you weren't using the power yourself. If you start running other processes that need the CPU cores, Condor will automatically "vacate" the job off of your computer after a short delay (around five minutes or so).

How can I only run a certain number of jobs (such as IDL jobs) at a time?

In some cases, you might only want to run a certain number of your jobs at once, while at the same time you want to batch them all and not have to manually start groups of 10 or whatnot. Condor has two ways this can be handled: concurrency limits and DAGMan.

Using Concurrency Limits

In your submit file, you need only specify "concurrency_limits = <something>". If there's a defined limit for <something> then you can use up whatever the limit of that thing is; for example, IDL is limited to 30 concurrent licenses (we have around 41 available total). This way IDL jobs won't eat up all the licenses at once and there should be enough for interactive users as well. So specifying "concurrency_limits = IDL" will "check out" one IDL license per job, and Condor will only let 30 of those jobs run at once. Note that the 30 is not just yours, but there may be other users who have "checked out" IDL licenses in this way too.

Note: The default concurrency limit for anything is 10; so if you set "concurrency_limits = Foo" and Foo_LIMIT is not otherwise defined in the configuration, then you can run 10 such jobs at once.

Using DAGMan

You can specify in a DAGMan job how many concurrent processes you want to run at a time. Please see the manual for details (and feel free to write those details in here as well). One way to do this is by setting the MAXJOBS keyword in the DAG input file. This is explained here. Your DAG input file will look something like this:

 #####
 #test.dag
 #####
 #jobs, each with a condor_submit file
 JOB a a.condor
 JOB b b.condor
 JOB c c.condor
 ...
 #place the concurrent jobs you want to limit in the same category called limited
 CATEGORY a limited
 CATEGORY b limited
 CATEGORY c limited
 ...
 #set a max number of jobs of type limited
 MAXJOBS limited 2

Note that this limits the number of job clusters. If a.condor submits 100 IDL jobs, you'll still be using a lot of licenses.

How do I ask for additional assistance with using the departmental condor pool?

Contact us as you would for any other inquiry, but please provide us with the location of your source code, job submission file, and any output or error logs you can. This will help us immensely in trying to troubleshoot issues (and we'll probably ask you for it anyway -- so save yourself some time and provide as much detail as you can in your initial e-mail).