Condor

From Peyton Hall Documentation

(Difference between revisions)

Revision as of 21:47, 28 January 2009

Condor is a batch scheduling system which is perfect for single processor "serial" jobs. It allows you to submit jobs to be run, and will farm them out to idle processors in the department as it finds room. Here you'll find information on Condor, and the department Condor configuration.

Overview

What is Condor?

From http://www.cs.wisc.edu/condor/description.html:

Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.

While providing functionality similar to that of a more traditional batch queueing system, Condor's novel architecture allows it to succeed in areas where traditional scheduling systems fail. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. For instance, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource.

Though Condor can be used for parallel jobs, such as those that use MPI and the like, we have not configured our system to do so. It would only "require" that MPI be installed on the network so it's visible to all the machines, however there are other reasons why parallel jobs would not work as well - for example, not every machine is the same speed, nor do they all have the same speed ethernet interfaces (and even if they did, it would still be slower than the gigabit ethernet that the Beowulf cluster has). It is, however, quite useful for serial jobs, for many reasons:

Condor is a batch scheduler.
This means you can queue up a list of jobs to be completed, and Condor will go through and do them however it can - you can tell it that some jobs must complete before others can start, or that they can all be done simultaneously, or however your jobs need to be executed.
All the machines that can participate in the "cluster" are setup to do so.
This includes fast and powerful desktop machines. While those desktop users might be thinking, "But that means someone's job will be making my machine slower when I want to do work!" that is not the case. Condor is designed with this in mind:
- If you start doing work on the computer, the Condor jobs will be immediately suspended.
- If you're still working and the job remains suspended, after a time the job will be told to vacate the machine - if it can, it will checkpoint itself, and then remove itself from the computer and go back to the queue.
- If another machine is available, the job will be transferred there, otherwise it will wait in queue until another machine is ready for jobs.
This means that in the worst case scenario, you'll have to wait a few minutes while the job is transferred off of your desktop and back to the queue, and its associated data is moved off the machine over the network.

Condor documentation

Documentation for Condor is available at http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html. For a quick start guide, see http://www.cs.wisc.edu/condor/quick-start.html which also has some quick job submission script examples.

Condor was not "fully installed" as per the documentation - this is because to do so we would have to replace the linker (ld) on all machines with their linker. So if you see a part in the documentation that refers to running one of two commands, one if Condor was fully installed and the other if not, run the second one. This only pertains to compiling programs using "condor_compile".

Our condor configuration

Due to changing how Condor is setup in the building, you'll want to source either /u/condor/condor-setup.sh or /u/condor/condor-setup.csh in your shell startup scripts, depending on which shell you use (.bashrc for bash users, .cshrc for tcsh users). This will add the proper bits to your $PATH and such, for the machine you're logged in to.

FAQ

What machines are in the pool?

Most department owned machines (see here) are in the pool, as are all the graduate student desktops. If your machine is not in the pool, but you plan on using Condor heavily, then your desktop should be added to the pool as well. Since jobs vacate off of computers fairly quickly, there's no harm in having your computer added to the pool, and everyone who uses Condor benefits from the addition.

How can I see the pool status?

Use the command 'condor_status' to view the members of the pool, and their current status.

How do I get jobs to run on both INTEL and X86_64?

By default, when you submit a job to Condor, it will only run on the same architecture as the host from which you submitted the job. This means if you use one of the machines with the "INTEL" arch, it will only run on those - and there's not that many of them! But, since jobs that are compiled for 32-bit execution will run on both "INTEL" and "x86_64" architectures, add this bit of magic to the top of your job file:

Requirements   = (Arch =="INTEL" && OpSys == "LINUX") || \
                 (Arch =="X86_64" && OpSys == "LINUX")

This will tell your job that it can run on either of the architectures just the same. Also have a look in section 2.5.6 of the manual for more information.

@@ Line 26: / Line 26: @@
 === Condor documentation ===
-Documentation for Condor is available at http://www.cs.wisc.edu/condor/manual/v6.7/2_Users_Manual.html.  For a quick start guide, see http://www.cs.wisc.edu/condor/quick-start.html which also has some quick job submission script examples.
+Documentation for Condor is available at http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html.  For a quick start guide, see http://www.cs.wisc.edu/condor/quick-start.html which also has some quick job submission script examples.
 Condor was not "fully installed" as per the documentation - this is because to do so we would have to replace the linker (ld) on all machines with their linker.  So if you see a part in the documentation that refers to running one of two commands, one if Condor was fully installed and the other if not, run the second one.  This only pertains to compiling programs using "condor_compile".
@@ Line 32: / Line 32: @@
 == Our condor configuration ==
-You'll want to make sure that /u/condor/bin is in your $PATH so that you can access all the Condor binaries.  Do so with the following:
+Due to changing how Condor is setup in the building, you'll want to source either /u/condor/condor-setup.sh or /u/condor/condor-setup.csh in your shell startup scripts, depending on which shell you use (.bashrc for bash users, .cshrc for tcsh users).  This will add the proper bits to your $PATH and such, for the machine you're logged in to.
-* bash: '<tt>export PATH=/u/condor/bin:$PATH</tt>'
-* tcsh: '<tt>set path = ( /u/condor/bin $path )</tt>'

Condor

From Peyton Hall Documentation

Revision as of 21:47, 28 January 2009

Contents

Overview

What is Condor?

Condor documentation

Our condor configuration

FAQ

What machines are in the pool?

How can I see the pool status?

How do I get jobs to run on both INTEL and X86_64?

Views

Personal tools

Navigation

links

Search

Toolbox