Cluster Computing
From Peyton Hall Documentation
m |
(Removing irrelevant information about PBS, and updated "usage policies") |
||
Line 1: | Line 1: | ||
No, we're not talking about star clusters - not this time, anyway. We're talking about Beowulf clusters, or to use a more general term "computer clusters". Not the same as what OIT calls their "computer labs". | No, we're not talking about star clusters - not this time, anyway. We're talking about Beowulf clusters, or to use a more general term "computer clusters". Not the same as what OIT calls their "computer labs". | ||
- | |||
- | |||
- | |||
- | |||
- | |||
== Introduction == | == Introduction == | ||
- | So what is a cluster? It's a set of machines, usually (but not necessarily) on a private network which is attached to a dual-homed "master node". Dual-homed means it sits on two networks at the same time, and may even act as a router between the two. This master node | + | So what is a cluster? It's a set of machines, usually (but not necessarily) on a private network which is attached to a dual-homed "master node". Dual-homed means it sits on two networks at the same time, and may even act as a router between the two. This master node can allow logins, and is where you setup your large parallel jobs. Once the job is submitted, software on the master connects to the drones and runs the job there. This software is designed to fairly execute programs when there is available resources for them, and make sure that someone doesn't start a job on the same nodes that you're using for your processes so that everyone's programs get fair share of the machine. |
Line 19: | Line 14: | ||
=== Status === | === Status === | ||
This section will be updated with status information for Hydra as necessary. | This section will be updated with status information for Hydra as necessary. | ||
+ | * '''2009-05-01:''' Hydra has been reimaged completely with CentOS 5, and is using Condor for scheduling instead of PBS. It no longer accepts user logins directly, as Condor jobs can be submitted from other machines without being logged into Hydra. | ||
* '''2008-04-24:''' We've done all the testing we can. Access has been restored to all users who previously had access to hydra and who still have a valid Astro account. | * '''2008-04-24:''' We've done all the testing we can. Access has been restored to all users who previously had access to hydra and who still have a valid Astro account. | ||
* '''2008-04-11:''' The entire cluster has been upgraded to PU_IAS Linux 5.1 (a RedHat Enterprise variant). This distribution has much newer software and libraries than hydra's old distro, which was painfully out of date. Given the latest generation of compilers and MPI libraries, we highly suggest recompiling your code before submitting new jobs. Hydra is currently in a limited test state, we will eventually (soon, hopefully) be reinstating full access to the cluster to all who had accounts, but for now, we are in a shakedown period. | * '''2008-04-11:''' The entire cluster has been upgraded to PU_IAS Linux 5.1 (a RedHat Enterprise variant). This distribution has much newer software and libraries than hydra's old distro, which was painfully out of date. Given the latest generation of compilers and MPI libraries, we highly suggest recompiling your code before submitting new jobs. Hydra is currently in a limited test state, we will eventually (soon, hopefully) be reinstating full access to the cluster to all who had accounts, but for now, we are in a shakedown period. | ||
Line 32: | Line 28: | ||
=== Getting access to Hydra === | === Getting access to Hydra === | ||
- | + | Anyone wishing to use hydra for computations need only have their data stored somewhere that Hydra nodes can see. Currently that is on /scr/chimera0, but other scratch disks may be added to the pool in time. If you don't have a directory on Chimera and need one, please [[Requesting assistance|contact us]]. | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
=== Submitting jobs to Hydra === | === Submitting jobs to Hydra === | ||
- | Hydra uses | + | Hydra uses [[Condor]] for job management. You'll find information about how to use it in the [[Condor|Condor article]]. |
== FAQ == | == FAQ == | ||
- | |||
- | |||
- | |||
- | |||
=== Can I speed up my compiling? === | === Can I speed up my compiling? === | ||
Yes, you can speed things up with the '-j' option to make. From the man page: | Yes, you can speed things up with the '-j' option to make. From the man page: | ||
Line 65: | Line 49: | ||
== See also == | == See also == | ||
- | * [[Condor]], | + | * [[Condor]], the scheduling system in use on Hydra and other machines on the network. |
* [[MPICH]], an implementation of the Message Passing Interface commonly used in large parallel jobs. | * [[MPICH]], an implementation of the Message Passing Interface commonly used in large parallel jobs. | ||
[[Category:Cluster Computing]] | [[Category:Cluster Computing]] |
Revision as of 21:47, 6 May 2009
No, we're not talking about star clusters - not this time, anyway. We're talking about Beowulf clusters, or to use a more general term "computer clusters". Not the same as what OIT calls their "computer labs".
Contents |
Introduction
So what is a cluster? It's a set of machines, usually (but not necessarily) on a private network which is attached to a dual-homed "master node". Dual-homed means it sits on two networks at the same time, and may even act as a router between the two. This master node can allow logins, and is where you setup your large parallel jobs. Once the job is submitted, software on the master connects to the drones and runs the job there. This software is designed to fairly execute programs when there is available resources for them, and make sure that someone doesn't start a job on the same nodes that you're using for your processes so that everyone's programs get fair share of the machine.
Hydra
Hydra is a 92 (72) node Beowulf cluster housed in the basement server room of Peyton Hall. Eight of the nodes have only 1GB of memory, another eight have 2GB of memory, and the remainder have 4GB. The nodes have dual processors ranging in speed from 2.2GHz to 3.06GHz.
Not all nodes are online presently. See the status message in the next section.
Status
This section will be updated with status information for Hydra as necessary.
- 2009-05-01: Hydra has been reimaged completely with CentOS 5, and is using Condor for scheduling instead of PBS. It no longer accepts user logins directly, as Condor jobs can be submitted from other machines without being logged into Hydra.
- 2008-04-24: We've done all the testing we can. Access has been restored to all users who previously had access to hydra and who still have a valid Astro account.
- 2008-04-11: The entire cluster has been upgraded to PU_IAS Linux 5.1 (a RedHat Enterprise variant). This distribution has much newer software and libraries than hydra's old distro, which was painfully out of date. Given the latest generation of compilers and MPI libraries, we highly suggest recompiling your code before submitting new jobs. Hydra is currently in a limited test state, we will eventually (soon, hopefully) be reinstating full access to the cluster to all who had accounts, but for now, we are in a shakedown period.
- 2008-04-11: The current state of the hardware is:
- Hydra is currently running with only 72 nodes available. We had several hardware failures during the reinstall. We intend to try and get a few more nodes fixed and added to the pool, stay tuned.
- Presently there are no 1GB nodes available, and our intent is to make the minimum memory on the whole cluster 2GB per node, if not 4.
- Hydra's hardware is incapable of running 64-bit, so the upgrade didn't include a move to a 64-bit OS,
- An additional RAID array has been installed on the Hydra cluster, named Chimera. Its scratch disk is mounted on Hydra as /chimera. To access it from the rest of the department, use the path /peyton/scr/chimera0. It is mounted to all the nodes, also as /chimera, and should be the preferred location for storing inputs and outputs from your programs.
- The Chimera disk (/chimera on Hydra, and /peyton/scr/chimera0 elsewhere) is truly a scratch disk, in that it is not and will not be backed up. It is a hardware RAID-5 array with one hot spare disk, and should handle most hardware failures without incident.
- There is another storage space on Cerberus, mounted as /work on Hydra. This disk used to be mounted on the nodes directly as well, however NFS load problems and network overloading caused it to be unmounted.
- Some usage policies have been posted; please feel free to comment on them by sending mail to the cluster list. They are not strict policies that must be adhered to, but more like general guidelines to be kept in mind while submitting jobs to the systems. They are posted below.
- Also, a bug was recently fixed in PBS, which caused every node to report having only 864MB of memory. Each node now properly reports the amount of RAM it has, so everyone should be specifying the amount of memory your jobs require through the PBS resource list ("#PBS -l mem=1380MB", for example). Maui will Do The Right Thing and only assign the job to a node with that much available RAM.
Getting access to Hydra
Anyone wishing to use hydra for computations need only have their data stored somewhere that Hydra nodes can see. Currently that is on /scr/chimera0, but other scratch disks may be added to the pool in time. If you don't have a directory on Chimera and need one, please contact us.
Submitting jobs to Hydra
Hydra uses Condor for job management. You'll find information about how to use it in the Condor article.
FAQ
Can I speed up my compiling?
Yes, you can speed things up with the '-j' option to make. From the man page:
-j jobs Specifies the number of jobs (commands) to run simul�- taneously. If there is more than one -j option, the last one is effective. If the -j option is given without an argument, make will not limit the number of jobs that can run simultaneously.
So using 'make -j2' will cause make to run 2 instances at the same time, which is good for a dual-processor machine. If you want it to just run wild, run 'make -j', and it will fork as many as it can at the same time.