| Academic Computing and Communications Center | ||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||||||||||||
ARGO: Running Jobs |
||||||||||||||||||||||||||||||||||
| Overview | ||||||||||||||||||||||||||||||||||
|
How you run a program on the argo cluster is VERY DIFFERENT from how you run a job on a single machine with one or multiple CPUs (for example, tigger). And, for the matter, how you would run a program on a PC/laptop in either a Windows or Linux operating system. To begin with, you do not run your executable on the machine (the master) where you create the executable. The following point cannot be emphasized enough:
There are monitors that alert systems to user programs running on the master. Running a program on the master is a violation of ACCC policy and can result in suspension and termination of your argo account. There are two types of programs that may be executed on the cluster:
For the purposes of the ACCC cluster, a sequential job is a single instance program that runs on one and only one node. A parallel job is composed of:
Serial version of the classic hello_world program - source in C
#include <stdio.h>
void main(int argc, char** argv) {
printf("Hello-world\n");
}
Parallel version of the classic hello_world program using MPI - source in C
#include <stdio.h>
#include "mpi.h"
void main(int argc, char **argv) {
int rank;
int size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello-world, I'm rank %d; Size is %d\n", rank, size);
MPI_Finalize();
}
|
||||||||||||||||||||||||||||||||||
| Torque | ||||||||||||||||||||||||||||||||||
|
Torque is a networked subsystem for submitting, monitoring, and controlling a workload of jobs on the cluster.
Years ago, only batch jobs could execute on the cluster. THAT IS NOT THE CASE NOW: torque does not restrict jobs to just batch execution; interactive jobs with GUIs and users interacting with the GUI may be run. |
||||||||||||||||||||||||||||||||||
| Batch and interactive jobs | ||||||||||||||||||||||||||||||||||
|
Batch and interactive jobs can be run on argo. A batch job is broadly defined as one where the job/program runs with no user interaction. Any input is provided by one or more files and there are no screens and menus like the kind you would see with a program running in a PC/Windows environment. You submit the program for execution and, upon its completion, you then review files containing results. You don't interact with the running program. An example of a batch program is Gaussian. An interactive job is the opposite of a batch program. It is one where the user DOES interact with screens via a mouse and/or keyboard. The screens appear on your local computer monitor even though the executable is running remotely on argo. For a more detailed discussion about running interactive jobs on argo, see the document here. |
||||||||||||||||||||||||||||||||||
| Queues | ||||||||||||||||||||||||||||||||||
|
Queues are locations for jobs, are composed of compute nodes, and have rules regarding who may
use them and for how long. Jobs are submitted to queues (via the qsub command) and the system
decides which node or nodes assigned to the queue runs the job. Queues are aptly named; some
jobs execute immediately (running) while others wait to execute (queued) until the requested
resource(s) in the queue become available. There are five queues: three for jobs submitted by students and two for the jobs submitted by staff/faculty.
A job submitted by a student to the staff queue will be denied access and instead be routed to the student_short queue. Conversely, a job submitted by staff/faculty will not run on any of the student queues and be re-routed to the staff queue. Regarding how long a job may execute in a queue, a job submitted to the student short queue will be terminated by the system after four hours whereas a job submitted to the staff queue has a default runtime of 72 hours. Currently when you login to argo, a chart with both the default times and maximum times is displayed. For more information about how to tell your job which queue to use,
click here. |
||||||||||||||||||||||||||||||||||
| Environmental Variables | ||||||||||||||||||||||||||||||||||
|
There are two environments, each with its own define variables, available to you:
Shell environmental variablesTo see a list of your shell environmental variables, type env | more at your shell prompt. To pass ALL the variables (not just a subset) to your job, include the -V option on the qsub command.Torque environmetal variablesEvery user job has the following torque enviromental variables available to it:
|
||||||||||||||||||||||||||||||||||
| Commands | ||||||||||||||||||||||||||||||||||
The following four commands are important and you will use them often:
|
||||||||||||||||||||||||||||||||||
| Submitting a job for execution | ||||||||||||||||||||||||||||||||||
|
The command for submitting a job on the cluster is:
The script file, referenced in the above commands with the name my_script, contains one or more commands that tell the system what to do. The file may simply contain one line - the path and name of the executable to run:
One other point. Always include the -V option on the qsub command. |
||||||||||||||||||||||||||||||||||
| Job Output and Management | ||||||||||||||||||||||||||||||||||
|
After submitting a program for execution, it is assigned an identifying number called a job id:
where XXX is the number (you can disregard the argo.cc.uic.edu portion). To see the status of your job, use the qstat command:
For stdout and stderr, torque creates two files. (If you are not familar with stdout/stderr files, then click here.) The names of the two files are constructed from the job name, the letter e (for stderr) or o (for stdout), and the job number. So for the sample hello world program (listed earlier) that had job id 338, you would have the following files:
Let's take a look:
An empty error file is a good sign. Let's see what we have in stdout:
Gives:
And, that's what we should have. |
||||||||||||||||||||||||||||||||||
| Node Selection and Properties | ||||||||||||||||||||||||||||||||||
|
Every node has multiple properties associated with it. The property that
clients are most familiar is the node name. Properties may be used to
identify and use a particular node. The properties of a particular node
can be displayed by using the qmgr command. For example, to see the
properties associated with argo1-1:
The property amd tells you that argo1-1 contains an Opteron processor. The property smp tells you that argo1-1 has dual processors (both of which, by virtue of the amd property, are Opterons). The generic sytax of the qsub command is:
where node_spec is:
A series of examples follows:
|
||||||||||||||||||||||||||||||||||
| My job does not run | ||||||||||||||||||||||||||||||||||
|
November 9, 2011. This section no longer reflects current information. Systems will re-write it. Please do not use. Before you read this section, make sure you understand the information presented regarding qsub. If not, click here The two most likely causes of a job not running are:
The request violates two policies. One, the user is requesting a total of
ten processors - two processors (ppn=2) on five nodes (nodes=5). The job is headed to the
student_short queue. As was explained previously, the maximum number of
processors (ncpus) a job may use on the student_short queue is eight (the
following command with the resulting answer tells you that) but the request
is for ten:
Two, the job is also requesting more nodes (max.nodect) than is permitted (the user wants five nodes when four is the maximum):
The following message should have appeared after issuing the qsub
command:
The user requests a particular node to run a job. Since the user is a student and did not identify a queue, the job, by default, is routed to the student_short queue. But, argo1-1, is not assigned to the student_short queue. The job is requesting a resource not owned by the queue and is, therefore, unavailable. The job does not run. It is important to note that the maximum number of nodes and CPUs is cumulative across all your submitted jobs. qstat -u jsmith1
JobID Username Queue Jobname SessID NDS TSK Memory Time S Time
----- -------- ----- --------- ------ --- --- ------ ---- - ----
1234 jsmith1 student_ my_script 12345 3 1 -- 04:00 R 03:41
argo13-2/1+argo13-2/0+argo7-4/1+argo7-4/0+argo7-3/1+argo7-3
1235 jsmith1 student_ my_script -- 3 1 -- 04:00 Q --
--
Why is the first job "running" - indicated by a R (for running) in the
status (S) column as well as by the names of the nodes assigned to the job -
and the second job is queued (Q), awaiting execution?Both jobs were submitted by the student jsmith1 for execution on the
student_short queue, the second job soon after the first. And, both were
submitted using the following qsub command invocation:
The first job requests six CPUs: two CPUs (ppn=2) on three nodes (nodes=3). The maximum number of CPUs a student may use on the student_short queue is eight:
Since jsmith had no other resources, the first job is assigned the six CPUs and begins execution. The second job also requests six CPUs. The user already has six CPUs (the first job) and requests an additional six (the second job) for a total of twelve, four OVER the limit of eight. The second job will not be assigned the requested resources and will sit, queued, awaiting the release of the four CPUs from the first job. However, the release of CPUs is an all or nothing proposition. Therefore, the first job would have to end before the second job begins. The two commands that are most useful to diagnose problems
pertaining to jobs not running:
The operand for both commands is the job id. The output of the checkjob can be very cryptic but the reason why the
job is not running is there. For example, suppose a student issues the
following command:
The job is assigned id 1277 but remains queued which is indicated by the capital Q in the S column (status) in the output of the qstat 1277 command: Job id Name User Time Use S Queue ------ ---------- ---- ---- --- - ------- 1277 my_script jsmith 0 Q student_short If the student issues the checkjob 1277 command, the output will include
the following (the output is too long and has been abbreviated HERE for the purposes of brevity):
Messages: procs too high (12 > 8) PE: 12.00 StartPriority: 7 cannot select job 1277 for partition DEFAULT (job hold active) Look closely: PolicyViolation: procs too high (12 >8)." The
student is asking for twelve processors. Go back and take a look at the qsub command:
Four nodes multiplied by three processors per node results in twelve
processors. But, the student is limited to a total of eight processors on
the student_short queue:
Remember to delete the queued job:
|
||||||||||||||||||||||||||||||||||
| How do I see what's going on on a compute node | ||||||||||||||||||||||||||||||||||
|
Though you can't login to a compute node, you will see what's going on
via the rsh command. For example, suppose you want to see what processes
you are running. For processes on the master, it's the basic ps command:
For a compute node, you would do the following:
Make sure you enclose the entire command (after the node name in double quotation marks:
All the shell commands are available. Suppose user jsmith had used the /tmp filesystem on the
compute node as a scratch area for a temporary file and, upon completion of the job wants to
erase it:
$ rsh -l jsmith argo1-1 "ls -al /tmp/junk1"
|
||||||||||||||||||||||||||||||||||
| What programs I've run on argo | ||||||||||||||||||||||||||||||||||
|
Your job history is available. Click here to get a listing of all the months for which statistics are available. Or, you can go directly to the month by entering the URL:
where YYYY is replaced with the year and MM is replaced with the two-digit month. Examples:
Examples: |
||||||||||||||||||||||||||||||||||
| Web Monitor | ||||||||||||||||||||||||||||||||||
|
There is also a very nice web-based tool to view the system.
To access it, point your browser to the following URL: Notice that you must use https and not the insecure http. You will be prompted for your netid and password before
you are allowed in. If you can't access the tool, then it means you are not on campus or, if off campus, not
using running a VPN on your computer. For more information about using a VPN, click here. |
||||||||||||||||||||||||||||||||||
| Argo Compute Cluster | Previous: Available Software | Next: MPI |
| 2011-11-9 ACCC Systems Group |
|