Difference between revisions of "Tracing Jobs"

From PDP/Grid Wiki
Jump to navigationJump to search
Line 7: Line 7:
  
 
You report the misbehaving job to the VO in question, and they ask you for the grid job ID.
 
You report the misbehaving job to the VO in question, and they ask you for the grid job ID.
 +
 +
=== Getting the info from qstat ===
  
 
If you are lucky and the job is still in the system, OR if you had enough foresight to save the qstat output, take a look at the job name:
 
If you are lucky and the job is still in the system, OR if you had enough foresight to save the qstat output, take a look at the job name:

Revision as of 12:32, 1 September 2010

This page helps you find out, given some "ID" for a job, how to find out other information.

Tracing a job given the PBS ID

Situation : you have identified the job via qstat, so you have an identifier like

7416345

or

7416345.stro.nikhef.nl

You report the misbehaving job to the VO in question, and they ask you for the grid job ID.

Getting the info from qstat

If you are lucky and the job is still in the system, OR if you had enough foresight to save the qstat output, take a look at the job name: bosui:~> qstat

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
7420628.stro              crash_808840023  alisgm02        23:27:59 R alice          
7422743.stro              crash_043407298  lhcbpi01        24:59:28 R lhcb           
7422877.stro              crash_907230299  lhcbpi01        25:25:50 R medium    
7423087.stro              crash_718305047  lhcbpi01        22:53:30 R lhcb           
7423147.stro              STDIN            lhcbpi01        23:23:57 R medium         
7423150.stro              STDIN            lhcbpi01        23:55:57 R lhcb [ ... ]

If the jobname looks like cr???_NNNNNNNN

In this case if cr??? == crash, your job was submitted to stremsel; if cr??? == crime, it went to juk. The piece after the underscore is the CREAM internal job ID. Proceed now to the section on #CREAM CE tracing

If the jobname is STDIN

In this case, and if you can still do a 'qstat' on the job, do so. If you can't do a qstat on the job, proceed to #Getting info from the PBS server logs

Do a qstat -f for that job:

bosui:~> qstat -f 7423147 | head
Job Id: 7423147.stro.nikhef.nl
    Job_Name = STDIN
    Job_Owner = lhcbpi01@trekker.nikhef.nl
    resources_used.cput = 23:28:25
    resources_used.mem = 2236848kb
    resources_used.vmem = 4094020kb
    resources_used.walltime = 24:06:58
    [ ... ]

The "Job_Owner" field tells you that this job was submitted to the CE "trekker". Proceed now to the section on #LCG CE tracing.

Getting info from the PBS server logs

If the job is no longer visible from qstat, you'll need to look at the PBS server logs on stro to get a bit more information, for example you need to know the name of the CE machine through which the job entered the system.

First, log into stro and look for the job in the PBS server logs:

stro:~> grep 7416345 /var/spool/pbs/server_logs/2010083*
/var/spool/pbs/server_logs/20100830:08/30/2010 13:25:19;0008;PBS_Server;Job;7416345.stro.nikhef.nl;Job Queued at request of lhcbpi01@gazon.nikhef.nl, owner = lhcbpi01@gazon.nikhef.nl, job name = STDIN, queue = lhcb
[ ... ]
/var/spool/pbs/server_logs/20100830:08/30/2010 13:25:24;0002;PBS_Server;Job;7416345.stro.nikhef.nl;child reported success for job after 0 seconds (dest=wn-lui2-008.farm.nikhef.nl), rc=0
[ ... ]
/var/spool/pbs/server_logs/20100831:08/31/2010 10:26:46;0010;PBS_Server;Job;7416345.stro.nikhef.nl;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=21:01:22

Given these lines you already have lots of information: you know which CE the job entered the system from, this is given by the "owner", lhcbpi01@gazon.nikhef.nl, hence the CE in this case was "gazon". Note the "Job Name" is STDIN, this is already a giveaway that the job went through an LCG CE. For a CREAM CE job, the lines look slightly different:

Here is a CREAM example : grepping for a different jobID on stro gave 'stremsel' as an answer:

stro:~> grep 7415602 /var/spool/pbs/server_logs/2010083*
/var/spool/pbs/server_logs/20100830:08/30/2010 12:04:38;0008;PBS_Server;Job;7415602.stro.nikhef.nl;Job Queued at request of lhcbpi01@stremsel.nikhef.nl, owner = lhcbpi01@stremsel.nikhef.nl, job name = crash_559557119, queue = lhcb

The "job name" here is "crash_559557119". This job name consists of two pieces, first a tag that identifies the CE (crash is associated with stremsel, and crime is associated with juk ... the tag has to be five characters starting with 'cr'). The second part, after the underscore (559557119) , is the unique identifier for the job on that CE.

After this step, you know the PBS job ID, the CE hostname, and in the case of a CREAM job, you know the unique job identifier used by the CE. Now proceed either to #LCG CE tracing or #CREAM CE tracing .

LCG CE tracing

For an lcg-CE, log into the CE. Remember the PBS jobid was 7416345 ... first find this jobID in the gatekeeper log file:

gazon:~> grep 7416345 /var/log/globus-gatekeeper.log
JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has GRAM_SCRIPT_JOB_ID 7416345.stro.nikhef.nl manager type pbs

The gatekeeper's ID for the job is the long string after the tag GATEKEEPER_JM_ID, so now grep the gatekeeper log file on that string:

gazon:~> grep 2010-08-30.13:25:15.0000011355 /var/log/globus-gatekeeper.log
LCMAPS 0: 2010-08-30.13:25:15.0000011355.0000000000 : 
LCMAPS 7: 2010-08-30.13:25:15.0000011355.0000000000 : Initialization LCMAPS version 1.4.7
[ ... lots of LCMAPS messages skipped ] 
PID: 11355 -- Notice: 0: GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 for /DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz on 128.142.167.136
JMA 2010/08/30 13:25:15 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has EDG_WL_JOBID 'https://wms203.cern.ch:9000/2gABtYdoEjq6R-fNdKH9vg'
JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 for /DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz on 128.142.167.136
JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 mapped to lhcbpi01 (71001, 2056)
JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has GRAM_SCRIPT_JOB_ID 7416345.stro.nikhef.nl manager type pbs
JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 JM exiting

Now you know under whose credentials the job entered the system:

/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz

You know which system submitted the job to gazon:

gazon:~> host 128.142.167.136 
136.167.142.128.in-addr.arpa domain name pointer wms203.cern.ch.

and now you know what the "grid jobID" is:

EDG_WL_JOBID 'https://wms203.cern.ch:9000/2gABtYdoEjq6R-fNdKH9vg'

CREAM CE tracing

Log in to the CE host and look for this CREAM jobID in the cream log file:

stremsel:~> grep 559557119 /opt/glite/var/log/glite-ce-cream.log.?       
/opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,401 INFO  org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) \
   - (http-8443-Processor24) Job inserted. JobId = CREAM559557119
/opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,652 INFO  org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \
   - (http-8443-Processor24) JOB CREAM559557119 STATUS CHANGED: -- => REGISTERED [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] \
   [delegationId=12779882352E959602wms2162Ecern2Ech]
/opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,878 INFO  org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:826) \
   - (Worker Thread 29) REMOTE_REQUEST_ADDRESS=128.142.167.157; USER_DN=/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz; \
   USER_FQAN={ /lhcb/Role=pilot/Capability=NULL; /lhcb/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; \
   commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \
   jobId=CREAM559557119; status=PROCESSING;
/opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,879 INFO  org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \
   - (Worker Thread 29) JOB CREAM559557119 STATUS CHANGED: REGISTERED => PENDING [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] \
   [delegationId=12779882352E959602wms2162Ecern2Ech]
/opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:40,906 INFO  org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \
   - (Worker Thread 29) JOB CREAM559557119 STATUS CHANGED: PENDING => IDLE [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] [delegationId=12779882352E959602wms2162Ecern2Ech]
/opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:40,907 INFO  org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:839) - \
   (Worker Thread 29) REMOTE_REQUEST_ADDRESS=128.142.167.157; USER_DN=/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz; \
   USER_FQAN={ /lhcb/Role=pilot/Capability=NULL; /lhcb/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; \
   commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \
   jobId=CREAM559557119; status=PROCESSING; lrmsAbsJobId=pbs/20100830/7415602.stro.nikhef.nl;

Here you can see rather clearly most of the info you want to know : the gridJobID (second line), the submitting user DN. There are more lines matched by the grep than are shown here. Some of them show the state transitions when the job starts to run and when it exits.