Difference between revisions of "Tracing Jobs"
(9 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
You report the misbehaving job to the VO in question, and they ask you for the grid job ID. | You report the misbehaving job to the VO in question, and they ask you for the grid job ID. | ||
− | If you are lucky and the job is still in the system, OR if you had enough foresight to save the qstat output, | + | === Getting the info from qstat === |
− | bosui:~> qstat | + | |
+ | If you are lucky and the job is still in the system, OR if you had enough foresight to save the qstat output, you can follow this section; otherwise go to [[ #Getting info from the PBS server logs ]]. | ||
+ | |||
+ | bosui:~> qstat | ||
Job id Name User Time Use S Queue | Job id Name User Time Use S Queue | ||
------------------------- ---------------- --------------- -------- - ----- | ------------------------- ---------------- --------------- -------- - ----- | ||
Line 17: | Line 20: | ||
7423087.stro crash_718305047 lhcbpi01 22:53:30 R lhcb | 7423087.stro crash_718305047 lhcbpi01 22:53:30 R lhcb | ||
7423147.stro STDIN lhcbpi01 23:23:57 R medium | 7423147.stro STDIN lhcbpi01 23:23:57 R medium | ||
− | 7423150.stro STDIN lhcbpi01 23:55:57 R lhcb [ ... ] | + | 7423150.stro STDIN lhcbpi01 23:55:57 R lhcb |
+ | [ ... ] | ||
+ | |||
+ | ==== Parsing a CREAM Job Name ==== | ||
+ | |||
+ | If the job name looks like cr???_NNNNNNNN, then the job went to a CREAM CE. For example, the first job name listed here is crash_808840023. This job name consists of two pieces. First comes a tag that identifies the CE. The tag "crash" is associated with CE machine stremsel, and the tag "crime" is associated with CE machine juk. The tag names must follow the pattern "five characters starting with 'cr'", hence the silliness. The second part of the job name, after the underscore (808840023) , is the unique identifier for this job used internal to the CE service. | ||
+ | |||
+ | Now you can proceed to the section on [[ #CREAM CE tracing ]] | ||
'''If the jobname is STDIN''' | '''If the jobname is STDIN''' | ||
+ | |||
+ | In this case, the job went through an LCG CE. If you can still do a 'qstat' on the job, do so. If you can't do a qstat on the job, proceed to [[ #Getting info from the PBS server logs ]] | ||
+ | |||
Do a qstat -f for that job: | Do a qstat -f for that job: | ||
bosui:~> qstat -f 7423147 | head | bosui:~> qstat -f 7423147 | head | ||
Line 30: | Line 43: | ||
resources_used.walltime = 24:06:58 | resources_used.walltime = 24:06:58 | ||
[ ... ] | [ ... ] | ||
− | The "Job_Owner" field tells you that this job was submitted to the CE "trekker". Proceed now to the section on [[ # | + | The "Job_Owner" field tells you that this job was submitted to the CE "trekker". Proceed now to the section on [[ #LCG CE tracing]]. |
− | + | === Getting info from the PBS server logs === | |
− | |||
− | |||
+ | If the job is no longer visible from qstat, you'll need to look at the PBS server logs on stro to get a bit more information, for example you need to know the name of the CE machine through which the job entered the system. | ||
First, log into stro and look for the job in the PBS server logs: | First, log into stro and look for the job in the PBS server logs: | ||
Line 44: | Line 56: | ||
[ ... ] | [ ... ] | ||
/var/spool/pbs/server_logs/20100831:08/31/2010 10:26:46;0010;PBS_Server;Job;7416345.stro.nikhef.nl;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=21:01:22 | /var/spool/pbs/server_logs/20100831:08/31/2010 10:26:46;0010;PBS_Server;Job;7416345.stro.nikhef.nl;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=21:01:22 | ||
− | Given these lines you already have lots of information: you know which CE the job entered the system from, this is given by the "owner", lhcbpi01@gazon.nikhef.nl, hence the CE in this case was "gazon". | + | Given these lines you already have lots of information: you know which CE the job entered the system from, this is given by the "owner", lhcbpi01@gazon.nikhef.nl, hence the CE in this case was "gazon". Note the "Job Name" is STDIN, this is already a giveaway that the job went through an LCG CE. Now that you know the PBS job ID and the CE hostname, you can proceed to [[ #LCG CE tracing ]]. |
− | + | For a CREAM CE job, the lines look slightly different: here is an example. Grepping for a different jobID on stro gave 'stremsel' as an answer: | |
+ | stro:~> grep 7415602 /var/spool/pbs/server_logs/2010083* | ||
+ | /var/spool/pbs/server_logs/20100830:08/30/2010 12:04:38;0008;PBS_Server;Job;7415602.stro.nikhef.nl;Job Queued at request of lhcbpi01@stremsel.nikhef.nl, owner = lhcbpi01@stremsel.nikhef.nl, job name = crash_559557119, queue = lhcb | ||
+ | Proceed now to the section [[ #Parsing a CREAM Job Name ]]. | ||
=== LCG CE tracing === | === LCG CE tracing === | ||
For an lcg-CE, log into the CE. | For an lcg-CE, log into the CE. | ||
− | Remember the jobid was 7416345 ... first find this jobID in the gatekeeper log file: | + | Remember the PBS jobid was 7416345 ... first find this jobID in the gatekeeper log file: |
gazon:~> grep 7416345 /var/log/globus-gatekeeper.log | gazon:~> grep 7416345 /var/log/globus-gatekeeper.log | ||
JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has GRAM_SCRIPT_JOB_ID 7416345.stro.nikhef.nl manager type pbs | JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has GRAM_SCRIPT_JOB_ID 7416345.stro.nikhef.nl manager type pbs | ||
Line 70: | Line 85: | ||
gazon:~> host 128.142.167.136 | gazon:~> host 128.142.167.136 | ||
136.167.142.128.in-addr.arpa domain name pointer wms203.cern.ch. | 136.167.142.128.in-addr.arpa domain name pointer wms203.cern.ch. | ||
− | and you know what the "grid jobID" is: | + | and now you know what the "grid jobID" is: |
EDG_WL_JOBID 'https://wms203.cern.ch:9000/2gABtYdoEjq6R-fNdKH9vg' | EDG_WL_JOBID 'https://wms203.cern.ch:9000/2gABtYdoEjq6R-fNdKH9vg' | ||
− | + | === CREAM CE tracing === | |
− | + | Log in to the CE host and look for this CREAM jobID in the cream log file: | |
− | |||
− | |||
− | |||
− | |||
stremsel:~> grep 559557119 /opt/glite/var/log/glite-ce-cream.log.? | stremsel:~> grep 559557119 /opt/glite/var/log/glite-ce-cream.log.? | ||
/opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,401 INFO org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) \ | /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,401 INFO org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) \ | ||
Line 101: | Line 112: | ||
commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \ | commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \ | ||
jobId=CREAM559557119; status=PROCESSING; lrmsAbsJobId=pbs/20100830/7415602.stro.nikhef.nl; | jobId=CREAM559557119; status=PROCESSING; lrmsAbsJobId=pbs/20100830/7415602.stro.nikhef.nl; | ||
− | Here you can see rather clearly most of the info you want to know : | + | Here you can see rather clearly most of the info you want to know : the gridJobID (second line), the submitting user DN. There are more lines matched by the grep than are shown here. Some of them show the state transitions when the job starts to run and when it exits. |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 12:45, 1 September 2010
This page helps you find out, given some "ID" for a job, how to find out other information.
Tracing a job given the PBS ID
Situation : you have identified the job via qstat, so you have an identifier like
7416345
or
7416345.stro.nikhef.nl
You report the misbehaving job to the VO in question, and they ask you for the grid job ID.
Getting the info from qstat
If you are lucky and the job is still in the system, OR if you had enough foresight to save the qstat output, you can follow this section; otherwise go to #Getting info from the PBS server logs .
bosui:~> qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 7420628.stro crash_808840023 alisgm02 23:27:59 R alice 7422743.stro crash_043407298 lhcbpi01 24:59:28 R lhcb 7422877.stro crash_907230299 lhcbpi01 25:25:50 R medium 7423087.stro crash_718305047 lhcbpi01 22:53:30 R lhcb 7423147.stro STDIN lhcbpi01 23:23:57 R medium 7423150.stro STDIN lhcbpi01 23:55:57 R lhcb [ ... ]
Parsing a CREAM Job Name
If the job name looks like cr???_NNNNNNNN, then the job went to a CREAM CE. For example, the first job name listed here is crash_808840023. This job name consists of two pieces. First comes a tag that identifies the CE. The tag "crash" is associated with CE machine stremsel, and the tag "crime" is associated with CE machine juk. The tag names must follow the pattern "five characters starting with 'cr'", hence the silliness. The second part of the job name, after the underscore (808840023) , is the unique identifier for this job used internal to the CE service.
Now you can proceed to the section on #CREAM CE tracing
If the jobname is STDIN
In this case, the job went through an LCG CE. If you can still do a 'qstat' on the job, do so. If you can't do a qstat on the job, proceed to #Getting info from the PBS server logs
Do a qstat -f for that job:
bosui:~> qstat -f 7423147 | head Job Id: 7423147.stro.nikhef.nl Job_Name = STDIN Job_Owner = lhcbpi01@trekker.nikhef.nl resources_used.cput = 23:28:25 resources_used.mem = 2236848kb resources_used.vmem = 4094020kb resources_used.walltime = 24:06:58 [ ... ]
The "Job_Owner" field tells you that this job was submitted to the CE "trekker". Proceed now to the section on #LCG CE tracing.
Getting info from the PBS server logs
If the job is no longer visible from qstat, you'll need to look at the PBS server logs on stro to get a bit more information, for example you need to know the name of the CE machine through which the job entered the system.
First, log into stro and look for the job in the PBS server logs:
stro:~> grep 7416345 /var/spool/pbs/server_logs/2010083* /var/spool/pbs/server_logs/20100830:08/30/2010 13:25:19;0008;PBS_Server;Job;7416345.stro.nikhef.nl;Job Queued at request of lhcbpi01@gazon.nikhef.nl, owner = lhcbpi01@gazon.nikhef.nl, job name = STDIN, queue = lhcb [ ... ] /var/spool/pbs/server_logs/20100830:08/30/2010 13:25:24;0002;PBS_Server;Job;7416345.stro.nikhef.nl;child reported success for job after 0 seconds (dest=wn-lui2-008.farm.nikhef.nl), rc=0 [ ... ] /var/spool/pbs/server_logs/20100831:08/31/2010 10:26:46;0010;PBS_Server;Job;7416345.stro.nikhef.nl;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=21:01:22
Given these lines you already have lots of information: you know which CE the job entered the system from, this is given by the "owner", lhcbpi01@gazon.nikhef.nl, hence the CE in this case was "gazon". Note the "Job Name" is STDIN, this is already a giveaway that the job went through an LCG CE. Now that you know the PBS job ID and the CE hostname, you can proceed to #LCG CE tracing .
For a CREAM CE job, the lines look slightly different: here is an example. Grepping for a different jobID on stro gave 'stremsel' as an answer:
stro:~> grep 7415602 /var/spool/pbs/server_logs/2010083* /var/spool/pbs/server_logs/20100830:08/30/2010 12:04:38;0008;PBS_Server;Job;7415602.stro.nikhef.nl;Job Queued at request of lhcbpi01@stremsel.nikhef.nl, owner = lhcbpi01@stremsel.nikhef.nl, job name = crash_559557119, queue = lhcb
Proceed now to the section #Parsing a CREAM Job Name .
LCG CE tracing
For an lcg-CE, log into the CE. Remember the PBS jobid was 7416345 ... first find this jobID in the gatekeeper log file:
gazon:~> grep 7416345 /var/log/globus-gatekeeper.log JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has GRAM_SCRIPT_JOB_ID 7416345.stro.nikhef.nl manager type pbs
The gatekeeper's ID for the job is the long string after the tag GATEKEEPER_JM_ID, so now grep the gatekeeper log file on that string:
gazon:~> grep 2010-08-30.13:25:15.0000011355 /var/log/globus-gatekeeper.log LCMAPS 0: 2010-08-30.13:25:15.0000011355.0000000000 : LCMAPS 7: 2010-08-30.13:25:15.0000011355.0000000000 : Initialization LCMAPS version 1.4.7 [ ... lots of LCMAPS messages skipped ] PID: 11355 -- Notice: 0: GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 for /DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz on 128.142.167.136 JMA 2010/08/30 13:25:15 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has EDG_WL_JOBID 'https://wms203.cern.ch:9000/2gABtYdoEjq6R-fNdKH9vg' JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 for /DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz on 128.142.167.136 JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 mapped to lhcbpi01 (71001, 2056) JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has GRAM_SCRIPT_JOB_ID 7416345.stro.nikhef.nl manager type pbs JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 JM exiting
Now you know under whose credentials the job entered the system:
/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz
You know which system submitted the job to gazon:
gazon:~> host 128.142.167.136 136.167.142.128.in-addr.arpa domain name pointer wms203.cern.ch.
and now you know what the "grid jobID" is:
EDG_WL_JOBID 'https://wms203.cern.ch:9000/2gABtYdoEjq6R-fNdKH9vg'
CREAM CE tracing
Log in to the CE host and look for this CREAM jobID in the cream log file:
stremsel:~> grep 559557119 /opt/glite/var/log/glite-ce-cream.log.? /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,401 INFO org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) \ - (http-8443-Processor24) Job inserted. JobId = CREAM559557119 /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,652 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \ - (http-8443-Processor24) JOB CREAM559557119 STATUS CHANGED: -- => REGISTERED [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] \ [delegationId=12779882352E959602wms2162Ecern2Ech] /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,878 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:826) \ - (Worker Thread 29) REMOTE_REQUEST_ADDRESS=128.142.167.157; USER_DN=/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz; \ USER_FQAN={ /lhcb/Role=pilot/Capability=NULL; /lhcb/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; \ commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \ jobId=CREAM559557119; status=PROCESSING; /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,879 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \ - (Worker Thread 29) JOB CREAM559557119 STATUS CHANGED: REGISTERED => PENDING [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] \ [delegationId=12779882352E959602wms2162Ecern2Ech] /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:40,906 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \ - (Worker Thread 29) JOB CREAM559557119 STATUS CHANGED: PENDING => IDLE [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] [delegationId=12779882352E959602wms2162Ecern2Ech] /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:40,907 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:839) - \ (Worker Thread 29) REMOTE_REQUEST_ADDRESS=128.142.167.157; USER_DN=/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz; \ USER_FQAN={ /lhcb/Role=pilot/Capability=NULL; /lhcb/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; \ commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \ jobId=CREAM559557119; status=PROCESSING; lrmsAbsJobId=pbs/20100830/7415602.stro.nikhef.nl;
Here you can see rather clearly most of the info you want to know : the gridJobID (second line), the submitting user DN. There are more lines matched by the grep than are shown here. Some of them show the state transitions when the job starts to run and when it exits.