Difference between revisions of "Tracing Jobs"
Line 49: | Line 49: | ||
/var/spool/pbs/server_logs/20100830:08/30/2010 12:04:38;0008;PBS_Server;Job;7415602.stro.nikhef.nl;Job Queued at request of lhcbpi01@stremsel.nikhef.nl, owner = lhcbpi01@stremsel.nikhef.nl, job name = crash_559557119, queue = lhcb | /var/spool/pbs/server_logs/20100830:08/30/2010 12:04:38;0008;PBS_Server;Job;7415602.stro.nikhef.nl;Job Queued at request of lhcbpi01@stremsel.nikhef.nl, owner = lhcbpi01@stremsel.nikhef.nl, job name = crash_559557119, queue = lhcb | ||
− | The "job name" here is "crash_559557119". | + | The "job name" here is "crash_559557119". This job name consists of two pieces, first a tag that identifies the CE (crash is associated with stremsel, and crime is associated with juk ... the tag has to be five characters starting with 'cr'). The second part, after the underscore, is the unique identifier for the job according to the CE. Since this job came from stremsel (see again the 'owner' field above), log into stremsel and look for this jobID in the cream log: |
+ | stremsel:~> grep 559557119 /opt/glite/var/log/glite-ce-cream.log.? | ||
+ | /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,401 INFO org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) \ | ||
+ | - (http-8443-Processor24) Job inserted. JobId = CREAM559557119 | ||
+ | /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,652 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \ | ||
+ | - (http-8443-Processor24) JOB CREAM559557119 STATUS CHANGED: -- => REGISTERED [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] \ | ||
+ | [delegationId=12779882352E959602wms2162Ecern2Ech] | ||
+ | /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,878 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:826) \ | ||
+ | - (Worker Thread 29) REMOTE_REQUEST_ADDRESS=128.142.167.157; USER_DN=/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz; \ | ||
+ | USER_FQAN={ /lhcb/Role=pilot/Capability=NULL; /lhcb/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; \ | ||
+ | commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \ | ||
+ | jobId=CREAM559557119; status=PROCESSING; | ||
+ | /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,879 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \ | ||
+ | - (Worker Thread 29) JOB CREAM559557119 STATUS CHANGED: REGISTERED => PENDING [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] \ | ||
+ | [delegationId=12779882352E959602wms2162Ecern2Ech] | ||
+ | /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:40,906 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \ | ||
+ | - (Worker Thread 29) JOB CREAM559557119 STATUS CHANGED: PENDING => IDLE [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] [delegationId=12779882352E959602wms2162Ecern2Ech] | ||
+ | /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:40,907 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:839) - \ | ||
+ | (Worker Thread 29) REMOTE_REQUEST_ADDRESS=128.142.167.157; USER_DN=/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz; \ | ||
+ | USER_FQAN={ /lhcb/Role=pilot/Capability=NULL; /lhcb/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; \ | ||
+ | commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \ | ||
+ | jobId=CREAM559557119; status=PROCESSING; lrmsAbsJobId=pbs/20100830/7415602.stro.nikhef.nl; | ||
+ | Here you can see rather clearly most of the info you want to know : thhe gridJobID (second line), the submitting user DN. There are more lines matched by the grep, that show the state transitions when the job starts to run and when it exits. | ||
+ | |||
+ | Note, if the job still exists on the system, you can immediately see if it's a CREAM job and if so, you don't even need to go to stro: | ||
+ | bosui:~> qstat | grep lhc | tail | ||
+ | 7428058.stro STDIN lhcbpi01 0 R medium | ||
+ | 7428059.stro STDIN lhcbpi01 0 R medium | ||
+ | 7428060.stro crash_709209640 lhcbpi01 0 R medium | ||
+ | The last job listed is a CREAM job on stremsel (tag "crash"), the part after "crash_" is already the internal stremsel ID, so one can go directly to stremsel. | ||
+ | stremsel:~> grep 709209640 /opt/glite/var/log/glite-ce-cream.log | ||
+ | 01 Sep 2010 13:55:59,293 INFO org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) - (http-8443-Processor24) Job inserted. JobId = CREAM709209640 | ||
+ | 01 Sep 2010 13:55:59,427 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) - (http-8443-Processor24)\ | ||
+ | JOB CREAM709209640 STATUS CHANGED: -- => REGISTERED [localUser=lhcbpi01] [gridJobId=https://lcglb02.gridpp.rl.ac.uk:9000/Ub88MWpNWjebUa1vR15Yjg] \ | ||
+ | [delegationId=12819733242E857889lcgwms012Egridpp2Erl2Eac2Euk] |
Revision as of 12:01, 1 September 2010
This page helps you find out, given some "ID" for a job, how to find out other information.
Tracing a job given the PBS ID
Situation : you have identified the job via qstat, so you have an identifier like
7416345
or
7416345.stro.nikhef.nl
You report the misbehaving job to the VO in question, and they ask you for the grid job ID. First, log into stro and look for the job in the PBS server logs:
stro:~> grep 7416345 /var/spool/pbs/server_logs/2010083* /var/spool/pbs/server_logs/20100830:08/30/2010 13:25:19;0008;PBS_Server;Job;7416345.stro.nikhef.nl;Job Queued at request of lhcbpi01@gazon.nikhef.nl, owner = lhcbpi01@gazon.nikhef.nl, job name = STDIN, queue = lhcb [ ... ] /var/spool/pbs/server_logs/20100830:08/30/2010 13:25:24;0002;PBS_Server;Job;7416345.stro.nikhef.nl;child reported success for job after 0 seconds (dest=wn-lui2-008.farm.nikhef.nl), rc=0 [ ... ] /var/spool/pbs/server_logs/20100831:08/31/2010 10:26:46;0010;PBS_Server;Job;7416345.stro.nikhef.nl;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=21:01:22
Given these lines you already have lots of information: you know which CE the job entered the system from, this is given by the "owner", lhcbpi01@gazon.nikhef.nl, hence the CE in this case was "gazon". You also know which worker node the job ran on (dest=wn-lui2-008.farm.nikhef.nl) and the last line tells you what time the job ended, how much cpu and memory it used, and wall time. The record for the memory is wrong in this case (0 kb memory used, not very likely!), most likely because there was a power glitch on this day, causing the WN to go down during the job execution ... the end record came when the WN came back up and the job was already gone.
What to do next, depends on whether the CE in question is an LCG-CE or a CREAM CE.
LCG CE tracing
For an lcg-CE, log into the CE. Remember the jobid was 7416345 ... first find this jobID in the gatekeeper log file:
gazon:~> grep 7416345 /var/log/globus-gatekeeper.log JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has GRAM_SCRIPT_JOB_ID 7416345.stro.nikhef.nl manager type pbs
The gatekeeper's ID for the job is the long string after the tag GATEKEEPER_JM_ID, so now grep the gatekeeper log file on that string:
gazon:~> grep 2010-08-30.13:25:15.0000011355 /var/log/globus-gatekeeper.log LCMAPS 0: 2010-08-30.13:25:15.0000011355.0000000000 : LCMAPS 7: 2010-08-30.13:25:15.0000011355.0000000000 : Initialization LCMAPS version 1.4.7 [ ... lots of LCMAPS messages skipped ] PID: 11355 -- Notice: 0: GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 for /DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz on 128.142.167.136 JMA 2010/08/30 13:25:15 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has EDG_WL_JOBID 'https://wms203.cern.ch:9000/2gABtYdoEjq6R-fNdKH9vg' JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 for /DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz on 128.142.167.136 JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 mapped to lhcbpi01 (71001, 2056) JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 has GRAM_SCRIPT_JOB_ID 7416345.stro.nikhef.nl manager type pbs JMA 2010/08/30 13:25:20 GATEKEEPER_JM_ID 2010-08-30.13:25:15.0000011355.0000000000 JM exiting
Now you know under whose credentials the job entered the system:
/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz
You know which system submitted the job to gazon:
gazon:~> host 128.142.167.136 136.167.142.128.in-addr.arpa domain name pointer wms203.cern.ch.
and you know what the "grid jobID" is:
EDG_WL_JOBID 'https://wms203.cern.ch:9000/2gABtYdoEjq6R-fNdKH9vg'
CREAM CE tracing
Here is a CREAM example : grepping for the jobID on stro gave 'stremsel' as an answer:
stro:~> grep 7415602 /var/spool/pbs/server_logs/2010083* /var/spool/pbs/server_logs/20100830:08/30/2010 12:04:38;0008;PBS_Server;Job;7415602.stro.nikhef.nl;Job Queued at request of lhcbpi01@stremsel.nikhef.nl, owner = lhcbpi01@stremsel.nikhef.nl, job name = crash_559557119, queue = lhcb
The "job name" here is "crash_559557119". This job name consists of two pieces, first a tag that identifies the CE (crash is associated with stremsel, and crime is associated with juk ... the tag has to be five characters starting with 'cr'). The second part, after the underscore, is the unique identifier for the job according to the CE. Since this job came from stremsel (see again the 'owner' field above), log into stremsel and look for this jobID in the cream log:
stremsel:~> grep 559557119 /opt/glite/var/log/glite-ce-cream.log.? /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,401 INFO org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) \ - (http-8443-Processor24) Job inserted. JobId = CREAM559557119 /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,652 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \ - (http-8443-Processor24) JOB CREAM559557119 STATUS CHANGED: -- => REGISTERED [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] \ [delegationId=12779882352E959602wms2162Ecern2Ech] /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,878 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:826) \ - (Worker Thread 29) REMOTE_REQUEST_ADDRESS=128.142.167.157; USER_DN=/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz; \ USER_FQAN={ /lhcb/Role=pilot/Capability=NULL; /lhcb/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; \ commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \ jobId=CREAM559557119; status=PROCESSING; /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:35,879 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \ - (Worker Thread 29) JOB CREAM559557119 STATUS CHANGED: REGISTERED => PENDING [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] \ [delegationId=12779882352E959602wms2162Ecern2Ech] /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:40,906 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) \ - (Worker Thread 29) JOB CREAM559557119 STATUS CHANGED: PENDING => IDLE [localUser=lhcbpi01] [gridJobId=https://wms216.cern.ch:9000/eSjUmw9LYh65LY52D2zsow] [delegationId=12779882352E959602wms2162Ecern2Ech] /opt/glite/var/log/glite-ce-cream.log.5:30 Aug 2010 12:04:40,907 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:839) - \ (Worker Thread 29) REMOTE_REQUEST_ADDRESS=128.142.167.157; USER_DN=/DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz; \ USER_FQAN={ /lhcb/Role=pilot/Capability=NULL; /lhcb/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; \ commandName=JOB_START; cmdExecutorName=BLAHExecutor; userId=_DC_es_DC_irisgrid_O_ecm_ub_CN_Ricardo_Graciani_Diaz_lhcb_Role_pilot_Capability_NULL; \ jobId=CREAM559557119; status=PROCESSING; lrmsAbsJobId=pbs/20100830/7415602.stro.nikhef.nl;
Here you can see rather clearly most of the info you want to know : thhe gridJobID (second line), the submitting user DN. There are more lines matched by the grep, that show the state transitions when the job starts to run and when it exits.
Note, if the job still exists on the system, you can immediately see if it's a CREAM job and if so, you don't even need to go to stro:
bosui:~> qstat | grep lhc | tail 7428058.stro STDIN lhcbpi01 0 R medium 7428059.stro STDIN lhcbpi01 0 R medium 7428060.stro crash_709209640 lhcbpi01 0 R medium
The last job listed is a CREAM job on stremsel (tag "crash"), the part after "crash_" is already the internal stremsel ID, so one can go directly to stremsel.
stremsel:~> grep 709209640 /opt/glite/var/log/glite-ce-cream.log 01 Sep 2010 13:55:59,293 INFO org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) - (http-8443-Processor24) Job inserted. JobId = CREAM709209640 01 Sep 2010 13:55:59,427 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2094) - (http-8443-Processor24)\ JOB CREAM709209640 STATUS CHANGED: -- => REGISTERED [localUser=lhcbpi01] [gridJobId=https://lcglb02.gridpp.rl.ac.uk:9000/Ub88MWpNWjebUa1vR15Yjg] \ [delegationId=12819733242E857889lcgwms012Egridpp2Erl2Eac2Euk]