Difference between revisions of "NL Cloud Monitor Instructions"
Line 36: | Line 36: | ||
# When you identify the destination site of the problematic transfers, you can click on the "+" sign in front of the site, the table will be extended again to show the "source site" of the transfers. By clicking on the number of the transfer errors showing on the table (the 4th column from the end), the error message will be presented. A graphic instruction of those steps is shown below. | # When you identify the destination site of the problematic transfers, you can click on the "+" sign in front of the site, the table will be extended again to show the "source site" of the transfers. By clicking on the number of the transfer errors showing on the table (the 4th column from the end), the error message will be presented. A graphic instruction of those steps is shown below. | ||
− | [[Image:DDM find error | + | [[Image:DDM find error msg.png|thumb|center|640px|DDM Dashboard Explaination]] |
'''Report the problem to the [mailto:adc-nl-cloud-support@nikhef.nl NL squad team] when the number of the error is high.''' | '''Report the problem to the [mailto:adc-nl-cloud-support@nikhef.nl NL squad team] when the number of the error is high.''' |
Revision as of 09:33, 8 January 2010
Introduction
This page will give a step-by-step instruction for the shifters (of the ATLAS NL-cloud regional operation) to check through several key monitoring pages used by Atlas Distributed Computing (ADC). Those key monitoring pages were also monitored by official ADC shifters (e.g. ADCoS, DAST).
The general architecture of ADC operation is shown below. The shifters that we are concerning here is part of the "regional operation team". The contribution will be credited by OTSMU.
Things to monitor
Follow the instructions below for checking different monitoring pages and notify the NL cloud squad team accordingly via adc-nl-cloud-support@nikhef.nl.
ADCoS eLog
ADCoS eLog is mainly used by ADC experts and ADCoS shifters to log the actions taken on a site concerning a site issues. For example, removing/adding site from/into the ATLAS production system. The shifter has to notify the squad team if there are issues not being followed up for a long while (~24 hours).
The eLog entries related to NL-cloud can be found here.
DDM Dashboard
DDM Dashboard is used for monitoring the data transfer activities between sites.
The main monitoring page is explained below
There are few things to note on this page:
- the summary indicates the data transfer "TO" a particular cloud or site. For example, transfers from RAL to SARA is categorized to "SARA"; while transfers from SARA to RAL is catagorized to "RAL".
- the cloud is label with its Tier-1 name, for example, "SARA" represents the whole transfers "TO" NL cloud.
- it will be handy to remember that "yellow" bar indicates transfers to NL cloud.
To check this page, here are few simple steps to follow:
- look at the bottom-right plot (total transfer errors). If the yellow bar persists every hour with a significant number of errors. Go to check the summary table below.
- To check the failed transfers to NL cloud, click on the "SARA" entry on the summary table. The table will be extended to show the detail transfers to the sites within NL cloud. From there you can see which site is in trouble.
- When you identify the destination site of the problematic transfers, you can click on the "+" sign in front of the site, the table will be extended again to show the "source site" of the transfers. By clicking on the number of the transfer errors showing on the table (the 4th column from the end), the error message will be presented. A graphic instruction of those steps is shown below.
Report the problem to the NL squad team when the number of the error is high.
There are some cases that you don't need to report:
- Don't report the problem if the error message indicates it's a "SOURCE" error.
- Don't report the problem if the site is in downtime. The downtime schedule can be found here: http://lxvm0350.cern.ch:12409/agis/calendar/
- Don't report the problem again if the same error remains during your shift and you have reported earlier.