NDPFAccounting
Systems involved
vlaai | gridmapdir NFS | create poolmap files based on the gridmapdir state on a daily basis |
stro | Torque server | conversion from Torque accounting files and inserting these into the NDPF accounting database on fulda, using a poolmapfile (it will try to collect one automatically if it can mount the gridmapdir) |
bosui | any EL5 system | extraction of data from database into lcgRecords format and upload through AMQ-OpenWire (but note the program is called 'stompfeeder', vaguely hoping that one time the APEL group implements a usable upload protocol ...) |
The relevant scripts for collecting and assembling the accounting information are contained in a single RPM package "ndpf-acctfeeder" that contains both the local and the EGEE scripts (and the accuse client tool). Formally, the dependencies include only perl, perl-DBI, and perl-DBD-MySQL, but there are a few others needed on specific hosts:
- pbsnodes
- needed for the facility capacity option (default) in ndpf-acctfeed on the Torque server
They have not been included in the rpm dependencies, so as to be able to have a single RPM that installs everywhere. This RPM does not install any cron jobs, and you must edit these two files where relevant:
- /etc/pbsaccdb.conf
- on the Torque server, needed for pbsaccdb.pl and pbsstatusdb.pl
- /etc/stompfeeder.conf
- on the uploader box, needed for ndpf-stompfeeder to (optionally) a new default group-to-VO definition, as well as database access passwords and the like unless specified on the command line
both files, if present, must only be readable by root (uid 0).
The new AMQ uploader tool, along with some necessary AMQ libraries, is installed on bosui via the rpm 'nikhef-apel'. Looking in /usr/local/bin/nikhef-apel is reasonably helpful for figuring out how to configure the thing. The AMQ uploader basically pipes SQL statements into the OpenWire uploader CLI, which sends the records off to the UK.
Sources
All relevant sources and the database schema are in SVN at
https://ndpfsvn.nikhef.nl/repos/ndpf/nl.nikhef.ndpf.tools/ndpf-acctfeeder/
NDPF Local Accounting
The local accounting is the most important element, and must (and is :-) fully reliable, because it is used as the basis for the cost reimbursement for projects where we contribute in-kind contributions in the form of compute cycles. These data are collected (yearly) from the NDPF accounting database on a per-VO basis.
Data is inserted into this database on a daily basis. The records are (or should) inserted just after midnight, when the pbs/torque accounting files have been closed and are complete. Since the accounting is based on the "E" records in that file, we thus get all completed jobs. Jobs that are still running will not be accounted -- they will be filed only then they are finished.
Master insertion
Insertion in the database requires the collaboration of two components:
- the mapping from poolaccounts to grid user DNs
- the extraction of the pbs data from the accounting file, and linking the unix users of the facility to their grid credentials
At the moment, the grid group (FQAN) mappings are not part of this scheme, and only unix groups are stored in the database. The unix group-to-gridVO mapping is only done in the EGEE upload phase. This is partly historical, but since the VO-FQAN mapping side of the grid software is in constant flux anyway it is better like it is done now. The FQAN info is added in a later stage with the CEJoiner, see below.
To ease the insertion, a meta-utility has been developed: ndpf-acctfeed. It is to be run on the PBS master (stro) every night, and by default will process yesterday's accounting file:
Usage: ndpf-acctfeed [-v] [-h] [-f] [--mapfile <poolmap>|--gridmapdir <dir>] [--date|-d Ymd] [--nocapacity] [-n|--dryrun] [--pbsaccdir dir] [--progacc command] [--progcapacity command]
and in accounting.ncm-cron.cron:
15 0 * * * root (PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-acctfeed) >> /var/log/accounting.ncm-cron.log 2>&1
The utility will do its very best to find a mapping from unix uids to gridDNs. By default it will look for the .poolmap.YYYYMMDD files that are created at midnight on the gridmapdir NFS server in the gridmapdir (currently on vlaai at 23:58 local time). If such a poolmap cannot be found, it will first try the file you specified on the command line without the YYYYMMDD postfix, then will read the gridmapdir (/share/gridmapdir by default) and create a temporary poolmapfile just for this run. If both the gridmapdir and the poolmapfile(s) are unreadable, the utility aborts.
Parsing PBS accounting data
the pbsaccdb.pl script parses the PBS accounting files, extract key user information and the list of participating nodes, and calculates (normalised) usage times. It then inserts it into the database.
pbsaccdb.pl /var/spool/pbs/server_priv/accountings/YYYYMMDD
Options:
- -c
- configuration file. Defaults to /etc/pbsaccdb.conf.
- --init
- try and insert a schema into the database. NOT reocmmended anymore, since there is a separate schema sql file now
- -u --user
- database user name. Usually read from the config file.
- --password
- password for the insertion user (must have INSERT and REPLACE rights). Usually read from the config file.
- -f --facility
- name of this facility (e.g. "lcg2elprod")
- --thishost
- used only for database GRANT statement creation
- -h
- database host
- -p
- database port
- --map
- filename of the account map file for sDN translation (NOT the grid-mapfile!)
Collecting data from PBS/Torque
The file /etc/pbsaccdb.conf must be present on the collecting system (i.e. the Torque server) and be formatted as described in NDPFAccouting_pbsaccdbconf.
Scaling and the GHzHourEquivalent
The GHzHoursEquiv normalisation (a bit like "Processor Node Hours") is an internal unit. 1 GHzEquivHour corresponds to 410 SI2k-'rate'-base. The job used GHzEquivHours value is taken as the sumproduct of the participating cores and the time on each core. Calculating it needs to actual list of nodes and the performance for each of the participating cores.
Incorporating grid mappings
The grid mappings are matches to the unix uids by the pbsaccdb.pl script, based on a mapfile. This map file is formatted with one mapping per line, with a single TAB character (\t) between the uid and the DN strings, like in:
unixuid /DNstring
This file is not generated by pbsaccdb.pl, but needs to be prepared and passed as a command-line argument (or a sensible default is taken, based on the date specification in the pbs accounting filename given). Normally, the ndpf-acctfeed meta-utility takes care of matching the poolmapfile and the accounting file based on dates, but also this utility will look for a 'true' mapfile that reflects the actual grid-DN-mappings in use on that date. So, such a file must be generated daily. Note that you do need actual dates mapfiles, since poolaccounts will expire after some time and get re-cycled (usually after 100 days of inactivity).
The mapfiles are generated by the poolmaprun script (part of the managepoolmap package, vlaai currently has managepoolmap-1.1-2 installed) on the NFS server hosting the gridmapdir (today: vlaai). This script will both create the mapfile of today and afterwards release any poolaccount mappings that have been idle for 100 days. This script is run from cron on the gridmapdir server:
/etc/cron.d/manage_gridmapdir: 58 23 * * * root MAX_AGE=100 CLEANING=1 /usr/local/bin/poolmaprun
The cron job is installed automatically by the RPM. The script must run on the server since the directory is (and should be) root-owned and the files are written (as '.poolmap.YYYYMMDD') to this directory. Really historic poolmaps are then later moved to a subdirectory gridmapdir/.history/
VO FQAN and ingress information
The base job table only knows about the refefence to the groupid and userid entries from the grid map file. Additional information on the user, more attributes and the list of FQANs used via VOMS, are only recorded at the job ingress points. As such, the initial record in the job table will not have this information. It needs to be augmented after insertion through the credential linkage tables, which are populated on each of the ingress points. When the ingress point fails to supply this info, only basic information (VO, userID, X.509 subject name) will be available.
Re-inserting historic data
First make sure you have accurate poolmap files, and then re-run the ndpf-acctfeed program with a date option:
ndpf-acctfeed --date 20081201
and do this for every missing day. It is harmless to re-insert the same day twice (the rows in the database will just be replaced, as the table is keyed on the JobID which is generated like
md5_base64($pbsinfo{qtime}.$MasterFQDN.$JobID)
where qtime is the time the job was put in the queue by the user, the MasterFQDN is the hostname of the Torque master server, and the JobID is the numerical part of the Torque job id, including the sub-job ID (e.g. "4132121-42"). These remain constant once a job is submitted by the user.
GOCDB APEL uploads
The uploads to the GOC Accounting system use the direct AMQ OpenWire interface by inserting into the lcgRecords table. All data put there is taken exclusively from the NDPF accounting database, and it does not use any other source of records.
Requirements on the submission host
You must be able to talk to the database in perl, so you need perl_DBI and perl-DBD-MySQL, but you'll also need ndpf-stompfeeder (the translator) and nikhef-apel (the OpenWire babbler).
Make sure this script runs daily, but well after the records from the previous day have been inserted into the local NDPF accounting database. For example in /etc/cron.d/amqrecords:
34 9 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin; LANG=C; unset TZ; date; echo Uploading new accounting data; /usr/local/sbin/ndpf-stompfeeder -v -n -q --dbpassword select --dbuser anon --pipe '/usr/local/bin/nikhef-apel' 2>&1 ; date ; echo Accounting upload completed.) >> /var/log/amqrecords.ncm-cron.log 2>&1
Note: this replaces the old append-records script from hooimijt.
Uploading historic data to the GOC
To do this, first make sure the data is actually there in the NDPF accounting database (see there). Then you can re-invoke the meta-utility like:
/usr/local/sbin/ndpf-stompfeeder -s '2011-03-30 00:00:00' -e '2011-04-04 12:00:00' -v -n -q --dbpassword PASSWORD --dbuser USER --pipe '/usr/local/bin/nikhef-apel'
and it will work as usual (including the writing of a log file with the current time as the time stamp. If you upload too much data, you WILL explode the apel-broker at RAL. If you do anything at all, you MAY explode the apel-broker at RAL. Be prepared to pick up broken pieces any time, 24x7x365...
Translating accounting records to APEL ormat
This is done by the ndpf-stompfeeder program (a specialised version of accuse, actually). The scaling (to GHzHoursEquivalent) to sumproduct-weighted normalised units has already been taken case of by the pbsaccdb.pl script and should not be done again here. Basically, ndpf-stompfeeder will use the calender walltime and GHzEquivWallTime (the normalised one) to re-obtain the performance of the worker node in SpecInt2k. In case of multi-node jobs that have used different CPU types, the 'core performance' in SI2k will be the weighted average of the nodes participating in the job. Anyway, you cannot convert hours to SI2k with re-inspecting the host list, since adding and multiplication in our algebra do not commute.
The output of the ndpf-stompfeeder script is
- a list of APEL-compliant SQL "REPLACE INTO" commands
- when used with the "--pipe" option, this program is invoked and the same SQL REPLACE INTO commands sent to the named program on stdin
- when used with "-v", a summary is printed for each job as well
Upload protocol
This for now appears to be OpenWire, over an SSL secured link with client auth. The Java program (nikhef-apel) uses a manually-crafted JCE trust store with the host cert of the APEL broker inside it (i.e. it must be recreated every year), and the client cert of the hostname listed in the GOCDB as the APEL-client host. It can be any host, and does not even have to belong ot the host where you run the nikhef-apel program.
Log monitoring and status pages
Have a look at the CESGA accounting GANTT chart once in a while to see if everything is still working.
Cron jobs to do accounting
There are three cron jobs:
- vlaai
- installed automatically from the managepoolmap-1.1-2 package in (when missing, the ndpf-acctfeeder will try and instantly create a map file as well)
58 23 * * * root MAX_AGE=100 CLEANING=1 /usr/local/bin/poolmaprun
- stro
- installed via quattor in /etc/cron.d/accounting.ncm-cron.cron
15 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-acctfeed) >> /var/log/accounting.ncm-cron.log 2>&1
- bosui
- installed via quattor in /etc/cron.d/amqrecords.ncm-cron.cron
34 9 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin; LANG=C; unset TZ; date; echo Uploading new accounting data; /usr/local/sbin/ndpf-stompfeeder -v -n -q --dbpassword PASSWORD --dbuser USERNAME --pipe '/usr/local/bin/nikhef-apel' 2>&1 ; date ; echo Accounting upload completed.) >> /var/log/amqrecords.ncm-cron.log 2>&1
- on each lcg-CE or GRAM node (with NDPF pbs.in) new
- to be installed
45 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-cejoiner.pl /var/log/messages /var/log/messages.1 /var/log/messages.1.gz) >> /var/log/cejoiner.ncm-cron.log 2>&1
- on each CREAM CE node new
- to be installed, the last 5 days of accounting data is arbitrary ... but yoiu need the max queue+run time at least. Also keep in mind that if you touch the directory, the ls -1tr command no longer gives you the latest n files!
45 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-cejoiner.pl `ls -1tr /opt/glite/var/log/accounting/* | tail -5`) >> /var/log/cejoiner.ncm-cron.log 2>&1
Using the information in other ways
to get CVS files for wLCG manglement
The PHP script at
http://www.nikhef.nl/grid/accuse/
gets the data directly from the NDPF accounting database on fulda. It is exclusively geared towards wLCG management and cannot do much more. For better control, install the ndpf-acctfeeder package and use the accuse command.
to view graphs
The graphs on http://www.nikhef.nl/grid/stats/ndpf-prd/voview-short are generated completely independent from the account database using perl scripts from cron on ribble/naab. They are orthogonal and don't cross.
to view the GOC accounting status
The accounting grantt chart at CESGA is a good point to see if the data is still getting across. Q uick view is, for example:
http://www3.egee.cesga.es/gridsite/accounting/CESGA/gantt.php?option=ROC&optval=1.8&sYear=2008&sMonth=10&eYear=2009&eMonth=12&type=Production&tree=TIER1
with the full interface at http://goc.grid-support.ac.uk/gridsite/accounting/
to get accurate statistics
use the accuse command on bosui, or any other system where ndpf-acctfeeder is installed. Add -h to get a lot of help
to get the data for the VL-e accountant
Use /user/davidg/bin/vle-usage on any ikonet/hefnet host with the mysqql client commands installed.
Database Schema
The schema for the accounting database (for Nikhef at bedstee.nikhef.nl) looks like the graph shown on the right.
The SQL sources are at https://ndpfsvn.nikhef.nl/repos/ndpf/nl.nikhef.ndpf.tools/ndpf-acctfeeder/accountingdb-IDextend.sql
The new schema allows for arbitrary credentials to be stored, and data collected from multiple sources, but the new tables like pubIdent and ingressPoint need a collector/aggregator be be run on each and every CE, over all log records written at the queueTime moment, for all jobs that are already in the database based on the endTime, i.e. recorded when the job has actually completed and the record uploaded from the Torque server.
The scripts to do that are in the save SubVersion repository as the database scema. In particular, ndpf-cejoiner.pl can run over both NDPF-style JobManager/pbs.om logs written in /var/log/messages (daemon.info), as well as the CREAM accounting files in /opt/glite/var/log/accounting/. The format is inferred automatically, and it can process gzipped files as well as regular text.
Uploading and registering data
Merging CE ingress information
Additional information is available only on the CE, such as the name of the ingress point (obviously), but also public identity information such as the user DN as seen on the CE, the primary FQAN, etc. For the subjectDN, this provides an alternative to the userid.commonName which is reverse-engineered from the gridmapdir.
help on ndpf-cejoiner.pl
Usage: ./ndpf-cejoiner.pl [-s|--start starttime:Y-m-d] [-e|--end endtime:Y-m-d] [-h] [-v] [-c|--config configfile] [-n|--dry] [--dbhost hostname] [--dbport port] [--dbname name] [--dbuser name] [--dbpassword password] [--pbsserver hostname] [--facility facilityname] [--ingressdefault defaultingresspointname] [--ingressname|--ceid fixedingresspointname] logfile ... Merge job identity data logged on the CE with existing job records in the accounting database, using the StartTime, facility name and the jobID to find the matching job entry and then link it up with a new or existing pubIdentity record using alinkage table It will process the log files given on the command line (inasfar as they exist), but only between <starttime> and <endtime> inclusive, and update the database with these identities where possible. starttime Y-m-d [HH:MM:SS] jobs ended from this time onwards default: 2010-11-02 00:00:00 endtime Y-m-d [HH:MM:SS] jobs ended up to this time default: 2010-11-03 00:00:00 dbhost hostname hostname of the database server default: bedstee.nikhef.nl dbport portnumber portnumber of the database server default: 3306 dbname name name of the accouting database default: accounting dbuser name username to access the acc database default: accounter dbpassword string password for the database default: do NOT use the command line, use a secured config facility string name of the NDPF facility default: lcg2elprod ingressdefault string name of this ingress point if no explicit name is found default: bosui.nikhef.nl:2119/jobmanager-pbs (for lcg-CE) default: bosui.nikhef.nl (all other ingress points) ingressname string force ingressname to be <name> default: none The <logfile> can be either plain-text line oriented, or a gzipped log file. The log file is parsed for lines matching an LRMS registration and associated user subject and group information. The log file can be either a NDPF-enhanced syslog file containing the tokens produced by the JobManager/pbs.pm management script, or a CREAM CE accounting file (from /opt/glite/var/log/accounting/) Lines in these log files pertaining to jobs not in the database, or to jobs that have not ended within the time window specified, will be ignored. This means that you MUST iterate over a few days(?) worth of historic CE accounting logs, since the association information is usually written on job queue time, not the job end time. Also, this script must be run *after* the job accounting records have been published ot the database from the LRMS server itself (i.e. after you have run pbsaccdb.pl on the LRMS server). By default, the time window is the whole of the previous calender day.