NDPFAccounting

From PDP/Grid Wiki
Jump to navigationJump to search

Systems involved

vlaai gridmapdir NFS create poolmap files based on the gridmapdir state on a daily basis
stro Torque server conversion from Torque accounting files and inserting these into the NDPF accounting database on fulda, using a poolmapfile (it will try to collect one automatically if it can mount the gridmapdir)
bosui any EL5 system extraction of data from database into lcgRecords format and upload through AMQ-OpenWire (but note the program is called 'stompfeeder', vaguely hoping that one time the APEL group implements a usable upload protocol ...)
any ingress system GT2/5 gatekeeper, lcg-CE or CREAM joining of the ingress records (user DN, VOMS FQAN) to jobs already in the database, using the (pbsJobID, masterFQDN) combination as a unique key.

The relevant scripts for collecting and assembling the accounting information are contained in a single RPM package "ndpf-acctfeeder" that contains both the local and the EGEE scripts (and the accuse client tool). Formally, the dependencies include only perl, perl-DBI, and perl-DBD-MySQL, but there are a few others needed on specific hosts:

pbsnodes
needed for the facility capacity option (default) in ndpf-acctfeed on the Torque server

They have not been included in the rpm dependencies, so as to be able to have a single RPM that installs everywhere. This RPM does not install any cron jobs, and you must edit these two files where relevant:

/etc/pbsaccdb.conf
on the Torque server, needed for pbsaccdb.pl and pbsstatusdb.pl
/etc/stompfeeder.conf
on the uploader box, needed for ndpf-stompfeeder to (optionally) a new default group-to-VO definition, as well as database access passwords and the like unless specified on the command line

both files, if present, must only be readable by root (uid 0).

The new AMQ uploader tool, along with some necessary AMQ libraries, is installed on bosui via the rpm 'nikhef-apel'. Looking in /usr/local/bin/nikhef-apel is reasonably helpful for figuring out how to configure the thing. The AMQ uploader basically pipes SQL statements into the OpenWire uploader CLI, which sends the records off to the UK.

Sources

All relevant sources and the database schema are in SVN at

https://ndpfsvn.nikhef.nl/repos/ndpf/nl.nikhef.ndpf.tools/ndpf-acctfeeder/

NDPF Local Accounting

The local accounting is the most important element, and must (and is :-) fully reliable, because it is used as the basis for the cost reimbursement for projects where we contribute in-kind contributions in the form of compute cycles. These data are collected (yearly) from the NDPF accounting database on a per-VO basis.

Data is inserted into this database on a daily basis. The records are (or should) inserted just after midnight, when the pbs/torque accounting files have been closed and are complete. Since the accounting is based on the "E" records in that file, we thus get all completed jobs. Jobs that are still running will not be accounted -- they will be filed only then they are finished.

Master insertion

Insertion in the database requires the collaboration of two components:

  • the mapping from poolaccounts to grid user DNs
  • the extraction of the pbs data from the accounting file, and linking the unix users of the facility to their grid credentials

At the moment, the grid group (FQAN) mappings are not part of this scheme, and only unix groups are stored in the database. The unix group-to-gridVO mapping is only done in the EGEE upload phase. This is partly historical, but since the VO-FQAN mapping side of the grid software is in constant flux anyway it is better like it is done now. The FQAN info is added in a later stage with the CEJoiner, see below.

To ease the insertion, a meta-utility has been developed: ndpf-acctfeed. It is to be run on the PBS master (stro) every night, and by default will process yesterday's accounting file:

Usage: ndpf-acctfeed [-v] [-h] [-f] [--mapfile <poolmap>|--gridmapdir <dir>]
           [--date|-d Ymd] [--nocapacity] [-n|--dryrun]
           [--pbsaccdir dir] [--progacc command] [--progcapacity command]

and in accounting.ncm-cron.cron:

15 0 * * * root (PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-acctfeed) >> /var/log/accounting.ncm-cron.log 2>&1

The utility will do its very best to find a mapping from unix uids to gridDNs. By default it will look for the .poolmap.YYYYMMDD files that are created at midnight on the gridmapdir NFS server in the gridmapdir (currently on vlaai at 23:58 local time). If such a poolmap cannot be found, it will first try the file you specified on the command line without the YYYYMMDD postfix, then will read the gridmapdir (/share/gridmapdir by default) and create a temporary poolmapfile just for this run. If both the gridmapdir and the poolmapfile(s) are unreadable, the utility aborts.

Parsing PBS accounting data

the pbsaccdb.pl script parses the PBS accounting files, extract key user information and the list of participating nodes, and calculates (normalised) usage times. It then inserts it into the database.

pbsaccdb.pl /var/spool/pbs/server_priv/accountings/YYYYMMDD

Options:

-c
configuration file. Defaults to /etc/pbsaccdb.conf.
--init
try and insert a schema into the database. NOT reocmmended anymore, since there is a separate schema sql file now
-u --user
database user name. Usually read from the config file.
--password
password for the insertion user (must have INSERT and REPLACE rights). Usually read from the config file.
-f --facility
name of this facility (e.g. "lcg2elprod")
--thishost
used only for database GRANT statement creation
-h
database host
-p
database port
--map
filename of the account map file for sDN translation (NOT the grid-mapfile!)

Collecting data from PBS/Torque

The file /etc/pbsaccdb.conf must be present on the collecting system (i.e. the Torque server) and be formatted as described in NDPFAccouting_pbsaccdbconf.


Scaling and the GHzHourEquivalent

The GHzHoursEquiv normalisation (a bit like "Processor Node Hours") is an internal unit. 1 GHzEquivHour corresponds to 410 SI2k-'rate'-base. The job used GHzEquivHours value is taken as the sumproduct of the participating cores and the time on each core. Calculating it needs to actual list of nodes and the performance for each of the participating cores.

Incorporating grid mappings

The grid mappings are matches to the unix uids by the pbsaccdb.pl script, based on a mapfile. This map file is formatted with one mapping per line, with a single TAB character (\t) between the uid and the DN strings, like in:

unixuid       /DNstring

This file is not generated by pbsaccdb.pl, but needs to be prepared and passed as a command-line argument (or a sensible default is taken, based on the date specification in the pbs accounting filename given). Normally, the ndpf-acctfeed meta-utility takes care of matching the poolmapfile and the accounting file based on dates, but also this utility will look for a 'true' mapfile that reflects the actual grid-DN-mappings in use on that date. So, such a file must be generated daily. Note that you do need actual dates mapfiles, since poolaccounts will expire after some time and get re-cycled (usually after 100 days of inactivity).

The mapfiles are generated by the poolmaprun script (part of the managepoolmap package, vlaai currently has managepoolmap-1.1-2 installed) on the NFS server hosting the gridmapdir (today: vlaai). This script will both create the mapfile of today and afterwards release any poolaccount mappings that have been idle for 100 days. This script is run from cron on the gridmapdir server:

/etc/cron.d/manage_gridmapdir:
58 23 * * * root    MAX_AGE=100 CLEANING=1 /usr/local/bin/poolmaprun

The cron job is installed automatically by the RPM. The script must run on the server since the directory is (and should be) root-owned and the files are written (as '.poolmap.YYYYMMDD') to this directory. Really historic poolmaps are then later moved to a subdirectory gridmapdir/.history/

VO FQAN and ingress information

The base job table only knows about the refefence to the groupid and userid entries from the grid map file. Additional information on the user, more attributes and the list of FQANs used via VOMS, are only recorded at the job ingress points. As such, the initial record in the job table will not have this information. It needs to be augmented after insertion through the credential linkage tables, which are populated on each of the ingress points. When the ingress point fails to supply this info, only basic information (VO, userID, X.509 subject name) will be available.


Re-inserting historic data

First make sure you have accurate poolmap files, and then re-run the ndpf-acctfeed program with a date option:

ndpf-acctfeed --date 20081201

and do this for every missing day. It is harmless to re-insert the same day twice (the rows in the database will just be replaced, as the table is keyed on the JobID which is generated like

md5_base64($pbsinfo{qtime}.$MasterFQDN.$JobID)

where qtime is the time the job was put in the queue by the user, the MasterFQDN is the hostname of the Torque master server, and the JobID is the numerical part of the Torque job id, including the sub-job ID (e.g. "4132121-42"). These remain constant once a job is submitted by the user.

GOCDB APEL uploads

The uploads to the GOC Accounting system use the direct AMQ OpenWire interface by inserting into the lcgRecords table. All data put there is taken exclusively from the NDPF accounting database, and it does not use any other source of records.

Requirements on the submission host

You must be able to talk to the database in perl, so you need perl_DBI and perl-DBD-MySQL, but you'll also need ndpf-stompfeeder (the translator) and nikhef-apel (the OpenWire babbler).

Make sure this script runs daily, but well after the records from the previous day have been inserted into the local NDPF accounting database. For example in /etc/cron.d/amqrecords:

34 9 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin; LANG=C; unset TZ; date; echo Uploading new accounting data; /usr/local/sbin/ndpf-stompfeeder -v -n -q --dbpassword select --dbuser anon --pipe '/usr/local/bin/nikhef-apel' 2>&1 ; date ; echo Accounting upload completed.) >> /var/log/amqrecords.ncm-cron.log 2>&1

Note: this replaces the old append-records script from hooimijt.

Uploading historic data to the GOC

To do this, first make sure the data is actually there in the NDPF accounting database (see there). Then you can re-invoke the meta-utility like:

/usr/local/sbin/ndpf-stompfeeder -s '2011-03-30 00:00:00' -e '2011-04-04 12:00:00' -v -n -q --dbpassword PASSWORD --dbuser USER --pipe '/usr/local/bin/nikhef-apel' 

and it will work as usual (including the writing of a log file with the current time as the time stamp. If you upload too much data, you WILL explode the apel-broker at RAL. If you do nothing at all, you MAY explode the apel-broker at RAL. Be prepared to pick up broken pieces any time, 24x7x365...

Translating accounting records to APEL ormat

This is done by the ndpf-stompfeeder program (a specialised version of accuse, actually). The scaling (to GHzHoursEquivalent) to sumproduct-weighted normalised units has already been taken case of by the pbsaccdb.pl script and should not be done again here. Basically, ndpf-stompfeeder will use the calender walltime and GHzEquivWallTime (the normalised one) to re-obtain the performance of the worker node in SpecInt2k. In case of multi-node jobs that have used different CPU types, the 'core performance' in SI2k will be the weighted average of the nodes participating in the job. Anyway, you cannot convert hours to SI2k with re-inspecting the host list, since adding and multiplication in our algebra do not commute.

The output of the ndpf-stompfeeder script is

  • a list of APEL-compliant SQL "REPLACE INTO" commands
  • when used with the "--pipe" option, this program is invoked and the same SQL REPLACE INTO commands sent to the named program on stdin
  • when used with "-v", a summary is printed for each job as well

Upload protocol

This for now appears to be OpenWire, over an SSL secured link with client auth. The Java program (nikhef-apel) uses a manually-crafted JCE trust store with the host cert of the APEL broker inside it (i.e. it must be recreated every year), and the client cert of the hostname listed in the GOCDB as the APEL-client host. It can be any host, and does not even have to belong ot the host where you run the nikhef-apel program.

Log monitoring and status pages

Have a look at the CESGA accounting GANTT chart once in a while to see if everything is still working.

Cron jobs to do accounting

There are three cron jobs:

vlaai
installed automatically from the managepoolmap-1.1-2 package in (when missing, the ndpf-acctfeeder will try and instantly create a map file as well)
58 23 * * * root    MAX_AGE=100 CLEANING=1 /usr/local/bin/poolmaprun
stro
installed via quattor in /etc/cron.d/accounting.ncm-cron.cron
15 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-acctfeed) >> /var/log/accounting.ncm-cron.log 2>&1
bosui
installed via quattor in /etc/cron.d/amqrecords.ncm-cron.cron
34 9 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin; LANG=C; unset TZ; date; echo Uploading new accounting data; /usr/local/sbin/ndpf-stompfeeder -v -n -q --dbpassword PASSWORD --dbuser USERNAME --pipe '/usr/local/bin/nikhef-apel' 2>&1 ; date ; echo Accounting upload completed.) >> /var/log/amqrecords.ncm-cron.log 2>&1


on each lcg-CE or GRAM node (with NDPF pbs.in) new
to be installed
45 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-cejoiner.pl /var/log/messages /var/log/messages.1 /var/log/messages.1.gz) >> /var/log/cejoiner.ncm-cron.log 2>&1
on each CREAM CE node new
to be installed, the last 5 days of accounting data is arbitrary ... but yoiu need the max queue+run time at least. Also keep in mind that if you touch the directory, the ls -1tr command no longer gives you the latest n files!
45 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-cejoiner.pl `ls -1tr /opt/glite/var/log/accounting/* | tail -5`) >> /var/log/cejoiner.ncm-cron.log 2>&1

Using the information in other ways

to get CVS files for wLCG manglement

The PHP script at

http://www.nikhef.nl/grid/accuse/

gets the data directly from the NDPF accounting database on fulda. It is exclusively geared towards wLCG management and cannot do much more. For better control, install the ndpf-acctfeeder package and use the accuse command.

to view graphs

The graphs on http://www.nikhef.nl/grid/stats/ndpf-prd/voview-short are generated completely independent from the account database using perl scripts from cron on ribble/naab. They are orthogonal and don't cross.

to view the GOC accounting status

The accounting grantt chart at CESGA is a good point to see if the data is still getting across. Q uick view is, for example:

http://www3.egee.cesga.es/gridsite/accounting/CESGA/gantt.php?option=ROC&optval=1.8&sYear=2008&sMonth=10&eYear=2009&eMonth=12&type=Production&tree=TIER1

with the full interface at http://goc.grid-support.ac.uk/gridsite/accounting/

to get accurate statistics

use the accuse command on bosui, or any other system where ndpf-acctfeeder is installed. Add -h to get a lot of help

to get the data for the VL-e accountant

Use /user/davidg/bin/vle-usage on any ikonet/hefnet host with the mysqql client commands installed.

Database Schema

The schema for the accounting database (for Nikhef at bedstee.nikhef.nl) looks like the graph shown on the right.

Description

The SQL sources are at https://ndpfsvn.nikhef.nl/repos/ndpf/nl.nikhef.ndpf.tools/ndpf-acctfeeder/accountingdb-IDextend.sql

The new schema allows for arbitrary credentials to be stored, and data collected from multiple sources, but the new tables like pubIdent and ingressPoint need a collector/aggregator be be run on each and every CE, over all log records written at the queueTime moment, for all jobs that are already in the database based on the endTime, i.e. recorded when the job has actually completed and the record uploaded from the Torque server.

The scripts to do that are in the save SubVersion repository as the database scema. In particular, ndpf-cejoiner.pl can run over both NDPF-style JobManager/pbs.om logs written in /var/log/messages (daemon.info), as well as the CREAM accounting files in /opt/glite/var/log/accounting/. The format is inferred automatically, and it can process gzipped files as well as regular text.

Uploading and registering data

Merging CE ingress information

Additional information is available only on the CE, such as the name of the ingress point (obviously), but also public identity information such as the user DN as seen on the CE, the primary FQAN, etc. For the subjectDN, this provides an alternative to the userid.commonName which is reverse-engineered from the gridmapdir. The joiner can eat our own Gatekeeper log entries from syslog (/var/log/messages) as well as /opt/glite/var/log/accounting/blahp.log files. It will figure out itself what the format is.

Help on ndpf-cejoiner.pl

Usage: ./ndpf-cejoiner.pl [-s|--start starttime:Y-m-d] [-e|--end endtime:Y-m-d]
  [-h] [-v] [-c|--config configfile] [-n|--dry]
  [--dbhost hostname] [--dbport port] [--dbname name] [--dbuser name]
  [--dbpassword password] [--pbsserver hostname] [--facility facilityname]
  [--ingressdefault defaultingresspointname]
  [--ingressname|--ceid fixedingresspointname]
  logfile ...

Merge job identity data logged on the CE with existing job records
in the accounting database, using the StartTime, facility name and the
jobID to find the matching job entry and then link it up with a new
or existing pubIdentity record using alinkage table
It will process the log files given on the command line (inasfar as they
exist), but only between <starttime> and <endtime> inclusive, and update
the database with these identities where possible.

  starttime    Y-m-d [HH:MM:SS]       jobs ended from this time onwards
               default: 2010-11-02 00:00:00
  endtime      Y-m-d [HH:MM:SS]       jobs ended up to this time
               default: 2010-11-03 00:00:00
  dbhost       hostname               hostname of the database server
               default: bedstee.nikhef.nl
  dbport       portnumber             portnumber of the database server
               default: 3306
  dbname       name                   name of the accouting database
               default: accounting
  dbuser       name                   username to access the acc database
               default: accounter
  dbpassword   string                 password for the database
               default: do NOT use the command line, use a secured config

  facility     string                 name of the NDPF facility
               default: lcg2elprod

  ingressdefault  string              name of this ingress point if no
                                      explicit name is found
                  default: bosui.nikhef.nl:2119/jobmanager-pbs (for lcg-CE)
                  default: bosui.nikhef.nl  (all other ingress points)
  ingressname  string                 force ingressname to be <name>
               default: none

The <logfile> can be either plain-text line oriented, or a gzipped log
file. The log file is parsed for lines matching an LRMS registration and
associated user subject and group information.
The log file can be either a NDPF-enhanced syslog file containing the
tokens produced by the JobManager/pbs.pm management script, or a CREAM CE
accounting file (from /opt/glite/var/log/accounting/)

Lines in these log files pertaining to jobs not in the database, or
to jobs that have not ended within the time window specified, will
be ignored. This means that you MUST iterate over a few days(?) worth
of historic CE accounting logs, since the association information is
usually written on job queue time, not the job end time.
Also, this script must be run *after* the job accounting records
have been published ot the database from the LRMS server itself (i.e.
after you have run pbsaccdb.pl on the LRMS server).

By default, the time window is the whole of the previous calender day.