Difference between revisions of "NDPFAccounting"

From PDP/Grid Wiki
Jump to navigationJump to search
m
m (moved NDPFAccouting to NDPFAccounting: Typo in title)
 
(24 intermediate revisions by 2 users not shown)
Line 7: Line 7:
 
| '''stro'''  || Torque server || conversion from Torque accounting files and inserting these into the NDPF accounting database on fulda, using a poolmapfile (it will try to collect one automatically if it can mount the gridmapdir)
 
| '''stro'''  || Torque server || conversion from Torque accounting files and inserting these into the NDPF accounting database on fulda, using a poolmapfile (it will try to collect one automatically if it can mount the gridmapdir)
 
|-
 
|-
 +
| '''bosui''' || any EL5 system || extraction of data from database into lcgRecords format and upload through AMQ-OpenWire (but note the program is called 'stompfeeder', vaguely hoping that one time the APEL group implements a usable upload protocol ...)
 
|-
 
|-
| '''klomp''' || MON box || conversion of data from accounting database to lcgrecords (UR-WG format), and upload through R-GMA
+
| ''any ingress system'' || GT2/5 gatekeeper, lcg-CE or CREAM || joining of the ingress records (user DN, VOMS FQAN) to jobs already in the database, using the (pbsJobID, masterFQDN) combination as a unique key.
 
|}
 
|}
  
All relevant script are contained in a single RPM package "ndpf-acctfeeder" that contains both the local and the EGEE scripts (and the accuse client tool).  
+
The relevant scripts for collecting and assembling the accounting information
 +
are contained in a single RPM package "ndpf-acctfeeder" that contains both the local and the EGEE scripts (and the accuse client tool).  
 
Formally, the dependencies include only perl, perl-DBI, and perl-DBD-MySQL, but there are a few others needed on specific hosts:
 
Formally, the dependencies include only perl, perl-DBI, and perl-DBD-MySQL, but there are a few others needed on specific hosts:
  
 
; pbsnodes : needed for the facility capacity option (default) in ndpf-acctfeed on the Torque server
 
; pbsnodes : needed for the facility capacity option (default) in ndpf-acctfeed on the Torque server
; rgma & grid-proxy-init : needed for ndpf-lcgrecordsfeed on the MON host
 
  
 
They have not been included in the rpm dependencies, so as to be able to have a single RPM that installs everywhere. This RPM '''does not''' install any cron jobs, and you '''must''' edit these two files where relevant:
 
They have not been included in the rpm dependencies, so as to be able to have a single RPM that installs everywhere. This RPM '''does not''' install any cron jobs, and you '''must''' edit these two files where relevant:
  
 
; /etc/pbsaccdb.conf : on the Torque server, needed for pbsaccdb.pl and pbsstatusdb.pl
 
; /etc/pbsaccdb.conf : on the Torque server, needed for pbsaccdb.pl and pbsstatusdb.pl
; /etc/lcgrecords.conf : on the MON box, needed for lcgrecords to read the database password and (optionally) a new default group-to-VO definition
+
; /etc/stompfeeder.conf : on the uploader box, needed for ndpf-stompfeeder to (optionally) a new default group-to-VO definition, as well as database access passwords and the like unless specified on the command line
  
 
both files, if present, must only be readable by root (uid 0).
 
both files, if present, must only be readable by root (uid 0).
 +
 +
The new AMQ uploader tool, along with some necessary AMQ libraries, is installed on bosui via the rpm 'nikhef-apel'.  Looking in <tt>/usr/local/bin/nikhef-apel</tt> is reasonably helpful for figuring out how to configure the thing.  The AMQ uploader basically pipes SQL statements into the OpenWire uploader CLI, which sends the records off to the UK.
 +
 +
= Sources =
 +
 +
All relevant sources and the database schema are in SVN at
 +
 +
https://ndpfsvn.nikhef.nl/repos/ndpf/nl.nikhef.ndpf.tools/ndpf-acctfeeder/
  
 
= NDPF Local Accounting =
 
= NDPF Local Accounting =
Line 28: Line 37:
 
The local accounting is the most important element, and must (and is :-) fully reliable, because it is used as the basis for the cost reimbursement for projects where we contribute in-kind contributions in the form of compute cycles. These data are collected (yearly) from the NDPF accounting database on a per-VO basis.
 
The local accounting is the most important element, and must (and is :-) fully reliable, because it is used as the basis for the cost reimbursement for projects where we contribute in-kind contributions in the form of compute cycles. These data are collected (yearly) from the NDPF accounting database on a per-VO basis.
  
Data is inserted into this atabase on a daily basis. The records are (or should) inserted just after midnight, when the pbs/torque accounting files have been closed and are complete. Since the accounting is based on the "E" records in that file, we thus get all completed jobs. Jobs that are still running will ''not'' be accounted -- they will be filed only then they are finished.
+
Data is inserted into this database on a daily basis. The records are (or should) inserted just after midnight, when the pbs/torque accounting files have been closed and are complete. Since the accounting is based on the "E" records in that file, we thus get all completed jobs. Jobs that are still running will ''not'' be accounted -- they will be filed only then they are finished.
  
 
== Master insertion ==
 
== Master insertion ==
Line 35: Line 44:
 
* the extraction of the pbs data from the accounting file, and linking the unix users of the facility to their grid credentials
 
* the extraction of the pbs data from the accounting file, and linking the unix users of the facility to their grid credentials
  
At the moment, the grid group (FQAN) mappings are ''not'' part of this scheme, and only unix groups are stored in the database. The unix group-to-gridVO mapping is only done in the EGEE upload phase. This is partly historical, but since the VO-FQAN mapping side of the grid software is in constant flux anyway it is better like it is done now.
+
At the moment, the grid group (FQAN) mappings are ''not'' part of this scheme, and only unix groups are stored in the database. The unix group-to-gridVO mapping is only done in the EGEE upload phase. This is partly historical, but since the VO-FQAN mapping side of the grid software is in constant flux anyway it is better like it is done now. The FQAN info is added in a later stage with the CEJoiner, see below.
  
 
To ease the insertion, a meta-utility has been developed: <tt>ndpf-acctfeed</tt>. It is to be run on the PBS master (stro) every night, and by default will process yesterday's accounting file:
 
To ease the insertion, a meta-utility has been developed: <tt>ndpf-acctfeed</tt>. It is to be run on the PBS master (stro) every night, and by default will process yesterday's accounting file:
Line 49: Line 58:
 
The utility will do its very best to find a mapping from unix uids to gridDNs. By default it will look for the <tt>.poolmap.YYYYMMDD</tt> files that are created at midnight on the gridmapdir NFS server in the gridmapdir (currently on vlaai at 23:58 local time). If such a poolmap cannot be found, it will first try the file you specified on the command line without the YYYYMMDD postfix, then will read the gridmapdir (/share/gridmapdir by default) and create a temporary poolmapfile just for this run. If both the gridmapdir and the poolmapfile(s) are unreadable, the utility aborts.
 
The utility will do its very best to find a mapping from unix uids to gridDNs. By default it will look for the <tt>.poolmap.YYYYMMDD</tt> files that are created at midnight on the gridmapdir NFS server in the gridmapdir (currently on vlaai at 23:58 local time). If such a poolmap cannot be found, it will first try the file you specified on the command line without the YYYYMMDD postfix, then will read the gridmapdir (/share/gridmapdir by default) and create a temporary poolmapfile just for this run. If both the gridmapdir and the poolmapfile(s) are unreadable, the utility aborts.
  
PS: the <tt>ndpf-acctfeed</tt> meta-tool replaces the historic 'merge_new_accdata' script.
+
=== Parsing PBS accounting data ===
 +
 
 +
the pbsaccdb.pl script parses the PBS accounting files, extract key user information and the list of participating nodes, and calculates (normalised) usage times. It then inserts it into the database.
 +
 
 +
pbsaccdb.pl /var/spool/pbs/server_priv/accountings/YYYYMMDD
 +
 
 +
Options:
 +
; -c : configuration file. Defaults to /etc/pbsaccdb.conf.
 +
; --init : try and insert a schema into the database. NOT reocmmended anymore, since there is a separate schema sql file now
 +
; -u --user : database user name. Usually read from the config file.
 +
; --password : password for the insertion user (must have INSERT and REPLACE rights). Usually read from the config file.
 +
; -f --facility : name of this facility (e.g. "lcg2elprod")
 +
; --thishost : used only for database GRANT statement creation
 +
; -h : database host
 +
; -p : database port
 +
; --map : filename of the account map file for sDN translation (NOT the grid-mapfile!)
  
 
== Collecting data from PBS/Torque ==
 
== Collecting data from PBS/Torque ==
Line 57: Line 81:
  
 
== Scaling and the GHzHourEquivalent ==
 
== Scaling and the GHzHourEquivalent ==
 +
 +
The GHzHoursEquiv normalisation (a bit like "Processor Node Hours") is an internal unit. 1 GHzEquivHour corresponds to 410 SI2k-'rate'-base.
 +
The job used GHzEquivHours value is taken as the sumproduct of the participating cores and the time on each core. Calculating it needs to actual list of nodes and the performance for each of the participating cores.
  
 
== Incorporating grid mappings ==
 
== Incorporating grid mappings ==
Line 76: Line 103:
 
The script must run on the server since the directory is (and should be) root-owned and the files are written (as '<tt>.poolmap.YYYYMMDD</tt>') to this directory.
 
The script must run on the server since the directory is (and should be) root-owned and the files are written (as '<tt>.poolmap.YYYYMMDD</tt>') to this directory.
 
Really historic poolmaps are then later moved to a subdirectory <tt>gridmapdir/.history/</tt>
 
Really historic poolmaps are then later moved to a subdirectory <tt>gridmapdir/.history/</tt>
 +
 +
=== VO FQAN and ingress information ===
 +
 +
The base job table only knows about the refefence to the groupid and userid entries from the grid map file. Additional information
 +
on the user, more attributes and the list of FQANs used via VOMS, are only recorded at the job ingress points. As such, the initial record in the
 +
job table ''will not'' have this information. It needs to be augmented after insertion through the credential linkage tables, which are populated on each of the ingress points. When the ingress point fails to supply this info, only basic information (VO, userID, X.509 subject name) will be available.
 +
  
 
== Re-inserting historic data ==
 
== Re-inserting historic data ==
Line 87: Line 121:
 
  md5_base64($pbsinfo{qtime}.$MasterFQDN.$JobID)
 
  md5_base64($pbsinfo{qtime}.$MasterFQDN.$JobID)
  
where qtime is the time the job was put in the queue by the user, the MasterFQDN is the hostname of the Torque master server, and the JobID is the numerical part of the Torque job id. These remain constant once a job is submitted by the user.
+
where qtime is the time the job was put in the queue by the user, the MasterFQDN is the hostname of the Torque master server, and the JobID is the numerical part of the Torque job id, including the sub-job ID (e.g. "4132121-42"). These remain constant once a job is submitted by the user.
  
= EGEE uploads =
+
= GOCDB APEL uploads =
  
The uploads to the GOC Accounting system (EGEE/LCG) use the direct R-GMA interface by inserting into the lcgRecords table. All data put there is taken exclusively from the NDPF accounting database, and it does not use any other source of records.  
+
The uploads to the GOC Accounting system use the direct AMQ OpenWire interface by inserting into the lcgRecords table. All data put there is taken exclusively from the NDPF accounting database, and it does not use any other source of records.  
  
 
== Requirements on the submission host ==
 
== Requirements on the submission host ==
  
You must be able to talk to the database in perl, so you need perl_DBI and perl-DBD-MySQL, but you'll also need grid-proxy-init and rgma (in /opt/globus and /opt/glite). If any of these path are non-standard, you must either recompile the RPM, or specify everything on the command line.
+
You must be able to talk to the database in perl, so you need perl_DBI and perl-DBD-MySQL, but you'll also need ndpf-stompfeeder (the translator) and nikhef-apel (the OpenWire babbler).
 
 
Like for pbsaccdb.pl, there is a convenience meta-utility available to automate much of the gory details of uploading data in R-GMA: the <tt>ndpf-lcgrecordsfeed</tt>. By default, it will upload the records from yesterday, i.e. jobs that ENDED yesterday according to the NDPF Accounting database. For good measure, unless dates have been specified explicitly, it will
 
'''also (again) upload all data from 3 to 7 days ago'''.
 
 
 
Run this on the MON box (klomp, for now) as root (uid=0)
 
 
 
Usage: ndpf-lcgrecordsfeed [-v] [-h] [-f] [-n|--dryrun] [--runallasroot] [--accuser uid]
 
            [--logfile=file|""] [--cert=file] [--key=file] [--proxydir dir]
 
            [--proxyhours=hrs]
 
            [--startdate|-s Ymd] [--enddate|-e Ymd] [--[no-]resiliance]
 
 
 
(see the source for even more hairy options). You'll need a /etc/lcgrecords.conf file to
 
instruct the underlying lcgrecords program what to use to talk to the database. For example:
 
 
 
cat /etc/lcgrecords.conf
 
# lcgrecords.conf
 
$opt_dbuser="lcgusage";
 
$opt_dbpasswd="PASSWORD";
 
 
 
And you MUST have a user 'accuser' defined on the system. This user is used to execute the R-GMA command at the end, as we will never trust R-GMA to run as root :-)
 
 
 
The script will:
 
- create a proxy for the accuser user (unless <tt>--runallasroot</tt> has been specified in a temporary file for 4 hours
 
- extract yesterday's data (or specified dates) from the NDPF accounting database and write to a temporary file
 
- (unless a daterange is specified or --noresiliance): als get all data from 4 to 6 days ago and also write this to the temporary file
 
- execure the RGMA command with the proper environment (X509_USER_PROXY and GLITE_LOCATION) and execure the commands in the temporary datafile using the "-f" option to R-GMA)
 
 
 
The meta-utility, like lcgrecords, is very verbose, and will by default write a log file for each invocation to <tt>/var/tmp/lcgfeed.''TIMESTAMP''.XXXXXX</tt>. You can override this choice with the <tt>--logfile</tt> directive. If you want to see the output on the screen, use
 
 
 
ndpf-lcgrecordsfeed -o ""
 
  
Make sure this script runs daily, but well after the records from the previous day have been inserted into the local NDPF accounting database.  For example in /etc/cron.d/append-records:
+
Make sure this script runs daily, but well after the records from the previous day have been inserted into the local NDPF accounting database.  For example in /etc/cron.d/amqrecords:
  
  34 8 * * *     /usr/local/sbin/ndpf-lcgrecordsfeed
+
  34 9 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin; LANG=C; unset TZ; date; echo Uploading new accounting data; /usr/local/sbin/ndpf-stompfeeder -v -n -q --dbpassword select --dbuser anon --pipe '/usr/local/bin/nikhef-apel' 2>&1 ; date ; echo Accounting upload completed.) >> /var/log/amqrecords.ncm-cron.log 2>&1
  
Note: this replaces the old <tt>append-records</tt> script from hooimijt.  
+
Note: this replaces the old <tt>append-records</tt> script from hooimijt.
  
 
== Uploading historic data to the GOC ==
 
== Uploading historic data to the GOC ==
  
To do this, first make sure the data is actually there in the NDPF accounting database (see there). Then you can re-invoke the meta-utility:
+
To do this, first make sure the data is actually there in the NDPF accounting database (see there). Then you can re-invoke the meta-utility like:
  
  ndpf-lcgrecordsfeed -s 20081201 -e 20081205
+
  /usr/local/sbin/ndpf-stompfeeder -s '2011-03-30 00:00:00' -e '2011-04-04 12:00:00' -v -n -q --dbpassword PASSWORD --dbuser USER --pipe '/usr/local/bin/nikhef-apel'
  
and is will work as usual (including the writing of a log file with the ''current'' time as the time stamp.  
+
and it will work as usual (including the writing of a log file with the ''current'' time as the time stamp.  
If you upload too much data, R-GMA will run out of memory and data will be lost. So, limit the time scales!
+
If you upload too much data, you WILL explode the apel-broker at RAL. If you do nothing at all, you MAY explode the apel-broker at RAL. Be prepared to pick up broken pieces any time, 24x7x365...
  
 +
== Translating accounting records to APEL ormat ==
  
== Translating accounting records to UR-WG format ==
+
This is done by the ndpf-stompfeeder program (a specialised version of accuse, actually). The scaling (to GHzHoursEquivalent) to sumproduct-weighted normalised units has already been taken case of by the pbsaccdb.pl script and should not be done again here. Basically, ndpf-stompfeeder will use the calender walltime and GHzEquivWallTime (the normalised one) to re-obtain the performance of the worker node in SpecInt2k. In case of multi-node jobs that have used different CPU types, the 'core performance' in SI2k will be the weighted average of the nodes participating in the job. Anyway, you cannot convert hours to SI2k with re-inspecting the host list, since adding and multiplication in our algebra do not commute.
  
This is done by the lcgrecords program (a specialised version of accuse, actually), and aftr this all hosts will hve the standard performance of 410 SI2k per core. Scaling (to GHzHoursEquivalent) has already been taken case of by the pbsaccdb.pl script and should not be done again here. Anyway, you cannot convert hours to SI2k with re-inspecting the host list, since adding and multiplication in our algebra do not commute.
+
The output of the ndpf-stompfeeder script is
 +
* a list of APEL-compliant SQL "REPLACE INTO" commands
 +
* when used with the "--pipe" option, this program is invoked and the same SQL REPLACE INTO commands sent to the named program on stdin
 +
* when used with "-v", a summary is printed for each job as well
  
The output of the lcgrecords script is a set of gLite rgma INSERT commands. Note that the EDG version of rgma is no longer supported after this upgrade!
+
== Upload protocol ==
  
== Upload protocol ==
+
This for now appears to be OpenWire, over an SSL secured link with client auth. The Java program (nikhef-apel) uses a manually-crafted JCE trust store with the host cert of the APEL broker inside it (i.e. it must be recreated every year), and the client cert of the hostname listed in the GOCDB as the APEL-client host. It can be any host, and does not even have to belong ot the host where you run the nikhef-apel program.
  
 
== Log monitoring and status pages ==
 
== Log monitoring and status pages ==
Line 159: Line 167:
 
There are three cron jobs:
 
There are three cron jobs:
  
; vlaai : installed automatically from the managepoolmap-1.1-2 package in
+
; vlaai : installed automatically from the managepoolmap-1.1-2 package in (when missing, the ndpf-acctfeeder will try and instantly create a map file as well)
 
  58 23 * * * root    MAX_AGE=100 CLEANING=1 /usr/local/bin/poolmaprun
 
  58 23 * * * root    MAX_AGE=100 CLEANING=1 /usr/local/bin/poolmaprun
  
Line 165: Line 173:
 
  15 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-acctfeed) >> /var/log/accounting.ncm-cron.log 2>&1
 
  15 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-acctfeed) >> /var/log/accounting.ncm-cron.log 2>&1
  
; klomp : installed via quattor in /etc/cron.d/lcgrecordsfeed.ncm-cron.cron
+
; bosui: installed via quattor in /etc/cron.d/amqrecords.ncm-cron.cron
  15 3 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-lcgrecordsfeed) >> /var/log/lcgrecordsfeed.ncm-cron.log 2>&1
+
  34 9 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin; LANG=C; unset TZ; date; echo Uploading new accounting data; /usr/local/sbin/ndpf-stompfeeder -v -n -q --dbpassword PASSWORD --dbuser USERNAME --pipe '/usr/local/bin/nikhef-apel' 2>&1 ; date ; echo Accounting upload completed.) >> /var/log/amqrecords.ncm-cron.log 2>&1
 +
 
 +
 
 +
; on each lcg-CE or GRAM node (with NDPF pbs.in) ''new'' : to be installed
 +
45 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-cejoiner.pl /var/log/messages /var/log/messages.1 /var/log/messages.1.gz) >> /var/log/cejoiner.ncm-cron.log 2>&1
 +
 
 +
; on each CREAM CE node ''new'' : to be installed, the last 5 days of accounting data is arbitrary ... but yoiu need the max queue+run time at least. Also keep in mind that if you touch the directory, the ls -1tr command no longer gives you the latest ''n'' files!
 +
45 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-cejoiner.pl `ls -1tr /opt/glite/var/log/accounting/* | tail -5`) >> /var/log/cejoiner.ncm-cron.log 2>&1
  
 
= Using the information in other ways =
 
= Using the information in other ways =
Line 199: Line 214:
 
= Database Schema =
 
= Database Schema =
  
The schema for the accounting database (for Nikhef at fulda.nikhef.nl) looks like this:
+
The schema for the accounting database (for Nikhef at bedstee.nikhef.nl) looks like the graph shown on the right. [[Image:AccountingDB.gif|thumb|Description]] The SQL sources are at https://ndpfsvn.nikhef.nl/repos/ndpf/nl.nikhef.ndpf.tools/ndpf-acctfeeder/accountingdb-IDextend.sql
 +
 
 +
The new schema allows for arbitrary credentials to be stored, and data collected from multiple sources, but the new tables like pubIdent and ingressPoint need a collector/aggregator be be run on each and every CE, over all log records written at the '''queueTime''' moment, for all jobs that are ''already'' in the database based on the '''endTime''', i.e. recorded when the job has actually completed and the record uploaded from the Torque server.
 +
 
 +
The scripts to do that are in the save SubVersion repository as the database scema. In particular, ndpf-cejoiner.pl can run over both NDPF-style JobManager/pbs.om logs written in /var/log/messages (daemon.info), as well as the CREAM accounting files in /opt/glite/var/log/accounting/.
 +
The format is inferred automatically, and it can process gzipped files as well as regular text.
 +
 
 +
= Uploading and registering data =
 +
 
 +
== Merging CE ingress information ==
 +
Additional information is available only on the CE, such as the name of the ingress point (obviously), but also public identity information such as the user DN as seen on the CE, the primary FQAN, etc. For the subjectDN, this provides an alternative to the userid.commonName which is reverse-engineered from the gridmapdir. The joiner can eat our own Gatekeeper log entries from syslog (/var/log/messages) as well as /opt/glite/var/log/accounting/blahp.log files. It will figure out itself what the format is.
 +
 
 +
== Help on ndpf-cejoiner.pl ==
  
  CREATE TABLE `capacity` (
+
  Usage: ./ndpf-cejoiner.pl [-s|--start starttime:Y-m-d] [-e|--end endtime:Y-m-d]
  `UniqueID` int(11) NOT NULL auto_increment,
+
  [-h] [-v] [-c|--config configfile] [-n|--dry]
  `facility` int(11) NOT NULL default '0',
+
  [--dbhost hostname] [--dbport port] [--dbname name] [--dbuser name]
  `Time` datetime default NULL,
+
  [--dbpassword password] [--pbsserver hostname] [--facility facilityname]
  `CPUCount` int(11) default NULL,
+
  [--ingressdefault defaultingresspointname]
  `NodeCount` int(11) default NULL,
+
  [--ingressname|--ceid fixedingresspointname]
  `GHz` float default NULL,
+
  logfile ...
  PRIMARY KEY (`UniqueID`),
+
   
  KEY `facility` (`facility`),
+
  Merge job identity data logged on the CE with existing job records
  KEY `Time` (`Time`)
+
  in the accounting database, using the StartTime, facility name and the
  );
+
  jobID to find the matching job entry and then link it up with a new
  CREATE TABLE `facility` (
+
  or existing pubIdentity record using alinkage table
  `UniqueID` int(11) NOT NULL auto_increment,
+
  It will process the log files given on the command line (inasfar as they
  `Name` varchar(32) default NULL,
+
  exist), but only between <starttime> and <endtime> inclusive, and update
  `MasterFQDN` varchar(255) default NULL,
+
  the database with these identities where possible.
  PRIMARY KEY (`UniqueID`),
+
   
  KEY `MasterFQDN` (`MasterFQDN`),
+
  starttime    Y-m-d [HH:MM:SS]      jobs ended from this time onwards
  KEY `Name` (`Name`)
+
                default: 2010-11-02 00:00:00
  );
+
  endtime      Y-m-d [HH:MM:SS]      jobs ended up to this time
  CREATE TABLE `groupid` (
+
                default: 2010-11-03 00:00:00
  `UniqueID` int(11) NOT NULL auto_increment,
+
  dbhost      hostname              hostname of the database server
  `facility` int(11) NOT NULL default '0',
+
                default: bedstee.nikhef.nl
  `id` varchar(8) default NULL,
+
  dbport      portnumber            portnumber of the database server
  `CommonName` varchar(255) default NULL,
+
                default: 3306
  PRIMARY KEY (`UniqueID`),
+
  dbname      name                  name of the accouting database
  KEY `CommonName` (`CommonName`),
+
                default: accounting
  KEY `id` (`id`)
+
  dbuser      name                  username to access the acc database
  );
+
                default: accounter
  CREATE TABLE `job` (
+
  dbpassword   string                password for the database
  `UniqueID` varchar(22) NOT NULL default '',
+
                default: do NOT use the command line, use a secured config
  `Facility` int(11) NOT NULL default '0',
+
  `JobID` int(11) NOT NULL default '0',
+
  facility    string                name of the NDPF facility
  `Queue` varchar(32) NOT NULL default '',
+
                default: lcg2elprod
  `userid` int(11) NOT NULL default '0',
+
  `groupid` int(11) NOT NULL default '0',
+
  ingressdefault  string              name of this ingress point if no
  `QueueTime` datetime NOT NULL default '0000-00-00 00:00:00',
+
                                      explicit name is found
  `StartTime` datetime default NULL,
+
                  default: bosui.nikhef.nl:2119/jobmanager-pbs (for lcg-CE)
  `EndTime` datetime default NULL,
+
                  default: bosui.nikhef.nl (all other ingress points)
  `JobName` varchar(64) default NULL,
+
  ingressname  string                force ingressname to be <name>
  `NodeCount` int(11) NOT NULL default '0',
+
                default: none
  `CPUCount` int(11) NOT NULL default '0',
+
   `Nodelist` varchar(255) default NULL,
+
The <logfile> can be either plain-text line oriented, or a gzipped log
  `CPUGHzEquivalent` float default NULL,
+
file. The log file is parsed for lines matching an LRMS registration and
  `WallGHzEquivalent` float default NULL,
+
associated user subject and group information.
  `CPUTimeRequest` int(11) default NULL,
+
The log file can be either a NDPF-enhanced syslog file containing the
  `WallTimeRequest` int(11) default NULL,
+
tokens produced by the JobManager/pbs.pm management script, or a CREAM CE
  `CPUTime` int(11) default NULL,
+
accounting file (from /opt/glite/var/log/accounting/)
  `WallTime` int(11) default NULL,
+
   
  `kByteMemory` float default NULL,
+
  Lines in these log files pertaining to jobs not in the database, or
  `kByteVirtualMemory` float default NULL,
+
to jobs that have not ended within the time window specified, will
  `ExitStatus` int(11) default NULL,
+
be ignored. This means that you MUST iterate over a few days(?) worth
  PRIMARY KEY (`UniqueID`),
+
of historic CE accounting logs, since the association information is
  UNIQUE KEY `UniqueID` (`UniqueID`),
+
usually written on job queue time, not the job end time.
  KEY `JobID` (`JobID`),
+
Also, this script must be run *after* the job accounting records
  KEY `userid` (`userid`),
+
  have been published ot the database from the LRMS server itself (i.e.
  KEY `groupid` (`groupid`),
+
after you have run pbsaccdb.pl on the LRMS server).
  KEY `Facility` (`Facility`),
+
  KEY `ExitStatus` (`ExitStatus`),
+
  By default, the time window is the whole of the previous calender day.
  KEY `EndTime` (`EndTime`),
 
  KEY `StartTime0` (`StartTime`),
 
  KEY `EndTime0` (`EndTime`)
 
  );
 
  CREATE TABLE `userid` (
 
  `UniqueID` int(11) NOT NULL auto_increment,
 
  `facility` int(11) NOT NULL default '0',
 
  `id` varchar(8) default NULL,
 
  `CommonName` varchar(255) default NULL,
 
  PRIMARY KEY (`UniqueID`),
 
  KEY `CommonName` (`CommonName`),
 
  KEY `id` (`id`)
 
  );
 

Latest revision as of 07:31, 2 July 2012

Systems involved

vlaai gridmapdir NFS create poolmap files based on the gridmapdir state on a daily basis
stro Torque server conversion from Torque accounting files and inserting these into the NDPF accounting database on fulda, using a poolmapfile (it will try to collect one automatically if it can mount the gridmapdir)
bosui any EL5 system extraction of data from database into lcgRecords format and upload through AMQ-OpenWire (but note the program is called 'stompfeeder', vaguely hoping that one time the APEL group implements a usable upload protocol ...)
any ingress system GT2/5 gatekeeper, lcg-CE or CREAM joining of the ingress records (user DN, VOMS FQAN) to jobs already in the database, using the (pbsJobID, masterFQDN) combination as a unique key.

The relevant scripts for collecting and assembling the accounting information are contained in a single RPM package "ndpf-acctfeeder" that contains both the local and the EGEE scripts (and the accuse client tool). Formally, the dependencies include only perl, perl-DBI, and perl-DBD-MySQL, but there are a few others needed on specific hosts:

pbsnodes
needed for the facility capacity option (default) in ndpf-acctfeed on the Torque server

They have not been included in the rpm dependencies, so as to be able to have a single RPM that installs everywhere. This RPM does not install any cron jobs, and you must edit these two files where relevant:

/etc/pbsaccdb.conf
on the Torque server, needed for pbsaccdb.pl and pbsstatusdb.pl
/etc/stompfeeder.conf
on the uploader box, needed for ndpf-stompfeeder to (optionally) a new default group-to-VO definition, as well as database access passwords and the like unless specified on the command line

both files, if present, must only be readable by root (uid 0).

The new AMQ uploader tool, along with some necessary AMQ libraries, is installed on bosui via the rpm 'nikhef-apel'. Looking in /usr/local/bin/nikhef-apel is reasonably helpful for figuring out how to configure the thing. The AMQ uploader basically pipes SQL statements into the OpenWire uploader CLI, which sends the records off to the UK.

Sources

All relevant sources and the database schema are in SVN at

https://ndpfsvn.nikhef.nl/repos/ndpf/nl.nikhef.ndpf.tools/ndpf-acctfeeder/

NDPF Local Accounting

The local accounting is the most important element, and must (and is :-) fully reliable, because it is used as the basis for the cost reimbursement for projects where we contribute in-kind contributions in the form of compute cycles. These data are collected (yearly) from the NDPF accounting database on a per-VO basis.

Data is inserted into this database on a daily basis. The records are (or should) inserted just after midnight, when the pbs/torque accounting files have been closed and are complete. Since the accounting is based on the "E" records in that file, we thus get all completed jobs. Jobs that are still running will not be accounted -- they will be filed only then they are finished.

Master insertion

Insertion in the database requires the collaboration of two components:

  • the mapping from poolaccounts to grid user DNs
  • the extraction of the pbs data from the accounting file, and linking the unix users of the facility to their grid credentials

At the moment, the grid group (FQAN) mappings are not part of this scheme, and only unix groups are stored in the database. The unix group-to-gridVO mapping is only done in the EGEE upload phase. This is partly historical, but since the VO-FQAN mapping side of the grid software is in constant flux anyway it is better like it is done now. The FQAN info is added in a later stage with the CEJoiner, see below.

To ease the insertion, a meta-utility has been developed: ndpf-acctfeed. It is to be run on the PBS master (stro) every night, and by default will process yesterday's accounting file:

Usage: ndpf-acctfeed [-v] [-h] [-f] [--mapfile <poolmap>|--gridmapdir <dir>]
           [--date|-d Ymd] [--nocapacity] [-n|--dryrun]
           [--pbsaccdir dir] [--progacc command] [--progcapacity command]

and in accounting.ncm-cron.cron:

15 0 * * * root (PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-acctfeed) >> /var/log/accounting.ncm-cron.log 2>&1

The utility will do its very best to find a mapping from unix uids to gridDNs. By default it will look for the .poolmap.YYYYMMDD files that are created at midnight on the gridmapdir NFS server in the gridmapdir (currently on vlaai at 23:58 local time). If such a poolmap cannot be found, it will first try the file you specified on the command line without the YYYYMMDD postfix, then will read the gridmapdir (/share/gridmapdir by default) and create a temporary poolmapfile just for this run. If both the gridmapdir and the poolmapfile(s) are unreadable, the utility aborts.

Parsing PBS accounting data

the pbsaccdb.pl script parses the PBS accounting files, extract key user information and the list of participating nodes, and calculates (normalised) usage times. It then inserts it into the database.

pbsaccdb.pl /var/spool/pbs/server_priv/accountings/YYYYMMDD

Options:

-c
configuration file. Defaults to /etc/pbsaccdb.conf.
--init
try and insert a schema into the database. NOT reocmmended anymore, since there is a separate schema sql file now
-u --user
database user name. Usually read from the config file.
--password
password for the insertion user (must have INSERT and REPLACE rights). Usually read from the config file.
-f --facility
name of this facility (e.g. "lcg2elprod")
--thishost
used only for database GRANT statement creation
-h
database host
-p
database port
--map
filename of the account map file for sDN translation (NOT the grid-mapfile!)

Collecting data from PBS/Torque

The file /etc/pbsaccdb.conf must be present on the collecting system (i.e. the Torque server) and be formatted as described in NDPFAccouting_pbsaccdbconf.


Scaling and the GHzHourEquivalent

The GHzHoursEquiv normalisation (a bit like "Processor Node Hours") is an internal unit. 1 GHzEquivHour corresponds to 410 SI2k-'rate'-base. The job used GHzEquivHours value is taken as the sumproduct of the participating cores and the time on each core. Calculating it needs to actual list of nodes and the performance for each of the participating cores.

Incorporating grid mappings

The grid mappings are matches to the unix uids by the pbsaccdb.pl script, based on a mapfile. This map file is formatted with one mapping per line, with a single TAB character (\t) between the uid and the DN strings, like in:

unixuid       /DNstring

This file is not generated by pbsaccdb.pl, but needs to be prepared and passed as a command-line argument (or a sensible default is taken, based on the date specification in the pbs accounting filename given). Normally, the ndpf-acctfeed meta-utility takes care of matching the poolmapfile and the accounting file based on dates, but also this utility will look for a 'true' mapfile that reflects the actual grid-DN-mappings in use on that date. So, such a file must be generated daily. Note that you do need actual dates mapfiles, since poolaccounts will expire after some time and get re-cycled (usually after 100 days of inactivity).

The mapfiles are generated by the poolmaprun script (part of the managepoolmap package, vlaai currently has managepoolmap-1.1-2 installed) on the NFS server hosting the gridmapdir (today: vlaai). This script will both create the mapfile of today and afterwards release any poolaccount mappings that have been idle for 100 days. This script is run from cron on the gridmapdir server:

/etc/cron.d/manage_gridmapdir:
58 23 * * * root    MAX_AGE=100 CLEANING=1 /usr/local/bin/poolmaprun

The cron job is installed automatically by the RPM. The script must run on the server since the directory is (and should be) root-owned and the files are written (as '.poolmap.YYYYMMDD') to this directory. Really historic poolmaps are then later moved to a subdirectory gridmapdir/.history/

VO FQAN and ingress information

The base job table only knows about the refefence to the groupid and userid entries from the grid map file. Additional information on the user, more attributes and the list of FQANs used via VOMS, are only recorded at the job ingress points. As such, the initial record in the job table will not have this information. It needs to be augmented after insertion through the credential linkage tables, which are populated on each of the ingress points. When the ingress point fails to supply this info, only basic information (VO, userID, X.509 subject name) will be available.


Re-inserting historic data

First make sure you have accurate poolmap files, and then re-run the ndpf-acctfeed program with a date option:

ndpf-acctfeed --date 20081201

and do this for every missing day. It is harmless to re-insert the same day twice (the rows in the database will just be replaced, as the table is keyed on the JobID which is generated like

md5_base64($pbsinfo{qtime}.$MasterFQDN.$JobID)

where qtime is the time the job was put in the queue by the user, the MasterFQDN is the hostname of the Torque master server, and the JobID is the numerical part of the Torque job id, including the sub-job ID (e.g. "4132121-42"). These remain constant once a job is submitted by the user.

GOCDB APEL uploads

The uploads to the GOC Accounting system use the direct AMQ OpenWire interface by inserting into the lcgRecords table. All data put there is taken exclusively from the NDPF accounting database, and it does not use any other source of records.

Requirements on the submission host

You must be able to talk to the database in perl, so you need perl_DBI and perl-DBD-MySQL, but you'll also need ndpf-stompfeeder (the translator) and nikhef-apel (the OpenWire babbler).

Make sure this script runs daily, but well after the records from the previous day have been inserted into the local NDPF accounting database. For example in /etc/cron.d/amqrecords:

34 9 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin; LANG=C; unset TZ; date; echo Uploading new accounting data; /usr/local/sbin/ndpf-stompfeeder -v -n -q --dbpassword select --dbuser anon --pipe '/usr/local/bin/nikhef-apel' 2>&1 ; date ; echo Accounting upload completed.) >> /var/log/amqrecords.ncm-cron.log 2>&1

Note: this replaces the old append-records script from hooimijt.

Uploading historic data to the GOC

To do this, first make sure the data is actually there in the NDPF accounting database (see there). Then you can re-invoke the meta-utility like:

/usr/local/sbin/ndpf-stompfeeder -s '2011-03-30 00:00:00' -e '2011-04-04 12:00:00' -v -n -q --dbpassword PASSWORD --dbuser USER --pipe '/usr/local/bin/nikhef-apel' 

and it will work as usual (including the writing of a log file with the current time as the time stamp. If you upload too much data, you WILL explode the apel-broker at RAL. If you do nothing at all, you MAY explode the apel-broker at RAL. Be prepared to pick up broken pieces any time, 24x7x365...

Translating accounting records to APEL ormat

This is done by the ndpf-stompfeeder program (a specialised version of accuse, actually). The scaling (to GHzHoursEquivalent) to sumproduct-weighted normalised units has already been taken case of by the pbsaccdb.pl script and should not be done again here. Basically, ndpf-stompfeeder will use the calender walltime and GHzEquivWallTime (the normalised one) to re-obtain the performance of the worker node in SpecInt2k. In case of multi-node jobs that have used different CPU types, the 'core performance' in SI2k will be the weighted average of the nodes participating in the job. Anyway, you cannot convert hours to SI2k with re-inspecting the host list, since adding and multiplication in our algebra do not commute.

The output of the ndpf-stompfeeder script is

  • a list of APEL-compliant SQL "REPLACE INTO" commands
  • when used with the "--pipe" option, this program is invoked and the same SQL REPLACE INTO commands sent to the named program on stdin
  • when used with "-v", a summary is printed for each job as well

Upload protocol

This for now appears to be OpenWire, over an SSL secured link with client auth. The Java program (nikhef-apel) uses a manually-crafted JCE trust store with the host cert of the APEL broker inside it (i.e. it must be recreated every year), and the client cert of the hostname listed in the GOCDB as the APEL-client host. It can be any host, and does not even have to belong ot the host where you run the nikhef-apel program.

Log monitoring and status pages

Have a look at the CESGA accounting GANTT chart once in a while to see if everything is still working.

Cron jobs to do accounting

There are three cron jobs:

vlaai
installed automatically from the managepoolmap-1.1-2 package in (when missing, the ndpf-acctfeeder will try and instantly create a map file as well)
58 23 * * * root    MAX_AGE=100 CLEANING=1 /usr/local/bin/poolmaprun
stro
installed via quattor in /etc/cron.d/accounting.ncm-cron.cron
15 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-acctfeed) >> /var/log/accounting.ncm-cron.log 2>&1
bosui
installed via quattor in /etc/cron.d/amqrecords.ncm-cron.cron
34 9 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin; LANG=C; unset TZ; date; echo Uploading new accounting data; /usr/local/sbin/ndpf-stompfeeder -v -n -q --dbpassword PASSWORD --dbuser USERNAME --pipe '/usr/local/bin/nikhef-apel' 2>&1 ; date ; echo Accounting upload completed.) >> /var/log/amqrecords.ncm-cron.log 2>&1


on each lcg-CE or GRAM node (with NDPF pbs.in) new
to be installed
45 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-cejoiner.pl /var/log/messages /var/log/messages.1 /var/log/messages.1.gz) >> /var/log/cejoiner.ncm-cron.log 2>&1
on each CREAM CE node new
to be installed, the last 5 days of accounting data is arbitrary ... but yoiu need the max queue+run time at least. Also keep in mind that if you touch the directory, the ls -1tr command no longer gives you the latest n files!
45 0 * * * root (date --iso-8601=seconds --utc; PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin; /usr/local/sbin/ndpf-cejoiner.pl `ls -1tr /opt/glite/var/log/accounting/* | tail -5`) >> /var/log/cejoiner.ncm-cron.log 2>&1

Using the information in other ways

to get CVS files for wLCG manglement

The PHP script at

http://www.nikhef.nl/grid/accuse/

gets the data directly from the NDPF accounting database on fulda. It is exclusively geared towards wLCG management and cannot do much more. For better control, install the ndpf-acctfeeder package and use the accuse command.

to view graphs

The graphs on http://www.nikhef.nl/grid/stats/ndpf-prd/voview-short are generated completely independent from the account database using perl scripts from cron on ribble/naab. They are orthogonal and don't cross.

to view the GOC accounting status

The accounting grantt chart at CESGA is a good point to see if the data is still getting across. Q uick view is, for example:

http://www3.egee.cesga.es/gridsite/accounting/CESGA/gantt.php?option=ROC&optval=1.8&sYear=2008&sMonth=10&eYear=2009&eMonth=12&type=Production&tree=TIER1

with the full interface at http://goc.grid-support.ac.uk/gridsite/accounting/

to get accurate statistics

use the accuse command on bosui, or any other system where ndpf-acctfeeder is installed. Add -h to get a lot of help

to get the data for the VL-e accountant

Use /user/davidg/bin/vle-usage on any ikonet/hefnet host with the mysqql client commands installed.

Database Schema

The schema for the accounting database (for Nikhef at bedstee.nikhef.nl) looks like the graph shown on the right.

Description

The SQL sources are at https://ndpfsvn.nikhef.nl/repos/ndpf/nl.nikhef.ndpf.tools/ndpf-acctfeeder/accountingdb-IDextend.sql

The new schema allows for arbitrary credentials to be stored, and data collected from multiple sources, but the new tables like pubIdent and ingressPoint need a collector/aggregator be be run on each and every CE, over all log records written at the queueTime moment, for all jobs that are already in the database based on the endTime, i.e. recorded when the job has actually completed and the record uploaded from the Torque server.

The scripts to do that are in the save SubVersion repository as the database scema. In particular, ndpf-cejoiner.pl can run over both NDPF-style JobManager/pbs.om logs written in /var/log/messages (daemon.info), as well as the CREAM accounting files in /opt/glite/var/log/accounting/. The format is inferred automatically, and it can process gzipped files as well as regular text.

Uploading and registering data

Merging CE ingress information

Additional information is available only on the CE, such as the name of the ingress point (obviously), but also public identity information such as the user DN as seen on the CE, the primary FQAN, etc. For the subjectDN, this provides an alternative to the userid.commonName which is reverse-engineered from the gridmapdir. The joiner can eat our own Gatekeeper log entries from syslog (/var/log/messages) as well as /opt/glite/var/log/accounting/blahp.log files. It will figure out itself what the format is.

Help on ndpf-cejoiner.pl

Usage: ./ndpf-cejoiner.pl [-s|--start starttime:Y-m-d] [-e|--end endtime:Y-m-d]
  [-h] [-v] [-c|--config configfile] [-n|--dry]
  [--dbhost hostname] [--dbport port] [--dbname name] [--dbuser name]
  [--dbpassword password] [--pbsserver hostname] [--facility facilityname]
  [--ingressdefault defaultingresspointname]
  [--ingressname|--ceid fixedingresspointname]
  logfile ...

Merge job identity data logged on the CE with existing job records
in the accounting database, using the StartTime, facility name and the
jobID to find the matching job entry and then link it up with a new
or existing pubIdentity record using alinkage table
It will process the log files given on the command line (inasfar as they
exist), but only between <starttime> and <endtime> inclusive, and update
the database with these identities where possible.

  starttime    Y-m-d [HH:MM:SS]       jobs ended from this time onwards
               default: 2010-11-02 00:00:00
  endtime      Y-m-d [HH:MM:SS]       jobs ended up to this time
               default: 2010-11-03 00:00:00
  dbhost       hostname               hostname of the database server
               default: bedstee.nikhef.nl
  dbport       portnumber             portnumber of the database server
               default: 3306
  dbname       name                   name of the accouting database
               default: accounting
  dbuser       name                   username to access the acc database
               default: accounter
  dbpassword   string                 password for the database
               default: do NOT use the command line, use a secured config

  facility     string                 name of the NDPF facility
               default: lcg2elprod

  ingressdefault  string              name of this ingress point if no
                                      explicit name is found
                  default: bosui.nikhef.nl:2119/jobmanager-pbs (for lcg-CE)
                  default: bosui.nikhef.nl  (all other ingress points)
  ingressname  string                 force ingressname to be <name>
               default: none

The <logfile> can be either plain-text line oriented, or a gzipped log
file. The log file is parsed for lines matching an LRMS registration and
associated user subject and group information.
The log file can be either a NDPF-enhanced syslog file containing the
tokens produced by the JobManager/pbs.pm management script, or a CREAM CE
accounting file (from /opt/glite/var/log/accounting/)

Lines in these log files pertaining to jobs not in the database, or
to jobs that have not ended within the time window specified, will
be ignored. This means that you MUST iterate over a few days(?) worth
of historic CE accounting logs, since the association information is
usually written on job queue time, not the job end time.
Also, this script must be run *after* the job accounting records
have been published ot the database from the LRMS server itself (i.e.
after you have run pbsaccdb.pl on the LRMS server).

By default, the time window is the whole of the previous calender day.