SoftDrive backend

From PDP/Grid Wiki
Revision as of 15:01, 14 July 2016 by Dennisvd@nikhef.nl (talk | contribs) (Some softdrive CVMFS documentation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

The Softdrive system is implemented by CVMFS on the Nikhef CVMFS Stratum-0 (mesthoop.nikhef.nl).

Mesthoop is quattor managed. It manages several CVMFS repositories and softdrive.nl is one of them.

The only thing that sets this repository apart from 'normal' repositories is the way it is populated. A cron job

/etc/cron.d/cvmfs-softdrive-rsync.ncm-cron.cron

runs an rsync with the source system softdrive.grid.sara.nl.

It's useful for administrators to have an account on that machine for testing purposes; this must be requested from SARA grid support.

The cron script runs the cvmfs-rsync-multi script that tries to be clever about which portions of the source tree it syncs.

The sources are found on

svn+ssh://svn@ndpfsvn.nikhef.nl/repos/pdpsoft/trunk/nl.nikhef.ndpf.tools/cvmfs

The way to build this is to run bootstrap.sh, make dist and copy the source tarball to

${HOME}/rpmbuild/SOURCES

and then run rpmbuild on the spec file on a CentOS 6 machine (e.g. stal).


For each cycle of rsync, a CVMFS transaction is started, and, after successful completion of the sync, committed.

Because of the high volume of transactions it is necessary to have garbage collection turned on. This will take a long time so it's not done automatically but only once per day.

Troubleshooting

There is an number of ways in which all of this can fail.

  • The rsync may be stuck (TODO: implement --timeout)
  • A transaction may hang (and prevent further transactions)
  • The disk may be full

The monitoring an the SARA end of things will notice if changes no longer come through (and they will tell us).

The official documentation of CVMFS is indispensable to understand the details of the system, and to help get the system unwedged.

I've seen a rare case of a hanging rsync and usually a simple kill will get rid of that one.

The script has some builtin logic to prevent multiple copies running at the same time, so an old lock file (should also not happen) may prevent it from running.