Cvmfs errors/warnings

From PDP/Grid Wiki
Jump to navigationJump to search

General issues and their remedies

An overfull disk due to cache overflow or possibly cache corruption may be remedied by running

cvmfs_config wipecache

In despair, one may wipe the cache manually and hope for the best:

rm -rf /var/lib/cvmfs/atlas.cern.ch/*

Errors and Warnings from the Nagios check

Below is an overview of cvmfs errors and warnings as reported by the Nagios check.

Note: this list is not yet complete!


SERVICE STATUS: N I/O errors detected: repository revision rev

This warning means that there have been I/O errors. It is not needed to fix anything. The warning can be cleared by logging in on the node and executing the following command:

cvmfs_talk -i <repository> reset error counters

The event handler for Nagios will try to clear the error counters a few times using this command. Therefore, this warning will usually disappear after some time. If the warning persists for more than 2 hours, a manual reset of the counters may be needed.

SERVICE STATUS: offline (repository via squid server): repository revision rev

The full warning may comprise multiple repositories and/or Squid servers:

SERVICE STATUS: offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via
http://pachter.nikhef.nl:3128): offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via
http://zonnewijzer.nikhef.nl:3128): offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via
http://karnton.nikhef.nl:3128): repository revision 1197

This warning means that the client could not connect to the repository via the listed Squid servers. If the remote repository is the same for all local Squid servers, the problem is likely to be a connection problem to the remote site. That is particularly true if the warning occurs for many clients. The message above refers to the remote repository at RAL for 3 local Squid, implying a problem at or to the remote end.

However, if connections to more than one remote repository are failing (RAL, CERN and/or BNL), the problem is likely to be in the local Squid servers. In this situation, a restart of the Squid servers may be required.


SERVICE STATUS: N I/O errors detected: space on cache partition low: repository revision rev

This warning indicates that the disk usage on the cache partition is low. The actual disk space used may be considerably higher than the cache usage reported by cvmfs-talk:

[root@wn-car-040 cvmfs2]# cvmfs-talk -i lhcb.cern.ch cache size
Current cache size is 4185MB (4389334621 Bytes), pinned: 1821MB (1909501952 Bytes)
[root@wn-car-040 cvmfs2]# du -sk lhcb.cern.ch
10294824        lhcb.cern.ch

The cache size usage may be reduced as follows:

[root@wn-car-040 lhcb.cern.ch]# cvmfs-talk -i lhcb.cern.ch cleanup 1000
Not fully cleaned (there might be pinned chunks)
[root@wn-car-040 lhcb.cern.ch]# cvmfs-talk -i lhcb.cern.ch cache size
Current cache size is 3599MB (3774548992 Bytes), pinned: 1821MB (1909501952 Bytes)

But there may still be an inconsistency:

[root@wn-car-040 cvmfs2]# du -sk lhcb.cern.ch
5878980         lhcb.cern.ch

(cvmfs reports 3.6 GB cache usage, du says it's 5.8 GB).

The safe workaround is to put the node offline. When the last cvmfs has disappeared, the service can be restarted with an empty cache:

service cvmfs restartclean