Difference between revisions of "Cvmfs errors/warnings"
(7 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | = General issues and their remedies = | ||
+ | |||
+ | An overfull disk due to cache overflow or possibly cache corruption may be remedied by running | ||
+ | cvmfs_config wipecache | ||
+ | |||
+ | In despair, one may wipe the cache manually and hope for the best: | ||
+ | rm -rf /var/lib/cvmfs/atlas.cern.ch/* | ||
+ | |||
+ | = Errors and Warnings from the Nagios check = | ||
+ | |||
Below is an overview of cvmfs errors and warnings as reported by the Nagios check. | Below is an overview of cvmfs errors and warnings as reported by the Nagios check. | ||
Line 4: | Line 14: | ||
− | == SERVICE STATUS: | + | == SERVICE STATUS: ''N'' I/O errors detected: repository revision ''rev'' == |
This warning means that there have been I/O errors. It is not needed to fix anything. The warning can be cleared by logging in on the node and executing the following command: | This warning means that there have been I/O errors. It is not needed to fix anything. The warning can be cleared by logging in on the node and executing the following command: | ||
− | + | cvmfs_talk -i <repository> reset error counters | |
The event handler for Nagios will try to clear the error counters a few times using this command. Therefore, this warning will usually disappear after some time. If the warning persists for more than 2 hours, a manual reset of the counters may be needed. | The event handler for Nagios will try to clear the error counters a few times using this command. Therefore, this warning will usually disappear after some time. If the warning persists for more than 2 hours, a manual reset of the counters may be needed. | ||
− | == SERVICE STATUS: offline ( | + | == SERVICE STATUS: offline (''repository'' via ''squid server''): repository revision ''rev'' == |
The full warning may comprise multiple repositories and/or Squid servers: | The full warning may comprise multiple repositories and/or Squid servers: | ||
− | SERVICE STATUS: offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via http://pachter.nikhef.nl:3128): offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via http://zonnewijzer.nikhef.nl:3128): offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via http://karnton.nikhef.nl:3128): repository revision 1197 | + | SERVICE STATUS: offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via |
+ | http://pachter.nikhef.nl:3128): offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via | ||
+ | http://zonnewijzer.nikhef.nl:3128): offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via | ||
+ | http://karnton.nikhef.nl:3128): repository revision 1197 | ||
This warning means that the client could not connect to the repository via the listed Squid servers. If the remote repository is the same for all local Squid servers, the problem is likely to be a connection problem to the remote site. That is particularly true if the warning occurs for many clients. The message above refers to the remote repository at RAL for 3 local Squid, implying a problem at or to the remote end. | This warning means that the client could not connect to the repository via the listed Squid servers. If the remote repository is the same for all local Squid servers, the problem is likely to be a connection problem to the remote site. That is particularly true if the warning occurs for many clients. The message above refers to the remote repository at RAL for 3 local Squid, implying a problem at or to the remote end. | ||
However, if connections to more than one remote repository are failing (RAL, CERN and/or BNL), the problem is likely to be in the local Squid servers. In this situation, a restart of the Squid servers may be required. | However, if connections to more than one remote repository are failing (RAL, CERN and/or BNL), the problem is likely to be in the local Squid servers. In this situation, a restart of the Squid servers may be required. | ||
+ | |||
+ | |||
+ | == SERVICE STATUS: ''N'' I/O errors detected: space on cache partition low: repository revision ''rev'' == | ||
+ | |||
+ | This warning indicates that the disk usage on the cache partition is low. The actual disk space used may be considerably higher than the cache usage reported by cvmfs-talk: | ||
+ | [root@wn-car-040 cvmfs2]# cvmfs-talk -i lhcb.cern.ch cache size | ||
+ | Current cache size is 4185MB (4389334621 Bytes), pinned: 1821MB (1909501952 Bytes) | ||
+ | [root@wn-car-040 cvmfs2]# du -sk lhcb.cern.ch | ||
+ | 10294824 lhcb.cern.ch | ||
+ | |||
+ | The cache size usage may be reduced as follows: | ||
+ | [root@wn-car-040 lhcb.cern.ch]# cvmfs-talk -i lhcb.cern.ch cleanup 1000 | ||
+ | Not fully cleaned (there might be pinned chunks) | ||
+ | [root@wn-car-040 lhcb.cern.ch]# cvmfs-talk -i lhcb.cern.ch cache size | ||
+ | Current cache size is 3599MB (3774548992 Bytes), pinned: 1821MB (1909501952 Bytes) | ||
+ | |||
+ | But there may still be an inconsistency: | ||
+ | [root@wn-car-040 cvmfs2]# du -sk lhcb.cern.ch | ||
+ | 5878980 lhcb.cern.ch | ||
+ | (cvmfs reports 3.6 GB cache usage, du says it's 5.8 GB). | ||
+ | |||
+ | The safe workaround is to put the node offline. When the last cvmfs has disappeared, the service can be restarted with an empty cache: | ||
+ | service cvmfs restartclean |
Latest revision as of 10:59, 10 February 2017
General issues and their remedies
An overfull disk due to cache overflow or possibly cache corruption may be remedied by running
cvmfs_config wipecache
In despair, one may wipe the cache manually and hope for the best:
rm -rf /var/lib/cvmfs/atlas.cern.ch/*
Errors and Warnings from the Nagios check
Below is an overview of cvmfs errors and warnings as reported by the Nagios check.
Note: this list is not yet complete!
SERVICE STATUS: N I/O errors detected: repository revision rev
This warning means that there have been I/O errors. It is not needed to fix anything. The warning can be cleared by logging in on the node and executing the following command:
cvmfs_talk -i <repository> reset error counters
The event handler for Nagios will try to clear the error counters a few times using this command. Therefore, this warning will usually disappear after some time. If the warning persists for more than 2 hours, a manual reset of the counters may be needed.
SERVICE STATUS: offline (repository via squid server): repository revision rev
The full warning may comprise multiple repositories and/or Squid servers:
SERVICE STATUS: offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via http://pachter.nikhef.nl:3128): offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via http://zonnewijzer.nikhef.nl:3128): offline (http://cernvmfs.gridpp.rl.ac.uk/opt/atlas via http://karnton.nikhef.nl:3128): repository revision 1197
This warning means that the client could not connect to the repository via the listed Squid servers. If the remote repository is the same for all local Squid servers, the problem is likely to be a connection problem to the remote site. That is particularly true if the warning occurs for many clients. The message above refers to the remote repository at RAL for 3 local Squid, implying a problem at or to the remote end.
However, if connections to more than one remote repository are failing (RAL, CERN and/or BNL), the problem is likely to be in the local Squid servers. In this situation, a restart of the Squid servers may be required.
SERVICE STATUS: N I/O errors detected: space on cache partition low: repository revision rev
This warning indicates that the disk usage on the cache partition is low. The actual disk space used may be considerably higher than the cache usage reported by cvmfs-talk:
[root@wn-car-040 cvmfs2]# cvmfs-talk -i lhcb.cern.ch cache size Current cache size is 4185MB (4389334621 Bytes), pinned: 1821MB (1909501952 Bytes) [root@wn-car-040 cvmfs2]# du -sk lhcb.cern.ch 10294824 lhcb.cern.ch
The cache size usage may be reduced as follows:
[root@wn-car-040 lhcb.cern.ch]# cvmfs-talk -i lhcb.cern.ch cleanup 1000 Not fully cleaned (there might be pinned chunks) [root@wn-car-040 lhcb.cern.ch]# cvmfs-talk -i lhcb.cern.ch cache size Current cache size is 3599MB (3774548992 Bytes), pinned: 1821MB (1909501952 Bytes)
But there may still be an inconsistency:
[root@wn-car-040 cvmfs2]# du -sk lhcb.cern.ch 5878980 lhcb.cern.ch
(cvmfs reports 3.6 GB cache usage, du says it's 5.8 GB).
The safe workaround is to put the node offline. When the last cvmfs has disappeared, the service can be restarted with an empty cache:
service cvmfs restartclean