NDPF

From PDP/Grid Wiki
Jump to navigationJump to search

Nikhef Data Processing Facility

NDPF TODO List
pending actions needed on NDPF
Acceptable Use Policy
also available in Oud-Hollandsch as "Gebruiksvoorwaarden" for those that are linguistically challenged

Hardware Inventory and Connections

NDPF_Systems
system function and logical network tables NDPF:System_Functions : which node does what, and NDPF:Network Network Connections in the NDPF (all routers and switches)
NDPF_Standalone_Nodes
list of manually managed hosts, with access, usage and restoration info on the CT Wiki
Hardware_Issues
overview of current hardware problems
NDPF_OPN_Interfaces
Cabling of the OPN switches and routers
Marsepein node type details
Details of the DELL R630 worker nodes type 'marsepein'
Oliebol node type details
Details of the Supermicro 4u 36 disk node type 'oliebol'
Strijker node type details
Details of the Dell PowerEdge R515 2u 12 disk node type 'strijker'
Biet node type details
Details of the Dell PowerEdge R710 disk node type 'biet'
NDPF_Disk_Servers
which disk server contains what??
NDPF_DomUlist
domUs on a Dom0
HW_X4500_MAC_20080417
MAC addresses "EasterEgg" Sun X4500 Thumpers
HooiThumpersLocation
Where is my Thumper? Physical locations per serial number
Valentine_memory
Status and information about the dram read errors on the Valentines
SintMaarten_network
Analysis of the performance issue with the SintMaarten (HP) worker nodes
Agile_testbed
Management of the agile test for experimentation
OpenVPN
How to use the OpenVPN servers
EGI Infra
Infrastructure and order lists for EGI TF10
NDPF_System_Locations
old cabling information

Operating procedures and manuals

External_Procedures_and_Tools
a collection of links to external documentation on procedures and websites "to get things done"
SaltReclass
Guide to using the salt/reclass configuration management system that is taking over from quattor.
CT Wiki NDPF
system descriptions of all stand-alone systems, access, usage and restoration
GridPBX
The Grid Asterisk/Elastix service
SurfVideoConf
The Big Grid and Nikhef videoconference
DTConsole
The Cantankerous HoD Guide to accessing machine consoles from your Desktop
Shutting_Down_WorkerNodes
How to shutdown the worker nodes, for example when the tempearture in the server room becomes too high
WLCG_Accounting
Guide to assembling the monthly accounting summaries for WLCG
Monthly Review
Guide to assembling the monthly "vak N review" for the NL-T1 meeting.
GLite on Nikhef Desktops
Guide to install the glite UI tarball for Nikhef desktop use

User Management

NDPFDirectoryImplementation
Structure of the NDPF Directory and Authorization in the NDPF
Creating Pool Accounts With LDAP
how to create poolaccounts for a new VO (or extend an existing set), or even recover from an empty gridmapdir
Adding local users
How to create a new user account.
Adding a new VO
How to add a new VO
NDPF_UidsAndGids
Uid and Gid number plan for the NDPF clusters

Grid, Quattor and Yaim

GRID Issues
Articles that might be interesting for the resolution of issues that are seen when working with the GRID.
How to work with our Quattor setup
How to work with the template repository, compiling and deploying node templates
Quattor package management with yum-based ncm-spma
Description of package management with the yum-based ncm-spma
Installing a new VM via Quattor
What to do to create a new virtual machine from scratch using our Quattor system.
Installing updates: OS, CAs Quattor, VL-e
The procedure to download, configure and install updates for the OS via Quattor.
Upgrading Quattor managed glite servers
Information on how to upgrade Quattor managed glite servers.
Upgrading Quattor Components
Information on how to upgrade quattor components using the CVS repository on stal and the local tools for rpm management.
Quattor and IPv6
Summary of what it has been done so far, in order to configure servers via quattor with IPv6 connectivity.
Pan Tutorial
Information about the Pan language.
RPM Repository
Information about working with the RPM repository.
Checkdeps
A Quattor tool to verify the RPM dependencies before deploying on a host
Maintaining_local_Yaim_functions
How to build a release of the local nikhef-yaim-* rpms and include them in the Quattor profiles.
Namespaced Quattor configuration
Documentation on the Quattor setup, including the namespaced templates
AII version 2 and complex block device schemas
How to configure complex block device schemas in Pan with AII version 2
Setting up a gLite WMS
Notes for installing and configuring a gLite WMS
LCAS and LCMAPS installation for gLExec and (GT4) gatekeepers
What to install and what/how to configure the components
Requesting or Renewing Host certificates
Guide on the procedure to request a new host certificate or renew an existing certificate
Working with CREAM CE servers and VOs
Information about adding plugins and files for VO authorization to CREAM-CE servers in Quattor.

Security Operations

How to ban users with quattor
Using quattor to ban a user on WMS, CE, DPM nodes.
How to install security updates using quattor
Update single or multiple rpms which are vulnerable on quattor managed hosts.
HowToLimitSSHScans
How to limit ssh scans by blocking probing IPs using standard iptables modules on RL4 and EL5
HowTo Disk images, RAID, LVM
How to access data in disk images created with dd

Misc

Adding a VO to a VOMS server
How to add a new VO to a VOMS-is-evil server
Monitoring Script
It monitors the status of the different jobs, in order to identify those that aborted, but still on a queue.
Managing RAID Controllers
How to deal with failing disks and batteries; how do I find my disk?
USB stick for Fujitsu Firmware update
how to prepare a bootable USB stick for firmware updates for Fujitsu servers.

Batch Systems and Scheduling

Adding/removing nodes to PBS
Information on how to add or remove nodes from PBS.
NDPFAccouting
Information on how the accounting chain works from PBS to the local and APEL accounting portals. (historic information is in Accounting).
VIRGOAccounting
Documentation how VIRGO daily accounting has been implemented.
VoViewGraphing
Creating the voview/grisview graphs on www.nikhef.nl/grid/stats, and the cron jobs that run on naab.
Enabling multicore jobs and jobs requesting large amounts of memory
Overview of the steps to enable flexible resource requests
Maui reservations
Creating a system reservation in Maui
Various local tools
Overview of local tools: when_idle (perform an action when a node has drained), mom-taskset (pin a job to the slots/cores allocated), prune_userprocs (remove user processes not related to a batch job).

CVMFS

SoftDrive backend
behind the curtains of SoftDrive.
CVMFS servers and repositories
Overview of the cvmfs repositories hosted at the Stratum-0 and Stratum-1 servers at Nikhef
Adding a new cvmfs repository
How to build a new cvmfs repository at the Stratum-0 server, update its contents, synchronize the contents to the Stratum-1 server and configure the clients to use the repository.

Systems Documentation

NDPFAlarmsSystem
handling NL-T1 alarms at Nikhef
Nagios Monitoring Setup
current setup of Nagios monitoring
NDPFSubVersion
NDPF SubVersion Repository set up
NDPFIpmiNikhefNlBindConfig
modifying the ipmi.nikhef.nl domain
NDPF rsync backup
backup of service nodes using rsync and indirect ADSM usage
NDPF MySQL configurations
Configuration and monitoring of the MySQL service on bedstee
BuildingCentOSXfsKernelModules
Building the XFS kernel modules for CentOS (if CentOS plus is late)
CricketGraphing
generating the configuration for network graphs
NDPF Dell switch config
setup and configuration of the dell switches (disabling port-fast) - required for every new installation
Nortel 4548GT switches
Luilak/Bulldozer subcluster switch documentation
Arista 7148S switches
Core switch documentation
Remote usage of the Dell console switches
How to remotely connect to the Dell console switches.
Serial Consoles
how to setup your serial (over LAN) console and IMPI 2.0 SoL stuff, even with Xen, on a PE1950
SettingLcdPanelText
how to set the LCD panel text at runtime on a Dell PowerEdge using IPMI


NDPF_Dell_OpenManage_SA_Installation
use and configuration of the chassis and other controls of the Dell (PE1950) systems
NDPF_Equinox_ELS16
documentation for the Equinox ELS-16 TS
NDPF GS environment
Grid Service (specialties) environment documentation
RAID-1 configuration and management
how to set up RAID-1 and how to manage RAID devices
SunX4500Documentation
some tidbits on the X4500 Thumpers
Increasing Thumper filesystems for DPM
recipe for increasing the logical volume, file system and making it visible to DPM
HP BL460c G6
How to setup and configure and manage the "Sint Maarten" cluster
HooiMaanden
How to setup and configure and manage the "Hooimaanden"
Hooikar/hooiwagen
Configuration of DPM disk servers exporting iSCSI-mounted file systems
Bieten
How to setup and configure and manage the "Suiker Bieten"
Hooikanon storage details
Details of the NetApp E2860
Hooikanon server set up (ppc64le)
Set up details for the Power9 hooikanon systems
backup_umts_location
Information about the backup UMTS connection
Remote Location
Information about the remote location

Problem Recovery

Procedures to be followed when recovering from major outages. No extensive descriptions, just a list of what to do, without having to think/know much about the topic.

rebooting XCP VMs the hard way
How to handle a virtual machine that doesn't want to go anymore.
cvmfs errors/warnings
description of the errors and warnings that cvmfs reports via its Nagios check
Restoring Services
order in which to bring services back on-line
Resurrecting kuiken
steps to restart and check the VOMS server kuiken

Troubleshooting

How to figure out various things, diagnose problems, fix them, etc.

Tracing Jobs
Given a job ID, how to figure out where the job came from.
CREAM CE Troubleshooting
Error messages and how to solve them for CREAM.
Generating crash reports with abrt
How to generate core dumps from crashing services

System Utilities

PBS_Caching_Utilities
The current LCG software can put severe load on a pbs server due to the times it calls qstat (we have seen as many as 25 qstat calls per second). This load can eventually bring the entire system to an halt. In order to reduce the impact of this problem (a full solution requires rewriting part of the grid job manager), Davide wrote a set of utilities that wrap the original pbs commands and provide caching
MoniFarm Utilities
Graphing package for farm utilisation (used to make the production graphs for the NDPF Facilities)
NDPF_VOBOX_Alice_Renewalscript
The proxy renewal wrapper script, enhanced with status info


Virtualization

GSP Virtualisation with Xen
Grid Server Park virtualisation with Xen Cloud Platform
Xen on CentOS 5
Installation and configuration of XEN Dom0 CentOS 5 x86_64, DomU CentOS 4 i386
Managing the security training sites
Quick writeup of managing the security training hosts with XCP, cobbler and salt.

Virtualisation Issues

NDPF VMware authentication
controlling and creating VMs on the central VMware server
NDPF VMware tips
tips and tricks for generating vmware images
Xen on CentOS 5 - Notes
Notes for installing and configuring a XEN System on CentOS 5
Xen on CentOS - Automating Installation-Administration
Quattor managed Xen-Dom0 and DomUs
Xen 3.2, CentOS 5.1 and NAT HOWTO
HOWTO document to set up Xen with NAT networking

Performance data and analysis

NDPF Node Performance
performance figures for nodes used for accounting purposes
NDPF Disk Performance
performance figures for disk servers
RunningSPEC
how to run the SPEC2006 and SPEC2000 suites at Nikhef, including using the Intel Compiler Suite and SmartHeap
Running HS06
notes from running the HEPSPEC-06 benchmark
Historical WN information
which nodes had what names during which time periods, etc.

User Security and eTokens

Using an Aladdin eToken PRO to store grid certificates
Using an Aladdin eToken PRO with grid certificates (including gridproxy generation)
Using voms-proxy-init on an OSX (10.4 or higher) system
Tiny how to on how to get voms-proxy-init working on your Mac.

User Tips and Tricks

PXE UEFI booting and installing
what to do when your systems are so hip they don't do legacy BIOS boot anymore
Passing job requirements through the WMS
If your jobs need more cores or more memory, and you still want to use the WMS
How to control access rights for LFC/SRM files
How to control access rights to your files stored on the grid in LFC, SRM or both
Checksumming support in SRM implementations
An overview of the support for checksumming in different SRM implementations
VO-specific software and modules
How to install VO-specific software including automatic modules support
Qsub-tunnel
job submission from the desktop and login systems to the NDPF clusters
Qsub-tunnel-admin
administering the qsub tunnel and adding new users

Software development tips

How to handle OpenSSL and not get hurt
Intended for developer that use the OpenSSL library. Organized as a (collective) braindump.
Building gLExec from src rpm
How to (re)build gLExec from a .src.rpm.
Software development on Arista hardware
Useful tips, how-tos and more.
Funny Curly things
An investigation into the different versions of the cURL tool.
How to work with VOMS
A detailed overview of how VOMS must work and how you need to work with VOMS.

AOB

Grid Opendag preparation
What to do, checklists, ideas and other pieces of information suited to prepare yourself for the Opendag (Open House) at Nikhef.
Grid contributions Open Dag 2013
Plans/Programm what to show during the Open Dag 2013, Oct. 5.
Updating a kernel module for XenServer 7
Getting Intel X722 10Gbps SFP+ cards to work with XenServer 7.0.
A workaround for a powerpc saltstack bug
Getting around the powerpc64le/ppc64le host architecture SNAFU in Saltstack.