NDPF
From PDP/Grid Wiki
Revision as of 12:48, 21 June 2012 by Ronalds@nikhef.nl (talk | contribs) (→Batch Systems and Scheduling)
Nikhef Data Processing Facility
- NDPF TODO List
- pending actions needed on NDPF
- Acceptable Use Policy
- also available in Oud-Hollandsch as "Gebruiksvoorwaarden" for those that are linguistically challenged
Hardware Inventory and Connections
- NDPF_Systems
- new node function tables, namely NDPF_System_Functions : which node does what, and NDPF_System_Locations Network Connections in the NDPF (all routers and switches)
- NDPF_Standalone_Nodes
- list of manually managed hosts, with access, usage and restoration info on the CT Wiki
- Hardware_Issues
- overview of current hardware problems
- NDPF_OPN_Interfaces
- Cabling of the OPN switches and routers
- NDPF_Disk_Servers
- which disk server contains what??
- NDPF_DomUlist
- domUs on a Dom0
- HW_X4500_MAC_20080417
- MAC addresses "EasterEgg" Sun X4500 Thumpers
- HooiThumpersLocation
- Where is my Thumper? Physical locations per serial number
- Valentine_memory
- Status and information about the dram read errors on the Valentines
- SintMaarten_network
- Analysis of the performance issue with the SintMaarten (HP) worker nodes
- Agile_testbed
- Management of the agile test for experimentation
- OpenVPN
- How to use the OpenVPN servers
- EGI Infra
- Infrastructure and order lists for EGI TF10
Operating procedures and manuals
- External_Procedures_and_Tools
- a collection of links to external documentation on procedures and websites "to get things done"
- CT Wiki NDPF
- system descriptions of all stand-alone systems, access, usage and restoration
- GridPBX
- The Grid Asterisk/Elastix service
- SurfVideoConf
- The Big Grid and Nikhef videoconference
- DTConsole
- The Cantankerous HoD Guide to accessing machine consoles from your Desktop
- Shutting_Down_WorkerNodes
- How to shutdown the worker nodes, for example when the tempearture in the server room becomes too high
- WLCG_Accounting
- Guide to assembling the monthly accounting summaries for WLCG
- Monthly Review
- Guide to assembling the monthly "vak N review" for the NL-T1 meeting.
- GLite on Nikhef Desktops
- Guide to install the glite UI tarball for Nikhef desktop use
User Management
- NDPFDirectoryImplementation
- Structure of the NDPF Directory and Authorization in the NDPF
- Creating_Pool_Accounts_With_LDAP
- how to create poolaccounts for a new VO (or extend an existing set), or even recover from an empty gridmapdir
- Adding local users
- How to create a new user account.
- Adding a new VO
- How to add a new VO
- NDPF_UidsAndGids
- Uid and Gid number plan for the NDPF clusters
Grid, Quattor and Yaim
- GRID Issues
- Articles that might be interesting for the resolution of issues that are seen when working with the GRID.
- How to work with our Quattor setup
- How to work with the template repository, compiling and deploying node templates
- Installing a new VM via Quattor
- What to do to create a new virtual machine from scratch using our Quattor system.
- Installing updates: OS, CAs Quattor, VL-e
- The procedure to download, configure and install updates for the OS via Quattor.
- Upgrading Quattor managed glite servers
- Information on how to upgrade Quattor managed glite servers.
- Upgrading Quattor Components
- Information on how to upgrade quattor components using the CVS repository on stal and the local tools for rpm management.
- Pan Tutorial
- Information about the Pan language.
- RPM Repository
- Information about working with the RPM repository.
- Checkdeps
- A Quattor tool to verify the RPM dependencies before deploying on a host
- Maintaining_local_Yaim_functions
- How to build a release of the local nikhef-yaim-* rpms and include them in the Quattor profiles.
- Namespaced Quattor configuration
- Documentation on the Quattor setup, including the namespaced templates
- AII version 2 and complex block device schemas
- How to configure complex block device schemas in Pan with AII version 2
- Setting up a gLite WMS
- Notes for installing and configuring a gLite WMS
- LCAS and LCMAPS installation for gLExec and (GT4) gatekeepers
- What to install and what/how to configure the components
- Requesting or Renewing Host certificates
- Guide on the procedure to request a new host certificate or renew an existing certificate
Security Operations
- How to ban users with quattor
- Using quattor to ban a user on WMS, CE, DPM nodes.
- How to install security updates using quattor
- Update single or multiple rpms which are vulnerable on quattor managed hosts.
- HowToLimitSSHScans
- How to limit ssh scans by blocking probing IPs using standard iptables modules on RL4 and EL5
Misc
- Adding a VO to a VOMS server
- How to add a new VO to a VOMS-is-evil server
- Monitoring Script
- It monitors the status of the different jobs, in order to identify those that aborted, but still on a queue.
Batch Systems and Scheduling
- Adding/removing nodes to PBS
- Information on how to add or remove nodes from PBS.
- NDPFAccouting
- Information on how the accounting chain works from PBS to the local and APEL accounting portals. (historic information is in Accounting).
- VoViewGraphing
- Creating the voview/grisview graphs on www.nikhef.nl/grid/stats, and the cron jobs that run on naab.
- Enabling multicore jobs and jobs requesting large amounts of memory
- Overview of the steps to enable flexible resource requests
Systems Documentation
- NDPFAlarmsSystem
- handling NL-T1 alarms at Nikhef
- Nagios Monitoring Setup
- current setup of Nagios monitoring
- NDPFSubVersion
- NDPF SubVersion Repository set up
- NDPFIpmiNikhefNlBindConfig
- modifying the ipmi.nikhef.nl domain
- NDPF rsync backup
- backup of service nodes using rsync and indirect ADSM usage
- NDPF MySQL configurations
- Configuration and monitoring of the MySQL service on bedstee
- BuildingCentOSXfsKernelModules
- Building the XFS kernel modules for CentOS (if CentOS plus is late)
- CricketGraphing
- generating the configuration for network graphs
- NDPF Dell switch config
- setup and configuration of the dell switches (disabling port-fast) - required for every new installation
- Nortel 4548GT switches
- Luilak/Bulldozer subcluster switch documentation
- Arista 7148S switches
- Core switch documentation
- Remote usage of the Dell console switches
- How to remotely connect to the Dell console switches.
- Serial Consoles
- how to setup your serial (over LAN) console and IMPI 2.0 SoL stuff, even with Xen, on a PE1950
- SettingLcdPanelText
- how to set the LCD panel text at runtime on a Dell PowerEdge using IPMI
- NDPF_Dell_OpenManage_SA_Installation
- use and configuration of the chassis and other controls of the Dell (PE1950) systems
- NDPF_Equinox_ELS16
- documentation for the Equinox ELS-16 TS
- NDPF GS environment
- Grid Service (specialties) environment documentation
- RAID-1 configuration and management
- how to set up RAID-1 and how to manage RAID devices
- SunX4500Documentation
- some tidbits on the X4500 Thumpers
- Increasing Thumper filesystems for DPM
- recipe for increasing the logical volume, file system and making it visible to DPM
- HP BL460c G6
- How to setup and configure and manage the "Sint Maarten" cluster
- HooiMaanden
- How to setup and configure and manage the "Hooimaanden"
- Hooikar/hooiwagen
- Configuration of DPM disk servers exporting iSCSI-mounted file systems
- Bieten
- How to setup and configure and manage the "Suiker Bieten"
- backup_umts_location
- Information about the backup UMTS connection
- Remote Location
- Information about the remote location
Problem Recovery
Procedures to be followed when recovering from major outages. No extensive descriptions, just a list of what to do, without having to think/know much about the topic.
- Restoring Services
- order in which to bring services back on-line
- Resurrecting kuiken
- steps to restart and check the VOMS server kuiken
Troubleshooting
How to figure out various things, diagnose problems, fix them, etc.
- Tracing Jobs
- Given a job ID, how to figure out where the job came from.
- CREAM CE Troubleshooting
- Error messages and how to solve them for CREAM.
System Utilities
- PBS_Caching_Utilities
- The current LCG software can put severe load on a pbs server due to the times it calls qstat (we have seen as many as 25 qstat calls per second). This load can eventually bring the entire system to an halt. In order to reduce the impact of this problem (a full solution requires rewriting part of the grid job manager), Davide wrote a set of utilities that wrap the original pbs commands and provide caching
- MoniFarm Utilities
- Graphing package for farm utilisation (used to make the production graphs for the NDPF Facilities)
- NDPF_VOBOX_Alice_Renewalscript
- The proxy renewal wrapper script, enhanced with status info
Virtualization
- GSP Virtualisation with Xen
- Grid Server Park virtualisation with Xen Cloud Platform
- Xen on CentOS 5
- Installation and configuration of XEN Dom0 CentOS 5 x86_64, DomU CentOS 4 i386
Virtualisation Issues
- NDPF VMware authentication
- controlling and creating VMs on the central VMware server
- NDPF VMware tips
- tips and tricks for generating vmware images
- Xen on CentOS 5 - Notes
- Notes for installing and configuring a XEN System on CentOS 5
- Xen on CentOS - Automating Installation-Administration
- Quattor managed Xen-Dom0 and DomUs
- Xen 3.2, CentOS 5.1 and NAT HOWTO
- HOWTO document to set up Xen with NAT networking
Performance data and analysis
- NDPF Node Performance
- performance figures for nodes used for accounting purposes
- NDPF Disk Performance
- performance figures for disk servers
- RunningSPEC
- how to run the SPEC2006 and SPEC2000 suites at Nikhef, including using the Intel Compiler Suite and SmartHeap
User Security and eTokens
- Using an Aladdin eToken PRO to store grid certificates
- Using an Aladdin eToken PRO with grid certificates (including gridproxy generation)
- Using voms-proxy-init on an OSX (10.4 or higher) system
- Tiny how to on how to get voms-proxy-init working on your Mac.
User Tips and Tricks
- How to control access rights for LFC/SRM files
- How to control access rights to your files stored on the grid in LFC, SRM or both
- Checksumming support in SRM implementations
- An overview of the support for checksumming in different SRM implementations
- VO-specific software and modules
- How to install VO-specific software including automatic modules support
- Qsub-tunnel
- job submission from the desktop and login systems to the NDPF clusters
- Qsub-tunnel-admin
- administering the qsub tunnel and adding new users
Software development tips
- How to handle OpenSSL and not get hurt
- Intended for developer that use the OpenSSL library. Organized as a (collective) braindump.
- Building gLExec from src rpm
- How to (re)build gLExec from a .src.rpm.
- Software development on Arista hardware
- Useful tips, how-tos and more.
- Funny Curly things
- An investigation into the different versions of the cURL tool.
- How to work with VOMS
- A detailed overview of how VOMS must work and how you need to work with VOMS.
AOB
- Grid Opendag preparation
- What to do, checklists, ideas and other pieces of information suited to prepare yourself for the Opendag (Open House) at Nikhef.