Difference between revisions of "Agile testbed"

From PDP/Grid Wiki
Jump to navigationJump to search
(36 intermediate revisions by one other user not shown)
Line 1: Line 1:
[[Image:P4ctb1.svg|thumb|Diagram of the agile test bed]]
+
[[Image:P4ctb-3.svg|thumb|Diagram of the agile test bed]]
  
 
== Introduction to the Agile testbed ==
 
== Introduction to the Agile testbed ==
 
TODO: integrate the contents of the [[Testbed_Update_Plan]] with this page.
 
  
 
The ''Agile Testbed'' is a collection of virtual machine servers, tuned to quickly setting up virtual machines for testing, certification and experimentation in various configurations.
 
The ''Agile Testbed'' is a collection of virtual machine servers, tuned to quickly setting up virtual machines for testing, certification and experimentation in various configurations.
Line 13: Line 11:
  
 
=== Getting access ===
 
=== Getting access ===
[[Image:Testbed-access.svg|thumb|Schema of access methods]]
+
[[Image:Testbed-access-v2.svg|thumb|Schema of access methods]]
  
 
Access to the testbed is either by
 
Access to the testbed is either by
 
# ssh access from '''bleek.nikhef.nl''' or
 
# ssh access from '''bleek.nikhef.nl''' or
# IPMI Serial-over-LAN
+
# IPMI Serial-over-LAN (only for the ''physical'' nodes)
 +
# serial console access from libvirt (only for the ''virtual'' nodes)
  
 
The only machine that can be reached with ssh from outside the testbed is the management node '''bleek.nikhef.nl'''. Inbound ssh is restricted to the Nikhef network. The other testbed hardware lives on a LAN with no inbound connectivity. Since bleek also has an interface in this network, you can log on to the other machines from bleek.
 
The only machine that can be reached with ssh from outside the testbed is the management node '''bleek.nikhef.nl'''. Inbound ssh is restricted to the Nikhef network. The other testbed hardware lives on a LAN with no inbound connectivity. Since bleek also has an interface in this network, you can log on to the other machines from bleek.
  
Access to bleek.nikhef.nl is restricted to users who have a home directory with their ssh pulic key in ~/.ssh/authorized_keys.
+
Access to bleek.nikhef.nl is restricted to users who have a home directory with their ssh pulic key in ~/.ssh/authorized_keys and an entry in /etc/security/access.conf.
  
 
Since all access has to go through bleek, it is convenient to set up ssh to proxy connections to *.testbed through bleek in combination with sharing connections, in ~/.ssh/config:
 
Since all access has to go through bleek, it is convenient to set up ssh to proxy connections to *.testbed through bleek in combination with sharing connections, in ~/.ssh/config:
Line 80: Line 79:
 
==== Installing the machine ====
 
==== Installing the machine ====
  
* '''Choose a VM host''' to starting the installation on. Peruse the [[#Hardware index|hardware inventory]] and pick one of the available machines in the <span style="background-color: #ffc; padding: 2px; border: 1px solid #ccc;">yellow</span> section.
+
* '''Choose a VM host''' to starting the installation on. Peruse the [[#Hardware|hardware inventory]] and pick one of the available machines.
* '''Choose a [[#Storage|storage option]]''' for the machine's disk image. This choice determines performance and the ability to do live migration later on.
+
* '''Choose a [[#Storage|storage option]]''' for the machine's disk image.
 
* '''Choose OS, memory and disk space''' as needed.
 
* '''Choose OS, memory and disk space''' as needed.
 
* '''Figure out''' which network bridge to use to link to as [[#Network|determined by the network]] of the machine.
 
* '''Figure out''' which network bridge to use to link to as [[#Network|determined by the network]] of the machine.
Line 155: Line 154:
  
 
==== adding a new user to the testbed ====
 
==== adding a new user to the testbed ====
 +
 +
Users are known from their ldap entries. All it takes to allow another user on the testbed is adding their name to
 +
/etc/security/access.conf
 +
on bleek (at least if logging on to bleek is necessary); Adding a home directory on bleek and copying the ssh key of the user to the appropriate file.
 +
 +
Something along these lines (but this is untested):
 +
test -d $NEWUSER || cp -r /etc/skel /user/$NEWUSER
 +
chown -R $NEWUSER:`id -ng $NEWUSER` /user/$NEWUSER
  
 
==== removing a user from the testbed ====
 
==== removing a user from the testbed ====
Line 178: Line 185:
 
==== Requesting certificates from the testbed CA ====
 
==== Requesting certificates from the testbed CA ====
  
Kindly ask Dennis. The CA key is on his eToken, which means no one else can generate host certificates. Some time in the future this will be replaced by a simple CA setup on the testbed itself.
+
There is a cheap 'n easy (and entirely untrustworthy) CA installation on bleek:/srv/ca-testbed/ca/.
 +
The DN is
 +
/C=NL/O=VL-e P4/CN=VL-e P4 testbed CA 2
 +
and this supersedes the testbed CA that had a key on an eToken (which was more secure but inconvenient).
 +
 
 +
Generating a new host cert is as easy as
  
 +
cd /srv/ca-testbed/ca
 +
./gen-host-cert.sh test.testbed
  
 +
You must enter the password for the CA key. The resulting certificate and key will be copied to
 +
/var/local/hostkeys/pem/(hostname)/
 +
 +
The testbed CA files (both the earlier CA as well as the new one) are distributed as rpm and deb package from http://bleek.nikhef.nl/extras.
  
 
==== Automatic configuration of machines ====
 
==== Automatic configuration of machines ====
Line 234: Line 252:
  
 
== Network ==
 
== Network ==
 +
 +
The testbed machines are connected to three VLANs:
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 243: Line 263:
 
! ACL
 
! ACL
 
|-
 
|-
| 2
+
| 82
 
| [[NDPF_System_Functions#P4CTB|P4CTB]]
 
| [[NDPF_System_Functions#P4CTB|P4CTB]]
 
| 194.171.96.16/28
 
| 194.171.96.16/28
 
| 194.171.96.30
 
| 194.171.96.30
| br2
+
| br82
 
| No inbound traffic on privileged ports
 
| No inbound traffic on privileged ports
 
|-
 
|-
| 8
+
| 88
 
| [[NDPF_System_Functions#Nordic (Open_Experimental)|Open/Experimental]]
 
| [[NDPF_System_Functions#Nordic (Open_Experimental)|Open/Experimental]]
 
| 194.171.96.32/27
 
| 194.171.96.32/27
 
| 194.171.96.62
 
| 194.171.96.62
| br8
+
| br88
 
| Open
 
| Open
 
|-
 
|-
| 17 (untagged)
+
| 97 (untagged)
 
| local
 
| local
 
| 10.198.0.0/16
 
| 10.198.0.0/16
Line 264: Line 284:
 
| testbed only
 
| testbed only
 
|- style="color: #999;"
 
|- style="color: #999;"
| 4
+
| 84
 
| [[NDPF System Functions#MNGT/IPMI|IPMI and management]]
 
| [[NDPF System Functions#MNGT/IPMI|IPMI and management]]
 
|  172.20.0.0/16
 
|  172.20.0.0/16
Line 271: Line 291:
 
| separate management network for IPMI and Serial-Over-Lan
 
| separate management network for IPMI and Serial-Over-Lan
 
|}
 
|}
 +
 +
The untagged VLAN is for internal use by the physical machines. The VMs are connected to bridge devices according to their purpose, and groups of VMs are isolated by using nested VLANs (Q-in-Q).
 +
This is an example configuration of /etc/network/interfaces:
 +
 +
# The primary network interface
 +
auto eth0
 +
iface eth0 inet manual
 +
up ip link set $IFACE mtu 9000
 +
 +
auto br0
 +
iface br0 inet dhcp
 +
bridge_ports eth0
 +
 +
auto eth0.82
 +
iface eth0.82 inet manual
 +
up ip link set $IFACE mtu 9000
 +
vlan_raw_device eth0
 +
 +
auto br2
 +
iface br2 inet manual
 +
bridge_ports eth0.82
 +
 +
auto eth0.88
 +
iface eth0.88 inet manual
 +
        vlan_raw_device eth0
 +
 +
auto br8
 +
iface br8 inet manual
 +
bridge_ports eth0.88
 +
 +
auto vlan100
 +
iface vlan100 inet manual
 +
        up ip link set $IFACE mtu 1500
 +
        vlan_raw_device eth0.82
 +
 +
auto br2_100
 +
iface br2_100 inet manual
 +
bridge_ports vlan100
 +
 +
In this example VLAN 82 is configured on the first bridge, br0, and this interface is used in a second bridge, br2; the nested VLAN 82.100 is configured on this brigde, and finally a bridge 2_100 is made. So VMs that are added to this bridge will only receive traffic coming from VLAN 100 nested inside VLAN 2.
 +
  
 
=== NAT ===
 
=== NAT ===
Line 283: Line 344:
 
  SNAT      all  --  10.198.0.0/16        0.0.0.0/0          to:194.171.96.17  
 
  SNAT      all  --  10.198.0.0/16        0.0.0.0/0          to:194.171.96.17  
 
  ...
 
  ...
 +
 +
=== Multicast ===
 +
 +
The systems for clustered LVM and Ganglia rely on multicast to work. Some out of the box Debian installations end up with a host entry like
 +
 +
127.0.1.1 arrone.testbed
 +
 +
These should be removed!
  
 
== Storage ==
 
== Storage ==
  
The storage options to use by the virtual machines are a trade-off between performance and flexibility. Shared storage options offer the benefit of live migration, but only the Fibre Channel hardware can combine this with high speeds.
+
The hypervisors of the testbed all connect to the same shared storage backend (a Fujitsu DX200 system called KLAAS) over iSCSI.
 +
The storage backend exports a number of pools to the testbed. These are formatted as LVM groups and shared through a clustered LVM setup.
  
{| class="wikitable"
+
In libvirt, the VG is known as a 'pool' under the name <code>vmachines</code> (location <code>/dev/p4ctb</code>).
!  || local disk || Fibre Channel || iSCSI to put || NFS to put
+
 
|-
+
=== Clustered LVM setup ===
! pool name
+
 
| * || vmachines || vmachines || put.testbed
+
The clustering of nodes is provided by corosync. Here are the contents of the configuration file /etc/corosync/corosync.conf:
|-
+
totem {
! performance
+
version: 2
|style="background-color: #ffc;"| fair
+
cluster_name: p4ctb
|style="background-color: #cfc;"| high
+
token: 3000
|style="background-color: #fcc;"| lower
+
token_retransmits_before_loss_const: 10
|style="background-color: #fcc;"| lower
+
clear_node_high_bit: yes
|-
+
crypto_cipher: aes256
! capacity
+
crypto_hash: sha256
|style="background-color: #ffc;"| limited
+
interface {
|style="background-color: #cfc;"| 1TB
+
ringnumber: 0
|style="background-color: #ffc;"| 400G
+
bindnetaddr: 10.198.0.0
|style="background-color: #cfc;"| 1.6TB
+
mcastport: 5405
|-style="border-top-style: double;"
+
ttl: 1
| blade13
+
}
|style="background-color: #fcc;"| 60GB
+
}
|style="background-color: #cfc;"| yes
+
|style="background-color: #fcc;"| no
+
logging {
|style="background-color: #cfc;"| yes
+
fileline: off
|-
+
to_stderr: no
| blade14
+
to_logfile: no
|style="background-color: #fcc;"| 60GB
+
to_syslog: yes
|style="background-color: #cfc;"| yes
+
syslog_facility: daemon
|style="background-color: #fcc;"| no
+
debug: off
|style="background-color: #cfc;"| yes
+
timestamp: on
|-
+
logger_subsys {
| arrone
+
subsys: QUORUM
|style="background-color: #fcc;"| 60GB
+
debug: off
|style="background-color: #fcc;"| no
+
}
|style="background-color: #cfc;"| yes
+
}
|style="background-color: #cfc;"| yes
+
|-
+
quorum {
| aulnes
+
provider: corosync_votequorum
|style="background-color: #fcc;"| 60GB
+
expected_votes: 2
|style="background-color: #fcc;"| no
+
}
|style="background-color: #cfc;"| yes
+
 
|style="background-color: #cfc;"| yes
+
The crypto settings refer to a file /etc/corosync/authkey which must be present on all systems. There is no predefined definition of the cluster, any node can join and that is why the security token is a good idea. You don't want any unexpected members joining the cluster. The quorum of 2 is, of course, because there are only 3 machines at the moment.
|-
+
 
| toom
+
As long as the cluster is quorate everything should be fine. That means that at any time, one of the machines can be maintained, rebooted, etc. without affecting the availability of the storage on the other nodes.
|style="background-color: #cfc;"| 650GB
+
 
|style="background-color: #fcc;"| no
+
As long as at least one node has the cluster up and running, others should be able to join even if the cluster is not quorate. That means that if only a single node out of three is up, the cluster is no longer quorate and storage queries are blocked. But when another node joins the cluster is again quorate and should unblock.
|style="background-color: #fcc;"| no
+
 
|style="background-color: #cfc;"| yes
+
 
|-
+
==== installation ====
| span
+
 
|style="background-color: #cfc;"| 750GB
+
Based on Debian 9.
|style="background-color: #fcc;"| no
+
 
|style="background-color: #fcc;"| no
+
Install the required packages:
|style="background-color: #cfc;"| yes
+
 
|}
+
apt-get install corosync clvm
 +
 
 +
Set up clustered locking in lvm:
 +
 
 +
sed -i 's/^    locking_type = 1$/    locking_type = 3/' /etc/lvm/lvm.conf
 +
 
 +
Make sure all nodes have the same corosync.conf file and the same authkey. A key can be generated with corosync-keygen.
 +
 
 +
==== Running ====
 +
 
 +
Start corosync
 +
 
 +
systemctl start corosync
 +
 
 +
Test the cluster status with
 +
 
 +
corosync-quorumtool -s
 +
dlm_tool -n ls
 +
 
 +
Should show all nodes.
 +
 
 +
Start the iscsi daemon
 +
 
 +
systemctl start iscsid
 +
systemctl start multipathd
 +
 
 +
See if the iscsi paths are visible.
 +
 
 +
multipath -ll
 +
3600000e00d2900000029295000110000 dm-1 FUJITSU,ETERNUS_DXL
 +
size=2.0T features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw
 +
|-+- policy='service-time 0' prio=50 status=active
 +
| |- 6:0:0:1 sdi 8:128 active ready running
 +
| `- 3:0:0:1 sdg 8:96  active ready running
 +
`-+- policy='service-time 0' prio=10 status=enabled
 +
  |- 4:0:0:1 sdh 8:112 active ready running
 +
  `- 5:0:0:1 sdf 8:80  active ready running
 +
3600000e00d2900000029295000100000 dm-0 FUJITSU,ETERNUS_DXL
 +
size=2.0T features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw
 +
|-+- policy='service-time 0' prio=50 status=active
 +
| |- 4:0:0:0 sdb 8:16  active ready running
 +
| `- 5:0:0:0 sdc 8:32  active ready running
 +
`-+- policy='service-time 0' prio=10 status=enabled
 +
  |- 3:0:0:0 sdd 8:48  active ready running
 +
  `- 6:0:0:0 sde 8:64  active ready running
 +
 
 +
Only then start the clustered lvm.
 +
 
 +
systemctl start lvm2-cluster-activation.service
 +
 
 +
 
 +
==== Troubleshooting ====
 +
 
 +
Cluster log messages are found in /var/log/syslog.
  
== Services index ==
+
== Services ==
  
 
There is a number of services being maintained in the testbed, most of them running on '''bleek.nikhef.nl'''.
 
There is a number of services being maintained in the testbed, most of them running on '''bleek.nikhef.nl'''.
Line 351: Line 474:
 
! host(s)
 
! host(s)
 
! description
 
! description
 +
|-
 +
|-
 +
| [[#DHCP|DHCP]]
 +
| bleek
 +
| DHCP is part of dnsmasq
 
|-
 
|-
 
| [[#backup|backup]]
 
| [[#backup|backup]]
Line 368: Line 496:
 
| shared git repositories
 
| shared git repositories
 
|-
 
|-
| dns
+
| DNS
 
| bleek.testbed
 
| bleek.testbed
|
+
| DNS is part of dnsmasq
 
|-
 
|-
 
| home directories
 
| home directories
 
| bleek
 
| bleek
 
| NFS exported to hosts in vlan2 and vlan17
 
| NFS exported to hosts in vlan2 and vlan17
 +
|-
 +
| X509 host keys and pre-generated ssh keys
 +
| bleek
 +
| NFS exported directory /var/local/hostkeys
 
|-
 
|-
 
| kickstart
 
| kickstart
Line 386: Line 518:
 
| firstboot
 
| firstboot
 
| bleek
 
| bleek
 +
|-
 +
| ganglia
 +
| bleek
 +
| Multicast 239.1.1.9; see the Ganglia [http://ploeg.nikhef.nl/ganglia/?c=Experimental web interface] on ploeg.
 +
|-
 +
| Nagios
 +
| bleek
 +
| https://bleek.nikhef.nl:8444/nagios/ (requires authorisation)
 
|}
 
|}
  
Line 391: Line 531:
 
=== Backup ===
 
=== Backup ===
  
The home directories and other precious data on bleek are backed up to Sara with the Tivoli Software Manager system. To interact with the system, run the dsmj tool.
+
The home directories and other precious data on bleek are backed up to SurfSARA with the Tivoli Software Manager system. To interact with the system, run the dsmj tool.
 +
A home-brew init.d script in /etc/init.d/adsm starts the service. The key is kept in /etc/adsm/TSM.PWD. The backup logs to /var/log/dsmsched.log, and this log is rotated automatically as configured in /opt/tivoli/tsm/client/ba/bin/dsm.sys.
 +
 
 +
Bleek is also backed up daily to beerput with rsync.
 +
 
 +
Since bleek has been virtualised the risk of data loss through a disk failure has been greatly reduced. This means that the need to maintain multiple backup strategies is probably not so urgent anymore.
  
 
=== Squid cache ===
 
=== Squid cache ===
Line 411: Line 556:
 
The virtual machines koji-hub.testbed, koji-builder.testbed and koji-boulder.testbed run automated builds of grid security middleware builds.
 
The virtual machines koji-hub.testbed, koji-builder.testbed and koji-boulder.testbed run automated builds of grid security middleware builds.
  
== Hardware index ==
+
== Hardware ==
  
 
Changes here should probably also go to [[NDPF System Functions]].
 
Changes here should probably also go to [[NDPF System Functions]].
Line 426: Line 571:
 
! disk
 
! disk
 
! [http://www.dell.com/support/ service tag]
 
! [http://www.dell.com/support/ service tag]
 +
! Fibre Channel
 
! location
 
! location
 
! remarks
 
! remarks
 
|-style="background-color: #cfc;"
 
|-style="background-color: #cfc;"
| bleek
 
| bleek
 
| PE1950
 
| Intel 5150  @ 2.66GHz
 
| 2&times;2
 
| align="right"|8GB
 
| CentOS 5
 
| software raid1 2&times;500GB disks
 
| CQ9NK2J
 
| C10
 
| High Availability, dual power supply; precious data; [[#backup|backed up]].
 
|-style="background: #cfc;"
 
| storage
 
| put
 
| PE2950
 
| Intel E5150 @ 2.66GHz
 
| 2&times;2
 
| align="right"|8GB
 
| FreeNAS 8.3
 
| 6&times; 500 GB SATA, raidz (ZFS)
 
| HMXP93J
 
| C03
 
| former garitxako
 
|-style="background-color: #ffc;"
 
 
| blade13
 
| blade13
 
| bl0-13
 
| bl0-13
Line 462: Line 584:
 
| 70 GB + 1 TB Fibre Channel (shared)
 
| 70 GB + 1 TB Fibre Channel (shared)
 
| 5NZWF4J
 
| 5NZWF4J
 +
| yes
 
| C08 blade13
 
| C08 blade13
 
|
 
|
|-style="background-color: #ffc;"
+
|-style="background-color: #cfc;"
 
| blade14
 
| blade14
 
| bl0-14
 
| bl0-14
Line 472: Line 595:
 
| align="right"|16GB
 
| align="right"|16GB
 
| Debian 6, KVM
 
| Debian 6, KVM
| 70 GB + 1 TB Fibre Channel (shared)
+
| 70 GB
 
| 4NZWF4J
 
| 4NZWF4J
| C08 blade13
+
| yes
 +
| C08 blade14
 
|
 
|
 +
|-style="background-color: #cfc;"
 +
| melkbus
 +
| bl0-02
 +
| PEM600
 +
| Intel E5450 @3.00GHz
 +
| 2&times;4
 +
| align="right"|32GB
 +
| VMWare ESXi
 +
| 2&times; 320GB SAS disks + 1 TB Fibre Channel (shared)
 +
| 76T974J
 +
| yes
 +
| C08, blade 2
 +
|
 
|-style="background-color: #ffc;"
 
|-style="background-color: #ffc;"
 
| arrone
 
| arrone
Line 486: Line 623:
 
| 70 GB + 400 GB iSCSI (shared)
 
| 70 GB + 400 GB iSCSI (shared)
 
| 982MY2J
 
| 982MY2J
 +
| no
 
| C10
 
| C10
| storage shared with aulnes
+
|  
 
|-style="background-color: #ffc;"
 
|-style="background-color: #ffc;"
 
| aulnes
 
| aulnes
Line 498: Line 636:
 
| 70 GB + 400 GB iSCSI (shared)
 
| 70 GB + 400 GB iSCSI (shared)
 
| B82MY2J
 
| B82MY2J
 +
| no
 
| C10
 
| C10
| storage shared with arrone
+
|  
 
|-style="background-color: #ffc;"
 
|-style="background-color: #ffc;"
 
| toom
 
| toom
Line 510: Line 649:
 
| Hardware raid1 2&times;715GB disks
 
| Hardware raid1 2&times;715GB disks
 
| DC8QG3J
 
| DC8QG3J
 +
| no
 
| C10
 
| C10
| current Xen 3 hypervisor with mktestbed scripts
+
|  
 
|-style="background-color: #ffc;"
 
|-style="background-color: #ffc;"
 
| span
 
| span
Line 522: Line 662:
 
| Hardware raid10 on 4&times;470GB disks (950GB net)
 
| Hardware raid10 on 4&times;470GB disks (950GB net)
 
| FP1BL3J
 
| FP1BL3J
 +
| no
 
| C10
 
| C10
 
| plus [[#Squid|squid proxy]]
 
| plus [[#Squid|squid proxy]]
|-style="background-color: #fcc;"
 
| melkbus
 
| bl0-02
 
| PEM600
 
| Intel E5450 @3.00GHz
 
| 2&times;4
 
| align="right"|32GB
 
| VMWare ESXi
 
| 2&times; 320GB SAS disks
 
| 76T974J
 
| C08, blade 1, slot 2
 
| Managed by Oscar
 
 
|-style="color: #444;"
 
|-style="color: #444;"
 
| kudde
 
| kudde
Line 548: Line 677:
 
| C10
 
| C10
 
| Contains hardware encryption tokens for robot certificates; managed by Jan Just
 
| Contains hardware encryption tokens for robot certificates; managed by Jan Just
|-
+
|-style="color: #444;"
 +
| storage
 +
| put
 +
| PE2950
 +
| Intel E5150 @ 2.66GHz
 +
| 2&times;2
 +
| align="right"|8GB
 +
| FreeNAS 8.3
 +
| 6&times; 500 GB SATA, raidz (ZFS)
 +
| HMXP93J
 +
| C03
 +
| former garitxako
 +
|- style="color: #444;"
 
| ent
 
| ent
 
| &mdash;
 
| &mdash;
Line 558: Line 699:
 
| SATA 80GB
 
| SATA 80GB
 
| &mdash;
 
| &mdash;
 +
| no
 
| C24
 
| C24
 
| OS X box (no virtualisation)
 
| OS X box (no virtualisation)
 +
|-style="color: #444;"
 +
| ren
 +
| bleek
 +
| PE1950
 +
| Intel 5150  @ 2.66GHz
 +
| 2&times;2
 +
| align="right"|8GB
 +
| CentOS 5
 +
| software raid1 2&times;500GB disks
 +
| 7Q9NK2J
 +
| no
 +
| C10
 +
| High Availability, dual power supply; former bleek
 
|}
 
|}
  
Line 569: Line 724:
 
  ipmi-oem -h host.ipmi.nikhef.nl -u username -p password dell get-system-info service-tag
 
  ipmi-oem -h host.ipmi.nikhef.nl -u username -p password dell get-system-info service-tag
  
Most machines all run [http://www.debian.org/releases/stable/ Debian squeeze] with [http://www.linux-kvm.org/page/Main_Page KVM] for virtualization, managed by [http://libvirt.org/ libvirt].
+
Most machines run [http://www.debian.org/releases/stable/ Debian wheezy] with [http://www.linux-kvm.org/page/Main_Page KVM] for virtualization, managed by [http://libvirt.org/ libvirt].
  
 
See [[NDPF_Node_Functions#P4CTB|the official list]] of machines for the most current view.
 
See [[NDPF_Node_Functions#P4CTB|the official list]] of machines for the most current view.
 +
 +
=== Console access to the hardware ===
 +
 +
In some cases direct ssh access to the hardware may not work anymore (for instance when the gateway host is down). All machines have been configured to have a serial console that can be accessed through IPMI.
 +
 +
* For details, see [[Serial Consoles]]. The setup for Debian squeeze is [[#Serial over LAN for hardware running Debian|slightly different]].
 +
* can be done by <code>ipmitool -I lanplus -H name.ipmi.nikhef.nl -U user sol activate</code>.
 +
* SOL access needs to be activated in the BIOS ''once'', by setting console redirection through COM2.
 +
 +
For older systems that do not have a web interface for IPMI, the command-line version can be used. Install the OpenIPMI service so root can use ipmitool. Here is a sample of commands to add a user and give SOL access.
 +
 +
ipmitool user enable 5
 +
ipmitool user set name 5 ctb
 +
ipmitool user set password 5 '<blah>'
 +
ipmitool channel setaccess 1 5 ipmi=on
 +
# make the user administrator (4) on channel 1.
 +
ipmitool user priv 5 4 1
 +
ipmitool channel setaccess 1 5 callin=on ipmi=on link=on
 +
ipmitool sol payload enable 1 5
 +
 +
==== Serial over LAN for hardware running Debian ====
 +
 +
On Debian squeeze you need to tell grub2 what to do with the kernel command line in the file /etc/default/grub. Add or uncomment the following settings:
 +
GRUB_CMDLINE_LINUX_DEFAULT=""
 +
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS1,115200n8"
 +
GRUB_TERMINAL=console
 +
GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=1 --word=8 --parity=no --stop=1"
 +
 +
Then run '''update-grub'''.
  
 
=== Installing Debian and libvirt on new hardware ===
 
=== Installing Debian and libvirt on new hardware ===
Line 639: Line 823:
  
 
This just means qemu could not create the domain!
 
This just means qemu could not create the domain!
 +
 +
=== Installing Debian on blades with Fiber Channel ===
 +
 +
Although FC support on Debian works fine, using the multipath-tools-boot package is a bit tricky. It will update the initrd to include the multipath libraries and tools, to make it available at boot time. This happened on blade-13; on reboot it was unable to mount the root partition (The message was 'device or resource busy') because the device mapper had somehow taken hold of the SCSI disk. By changing the root=UUID=xxxx stanza in the GRUB menu to root=/dev/dm-2 (this was guess-work) I managed to boot the system. There were probably several remedies to resolve the issue:
 +
# rerun update-grub. This should replace the UUID= with a link to /dev/mapper/xxxx-part1
 +
# blacklist the disk in the device mapper (and running mkinitramfs)
 +
# remove the multipath-tools-boot package altogether.
 +
 +
I opted for blacklisting; this is what's in /etc/multipath.conf:
 +
blacklist {
 +
  wwid 3600508e000000000d6c6de44c0416105
 +
}
 +
 +
== Migration plans to a cloud infrastructure ==
 +
 +
Previous testbed cloud experiences are [[Agile testbed/Cloud|reported here]].
 +
 +
Currently, using plain libvirt seems to fit most of our needs.

Revision as of 15:59, 6 September 2018