From PDP/Grid Wiki
Jump to navigationJump to search

This article describes the setup of the DPM disk servers hooikar and hooiwagen. These disk servers export file systems which are mounted via iSCSI from the Sun X4500 hosts (hooi-ei-*).


The Sun X4500 hosts were installed with operating system CentOS-4 x86-64 and the gLite 3.1 version of DPM. gLite decided to drop support for the gLite 3.1 version of DPM in the course of 2011. Existing installations had to be upgraded to the gLite 3.2 version of DPM under EL5.

Unfortunately, the Sun X4500 series are only certified for use with RHEL4. The combination with RHEL5 is explicitly not supported by the vendor.

Although aware of this support issue, we installed CentOS-5 x86-64 on a test machine and started testing. The results were disappointing. When used under modest load (using rsync to copy from another host, a single-threaded operation), the test machine would halt with a kernel panic in ~6 hours. Leaving the test machine without any load would lead to a kernel panic in a week or so.

These findings and the lack of support by the vendor made us search for another way to use the ~280 TB of disk space using the gLite 3.2 version of DPM.

iSCSI-based setup

The new setup uses a CentOS-4 x86-64 installation of the Sun X4500 hosts as iscsi-target hosts. Each of these hosts exports 20 TB disk space as a single block device via iSCSI.

Two new hosts come into play: hooikar and hooiwagen. They are Dell R610 hosts with dual Myricom 10 Gbps network cards. These hosts are installed with CentOS-5 x86-64 and the gLite 3.2 version of the DPM disk server middleware. They are configured as iSCSI clients and connected via a private network to the the Sun X4500 hosts.

Each of the 2 disk servers mounts block devices from 7 X4500 hosts and exports them to DPM. (That brings an interesting issue for DPM's file selection mechanism; see below).

Configuration of the iscsi-target

Download the rpm source package from http://www.cryptoforge.net/iscsi/RPMS, maintained by Bastiaan Bakker. As of this writing use version 0.4.12 as 0.4.13 will fail with the updated kernel

Use the version "-6" and rebuild on an RHEL4 system with the kernel module build source RPM installed.

The result for kernel 2.6.9-89.0.16.EL is at:


alongside the source.

Install the kernel module RPMs, run "depmod -a" and "modprobe iscsi_trgt"

On the Sun X4500 hosts, create 4 physical volumes and combine them in a single volume group "data" (the script mkdataordered handles that). Then, create a logical volume in the volume group. This volume is the block device that is exported.

Set that block device in the configure file /etc/ietd.conf file, containing for example:

 Target iqn.2010-11.nl.nikhef:opnstorage.easteregg.host-hooi-ei-13.lun1
       IncomingUser username twelvecharss
       Lun 0 Path=/dev/data/ei13,Type=fileio
       Alias iDISK0

In this case the password is sent in the clear ovr the network -- so must be on a private SAN network!

Finally, (re)start the service:

  /etc/init.d/iscsi-target restart

Configuration of the iscsi client

Make sure the iscsi-initiator-utils- package is installed.

Static configuration of the client

Create (or edit) the /etc/scsi/iscsid.conf file with at least

node.session.auth.username = username
node.session.auth.password = twelvecharss
discovery.sendtargets.auth.username = username
discovery.sendtargets.auth.password = twelvecharss
node.startup = automatic

and then restart the service:

/etc/init.d/iscsi restart

This WILL give an error:

 Setting up iSCSI targets: iscsiadm: No records found!

Get the iSCSI initiator name


and set it in /etc/iscsi/initiatorname.iscsi:


Discover targets

Then it is time to discover the targets. Since there are potentially 14 targets, we must repeat the following steps per target or we'll have a hell of a time to find out which target is mounted on which device.

To discover the target at IP address a.b.c.d:

 iscsiadm -m discovery --type=st --portal=a.b.c.d

and restart the service.

 # /etc/init.d/iscsi restart
 Stopping iSCSI daemon:
 iscsid dead but pid file exists                            [  OK  ]
 Starting iSCSI daemon:                                     [  OK  ]
                                                            [  OK  ]
 Setting up iSCSI targets: Logging in to [iface: default, target:
 iqn.2010-11.nl.nikhef:opnstorage.easteregg.hooikoorts.lun1, portal:,3260]

and dmesg will report where the device is connected:

Loading iSCSI transport class v2.0-871.
cxgb3i: tag itt 0x1fff, 13 bits, age 0xf, 4 bits.
iscsi: registered transport (cxgb3i)
Broadcom NetXtreme II CNIC Driver cnic v2.1.0 (Oct 10, 2009)
Broadcom NetXtreme II iSCSI Driver bnx2i v2.1.0 (Dec 06, 2009)
iscsi: registered transport (bnx2i)
iscsi: registered transport (tcp)
iscsi: registered transport (iser)
iscsi: registered transport (be2iscsi)
scsi48 : iSCSI Initiator over TCP/IP
  Vendor: IET       Model: VIRTUAL-DISK      Rev: 0
  Type:   Direct-Access                      ANSI SCSI revision: 04
SCSI device sdaw: 125829120 512-byte hdwr sectors (64425 MB)
sdaw: Write Protect is off
sdaw: Mode Sense: 77 00 00 08
SCSI device sdaw: drive cache: write back
SCSI device sdaw: 125829120 512-byte hdwr sectors (64425 MB)
sdaw: Write Protect is off
sdaw: Mode Sense: 77 00 00 08
SCSI device sdaw: drive cache: write back
 sdaw: unknown partition table
sd 48:0:0:0: Attached scsi disk sdaw
sd 48:0:0:0: Attached scsi generic sg48 type 0

And continue with the next file system.

Show connections to local block device

To find out which remote iSCSI block device is attached to which local block device, use the following command:

 # iscsiadm -m session -P 3| egrep '(^Target|Attached scsi)'
 Target: iqn.2010-11.nl.nikhef:opnstorage.easteregg.host-hooi-ei-12.lun1
                         Attached scsi disk sdj          State: running
 Target: iqn.2010-11.nl.nikhef:opnstorage.easteregg.host-hooi-ei-10.lun1
                         Attached scsi disk sdm          State: running

Create file systems and mount points

Create a file system on each local block device. It is a good idea to add a label to the file system via which it can be identified:

 # mkfs.xfs -L ei10-atlas /dev/sdm

because the local mapping will change when remounting.

Finally, add mount points to /etc/fstab:

 LABEL=ei10-atlas        /export/data/atlprd10   xfs     _netdev         1 3
 LABEL=ei12-atlas        /export/data/atlprd12   xfs     _netdev         1 3

which mount the file system by its label. Since the file systems can only be mounted when the network is up and running, the flag "_netdev" is required.

If needed, create the local mount points:

 # mkdir /export/data/atlprd10

before mounting the file systems. When all file systems are mounted, they can be added to a DPM pool.

Remove a target

If, for some reason, the iSCSI-target with IP=a.b.c.d should be removed from the client database, use the following command:

 iscsiadm -m discoverydb -p a.b.c.d:3260 -t sendtargets -o delete

Note on DPM's file selection mechanism

The above setup was used to add the 14 file systems behind 2 disk servers to one pool:

 hooikar.nikhef.nl /export/data/atlprd01 CAPACITY 19.99T FREE 3.12T ( 15.6%) 
 hooikar.nikhef.nl /export/data/atlprd07 CAPACITY 20.00T FREE 3.11T ( 15.5%) RDONLY
 hooimaand-01.nikhef.nl /export/data/atlasprd CAPACITY 69.50T FREE 9.89T ( 14.2%)
 hooimaand-12.nikhef.nl /export/data/atlasprd CAPACITY 64.01T FREE 27.19T ( 42.5%)
 hooiwagen.nikhef.nl /export/data/atlprd08 CAPACITY 20.00T FREE 19.90T ( 99.5%) 
 hooiwagen.nikhef.nl /export/data/atlprd14 CAPACITY 20.00T FREE 19.89T ( 99.4%) RDONLY

At least until DPM version 1.8.0, the DPM head node will select the file system in a pool for storing a new file by means of a round robin mechanism. That implies that every file system receives the same number of files. However, with a setup as described above, where one file server exports 7 file system to the same disk pool, that means that the disk servers hooikar and hooiwagen receive 7x the number of file as other disk servers that contribute a single file system. That may cause a high load on the servers hooikar/hooimaand when lots of data are simultaneously written and read.

In future versions of DPM, a weight can be attributed to a file system to distribute the load. For the short term, we prevent overloading of the disk servers by making 1 of the 7 file systems on each host writable, whereas the 6 other are read-only. A cron job that runs every 15 minutes selects the next file system to be writable (and makes the 6 other read-only) based on most available free space.