Difference between revisions of "SintMaarten network"

From PDP/Grid Wiki
Jump to navigationJump to search
 
(4 intermediate revisions by the same user not shown)
Line 30: Line 30:
 
the command ''also'' timed out. It is worth noting that
 
the command ''also'' timed out. It is worth noting that
 
* <tt>-nodcau</tt> means 'no data channel authentication' ; it makes file transfers less secure but it works better when firewalls are in place
 
* <tt>-nodcau</tt> means 'no data channel authentication' ; it makes file transfers less secure but it works better when firewalls are in place
* <tt>lcg-cp</tt> also disabled data channel authentication
+
* <tt>lcg-cp</tt> also disables data channel authentication
  
 
The next step was to rule out the grid middleware altogether: by using netcat (<tt>nc</tt>) to transfer a file we also managed to bring the network speed to a crawl:
 
The next step was to rule out the grid middleware altogether: by using netcat (<tt>nc</tt>) to transfer a file we also managed to bring the network speed to a crawl:
Line 37: Line 37:
 
   on hooi-ei-09:
 
   on hooi-ei-09:
 
   # cat bigfile | nc wn-smrt-011.farm.nikhef.nl 20009
 
   # cat bigfile | nc wn-smrt-011.farm.nikhef.nl 20009
Transfer speeds are initially OK but after roughly 100 Mb transfer speeds drop to about 200 Kb/s. A very useful tool to see this is  
+
Transfer speeds are initially OK but after roughly 100 Mb transfer speeds drop to about 200 KB/s. A very useful tool to see this is  
<a href="http://www.ivarch.com/programs/pv.shtml">Pipe Viewer</a>
+
[http://www.ivarch.com/programs/pv.shtml PipeViewer]
 +
  # cat bigfile | pv -c | nc wn-smrt-011.farm.nikhef.nl 20009
 +
 
 +
So now we know it's not the (grid) middleware but more likely a networking issue.
 +
The <tt>hooi-ei-09</tt> storage element is part of a DPM disk pool. It is a Sun XFire ('Thumper') with 3 gigabit controllers which are bonded using 802.3ad into a single interface using a 3Com switch. Each connection to such a configuration will use only a single network controller, hence a single <tt>globus-url-copy</tt> should max out at roughly 110 MB/s.
 +
The SintMaarten worker nodes have dual Broadcom NetXtreme II 10GE adapters:
 +
  # grep Broadcom /var/log/dmesg
 +
  eth0: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fb000000, IRQ 114, node addr 0017a4770028
 +
  eth1: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fa000000, IRQ 122, node addr 0017a477002a
 +
of which only the first is connected to an Arista switch. The connection speed is 1 Gbps as can be seen using <tt>ethtool eth0</tt>:
 +
  [root@wn-smrt-011 ~]# ethtool eth0
 +
  Settings for eth0:
 +
        Supported ports: [ FIBRE ]
 +
        Supported link modes:  1000baseT/Full
 +
                                2500baseX/Full
 +
        Supports auto-negotiation: Yes
 +
        Advertised link modes:  1000baseT/Full
 +
                                2500baseX/Full
 +
                                10000baseT/Full
 +
        Advertised auto-negotiation: Yes
 +
        Speed: 1000Mb/s
 +
        Duplex: Full
 +
        Port: FIBRE
 +
        PHYAD: 1
 +
        Transceiver: internal
 +
        Auto-negotiation: on
 +
        Supports Wake-on: g
 +
        Wake-on: g
 +
        Current message level: 0x00000000 (0)
 +
        Link detected: yes
 +
  [root@wn-smrt-011 ~]# ethtool eth1 | grep Link
 +
        Link detected: no
 +
 
 +
Several things were tried to see if it had any effect:
 +
* remove channel bonding on the Sun box: no effect
 +
* disable TCP checksum offloading on the SintMaarten worker node (<tt>ethtool -K eth0 rx off tx off sg off tso off</tt>) : no effect
 +
* tweak the Linux kernel <tt>/proc/sys/net/*/*{rmem,wmem}</tt> parameters: no effect
 +
 
 +
At this point we went for the 'wild guess' : it's a network adapter/driver issue with the Broadcom NetXtreme II adapters. Let's see which parameters we can pass to the driver for this card:
 +
  # strings /lib/modules/2.6.18-164.6.1.el5/kernel/drivers/net/bnx2x.ko | grep parm
 +
  parm=debug: Default debug msglevel
 +
  parmtype=debug:int
 +
  parm=mrrs: Force Max Read Req Size (0..3) (for debug)
 +
  parmtype=mrrs:int
 +
  parm=poll: Use polling (for debug)
 +
  parmtype=poll:int
 +
  parm=int_mode: Force interrupt mode (1 INT#x; 2 MSI)
 +
  parmtype=int_mode:int
 +
  parm=disable_tpa: Disable the TPA (LRO) feature
 +
  parmtype=disable_tpa:int
 +
  parm=multi_mode: Use per-CPU queues
 +
  parmtype=multi_mode:int
 +
 
 +
Hmmm, what is "TPA (LR0)" and why would I want to disable it?
 +
A search on the web hinted at somebody saying that it needs to be disabled: [https://bugzilla.redhat.com/show_bug.cgi?id=518531 Bugzilla Report]
  
 
= Solution =
 
= Solution =
 +
 +
By adding a file <tt>/etc/modprobe.d/network</tt> with contents
 +
# cat /etc/modprobe.d/network
 +
options bnx2x disable_tpa=1
 +
and rebooting the worker node we now see that transfers are consistently in the 60MB/s range, both with <tt>nc</tt> and <tt>lcg-cp</tt>.

Latest revision as of 11:51, 24 November 2009

In October 2009 the SintMaarten cluster was commissioned. This cluster is based on HP blades. Soon after commissioning a serious performance issue was reported:

 [BG-NLT1-Support] #287: bad gridftp transfer rate - smrt wns

This page is the result of the analysis of this performance issue.

Problem report

The performance issue reported was seen when copying a file from the Nikhef storage system to a SintMaarten worker node. Transfer speeds at first were OK but dropped to very low levels after about 120 Mb of data, eventually causing timeouts in the lcg-cp command used. Copying the exact same file from the exact same storage element to a slightly older worker node did not experience this problem:

 ===
 wn-smrt-006 (Bad!)
 ===
 # lcg-cp --vo atlas -v srm://.... file://.....
 [snip]
 # streams: 1
    62914560 bytes   1279.98 KB/sec avg    512.00 KB/sec inst

vs

 ===
 wn-val-066 (Good!)
 ===
 # lcg-cp --vo atlas -v srm://.... file://.....
 [snip]
 # streams: 1
    1672478720 bytes  68053.21 KB/sec avg  70142.84 KB/sec inst


Analysis

At first it was thought that the lcg-cp command itself was causing the error:

  • when copying the file using lcg-cp the command timed out after several minutes
  • when copying the exact same file using globus-url-copy the command finished in less than a minute

However when using

 globus-url-copy -nodcau

the command also timed out. It is worth noting that

  • -nodcau means 'no data channel authentication' ; it makes file transfers less secure but it works better when firewalls are in place
  • lcg-cp also disables data channel authentication

The next step was to rule out the grid middleware altogether: by using netcat (nc) to transfer a file we also managed to bring the network speed to a crawl:

 on wn-smrt-011.farm.nikhef.nl:
 # nc -l 20009 > /tmp/bigfile
 on hooi-ei-09:
 # cat bigfile | nc wn-smrt-011.farm.nikhef.nl 20009

Transfer speeds are initially OK but after roughly 100 Mb transfer speeds drop to about 200 KB/s. A very useful tool to see this is PipeViewer

 # cat bigfile | pv -c | nc wn-smrt-011.farm.nikhef.nl 20009

So now we know it's not the (grid) middleware but more likely a networking issue. The hooi-ei-09 storage element is part of a DPM disk pool. It is a Sun XFire ('Thumper') with 3 gigabit controllers which are bonded using 802.3ad into a single interface using a 3Com switch. Each connection to such a configuration will use only a single network controller, hence a single globus-url-copy should max out at roughly 110 MB/s. The SintMaarten worker nodes have dual Broadcom NetXtreme II 10GE adapters:

 # grep Broadcom /var/log/dmesg
 eth0: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fb000000, IRQ 114, node addr 0017a4770028
 eth1: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fa000000, IRQ 122, node addr 0017a477002a

of which only the first is connected to an Arista switch. The connection speed is 1 Gbps as can be seen using ethtool eth0:

 [root@wn-smrt-011 ~]# ethtool eth0
 Settings for eth0:
       Supported ports: [ FIBRE ]
       Supported link modes:   1000baseT/Full
                               2500baseX/Full
       Supports auto-negotiation: Yes
       Advertised link modes:  1000baseT/Full
                               2500baseX/Full
                               10000baseT/Full
       Advertised auto-negotiation: Yes
       Speed: 1000Mb/s
       Duplex: Full
       Port: FIBRE
       PHYAD: 1
       Transceiver: internal
       Auto-negotiation: on
       Supports Wake-on: g
       Wake-on: g
       Current message level: 0x00000000 (0)
       Link detected: yes
 [root@wn-smrt-011 ~]# ethtool eth1 | grep Link
       Link detected: no

Several things were tried to see if it had any effect:

  • remove channel bonding on the Sun box: no effect
  • disable TCP checksum offloading on the SintMaarten worker node (ethtool -K eth0 rx off tx off sg off tso off) : no effect
  • tweak the Linux kernel /proc/sys/net/*/*{rmem,wmem} parameters: no effect

At this point we went for the 'wild guess' : it's a network adapter/driver issue with the Broadcom NetXtreme II adapters. Let's see which parameters we can pass to the driver for this card:

 # strings /lib/modules/2.6.18-164.6.1.el5/kernel/drivers/net/bnx2x.ko | grep parm
 parm=debug: Default debug msglevel
 parmtype=debug:int
 parm=mrrs: Force Max Read Req Size (0..3) (for debug)
 parmtype=mrrs:int
 parm=poll: Use polling (for debug)
 parmtype=poll:int
 parm=int_mode: Force interrupt mode (1 INT#x; 2 MSI)
 parmtype=int_mode:int
 parm=disable_tpa: Disable the TPA (LRO) feature
 parmtype=disable_tpa:int
 parm=multi_mode: Use per-CPU queues
 parmtype=multi_mode:int

Hmmm, what is "TPA (LR0)" and why would I want to disable it? A search on the web hinted at somebody saying that it needs to be disabled: Bugzilla Report

Solution

By adding a file /etc/modprobe.d/network with contents

# cat /etc/modprobe.d/network
options bnx2x disable_tpa=1

and rebooting the worker node we now see that transfers are consistently in the 60MB/s range, both with nc and lcg-cp.