Difference between revisions of "SintMaarten network"
(4 intermediate revisions by the same user not shown) | |||
Line 30: | Line 30: | ||
the command ''also'' timed out. It is worth noting that | the command ''also'' timed out. It is worth noting that | ||
* <tt>-nodcau</tt> means 'no data channel authentication' ; it makes file transfers less secure but it works better when firewalls are in place | * <tt>-nodcau</tt> means 'no data channel authentication' ; it makes file transfers less secure but it works better when firewalls are in place | ||
− | * <tt>lcg-cp</tt> also | + | * <tt>lcg-cp</tt> also disables data channel authentication |
The next step was to rule out the grid middleware altogether: by using netcat (<tt>nc</tt>) to transfer a file we also managed to bring the network speed to a crawl: | The next step was to rule out the grid middleware altogether: by using netcat (<tt>nc</tt>) to transfer a file we also managed to bring the network speed to a crawl: | ||
Line 37: | Line 37: | ||
on hooi-ei-09: | on hooi-ei-09: | ||
# cat bigfile | nc wn-smrt-011.farm.nikhef.nl 20009 | # cat bigfile | nc wn-smrt-011.farm.nikhef.nl 20009 | ||
− | Transfer speeds are initially OK but after roughly 100 Mb transfer speeds drop to about 200 | + | Transfer speeds are initially OK but after roughly 100 Mb transfer speeds drop to about 200 KB/s. A very useful tool to see this is |
− | + | [http://www.ivarch.com/programs/pv.shtml PipeViewer] | |
+ | # cat bigfile | pv -c | nc wn-smrt-011.farm.nikhef.nl 20009 | ||
+ | |||
+ | So now we know it's not the (grid) middleware but more likely a networking issue. | ||
+ | The <tt>hooi-ei-09</tt> storage element is part of a DPM disk pool. It is a Sun XFire ('Thumper') with 3 gigabit controllers which are bonded using 802.3ad into a single interface using a 3Com switch. Each connection to such a configuration will use only a single network controller, hence a single <tt>globus-url-copy</tt> should max out at roughly 110 MB/s. | ||
+ | The SintMaarten worker nodes have dual Broadcom NetXtreme II 10GE adapters: | ||
+ | # grep Broadcom /var/log/dmesg | ||
+ | eth0: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fb000000, IRQ 114, node addr 0017a4770028 | ||
+ | eth1: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fa000000, IRQ 122, node addr 0017a477002a | ||
+ | of which only the first is connected to an Arista switch. The connection speed is 1 Gbps as can be seen using <tt>ethtool eth0</tt>: | ||
+ | [root@wn-smrt-011 ~]# ethtool eth0 | ||
+ | Settings for eth0: | ||
+ | Supported ports: [ FIBRE ] | ||
+ | Supported link modes: 1000baseT/Full | ||
+ | 2500baseX/Full | ||
+ | Supports auto-negotiation: Yes | ||
+ | Advertised link modes: 1000baseT/Full | ||
+ | 2500baseX/Full | ||
+ | 10000baseT/Full | ||
+ | Advertised auto-negotiation: Yes | ||
+ | Speed: 1000Mb/s | ||
+ | Duplex: Full | ||
+ | Port: FIBRE | ||
+ | PHYAD: 1 | ||
+ | Transceiver: internal | ||
+ | Auto-negotiation: on | ||
+ | Supports Wake-on: g | ||
+ | Wake-on: g | ||
+ | Current message level: 0x00000000 (0) | ||
+ | Link detected: yes | ||
+ | [root@wn-smrt-011 ~]# ethtool eth1 | grep Link | ||
+ | Link detected: no | ||
+ | |||
+ | Several things were tried to see if it had any effect: | ||
+ | * remove channel bonding on the Sun box: no effect | ||
+ | * disable TCP checksum offloading on the SintMaarten worker node (<tt>ethtool -K eth0 rx off tx off sg off tso off</tt>) : no effect | ||
+ | * tweak the Linux kernel <tt>/proc/sys/net/*/*{rmem,wmem}</tt> parameters: no effect | ||
+ | |||
+ | At this point we went for the 'wild guess' : it's a network adapter/driver issue with the Broadcom NetXtreme II adapters. Let's see which parameters we can pass to the driver for this card: | ||
+ | # strings /lib/modules/2.6.18-164.6.1.el5/kernel/drivers/net/bnx2x.ko | grep parm | ||
+ | parm=debug: Default debug msglevel | ||
+ | parmtype=debug:int | ||
+ | parm=mrrs: Force Max Read Req Size (0..3) (for debug) | ||
+ | parmtype=mrrs:int | ||
+ | parm=poll: Use polling (for debug) | ||
+ | parmtype=poll:int | ||
+ | parm=int_mode: Force interrupt mode (1 INT#x; 2 MSI) | ||
+ | parmtype=int_mode:int | ||
+ | parm=disable_tpa: Disable the TPA (LRO) feature | ||
+ | parmtype=disable_tpa:int | ||
+ | parm=multi_mode: Use per-CPU queues | ||
+ | parmtype=multi_mode:int | ||
+ | |||
+ | Hmmm, what is "TPA (LR0)" and why would I want to disable it? | ||
+ | A search on the web hinted at somebody saying that it needs to be disabled: [https://bugzilla.redhat.com/show_bug.cgi?id=518531 Bugzilla Report] | ||
= Solution = | = Solution = | ||
+ | |||
+ | By adding a file <tt>/etc/modprobe.d/network</tt> with contents | ||
+ | # cat /etc/modprobe.d/network | ||
+ | options bnx2x disable_tpa=1 | ||
+ | and rebooting the worker node we now see that transfers are consistently in the 60MB/s range, both with <tt>nc</tt> and <tt>lcg-cp</tt>. |
Latest revision as of 11:51, 24 November 2009
In October 2009 the SintMaarten cluster was commissioned. This cluster is based on HP blades. Soon after commissioning a serious performance issue was reported:
[BG-NLT1-Support] #287: bad gridftp transfer rate - smrt wns
This page is the result of the analysis of this performance issue.
Problem report
The performance issue reported was seen when copying a file from the Nikhef storage system to a SintMaarten worker node. Transfer speeds at first were OK but dropped to very low levels after about 120 Mb of data, eventually causing timeouts in the lcg-cp command used. Copying the exact same file from the exact same storage element to a slightly older worker node did not experience this problem:
=== wn-smrt-006 (Bad!) === # lcg-cp --vo atlas -v srm://.... file://..... [snip] # streams: 1 62914560 bytes 1279.98 KB/sec avg 512.00 KB/sec inst
vs
=== wn-val-066 (Good!) === # lcg-cp --vo atlas -v srm://.... file://..... [snip] # streams: 1 1672478720 bytes 68053.21 KB/sec avg 70142.84 KB/sec inst
Analysis
At first it was thought that the lcg-cp command itself was causing the error:
- when copying the file using lcg-cp the command timed out after several minutes
- when copying the exact same file using globus-url-copy the command finished in less than a minute
However when using
globus-url-copy -nodcau
the command also timed out. It is worth noting that
- -nodcau means 'no data channel authentication' ; it makes file transfers less secure but it works better when firewalls are in place
- lcg-cp also disables data channel authentication
The next step was to rule out the grid middleware altogether: by using netcat (nc) to transfer a file we also managed to bring the network speed to a crawl:
on wn-smrt-011.farm.nikhef.nl: # nc -l 20009 > /tmp/bigfile on hooi-ei-09: # cat bigfile | nc wn-smrt-011.farm.nikhef.nl 20009
Transfer speeds are initially OK but after roughly 100 Mb transfer speeds drop to about 200 KB/s. A very useful tool to see this is PipeViewer
# cat bigfile | pv -c | nc wn-smrt-011.farm.nikhef.nl 20009
So now we know it's not the (grid) middleware but more likely a networking issue. The hooi-ei-09 storage element is part of a DPM disk pool. It is a Sun XFire ('Thumper') with 3 gigabit controllers which are bonded using 802.3ad into a single interface using a 3Com switch. Each connection to such a configuration will use only a single network controller, hence a single globus-url-copy should max out at roughly 110 MB/s. The SintMaarten worker nodes have dual Broadcom NetXtreme II 10GE adapters:
# grep Broadcom /var/log/dmesg eth0: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fb000000, IRQ 114, node addr 0017a4770028 eth1: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fa000000, IRQ 122, node addr 0017a477002a
of which only the first is connected to an Arista switch. The connection speed is 1 Gbps as can be seen using ethtool eth0:
[root@wn-smrt-011 ~]# ethtool eth0 Settings for eth0: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Full 2500baseX/Full Supports auto-negotiation: Yes Advertised link modes: 1000baseT/Full 2500baseX/Full 10000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: FIBRE PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: g Current message level: 0x00000000 (0) Link detected: yes [root@wn-smrt-011 ~]# ethtool eth1 | grep Link Link detected: no
Several things were tried to see if it had any effect:
- remove channel bonding on the Sun box: no effect
- disable TCP checksum offloading on the SintMaarten worker node (ethtool -K eth0 rx off tx off sg off tso off) : no effect
- tweak the Linux kernel /proc/sys/net/*/*{rmem,wmem} parameters: no effect
At this point we went for the 'wild guess' : it's a network adapter/driver issue with the Broadcom NetXtreme II adapters. Let's see which parameters we can pass to the driver for this card:
# strings /lib/modules/2.6.18-164.6.1.el5/kernel/drivers/net/bnx2x.ko | grep parm parm=debug: Default debug msglevel parmtype=debug:int parm=mrrs: Force Max Read Req Size (0..3) (for debug) parmtype=mrrs:int parm=poll: Use polling (for debug) parmtype=poll:int parm=int_mode: Force interrupt mode (1 INT#x; 2 MSI) parmtype=int_mode:int parm=disable_tpa: Disable the TPA (LRO) feature parmtype=disable_tpa:int parm=multi_mode: Use per-CPU queues parmtype=multi_mode:int
Hmmm, what is "TPA (LR0)" and why would I want to disable it? A search on the web hinted at somebody saying that it needs to be disabled: Bugzilla Report
Solution
By adding a file /etc/modprobe.d/network with contents
# cat /etc/modprobe.d/network options bnx2x disable_tpa=1
and rebooting the worker node we now see that transfers are consistently in the 60MB/s range, both with nc and lcg-cp.