Difference between revisions of "SintMaarten network"

From PDP/Grid Wiki
Jump to navigationJump to search
Line 29: Line 29:
 
   globus-url-copy -nodcau
 
   globus-url-copy -nodcau
 
the command ''also'' timed out. It is worth noting that
 
the command ''also'' timed out. It is worth noting that
* -nodcau means 'no data channel authentication' ; it makes file transfers less secure but it works better when firewalls are in place
+
* <tt>-nodcau</tt> means 'no data channel authentication' ; it makes file transfers less secure but it works better when firewalls are in place
 
* <tt>lcg-cp</tt> also disabled data channel authentication
 
* <tt>lcg-cp</tt> also disabled data channel authentication
 +
 +
The next step was to rule out the grid middleware altogether: by using netcat (<tt>nc</tt>) to transfer a file we also managed to bring the network speed to a crawl:
 +
  on wn-smrt-011.farm.nikhef.nl:
 +
  # nc -l 20009 > /tmp/bigfile
 +
  on hooi-ei-09:
 +
  # cat bigfile | nc wn-smrt-011.farm.nikhef.nl 20009
 +
Transfer speeds are initially OK but after roughly 100 Mb transfer speeds drop to about 200 Kb/s. A very useful tool to see this is
 +
<a href="http://www.ivarch.com/programs/pv.shtml">Pipe Viewer</a>
  
 
= Solution =
 
= Solution =

Revision as of 11:33, 24 November 2009

In October 2009 the SintMaarten cluster was commissioned. This cluster is based on HP blades. Soon after commissioning a serious performance issue was reported:

 [BG-NLT1-Support] #287: bad gridftp transfer rate - smrt wns

This page is the result of the analysis of this performance issue.

Problem report

The performance issue reported was seen when copying a file from the Nikhef storage system to a SintMaarten worker node. Transfer speeds at first were OK but dropped to very low levels after about 120 Mb of data, eventually causing timeouts in the lcg-cp command used. Copying the exact same file from the exact same storage element to a slightly older worker node did not experience this problem:

 ===
 wn-smrt-006 (Bad!)
 ===
 # lcg-cp --vo atlas -v srm://.... file://.....
 [snip]
 # streams: 1
    62914560 bytes   1279.98 KB/sec avg    512.00 KB/sec inst

vs

 ===
 wn-val-066 (Good!)
 ===
 # lcg-cp --vo atlas -v srm://.... file://.....
 [snip]
 # streams: 1
    1672478720 bytes  68053.21 KB/sec avg  70142.84 KB/sec inst


Analysis

At first it was thought that the lcg-cp command itself was causing the error:

  • when copying the file using lcg-cp the command timed out after several minutes
  • when copying the exact same file using globus-url-copy the command finished in less than a minute

However when using

 globus-url-copy -nodcau

the command also timed out. It is worth noting that

  • -nodcau means 'no data channel authentication' ; it makes file transfers less secure but it works better when firewalls are in place
  • lcg-cp also disabled data channel authentication

The next step was to rule out the grid middleware altogether: by using netcat (nc) to transfer a file we also managed to bring the network speed to a crawl:

 on wn-smrt-011.farm.nikhef.nl:
 # nc -l 20009 > /tmp/bigfile
 on hooi-ei-09:
 # cat bigfile | nc wn-smrt-011.farm.nikhef.nl 20009

Transfer speeds are initially OK but after roughly 100 Mb transfer speeds drop to about 200 Kb/s. A very useful tool to see this is <a href="http://www.ivarch.com/programs/pv.shtml">Pipe Viewer</a>

Solution