Difference between revisions of "100g Storage Box"

From PDP/Grid Wiki
Jump to navigationJump to search
 
(3 intermediate revisions by the same user not shown)
Line 146: Line 146:
 
If you want to go faster. You'll can do the following things:
 
If you want to go faster. You'll can do the following things:
  
Change the scheduler for all luns:
+
* Change the scheduler for all luns:
 
  echo deadline > /sys/block/sdb/queue/scheduler
 
  echo deadline > /sys/block/sdb/queue/scheduler
 +
 +
* Making the readahead buffer bigger, will also help the performance of the system. There is a good example [http://www.fibrevillage.com/storage/291-blockdev-command-examples here].
 +
 +
More tuning options will follow.
 +
 +
== How we have tested this ==
 +
 +
We're using an inhouse benchmark tool, that we also use for acceptancetests when we procure new storage systems. The tool is located [https://github.com/SURFsara-e-infra/dd_tests here].
 +
 +
The tool can test local read and write performance and also GridFTP performance. With a read/write ratio of 2:1.

Latest revision as of 21:57, 1 February 2021

This setup is presented at Hepix Spring 2018. The idea behind this setup is to build an efficient, fast, dense and affordable 100Gbit/s storage solution.

Hardware shoppinglist

  • 1x IBM S822L or S922L with dual socket 8 core CPU's and 128GB ram
  • 1x Mellanox MCX516A-CDAT or Chelsio T62100-LP-CR (choose your favorite)
  • 2x LSI 9405W quadport SAS HBA
  • 2x Seagate AssuredSAN 4004 with 8TB drives
  • 8x SAS3 cables

As an alternative you'll can also use an IBM LC922 with dual socket 16 core CPU's and 128GB ram.

Configuration and tuning

We have used Ubuntu 18.04 for this setup. If you want to use another distribution, make sure that the kernel is new enough and supporting the hardware.

Make sure that the system is installed before connecting it to the AssuredSAN boxes.

Physical setup

It's important that the HBA's are connected to the same CPU. For the S822L, I advise to use slot C3 and C5. For the S922L, I advise to use slot C3 and C4.

Make sure that the server is booted, before connecting the SAS cables. Use the shortest SAS3 cables that still fit between the server and AssuredSAN boxes.

Only use ports 0 and 2 on the AssuredSAN boxes. It doesn't matter how you'll connect the SAS cables onto the HBA's.

Place the 100Gbit/s card in slot C6 on the S822L and slot C9 on the S922L. Don't forget to install the drivers from the vendor. It gives you more offloading than the standard driver delivered in the distribution.

Configure the AssuredSAN

How you'll need to do the configuration. I redirect you to the manual.

The raidsets needs to be configured like this:

14 drives per RAID6
Read ahead set to 32MB
Devide the raidsets over both controllers
Devide the raidsets over SAS ports of the controller.
Map a raidset to one port. 
Only use ports 0 and 2 on each controller.
Don't use lun 0 for mapping the raidsets! Lun 0 is used for the control channel between the server and AssuredSAN.

Configure the server

Install the following packages: multipath-tools sg3-utils mdadm

Configure multipath

Place this piece of configuration in the file /etc/multipath.conf

defaults {
	user_friendly_names yes
	find_multipaths yes
        path_grouping_policy    group_by_prio
        path_checker            tur
        path_selector           "round-robin 0"
        prio                    "alua"
        rr_weight               uniform
        failback                immediate
        no_path_retry           18
        rr_min_io               100
}
blacklist {
	devnode "^(ram|sda|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
	devnode "^hd[a-z]"
}
devices {
device {
      vendor                  "DotHill"
      product                 "DH4554"
      path_grouping_policy    group_by_prio
      path_checker            tur
      path_selector           "round-robin 0"
      prio                    "alua"
      rr_weight               priorities
      failback                immediate
      hardware_handler        "1 alua"
      no_path_retry           18
      rr_min_io               100
}
}

To start multipath you'll need to run: multipath

You'll need to see something like this:

mpathe (3600c0ff00029accca7c8a95a01000000) dm-2 DotHill,DH4554
size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 3:0:2:1 sdd 8:48  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 4:0:1:1 sdj 8:144 active ready running
mpathd (3600c0ff00029394ca8c0cb5901000000) dm-5 DotHill,DH4554
size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 3:0:3:3 sdg 8:96  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 4:0:2:3 sdm 8:192 active ready running
mpathc (3600c0ff0002939299dc0cb5901000000) dm-4 DotHill,DH4554
size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 4:0:2:2 sdl 8:176 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 3:0:3:2 sdf 8:80  active ready running
mpathb (3600c0ff00029394c07c0cb5901000000) dm-0 DotHill,DH4554
size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 3:0:1:1 sdb 8:16  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 4:0:3:1 sdn 8:208 active ready running
mpatha (3600c0ff000293929fcbfcb5901000000) dm-1 DotHill,DH4554
size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 4:0:3:4 sdo 8:224 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 3:0:1:4 sdc 8:32  active ready running
mpathh (3600c0ff00029ad379ac8a95a01000000) dm-3 DotHill,DH4554
size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 4:0:1:4 sdk 8:160 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 3:0:2:4 sde 8:64  active ready running
mpathg (3600c0ff00029acccbac8a95a01000000) dm-7 DotHill,DH4554
size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 3:0:4:3 sdi 8:128 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 4:0:4:3 sdq 65:0  active ready running
mpathf (3600c0ff00029ad37b2c8a95a01000000) dm-6 DotHill,DH4554
size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 4:0:4:2 sdp 8:240 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 3:0:4:2 sdh 8:112 active ready running

If you get an error. It could be that the multipath module isn't loaded. So do this: modprobe dm_multipath

Also important. That you'll see that the SAS paths are balanced as like the example above. If not, check all the cables and SAS luns if they are running at 12Gbps and check if the mapping is neatly distributed over the SAS ports.

Running "rescan-scsi-bus --forcerescan" will scan for new blockdevices. This command will also help to find all the SAS luns.

Configure the raid0

Lets create the raid0 so we fuse all the luns into one raidset. With this it will gives us the combined performance. To demostrate the performance, we place a simple xfs filesystem directly on top the raidset. You can also add LVM if you want to slice it in multiple pieces.

mdadm --create md6 --chunk=16M --level 0 --raid-devices 8 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 /dev/dm-4 /dev/dm-5 /dev/dm-6 /dev/dm-7
mkfs.xfs -f /dev/md6
mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8,logbsize=256k,swalloc,largeio,inode64 /dev/md6 /mnt

If everything went well, you'll should have a 700TiB big storage mounted on /mnt

Tuning parameters

If you want to go faster. You'll can do the following things:

  • Change the scheduler for all luns:
echo deadline > /sys/block/sdb/queue/scheduler
  • Making the readahead buffer bigger, will also help the performance of the system. There is a good example here.

More tuning options will follow.

How we have tested this

We're using an inhouse benchmark tool, that we also use for acceptancetests when we procure new storage systems. The tool is located here.

The tool can test local read and write performance and also GridFTP performance. With a read/write ratio of 2:1.