100g Storage Box
This setup is presented at Hepix Spring 2018. The idea behind this setup is to build an efficient, fast, dense and affordable 100Gbit/s storage solution.
Hardware shoppinglist
- 1x IBM S822L or S922L with dual socket 8 core CPU's and 128GB ram
- 1x Mellanox MCX516A-CDAT or Chelsio T62100-LP-CR (choose your favorite)
- 2x LSI 9405W quadport SAS HBA
- 2x Seagate AssuredSAN 4004 with 8TB drives
- 8x SAS3 cables
As an alternative you'll can also use an IBM LC922 with dual socket 16 core CPU's and 128GB ram.
Configuration and tuning
We have used Ubuntu 18.04 for this setup. If you want to use another distribution, make sure that the kernel is new enough and supporting the hardware.
Make sure that the system is installed before connecting it to the AssuredSAN boxes.
Physical setup
It's important that the HBA's are connected to the same CPU. For the S822L, I advise to use slot C3 and C5. For the S922L, I advise to use slot C3 and C4.
Make sure that the server is booted, before connecting the SAS cables. Use the shortest SAS3 cables that still fit between the server and AssuredSAN boxes.
Only use ports 0 and 2 on the AssuredSAN boxes. It doesn't matter how you'll connect the SAS cables onto the HBA's.
Place the 100Gbit/s card in slot C6 on the S822L and slot C9 on the S922L. Don't forget to install the drivers from the vendor. It gives you more offloading than the standard driver delivered in the distribution.
Configure the AssuredSAN
How you'll need to do the configuration. I redirect you to the manual.
The raidsets needs to be configured like this:
14 drives per RAID6 Read ahead set to 32MB Devide the raidsets over both controllers Devide the raidsets over SAS ports of the controller. Map a raidset to one port. Only use ports 0 and 2 on each controller. Don't use lun 0 for mapping the raidsets! Lun 0 is used for the control channel between the server and AssuredSAN.
Configure the server
Install the following packages: multipath-tools sg3-utils mdadm
Configure multipath
Place this piece of configuration in the file /etc/multipath.conf
defaults { user_friendly_names yes find_multipaths yes path_grouping_policy group_by_prio path_checker tur path_selector "round-robin 0" prio "alua" rr_weight uniform failback immediate no_path_retry 18 rr_min_io 100 } blacklist { devnode "^(ram|sda|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z]" } devices { device { vendor "DotHill" product "DH4554" path_grouping_policy group_by_prio path_checker tur path_selector "round-robin 0" prio "alua" rr_weight priorities failback immediate hardware_handler "1 alua" no_path_retry 18 rr_min_io 100 } }
To start multipath you'll need to run: multipath
You'll need to see something like this:
mpathe (3600c0ff00029accca7c8a95a01000000) dm-2 DotHill,DH4554 size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 3:0:2:1 sdd 8:48 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 4:0:1:1 sdj 8:144 active ready running mpathd (3600c0ff00029394ca8c0cb5901000000) dm-5 DotHill,DH4554 size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 3:0:3:3 sdg 8:96 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 4:0:2:3 sdm 8:192 active ready running mpathc (3600c0ff0002939299dc0cb5901000000) dm-4 DotHill,DH4554 size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 4:0:2:2 sdl 8:176 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 3:0:3:2 sdf 8:80 active ready running mpathb (3600c0ff00029394c07c0cb5901000000) dm-0 DotHill,DH4554 size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 3:0:1:1 sdb 8:16 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 4:0:3:1 sdn 8:208 active ready running mpatha (3600c0ff000293929fcbfcb5901000000) dm-1 DotHill,DH4554 size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 4:0:3:4 sdo 8:224 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 3:0:1:4 sdc 8:32 active ready running mpathh (3600c0ff00029ad379ac8a95a01000000) dm-3 DotHill,DH4554 size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 4:0:1:4 sdk 8:160 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 3:0:2:4 sde 8:64 active ready running mpathg (3600c0ff00029acccbac8a95a01000000) dm-7 DotHill,DH4554 size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 3:0:4:3 sdi 8:128 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 4:0:4:3 sdq 65:0 active ready running mpathf (3600c0ff00029ad37b2c8a95a01000000) dm-6 DotHill,DH4554 size=87T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 4:0:4:2 sdp 8:240 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 3:0:4:2 sdh 8:112 active ready running
If you get an error. It could be that the multipath module isn't loaded. So do this: modprobe dm_multipath
Also important. That you'll see that the SAS paths are balanced as like the example above. If not, check all the cables and SAS luns if they are running at 12Gbps and check if the mapping is neatly distributed over the SAS ports.
Running "rescan-scsi-bus --forcerescan" will scan for new blockdevices. This command will also help to find all the SAS luns.
Configure the raid0
Lets create the raid0 so we fuse all the luns into one raidset. With this it will gives us the combined performance. To demostrate the performance, we place a simple xfs filesystem directly on top the raidset. You can also add LVM if you want to slice it in multiple pieces.
mdadm --create md6 --chunk=16M --level 0 --raid-devices 8 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 /dev/dm-4 /dev/dm-5 /dev/dm-6 /dev/dm-7 mkfs.xfs -f /dev/md6 mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8,logbsize=256k,swalloc,largeio,inode64 /dev/md6 /mnt
If everything went well, you'll should have a 700TiB big storage mounted on /mnt
Tuning parameters
If you want to go faster. You'll can do the following things:
Change the scheduler for all luns:
echo deadline > /sys/block/sdb/queue/scheduler
More tuning options will follow.
How we have tested this
We're using an inhouse benchmark tool, that we also use for acceptancetests when we procure new storage systems. The tool is located here.
The tool can test local read and write performance and also GridFTP performance. With a read/write ratio of 2:1.