Using fusion-io drives on Redhat Enterprise 5

%{color:red}Update:% Please note that this post is getting a bit old. Currently I am running these IBM FusionIO drives on RHEL 6. I’ll be posting about that and a few other PCIe-SSD subjects in the next short while. - 24 Apr 2012

FusionIO IODrive Overview

So at work we have a rather large IBM x3850 x5 server. It has 4 sockets each with six cores and hyperthreading (not that I’m necessarily a fan of hyperthreading–really I haven’t done enough research to make up my mind) which ends up with RHEL5 seeing 48 CPUS.

$ cat /proc/cpuinfo | grep proc | wc -l
48

Fun.

But the important part of this post is that this server also has three 640GB fusion-io drives which I have installed and configured as a volume group called fio

$ ls /dev/fio
fio/  fioa  fiob  fioc  fiod  fioe  fiof  
$ vgs fio
  VG   #PV #LV #SN Attr   VSize VFree
  fio    6   4   0 wz--n- 1.76T 1.08T

and where the fio[a,b,c,d,e,f] are the drives, with each 640 gig card actually appearing as 2 320 gig disks.

$ dmesg  |grep -i "found device"
fioinf IBM 640GB High IOPS MD Class PCIe Adapter 0000:89:00.0: Found device 0000:89:00.0
fioinf IBM 640GB High IOPS MD Class PCIe Adapter 0000:8a:00.0: Found device 0000:8a:00.0
fioinf IBM 640GB High IOPS MD Class PCIe Adapter 0000:93:00.0: Found device 0000:93:00.0
fioinf IBM 640GB High IOPS MD Class PCIe Adapter 0000:94:00.0: Found device 0000:94:00.0
fioinf IBM 640GB High IOPS MD Class PCIe Adapter 0000:98:00.0: Found device 0000:98:00.0
fioinf IBM 640GB High IOPS MD Class PCIe Adapter 0000:99:00.0: Found device 0000:99:00.0

Resources

The most important resource for using these FusionIO drives is the official knowledge base which has several articles specifically for linux. I would suggest reading all of them. :)

Install

Once the cards were put into the server, which is somewhat harrowing given their individual cost, and the server was booted, the software drivers that were downloaded from the IBM website were installed. This server runs RHEL5

$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.6 (Tikanga)

as that RHEL version that is what IBM supports for drivers.

$ rpm -i iodrive-driver-1.2.7.5-1.0_2.6.18_164.el5.x86_64.rpm \
iodrive-firmware-1.2.7.6.43246-1.0.noarch.rpm \
iodrive-jni-1.2.7.5-1.0.x86_64.rpm \
iodrive-snmp-1.2.7.5-1.0.x86_64.rpm \
iodrive-util-1.2.7.5-1.0.x86_64.rpm \

Currently I am using the drivers as they were downloaded, which means using a specific matching kernel to match. The drivers do come with a source RPM so that you can rebuild them for your latest kernel, but I have opted not to do that yet. So install the matching kernel

$ yum install kernel-2.6.18-164.el5

and reboot.

However, I am also using the amazing ksplice service to ensure that depsite the fact that I am using a rather old kernel to match the FusionIO drivers that the kernel is still up to date in terms of security issues:

$ uptrack-uname -r
2.6.18-238.12.1.el5
$ uname -r
2.6.18-164.el5

The @uptrack-uname -r@ command asks uptrack what security equivalent version of the kernel is. Great stuff that kslplice.

Once the drivers are installed we can load the modules

$ modprobe fio-driver

and now we can see the drives

$ ls /dev/fio*
fioa  fiob  fioc  fiod  fioe  fiof 

and at this point we can configure the drives.

Worker processes

Once the drivers are installed there is a /etc/init.d/iodrive startup script. One of the things this script does is startup some worker processes which I believe are used to move data around the FusionIO drives to ensure their performance and longevity.

$ chkconfig --list iodrive
iodrive 0:off	1:on	2:on	3:on	4:on	5:on	6:off

$ ps ax | grep worker
 5271 ?        S<   1169:51 [fct0-worker]
 5588 ?        S<   1168:07 [fct1-worker]
 5593 ?        S<   359:01 [fct2-worker]
 5598 ?        R<   206:02 [fct3-worker]
 5603 ?        S<   203:15 [fct4-worker]
 5608 ?        S<   203:12 [fct5-worker]
20921 pts/2    S+     0:00 grep worker

These processes will take up some CPU time. Frankly, because there are 48 CPUs in this server, using up one to run these workers is OK. But it was a little confusing at first seeing all this activity–one worker process for each card.

Configuration

Given that we are going to manage the FusionIO drives via LVM, we will need to configure LVM to allow it. See this knowledge base article.

$ grep fio /etc/lvm/lvm.conf
    types = [ "fio", 16 ]

Then add each /dev/fio* drive as a phyical volume and then add them to a volume group.

$ pvs | grep fio
  /dev/fioa  fio    lvm2 a-   300.31G 320.00M
  /dev/fiob  fio    lvm2 a-   300.31G 100.31G
  /dev/fioc  fio    lvm2 a-   300.31G 100.31G
  /dev/fiod  fio    lvm2 a-   300.31G 300.31G
  /dev/fioe  fio    lvm2 a-   300.31G 300.31G
  /dev/fiof  fio    lvm2 a-   300.31G 300.31G
$ vgs fio
  VG   #PV #LV #SN Attr   VSize VFree
  fio    6   4   0 wz--n- 1.76T 1.08T

fio-status

Useful way to check the status of the FusionIO drives.

$ fio-status

Found 6 ioDrives in this system with 3 ioDrive Duos
Fusion-io driver version: 1.2.7.5

Adapter: ioDrive Duo
	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:59518
	PCIE Power limit threshold: 24.75W
	Connected ioDimm modules:
	  fct0:	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77479
	  fct1:	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77478

fct0	Attached as 'fioa' (block device)
	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77479
	Alt PN:68Y7382
	Located in 0 Upper slot of ioDrive Duo SN:59518
	Firmware v43246
	322.46 GBytes block device size, 396 GBytes physical device size
	Internal temperature: avg 56.6 degC, max 59.6 degC
	Media status: Healthy; Reserves: 100.00%, warn at 10%

fct1	Attached as 'fiob' (block device)
	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77478
	Alt PN:68Y7382
	Located in 1 Lower slot of ioDrive Duo SN:59518
	Firmware v43246
	322.46 GBytes block device size, 396 GBytes physical device size
	Internal temperature: avg 61.0 degC, max 63.0 degC
	Media status: Healthy; Reserves: 100.00%, warn at 10%


Adapter: ioDrive Duo
	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:59507
	PCIE Power limit threshold: 24.75W
	Connected ioDimm modules:
	  fct2:	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77143
	  fct3:	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77144

fct2	Attached as 'fioc' (block device)
	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77143
	Alt PN:68Y7382
	Located in 0 Upper slot of ioDrive Duo SN:59507
	Firmware v43246
	322.46 GBytes block device size, 396 GBytes physical device size
	Internal temperature: avg 62.0 degC, max 65.5 degC
	Media status: Healthy; Reserves: 100.00%, warn at 10%

fct3	Attached as 'fiod' (block device)
	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77144
	Alt PN:68Y7382
	Located in 1 Lower slot of ioDrive Duo SN:59507
	Firmware v43246
	322.46 GBytes block device size, 396 GBytes physical device size
	Internal temperature: avg 64.0 degC, max 66.4 degC
	Media status: Healthy; Reserves: 100.00%, warn at 10%


Adapter: ioDrive Duo
	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:100366
	PCIE Power limit threshold: 24.75W
	Connected ioDimm modules:
	  fct4:	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77344
	  fct5:	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77345

fct4	Attached as 'fioe' (block device)
	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77344
	Alt PN:68Y7382
	Located in 0 Upper slot of ioDrive Duo SN:100366
	Firmware v43246
	322.46 GBytes block device size, 396 GBytes physical device size
	Internal temperature: avg 68.9 degC, max 71.9 degC
	Media status: Healthy; Reserves: 100.00%, warn at 10%

fct5	Attached as 'fiof' (block device)
	IBM 640GB High IOPS MD Class PCIe Adapter, Product Number:68Y7381 SN:77345
	Alt PN:68Y7382
	Located in 1 Lower slot of ioDrive Duo SN:100366
	Firmware v43246
	322.46 GBytes block device size, 396 GBytes physical device size
	Internal temperature: avg 63.0 degC, max 66.0 degC
	Media status: Healthy; Reserves: 100.00%, warn at 10%



XFS

Prior to finding out about the official knowledge base, I had decided to purchase a subscription from Redhat for the XFS file system. Then, upon reading this kb article, I found that they heavily recommend XFS as the file system to run on top of a FusionIO drive

XFS is currently the recommended filesystem. It can achieve up to 3x 
the performance of a tuned ext2/ext3 solution. At this time, there is 
no know additional tuning for running XFS in a single- or multi-ioDrive 
configuration 

so that is the file system we use.

$ mount | grep fio
/dev/mapper/fio-vault1 on /var/lib/vault1 type xfs (rw)

Mounting drives after a reboot

I’ll admit I hadn’t thought of this during the initial installation. After a few days we moved the server to a new location which thus required a power down and restart.

While the server was restarting, and I was standing in the cold, loud server room because the new room didn’t have any networking for IPMI (which is not good), I noticed it took a very long time to get past the udev portion of the boot, and in fact the FusionIO drives failed to mount from fstab. Of course there is a logical reason for that–read about it here.

Because we are using the 1.2 driver, I followed the straight forward instructions here.

Performance testing

Performance testing is hard. Maybe it’s just me. But testing superdisk like these FusionIO drives on a server with 48 CPUS and 64 gigs of main memory is not easy. Again I will admit I took a shot at benchmarking the FusionIO disk having not read the kb. I messed around with Bonnie++, io-whatever, but nothing quite came out right, partially because I didn’t put a lot of time into it, and because the server has so much memory that it makes it hard to beat the cache (I did try to reduce the memory the OS could see via kernel configuration, but didn’t have a lot of luck with that).

Finally I read this kb article which suggested using the fio utility (which I don’t believe is a utility put out by FusionIO, rather just aptly named).

The fio tool is not in the RHEL repositories but it is in rpmforge/repoforge.

$ cd /var/tmp
$ wget http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.x86_64.rpm
$ rpm -Uvh rpmforge-release-0.5.2-2.el5.rf.x86_64.rpm
$ yum repolist | grep forge
rpmforge                           RHEL 5Server - RPMforge.net - enabled: 10,636
$ yum search fio | grep -i benchmark
fio.x86_64 : I/O benchmark and stress/hardware verification tool


Here are a couple of example runs. Please note that at this point I do not know much about fio. Benchmarking disk is a highly technical thing to do, and getting tests right would take a lot of research and consideration, which I have not done.

It seems that the fio benchmark utility suports direct=1 which means use non-buffered-io, thereby skipping memory cacheing and going straight to the disk.

$ cat fio-randwrite.fio 
[randwrite[

direct=1
rw=randwrite 
bs=1m 
size=5G 
numjobs=4 
runtime=10 
group_reporting 
directory=/mnt/fio-test-xfs
$ fio fio-randwrite.fio 
randwrite: (g=0): rw=randwrite, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1
...
randwrite: (g=0): rw=randwrite, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1
fio 1.55
Starting 4 processes
randwrite: Laying out IO file(s) (1 file(s) / 5120MB)
randwrite: Laying out IO file(s) (1 file(s) / 5120MB)
randwrite: Laying out IO file(s) (1 file(s) / 5120MB)
randwrite: Laying out IO file(s) (1 file(s) / 5120MB)
Jobs: 4 (f=4): [wwww] [100.0% done] [0K/522.8M /s] [0 /510  iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=4): err= 0: pid=28487
  write: io=4556.0MB, bw=466161KB/s, iops=455 , runt= 10008msec
    clat (msec): min=1 , max=1692 , avg= 9.83, stdev=22.04
     lat (msec): min=1 , max=1692 , avg= 9.84, stdev=22.04
    bw (KB/s) : min=  559, max=264126, per=24.79%, avg=115540.55, stdev=20377.90
  cpu          : usr=0.10%, sys=14.85%, ctx=59071, majf=0, minf=92
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/4556/0, short=0/0/0

     lat (msec): 2=0.53%, 4=1.27%, 10=97.17%, 20=0.59%, 50=0.09%
     lat (msec): 100=0.18%, 250=0.15%, 2000=0.02%

Run status group 0 (all jobs):
  WRITE: io=4556.0MB, aggrb=466161KB/s, minb=477349KB/s, maxb=477349KB/s,
  mint=10008msec, maxt=10008msec

Disk stats (read/write):
  dm-11: ios=0/158802, merge=0/0, ticks=0/55956241, in_queue=55915327, 
  util=66.05%, aggrios=0/159667, aggrmerge=0/0, aggrticks=0/55932489,
  aggrin_queue=55785218, aggrutil=65.96%
    fioc: ios=0/159667, merge=0/0, ticks=0/55932489, in_queue=55785218, 
    util=65.96%

And then a similar test using RAID10 SAS disk formated ext3.

$ cat fio-randwrite.fio 
[randwrite[

direct=1
rw=randwrite 
bs=1m 
size=5G 
numjobs=4 
runtime=10 
group_reporting 
directory=/mnt/sas-test
$ fio fio-randwrite.fio 
randwrite: (g=0): rw=randwrite, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1
...
randwrite: (g=0): rw=randwrite, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1
fio 1.55
Starting 4 processes
randwrite: Laying out IO file(s) (1 file(s) / 5120MB)
randwrite: Laying out IO file(s) (1 file(s) / 5120MB)
randwrite: Laying out IO file(s) (1 file(s) / 5120MB)
randwrite: Laying out IO file(s) (1 file(s) / 5120MB)
Jobs: 4 (f=4): [wwww] [1200.0% done] [0K/0K /s] [0 /0  iops] [eta
 1158050441d:07h:00m:05sJobs: 4 (f=4): [wwww] [inf% done] [0K/0K /s] 
[0 /0  iops] [eta 1158050441d:07h:00m:04s]  Jobs: 4 (f=4): [wwww] 
[1300.0% done] [0K/0K /s] [0 /0  iops] [eta 1158050441d:07h:00m:04sJobs: 
4 (f=4): [wwww] [inf% done] [0K/0K /s] [0 /0  iops] 
[eta 1158050441d:07h:00m:03s]  Jobs: 1 (f=1): [___w] [66.1% done] 
[0K/0K /s] [0 /0  iops] [eta 00m:19s]               
randwrite: (groupid=0, jobs=4): err= 0: pid=28586
  write: io=4096.0KB, bw=112369 B/s, iops=0 , runt= 37326msec
    clat (usec): min=12140K, max=37183K, avg=32696578.04, stdev= 0.00
     lat (usec): min=12140K, max=37183K, avg=32696579.88, stdev= 0.00
    bw (KB/s) : min=   27, max=   83, per=31.61%, avg=34.46, stdev= 0.00
  cpu          : usr=0.00%, sys=51.90%, ctx=9598, majf=0, minf=102
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/4/0, short=0/0/0

     lat (msec): >=2000=100.00%

Run status group 0 (all jobs):
  WRITE: io=4096KB, aggrb=109KB/s, minb=112KB/s, maxb=112KB/s, 
  mint=37326msec, maxt=37326msec

Disk stats (read/write):
  dm-12: ios=128/4721384, merge=0/0, ticks=5582/602531980, in_queue=602926524,
  util=97.85%, aggrios=129/87424, aggrmerge=0/4634618, aggrticks=5631/10828734,
  aggrin_queue=10826088, aggrutil=98.01%
    sdb: ios=129/87424, merge=0/4634618, ticks=5631/10828734, in_queue=10826088,
    util=98.01%

That’s a pretty big difference: io=4556.0MB for the FusionIO drives versus io=4096.0KB for the SAS RAID10. I’m going to have to look into this more! :)

PS. I found this list of device bandwidths interesting.