slow storage performance - HW or DBA issue ?

sharfuddin · January 19, 2015, 7:01am

OS: SLES 11 SP3
Storage: Multipath FC SAN on raid10.

DBA ran an I/O bond job(indexing) and complained about slow performance, as he did the similar excercise on other system(identical hardware/storage) previously and job was completed comparitively 4 hours early.

While the job was running I got the chance to collect the iostat output and per my understanding found no bottleneck on storage/hardware level (/oracle file system is on /dev/dm-0)

# iostat -xk /dev/dm-0  2 6
Linux 3.0.76-0.11-default (thltlp)      01/18/15        _x86_64_

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.07    0.00    0.16    2.55    0.00   94.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0            124.99     6.01  669.11   43.72 16733.30   855.89    49.35     3.22    4.52   0.82  58.52

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.25    0.00    0.20   14.01    0.00   82.54

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0             13.50     0.00 1145.50   11.50 12644.00    65.00    21.97     6.55   10.92   0.86 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.28    0.00    0.19   12.41    0.00   84.13

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0             12.00     4.00 1066.00   12.00 12580.00    57.00    23.45     5.47    6.89   0.93 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.16    0.00    0.15   13.02    0.00   83.67

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0             13.50     0.00  806.00   10.00 10008.00    49.00    24.65     4.79    4.66   1.23 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.11    0.00    0.23   11.81    0.00   84.86

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0              6.50     0.00 1144.00   10.00 12684.00    49.00    22.07     4.73    3.09   0.87 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.10    0.00    0.11   12.01    0.00   84.78

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0             13.00     0.00  882.00   12.00 11112.00    50.00    24.97     4.81    4.30   1.12 100.00

Please guide me if there is any performance issue with hardware/storage or the DBA should check this with DB Support.

Jens-U · January 19, 2015, 1:39pm

Hi sharfuddin,

While the job was running I got the chance to collect the iostat output and per my understanding found no bottleneck on storage/hardware level

How many CPUs does the OS see, 8? It might well be that approx 12 % iowait means one core / process is pending all the time. “%util” at 100 supports that…

My guess would be your DBMS is saturating the IO.

Regards,
Jens

sharfuddin · January 19, 2015, 4:50pm

How many CPUs does the OS see, 8?

OS sees 40 cores/CPUs (this machines has 2 sockets each having 10 sockets, plus hyperthreding is so altogether 40 cores/cpus).

Yes DBA has told that job was too much I/O intensive(indexing), but he has run the similar job on almost identical hardware and that job took 4 hours less to complete. Also I forgot to mention that LUN for Oracle file system is a raid-10 and on SSD disks.

Jens-U · January 20, 2015, 1:45pm

Hi sharfuddin,

well, the point is that the server system seems to point to the storage as the bottleneck - have you checked the throughput of the FC link and corresponding fabric & SAN server logs for problems?

If you believe this to be caused at the OS layer, you might want to look into differences regarding the I/O scheduler for the device and hardware-level settings regarding the FC card (and it connecting bus), although I would not expect too much out of that.

BTW, how much slower was the job, relatively? 10 minutes on the “fast” server and 250 minutes on the “slow” one? Or 240 vs. 244 hours?

Regards,
Jens

sharfuddin · January 20, 2015, 2:06pm

Storage/Hardware guys have checked the logs and found nothing. Also we are not considering OS for the slow performance, but we have doubts on
hardware side(san).

So is there anything pointing towards the I/O bottleneck via reviewing the iostat output. Also what test should i run to check/test the storage performance ?

[QUOTE=jmozdzen;25893]Hi sharfuddin,

BTW, how much slower was the job, relatively? 10 minutes on the “fast” server and 250 minutes on the “slow” one? Or 240 vs. 244 hours?

Regards,
Jens[/QUOTE]

its 240 vs 244 ;-).

Jens-U · January 20, 2015, 4:47pm

Hi sharfuddin,

yes, the reported 100% utilization - see “man iostat”.

There are various tools out there that will evaluate disk performance. On the other hand, the consequences usually are tuning measures, giving you relative improvements (or not ). So as you already have your test case (DBA’s action), I personally wouldn’t run too many other performance measuring tools, but rather check the whole data chain, then try to understand and to tune the components involved. As you’re handling write transactions through a file system (and the delta between both the fast and the slow machine are only about 1.7%), you might want to look into

[LIST]
[]file system optimization, i.e. not updating the “last access time” (leading to faster execution time - no other way to measure, IMO)
[]tuning your scheduler (resulting in less I/O wait and/or higher throughput in vmstat, iostat)
[]tuning your hardware (i.e. checking IRQ balancing - leading to more throughput in vmstat, iostat)
[]tuning your storage network and back-end
[/LIST]
This is a tedious task… and depending on the already existing efficiency of your system, can be less rewarding than expected.

Regards,
Jens

Topic		Replies	Views
Bad performance on new hardware ? SLES Configure-Administer	5	198	July 19, 2013
slow performance on logical volumes SLES Configure-Administer	3	202	April 10, 2013
SLES 10 SP4 guest slowly in SLES 11 SP4 host - high si SLES Virtualization	13	487	August 26, 2017
SLES SP2 slows Windows-VM SLES Virtualization	5	250	March 5, 2013
Database performance slow on Windows DomU SLES Virtualization	6	207	February 10, 2014

slow storage performance - HW or DBA issue ?

Related topics