slow storage performance - HW or DBA issue ?

OS: SLES 11 SP3
Storage: Multipath FC SAN on raid10.

DBA ran an I/O bond job(indexing) and complained about slow performance, as he did the similar excercise on other system(identical hardware/storage) previously and job was completed comparitively 4 hours early.

While the job was running I got the chance to collect the iostat output and per my understanding found no bottleneck on storage/hardware level (/oracle file system is on /dev/dm-0)

# iostat -xk /dev/dm-0  2 6
Linux 3.0.76-0.11-default (thltlp)      01/18/15        _x86_64_

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.07    0.00    0.16    2.55    0.00   94.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0            124.99     6.01  669.11   43.72 16733.30   855.89    49.35     3.22    4.52   0.82  58.52

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.25    0.00    0.20   14.01    0.00   82.54

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0             13.50     0.00 1145.50   11.50 12644.00    65.00    21.97     6.55   10.92   0.86 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.28    0.00    0.19   12.41    0.00   84.13

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0             12.00     4.00 1066.00   12.00 12580.00    57.00    23.45     5.47    6.89   0.93 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.16    0.00    0.15   13.02    0.00   83.67

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0             13.50     0.00  806.00   10.00 10008.00    49.00    24.65     4.79    4.66   1.23 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.11    0.00    0.23   11.81    0.00   84.86

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0              6.50     0.00 1144.00   10.00 12684.00    49.00    22.07     4.73    3.09   0.87 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.10    0.00    0.11   12.01    0.00   84.78

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0             13.00     0.00  882.00   12.00 11112.00    50.00    24.97     4.81    4.30   1.12 100.00

Please guide me if there is any performance issue with hardware/storage or the DBA should check this with DB Support.

Hi sharfuddin,

While the job was running I got the chance to collect the iostat output and per my understanding found no bottleneck on storage/hardware level

How many CPUs does the OS see, 8? It might well be that approx 12 % iowait means one core / process is pending all the time. “%util” at 100 supports that…

My guess would be your DBMS is saturating the IO.

Regards,
Jens

How many CPUs does the OS see, 8?

OS sees 40 cores/CPUs (this machines has 2 sockets each having 10 sockets, plus hyperthreding is so altogether 40 cores/cpus).

Yes DBA has told that job was too much I/O intensive(indexing), but he has run the similar job on almost identical hardware and that job took 4 hours less to complete. Also I forgot to mention that LUN for Oracle file system is a raid-10 and on SSD disks.

Hi sharfuddin,

well, the point is that the server system seems to point to the storage as the bottleneck - have you checked the throughput of the FC link and corresponding fabric & SAN server logs for problems?

If you believe this to be caused at the OS layer, you might want to look into differences regarding the I/O scheduler for the device and hardware-level settings regarding the FC card (and it connecting bus), although I would not expect too much out of that.

BTW, how much slower was the job, relatively? 10 minutes on the “fast” server and 250 minutes on the “slow” one? Or 240 vs. 244 hours? :wink:

Regards,
Jens

Storage/Hardware guys have checked the logs and found nothing. Also we are not considering OS for the slow performance, but we have doubts on
hardware side(san).

So is there anything pointing towards the I/O bottleneck via reviewing the iostat output. Also what test should i run to check/test the storage performance ?

[QUOTE=jmozdzen;25893]Hi sharfuddin,

BTW, how much slower was the job, relatively? 10 minutes on the “fast” server and 250 minutes on the “slow” one? Or 240 vs. 244 hours? :wink:

Regards,
Jens[/QUOTE]

its 240 vs 244 ;-).

Hi sharfuddin,

yes, the reported 100% utilization - see “man iostat”.

There are various tools out there that will evaluate disk performance. On the other hand, the consequences usually are tuning measures, giving you relative improvements (or not :wink: ). So as you already have your test case (DBA’s action), I personally wouldn’t run too many other performance measuring tools, but rather check the whole data chain, then try to understand and to tune the components involved. As you’re handling write transactions through a file system (and the delta between both the fast and the slow machine are only about 1.7%), you might want to look into

[LIST]
[]file system optimization, i.e. not updating the “last access time” (leading to faster execution time - no other way to measure, IMO)
[
]tuning your scheduler (resulting in less I/O wait and/or higher throughput in vmstat, iostat)
[]tuning your hardware (i.e. checking IRQ balancing - leading to more throughput in vmstat, iostat)
[
]tuning your storage network and back-end
[/LIST]
This is a tedious task… and depending on the already existing efficiency of your system, can be less rewarding than expected.

Regards,
Jens