copy operation hanged

sharfuddin · September 19, 2012, 11:52am

SLES 11 SP1 x864

My Customer is using a NAS device for backup.

192.168.0.y:/mnt/HD_a2 on /external_disk type nfs (rw,addr=192.168.0.y)

also the the directory they backed up is on a local file system “/backup”
/dev/cciss/c0d0p8 on /backup type ext3 (rw,acl,user_xattr)
this /backup file system contains some very large file like 129 GB.

Problem is that when we try 1 - blocks in(bi) and blocks [HTML]
procs -----------memory---------- r b swpd free buff cache si so 0 1 0 11230940 91200 19138368 0 1 0 11231312 91200 19139952 0 1 0 11231296 91200 19140744 2 1 0 11231296 91200 19141592 0 1 0 11231544 91208 19143008 0 1 0 11231544 91208 19144184 1 1 0 11232108 91208 19145680 0 1 0 11231744 91208 19148480 0 1 0 11231496 91208 19150044 0 1 0 11231744 91216 19152192 1 1 0 11231480 91216 19154592 0 1 0 11231736 91216 19155932 0 1 0 11231116 91308 19157936 2 1 0 11231116 91308 19158368 0 1 0 11227396 91308 19164352 0 2 0 11227396 91324 19166280 0 1 0 11227520 91324 19167488 1 1 0 11227528 91324 19169596 0 1 0 11226784 91324 19173440 0 1 0 11225372 91324 19176268 1 1 0 11225372 91336 19178540 [/HTML] to copy a very large file “/backup/19sep/large-file” which is about 129 GB in size to the NAS we found that
out(bo) remains very low
—swap-- -----io---- -system-- -----cpu------
bi bo in cs us sy id wa st
0 0 102 7 18 156 0 0 98 2 0
0 0 0 0 1116 1321 0 0 96 4 0
0 0 0 0 350 475 0 0 96 4 0
0 0 0 0 767 876 0 0 95 5 0
0 0 0 12 801 961 0 0 97 3 0
0 0 0 268 727 990 0 0 95 5 0
0 0 0 16 1020 1498 0 0 96 3 0
0 0 0 0 1125 1311 1 0 93 5 0
0 0 0 0 863 1034 0 0 96 4 0
0 0 32 0 1318 1965 0 0 94 5 0
0 0 0 944 1385 1563 0 0 96 4 0
0 0 0 0 1050 1091 0 0 96 4 0
0 0 516 136 1176 2495 0 0 93 6 0
0 0 0 0 217 343 0 0 95 5 0
0 0 0 0 1504 1717 1 0 95 4 0
0 0 80 372 1373 1706 0 0 95 5 0
0 0 8 0 885 1121 0 0 96 4 0
0 0 0 0 1254 1473 0 0 95 4 0
0 0 0 0 1479 2022 1 0 96 4 0
0 0 0 0 1265 1545 0 0 94 6 0
0 0 12 468 1067 1220 0 0 96 4 0

2 - then after few hours like 2 hours we found that
[HTML]
procs -----------memory---------- —swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 4 0 11230940 91200 19138368 0 0 108 12 40 131 0 0 95 5 0
1 3 0 11231312 91200 19139952 0 0 576 288 1571 2068 0 0 85 14 0
1 3 0 11231296 91200 19140744 0 0 216 344 1585 3430 1 0 86 13 0
[/HTML]
i.e block jobs(b), in, cs, and wa are all high, while bi and bo remains low. Also the copy operation becomes uninterruptable (D+)
“ps aux” shows

and we have to reboot the server to recover.

My questions
1 - why it start copying very slow(‘bi’ and ‘bo’ values of vmstat, remains too low when we copy the 129 Gig file from local disk to the NAS)
2 - then why the copy operation becomes hang/free or D+

and where is the problem… is there something wrong with the NAS device or with our local disk/file system(/backup)

please help

Bob-O-Rama · September 20, 2012, 6:00am

No idea… but a couple theories:

As a test copy the same data to /dev/null If you have issues / slowness. If so, something bad is happening with your smartarray. check dmesg output and see if the cciss ( or whatever it is now ) is screaming. Make sure you have the latest HP firmware updates. I have had issues where a disk was going bad, but not quite, and it acted this way - struggling when it hit the bad disk.

On the NAS side of things, you can dd count=129000 ibs=1M obs=1M < /dev/zero > some_file_on_the_NAS or whatever to exercise the NAS

This tests each storage device independant of the other. If they seem to be handling the IO properly, then we have to look at what is in between them ( you backup script ) With certain filers, you need to disable file locking which may be an issue depending on the I/O patterns used. Disabling opportunistic locking, for example.

Like I said, total guesses. But if you cut the problem in half, you might be able to determine which side is the issue.

– Bob

sharfuddin · September 24, 2012, 12:06pm

very nice recommendations/tips, thanks a lot.

I have forwarded the link of this thread to my customer, awaiting for their response.

I will update if and when I get anything from my customer.

[QUOTE=Bob-O-Rama;6967]No idea… but a couple theories:

As a test copy the same data to /dev/null If you have issues / slowness. If so, something bad is happening with your smartarray. check dmesg output and see if the cciss ( or whatever it is now ) is screaming. Make sure you have the latest HP firmware updates. I have had issues where a disk was going bad, but not quite, and it acted this way - struggling when it hit the bad disk.

On the NAS side of things, you can dd count=129000 ibs=1M obs=1M < /dev/zero > some_file_on_the_NAS or whatever to exercise the NAS

This tests each storage device independant of the other. If they seem to be handling the IO properly, then we have to look at what is in between them ( you backup script ) With certain filers, you need to disable file locking which may be an issue depending on the I/O patterns used. Disabling opportunistic locking, for example.

Like I said, total guesses. But if you cut the problem in half, you might be able to determine which side is the issue.

– Bob[/QUOTE]

Topic		Replies	Views
iscsi - server reports wrong size SLES Configure-Administer	1	217	December 21, 2014
Cp limited at 4GB on SLES12SP5 but works for SP3 SUSE Linux Enterprise Server	0	586	November 18, 2021
SLES 9 - Copy files to NTFS formatted Drive SLES Configure-Administer	8	187	December 13, 2013
Does scp lock the filesystem that is being transferred to? SLES Configure-Administer	6	250	December 3, 2011
Max blocksize for LTO-4 tape unit SLES Hardware	3	259	October 16, 2012

copy operation hanged

Related topics