NFS client freezes when NFS server system loses connection

I have a few servers on a LAN where I share out the /scratch folder on each via NFS-server,
and on each server using NFS-client I mount the other’s scratch folders.
Problem arises when any one of the servers goes down, for example a reboot.
When an NFS-server is down, if cannot log in to any other system which is an NFS-client to that server.
Specifically:

  • an SSH connection to the NFS-client system can be established successfully,
  • you can enter your username and password,
  • but after entering your password you never get a prompt and the SSH connection is frozen.
  • After however long if the NFS-server system finally becomes available, you will get the prompt and be able to use that SSH connection.

How can I make it so that this does not happen?

there is this at stackexchange with further describes the problem in general, however there does not seem to be a solution:
http://unix.stackexchange.com/questions/267138/preventing-broken-nfs-connection-from-freezing-the-client-system

my /etc/fstab file contains: hpc2:/scratch /scratch_hpc2 nfs defaults 0 0

the /etc/exports file on the nfs-server system contains: /scratch <ip_address>(rw,root_squash,sync,no_subtree_check)

I have set up NFS-server via YAST and I do NOT use NFSv4 nor do I use GSS security,

and /proc/mounts from an nfs-client system shows:

hpc2:/scratch /scratch_hpc2 nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=<ip_address>,mountvers=3,mountport=40551,mountproto=udp,local_lock=none,addr=<ip_address> 0 0

On 09/19/2016 09:14 AM, ron7000 wrote:[color=blue]

I have a few servers on a LAN where I share out the /scratch folder on
each via NFS-server,
and on each server using NFS-client I mount the other’s scratch
folders.
Problem arises when any one of the servers goes down, for example a
reboot.
When an NFS-server is down, if cannot log in to any other system which
is an NFS-client to that server.
Specifically:

  • an SSH connection to the NFS-client system can be established
    successfully,
  • you can enter your username and password,
  • but after entering your password you never get a prompt and the SSH
    connection is frozen.
  • After however long if the NFS-server system finally becomes available,
    you will get the prompt and be able to use that SSH connection.

How can I make it so that this does not happen?

there is this at stackexchange with further describes the problem in
general, however there does not seem to be a solution:
http://unix.stackexchange.com/questions/267138/preventing-broken-nfs-connection-from-freezing-the-client-system[/color]

The biggest difference I see between your post and the one at
StackExchange is that you are having a problem merely on a login, where
the StackExchange thread has a problem when somebody tries to access the
mountpoint explicitly.

Are you trying, via login somehow (.bashrc, .profile, user’s home
directory assignment), to access that mountpoint? If so, why?

Do not try to access mountpoints that are not there; that’s just silly.

Does this happen if you just launch a new bash shell via an
already-logged-in user?

bash


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

Hi,

try mounting it with the “soft” option. From “man 5 nfs”:

[QUOTE]soft / hard Determines the recovery behavior of the NFS client after an NFS request times out. If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely. If the soft option is specified, then the NFS
client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.[/QUOTE]

Regards,
J

everything i’ve read about the soft option says to never use it, and can easily lead to data corruption.

i think you are missing my point, otherwise you are effectively saying don’t use nfs because there is no way to handle the situation when a system that is an nfs-server goes down.

When all systems are up and running, everything is fine and the mount points work but the problem is when one system goes down for whatever reason at any point in time then all the other systems for any user having a shell open will freeze. Now yeah the mount points are still there going to that system that’s now offline. I have no way of automatically knowing exactly when a system goes down to then get in quickly to undo all mount points to it.

this freezing will also happen for any existing shell window that is open for any user,
in addition to if you try to log in to an nfs-client system where one of its mount points does not exist.
This is completely separate from having a working shell window where at the prompt you try to do: cd /folder_from_nfsserver_thathascrashed/
What i am saying is when an nfs-server goes down, the shell windows that are already open become unresponsive until the nfs-server comes back online.
Even when i am logged in under my user account I cannot switch user with SU to become root to then type unmount /folder_from_nfsserver_thathascrashed/
After doing su and typing root password, I never get a prompt !!!

On 09/27/2016 11:54 AM, ron7000 wrote:[color=blue]

everything i’ve read about the soft option says to never use it, and can
easily lead to data corruption.
[color=green]

Do not try to access mountpoints that are not there; that’s just silly.[/color]

i think you are missing my point, otherwise you are effectively saying
don’t use nfs because there is no way to handle the situation when a
system that is an nfs-server goes down.[/color]

I definitely do not mean to discourage NFS use that much. I also cannot
duplicate your issue, at least not entirely, though my system’s setup
(SLES 12) is using NFSv4, which you indicated is not the case with your
systems One thing that is similar-ish, is that if I am on an NFS client
system, and I then stop the NFS service on my server (not stopping the
entire server yet, just its NFS service), and then I go into the parent
directory of my mountpoint (/import), I cannot do a long-listing of the
directory contents, which would effectively be the mount points. Where
you have your mount point, I think, right off the root of your filesystem
(/), perhaps that is something similar. If you have something in a login
script trying to do a listing of the filesystem root, or doing something
equivalent with another command, perhaps that is where it blocks because
of the inability to look at the mount point fully.

Even in that case, after I lock up the ‘ls’ command, hitting Ctrl+c fixes
it, so perhaps try that during your SSH login that hangs to see if you are
in a login script. If you are, figure out which one, and where, and maybe
a fix can be implemented that way too. It would be neat to see what
happens if you move your mountpoints one level deeper, for example,
instead of /scratch-hpc2 you put them at /import/scratch-hpc2 or somewhere
else not as public as the root of the filesystem.

If you cannot get in even with Ctrl+c after the SSH login takes the
username and password successfully, the next step is to figure out what is
blocking, and for that I would use strace. While already logged into the
box, find a command that locks things up, like loading a new shell
(‘bash’) may. To use strace, run the following and be prepared for a lot
of output so be sure your history is big:

strace -ttt -ff bash

Post the output here and let’s see what the last line is.


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

Hi ron7000,

well, I haven’t seen that happen so far, but of course, the theoretical risk is there. So YMMV.

The hanging sessions, including logins, point at hanging NFS mounts somewhere in the search path - either the shell search path ($PATH) or some access path used by one of the generally started programs. I even had Eclipse traverse a multitude of directories that should have been none of its business… and noticed, because those access hung on a bad NFS mount point.

So from my point of view, you only have to options with NFS:

Either you have the client wait for any NFS access ad infinitum, which will help to avoid corruption but may cause severe “hangs” from the users’ POV.

Or you will allow the I/O operations to fail when the server is down.

If you’ve found a third option, please let me know - we run many systems relying on NFS mounts all over the place, with some of the servers intentionally unavailable over extended periods of time. So we’d suffer from hangs without the “soft” option, if those resources weren’t properly unmounted before killing the server. Which can easily happen when the NFS server is a mobile system.

Regards,
J