Is there a ‘simple’ guide to getting the most from a NFS4 server
Our servers generally have 60-90 users hanging off them and they
are fine but occasionally things conspire so that the full 180 users
are attached to each one and they get very slow.
I have fiddles with the use_kernel_nfsd_number’ and increasing it from
the default 4 to 64 makes a lot of difference but over 128 makes it
slower- I am assuming all those processes exhaust something else
These servers just sit there being NFS4 servers ( for openSUSE
clients ) and nothing else.
Has anyone some guides
Ta
M
Hi M,
the typical bottle necks apply here. Server-wise these are disk i/o, CPU and memory. Network (not only bandwith: NFS has pretty small window sizes, so latency is a factor, too). Protocol overheads, additional services (i.e. DNS delays) and NFS4-specific side effects (i.e. Kerberos performance) will influence, too.
What you haven’t told is the work load these users run across the NFS mounts. Are these home directories of up to 180 users, with all the tempfiles and databases (KDE, Firefox,…) running via NFS?
The main factors in our environment were disk I/O throughput and latencies (since switching to bcache, that has been no problem anymore), and setting the “async” options server-side.
use_kernel_nfsd_number
It depends on the number of requests that come in in parallel - “4” is a very conservative number, more suitable for test setups We’re running at 128. Take a look at /proc/net/rpc/nfsd (the “th” values), the first value is the number of threads currently active and the second number counts the number of times all threads were busy. Though, depending on your kernel NFS server version, you may not have reasonable values there (all zeros), then /proc/fs/nfsd/pool_stats is the file to look at (see knfsd-stats.txt for a description).
Regards,
Jens
Hi, Thanks for replying
Yes, they are home directories ( for openSUSE 13.2 clients )
Looking at /proc/net/rpc/nfsd shows 0 in the second column
I was wondering if nfsd is running out of buffer space - The machines have 12G of ram so it could have most of it
It seems to happen more often when 180 users are all using LibreOffice - I am going to investigate LO’s file locking as a suspect
Ta
M
Hi M,
It seems to happen more often when 180 users are all using LibreOffice - I am going to investigate LO’s file locking as a suspect
and try to run some long-term statistics on server memory/cpu/network/disk i/o. If you have no systems management tool set up for this, fetch MRTG or alike and monitor the appropriate SNMP variables. The resulting graphs may shed some light on the actual bottle neck.
Regards,
Jens
I looked at /proc/fs/nfsd and the ‘sockets-enqueued’ is significantly non-zero
- as in - about half a million… hmmm
M
Hi M,
[QUOTE=interele;28674]I looked at /proc/fs/nfsd and the ‘sockets-enqueued’ is significantly non-zero
- as in - about half a million… hmmm
M[/QUOTE]
I guess the more important question is - since when? You wrote that you started with 4 threads… if no server reboot was inbetween, then those large numbers could have their origin in those times.
It’s more important to monitor that number (as in “increase per time unit”) and correlate that with reports of bad overall performance.
Regards,
Jens