NFS issue with K8S RKE

We are using RKE k8s 1.18 version. We deployed mito.ai application. The application consists of postgresdb, airflow workflow scheduler and spark applications. The postgres and airflow scheduler and webserver pods are long running pods. Airflow scheduler will periodically launch airflow worker pods and spark application pods, both these pods short lived running only for few minutes (~5-10 minutes). All these pods will be mounting its required PVC (backed by od nfs).

After a fresh installation of mito.ai things will work without issues for a few days (~1-2weeks) after that short-lived pod’s launch would intermittently fail with IOError as seen in the attached log file. This issue only affects the pods at launch time or right after launching. If the pod starts fine, then it would not be encountered in that pod during its running. This lead us to suspect the issue is encountered only when a new connection to nfs server is made. We were able to isolate and replicate this issue using our diagnostic pods as well.

Please find attached snippets of all the error logs we had seen while debugging this.

As a workaround infra team provided us with a new nfs server to be used instead of the above mentioned one. At first we faced file permission issue on this server where all files in the server will be owned by uid 99 and gid 99. To fix this VEON infra team raised fixed somethings on the sever side and provided us with additional mount options to be used. Using their recommendations lead to our current problem where the postgres pods do not start and produce no logs at all.

Please advise to resolve this.

Error logs seen in airflow worker pod

[2021-07-12 14:15:09,698] {dagbag.py:450} ERROR - [Errno 5] I/O error: '/usr/local/airflow/dags/dags/usecases/sdp-charging-live/batch/sdp-charging-live-inference-2021_07_08_09_24.py'

Error message seen during diagnostic when running shell command find /usr/local/airflow/dags/dags -type f -exec wc -l {} \\;

/usr/local/airflow/dags/dags/usecases/sdp-charging-live-v027-2/snapshot/sdp-charging-live-v027-2-snapshot-2021_08_10_11_46.py: IO Error
/usr/local/airflow/dags/dags/usecases/sdp-charging-live-v027-2/snapshot/__pycache__/sdp-charging-live-v027-2-snapshot-2021_08_10_11_46.cpython-38.pyc: IO Error
/usr/local/airflow/dags/dags/usecases/sdp-charging-live-v027-2/batch/sdp-charging-live-v027-2-inference-2021_08_10_12_28.py: IO Error
/usr/local/airflow/dags/dags/usecases/sdp-charging-live-v027-2/batch/__pycache__/sdp-charging-live-v027-2-inference-2021_08_10_12_28.cpython-38.pyc: IO Error
/usr/local/airflow/dags/dags/usecases/sdp-charging-live-v027-2/training/sdp-charging-live-v027-2-training-2021_08_10_12_28.py: IO Error
/usr/local/airflow/dags/dags/usecases/sdp-charging-live-v027-2/training/__pycache__/sdp-charging-live-v027-2-training-2021_08_10_12_28.cpython-38.pyc: IO Error
/usr/local/airflow/dags/dags/usecases/air-charging-live-v027-2/snapshot/air-charging-live-v027-2-snapshot-2021_08_10_11_49.py: IO Error
/usr/local/airflow/dags/dags/usecases/air-charging-live-v027-2/snapshot/__pycache__/air-charging-live-v027-2-snapshot-2021_08_10_11_49.cpython-38.pyc: IO Error
/usr/local/airflow/dags/dags/usecases/air-charging-live-v027-2/batch/air-charging-live-v027-2-inference-2021_08_10_13_16.py: IO Error
/usr/local/airflow/dags/dags/usecases/air-charging-live-v027-2/batch/__pycache__/air-charging-live-v027-2-inference-2021_08_10_13_16.cpython-38.pyc: IO Error
/usr/local/airflow/dags/dags/usecases/air-charging-live-v027-2/training/air-charging-live-v027-2-training-2021_08_10_13_16.py: IO Error
/usr/local/airflow/dags/dags/usecases/air-charging-live-v027-2/training/__pycache__/air-charging-live-v027-2-training-2021_08_10_13_16.cpython-38.pyc: IO Error
/usr/local/airflow/dags/dags/test_dag/dummy_dag.py: IO Error
/usr/local/airflow/dags/dags/test_dag/__pycache__/dummy_dag.cpython-38.pyc: IO Error
/usr/local/airflow/dags/dags/test_dag/__pycache__/dummy_dag_2.cpython-38.pyc: IO Error
/usr/local/airflow/dags/dags/test_dag/dummy_dag_2.py: IO Error
/usr/local/airflow/dags/dags/test_dag/pod_template.yaml: IO Error

The following message was seen sometimes

Sep 26 13:47:35 master03 kernel: INFO: task mount.nfs:78174 blocked for more than 120 seconds.
Sep 26 13:47:35 master03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 26 13:47:35 master03 kernel: mount.nfs       D ffff957a3dc5acc0     0 78174  78173 0x00000080
Sep 26 13:47:35 master03 kernel: Call Trace:
Sep 26 13:47:35 master03 kernel: [<ffffffffaa8c7745>] ? wake_up_bit+0x25/0x30
Sep 26 13:47:35 master03 kernel: [<ffffffffaaf85d89>] schedule+0x29/0x70
Sep 26 13:47:35 master03 kernel: [<ffffffffaaf83891>] schedule_timeout+0x221/0x2d0
Sep 26 13:47:35 master03 kernel: [<ffffffffaa8ae46b>] ? lock_timer_base.isra.38+0x2b/0x50

Regards
Chirag