addrConn.createTransport failed to connect to Instance Manager

des · December 15, 2023, 3:44am

I’m currently experiencing an issue where the Longhorn Manager is unable to connect to the Instance Manager on the same node. Or at the very least it has transient failures to connect. Here’s an example of a log I’m seeing in the Longhorn Manager.

longhorn-manager-q4x8p longhorn-manager W1215 03:35:22.010195       1 logging.go:59] [core] [Channel #197 SubChannel #198] grpc: addrConn.createTransport failed to connect to {
longhorn-manager-q4x8p longhorn-manager   "Addr": "10.42.10.18:8502",
longhorn-manager-q4x8p longhorn-manager   "ServerName": "10.42.10.18:8502",
longhorn-manager-q4x8p longhorn-manager   "Attributes": null,
longhorn-manager-q4x8p longhorn-manager   "BalancerAttributes": null,
longhorn-manager-q4x8p longhorn-manager   "Type": 0,
longhorn-manager-q4x8p longhorn-manager   "Metadata": null
longhorn-manager-q4x8p longhorn-manager }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.10.18:8502: operation was canceled"

Due to this it appears that Longhorn is unstable. With volumes mounted into Pods periodically experiencing an I/O error which causes the Pod to restart.

I’ve deployed Longhorn 1.5.3 into a Rancher managed RKE v1.27.6 cluster using Ubuntu 22.04 nodes. I’ve also reconfigured the taints and tolerations as I only want one set of nodes disks to be used and I want the UI / driver on the tools nodes. I’ve used the following values.yaml for this deployment.

defaultSettings:
  createDefaultDiskLabeledNodes: true
  taintToleration: type=tools:NoSchedule; katonic.ai/node-pool:NoSchedule
longhornManager:
    tolerations:
    - key: "type"
      operator: "Equal"
      value: "tools"
      effect: "NoSchedule"
    - key: "katonic.ai/node-pool"
      operator: "Exists"
      effect: "NoSchedule"
longhornDriver:
    tolerations:
    - key: "type"
      operator: "Equal"
      value: "tools"
      effect: "NoSchedule"
    nodeSelector:
      type: "tools"
longhornUI:
    tolerations:
    - key: "type"
      operator: "Equal"
      value: "tools"
      effect: "NoSchedule"
    nodeSelector:
      type: "tools"

I’ve also created two debug Pods on the same node and confirmed using nc that I am able to communicate between two Pods on the same node.

Any help on this would be greatly appreciated

Des

PhanLe0110 · December 19, 2023, 3:42am

Hi @des ! Could you reproduce the problem and send us support bundle to longhorn-support-bundle@suse.com ?

des · December 19, 2023, 11:32pm

Hi,

With more digging it seems that these connection issues are quite common. I discovered that we also see them periodically in other clusters as well. The difference being that these other clusters are much smaller. I was tailing all instance-manager logs in the new, bigger, cluster so the errors were more obvious. Also, it seems that the instability may happen on occasion when creating new volumes but is resolved in a reasonably short amount of time.

So I probably don’t need to take any further action. But, out of interest, what is the process to create a support bundle?

Thanks.

Des

Topic		Replies	Views
Longhorn manager occasionally stop working Longhorn	2	2231	March 22, 2023
Longhorn PVC failed to switch to different pod, once pod instance died Longhorn	3	3848	September 4, 2019
Longhorn Cronjob Failed Longhorn	5	1654	September 13, 2021
Longhorn volume degraded, Replica Scheduling Failure, Error Message: precheck new replica failed Rancher	1	151	December 10, 2024
Node does not exist Longhorn	10	2985	December 19, 2023

addrConn.createTransport failed to connect to Instance Manager

Related topics