addrConn.createTransport failed to connect to Instance Manager

I’m currently experiencing an issue where the Longhorn Manager is unable to connect to the Instance Manager on the same node. Or at the very least it has transient failures to connect. Here’s an example of a log I’m seeing in the Longhorn Manager.

longhorn-manager-q4x8p longhorn-manager W1215 03:35:22.010195       1 logging.go:59] [core] [Channel #197 SubChannel #198] grpc: addrConn.createTransport failed to connect to {
longhorn-manager-q4x8p longhorn-manager   "Addr": "10.42.10.18:8502",
longhorn-manager-q4x8p longhorn-manager   "ServerName": "10.42.10.18:8502",
longhorn-manager-q4x8p longhorn-manager   "Attributes": null,
longhorn-manager-q4x8p longhorn-manager   "BalancerAttributes": null,
longhorn-manager-q4x8p longhorn-manager   "Type": 0,
longhorn-manager-q4x8p longhorn-manager   "Metadata": null
longhorn-manager-q4x8p longhorn-manager }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.10.18:8502: operation was canceled"

Due to this it appears that Longhorn is unstable. With volumes mounted into Pods periodically experiencing an I/O error which causes the Pod to restart.

I’ve deployed Longhorn 1.5.3 into a Rancher managed RKE v1.27.6 cluster using Ubuntu 22.04 nodes. I’ve also reconfigured the taints and tolerations as I only want one set of nodes disks to be used and I want the UI / driver on the tools nodes. I’ve used the following values.yaml for this deployment.

defaultSettings:
  createDefaultDiskLabeledNodes: true
  taintToleration: type=tools:NoSchedule; katonic.ai/node-pool:NoSchedule
longhornManager:
    tolerations:
    - key: "type"
      operator: "Equal"
      value: "tools"
      effect: "NoSchedule"
    - key: "katonic.ai/node-pool"
      operator: "Exists"
      effect: "NoSchedule"
longhornDriver:
    tolerations:
    - key: "type"
      operator: "Equal"
      value: "tools"
      effect: "NoSchedule"
    nodeSelector:
      type: "tools"
longhornUI:
    tolerations:
    - key: "type"
      operator: "Equal"
      value: "tools"
      effect: "NoSchedule"
    nodeSelector:
      type: "tools"

I’ve also created two debug Pods on the same node and confirmed using nc that I am able to communicate between two Pods on the same node.

Any help on this would be greatly appreciated :pray:t3:

Des

Hi @des ! Could you reproduce the problem and send us support bundle to longhorn-support-bundle@suse.com ?

Hi,

With more digging it seems that these connection issues are quite common. I discovered that we also see them periodically in other clusters as well. The difference being that these other clusters are much smaller. I was tailing all instance-manager logs in the new, bigger, cluster so the errors were more obvious. Also, it seems that the instability may happen on occasion when creating new volumes but is resolved in a reasonably short amount of time.

So I probably don’t need to take any further action. But, out of interest, what is the process to create a support bundle?

Thanks.

Des