Hi! I am pretty new to Kubernetes in general, and I’ve been tinkering with a 3 worker node Rancher cluster for the last two months. Lab use, just to learn. I have had a lot of trouble with NFS for PVC (specifically permissions related issues after the image creates a few directories down and sets ownership, regardless of uid/gid set), and bind mounting from the node isn’t really scalable - which led me to Longhorn.
I’ve read the documentation that is available, and watched the available couple related Rancher Labs videos, but I still have a number of questions based on my uncertainty of what is expected behavior.
In the Cloud Native 2019 video, it is noted that a statefulset is preferable to a deployment for use with Longhorn due to the need to maintain one mounted volume to one node (RWO). When deployed as a statefulset, can it be scaled beyond one pod/node utilizing the block level replication to line up the volumes?
Edit To clarify - in testing I have been unable to scale beyond one live pod, every other pod regardless of node gets stuck at unable to attach volumes.
Is it designed to operate that way? If not:
I: How can you scale to multiple pods or nodes?
II: What is the point of defining as a statefulset vs just a single pod deployment?
III: Is there any way to switch nodes in this scenario other than detaching the volume from
Longhorn UI and then reattaching to the node with the desired pod?
IV: What is the point of the replicas? Just for data resiliency? *Edit *Or HA in the scenario
described below under #V?
V: In all of my tests, I’ve never been able to get a volume to re-attach to a new pod without
forcefully deleting the pod and detaching the volume in a failed node scenario. I see the
expected 5 minutes, some time, 6 minutes etc documented under the node failed section –
but unless I manually intervene, nothing changes. I assume this is because as a statefulset,
the pods come up 1 by 1, and since the last pod is unknown, it won’t finish creating. Is that
correct/expected? *Edit* Re-reading the document, I see that it is, although that final six
minutes step I have not seen, I've had to detach the volume and re-attach it to another node.
Is there no better approach I can take to automate this?
VI: I’ve noticed many examples of Longhorn using deployments, despite this. I see that as long
as you lock your pods to the one node that has the mounted volumes, this works at scale. I
assume these examples are just for show, and this is not desirable as it could lead to data
corruption (1 node, 3x pods, 1x volume). Is that correct?
Are you using VolumeClaimTemplate in the statefulset? That’s what make statefulset different from deployment. StatefulSet can create PVC automatically thus scalable with RWO type storage.
I saw the references to that but I didn’t understand what it meant. I was creating the statefulset using the Rancher GUI, and assumed that was created by it. I also assumed the headless service required was made by it, but couldn’t see that either… so there might be some missing pieces in my test.
When you say it would be scalable then… would it be as I described - basically, with the block replication handling the back end between the - say - three replicas, one on each node, and then a pod on each node with it’s respective volume replica mounted?
It actually looks like you requested that it be added back in 2018 and it was, so while I’d still like to understand the answers to my question above, I wonder if an additional question I should ask is - how was that implemented and what is the correct way to use it?
They mention it here, but not specifically how to use it from GUI.
Oh! That makes sense, my mistake was continuing to make it as a persistent volume claim.
Two more questions:
When I make the statefulset with the volume claim template,I do see the little prompt about the upstream issue with kubernetes, and while if I set it at the desired level of replicas - it seems fine, but if I go to edit anything I get the error noted in the link. Is that just the current state, with workaround being - set your scale as you create the set and… you can’t edit those sets?
Second question - data is replicating in a set of three, but the replication is slow enough that I could cycle through a load balanced set and see the changes come into effect from the other replicas over a minute or so. In practice, it worked, but - what happens if there are human generated conflicts, with that long of a delay?
The upstream issue is related to edit the statefulset’s VolumeClaimTemplate. I think you probably don’t need to edit the volumeClaimTemplate after creation? It will scale with the statefulset pods automatically.
What workload you’re running? The replication between different Longhorn volumes (PVs) are done by the workload, not Longhorn. Also check the node CPU workload to see if that’s the bottleneck.
I was getting errors after attempting to make any change to the set from GUI, including adding a loadbalancer service.
Not testing with anything in specific, that test just happened to be the docker image for organizrtools/organizr-v2 since it was the last one I tested with.
On that note, my fundamental assumptions about this were incorrect - I had thought that the replicas would be mounted separately, but in this scenario there is a new PV for each pod, each with the default replica count (3). So as you noted - the workload is doing the replication (or not). I can’t imagine why it would, not sure why I was seeing that, I assume I was mistaken and I was seeing the same pod over and over and the change reversion was just cached content (?).
It does bring me back to my original set of questions about the how and why of scaling, and I think the answer I’m inferring is that if you want to use stateful applications with longhorn for storage, the workload itself needs to be the source of replication between pods and their attached volumes. There is no out of the box solution for scale with pods if the application is stateful, and the only purpose of the replicas for each volume is data resiliency. Is that correct?
You can still use statefulset or deployment with one replica to do that. The replication in this case will be done by Longhorn.
If your application can share one directory for data, you can use RWX type of persistent storage e.g. NFS server, which allows multiple pods accessing the same piece of data. But in that case, this application need to implement some kind of lock by itself to avoid simultaneously writing to the same file by different instances, which can cause data corruption.
In the end, it all depends on which workload you’re running and how it scales. Any workload designed to scale need to either store different data for each instance (which fits RWO), or share the same data for all the instances but implemented mechanism to coordinate between them(which fits RWX).
Your comments were very helpful in understanding the scope of what Longhorn can and cannot do with a workload, and after a few hours of testing different workloads I think I have a better appreciation for where it would be useful. However - there is still one thing I am stuck on - this statefulset editing issue.
I mentioned the alert I see above, and your reply makes sense, I shouldn’t need to edit that volume claim. But the end result I see is I can’t even make changes to services - such as if I wanted to change from using a nodeport to loadbalancer? Am I missing something?
Here is the error I get in that example “Validation failed in API: StatefulSet.apps “orgtest10” is invalid: spec: Forbidden: updates to statefulset spec for fields other than ‘replicas’, ‘template’, and ‘updateStrategy’ are forbidden”
Those are limitations imposed by Kubernetes. Editing resources on Kubernetes normally implies you want to non-disruptive upgraded to the new spec, so many fields cannot be changed. In this case, you need to delete the objects and recreate them.
So, if you wanted the data to persist, you would need to re-attach the volume to the new object?
Edit
In that case, if you wanted to do that without using a restore of a backup, and rather just re-attach the persistent volume after the PVC is removed - how would you do that? The default reclaim policy is delete, so the PV is dumped right after that is removed, and you can’t just reassign with the claim in place.
Am I on the wrong track, and is there both a yaml and Rancher GUI approach to dealing with that?
Hmm… I tried detaching the volume, and the deleting the statefulset, recreating, and attaching the existing volume claim - it attaches, but the prior data is gone. This is with it being creating at a volumeclaimtemplate for the statefulset initially, having some data written, and then detached before the workload is deleted. Am I taking the wrong approach?
I think the data is gone because you’ve detached the volume first, which may prevent the filesystem to flush/sync the data into the volume.
You can after you write the data into the workload, scale down the statefulset, you will see volume detach automatically. Then scale back the stateful set, the volume will be attached automatically and the data will still be there.
You can change ReclaimPolicy to retain if you don’t want PV/PVC deletion caused Longhorn to delete the volume.
Whoops, my mistake, I accidentally tested on a deployment. Redid with statefulset, but I can’t scale down the statefulset, I get the:
“Validation failed in API: StatefulSet.apps “orgtest22” is invalid: spec: Forbidden: updates to statefulset spec for fields other than ‘replicas’, ‘template’, and ‘updateStrategy’ are forbidden” error we were talking about before.
Also - Maybe the ReclaimPolicy? Where was that for longhorn?
I see the under customize for the storage class you have the option to move from delete to retain for the reclaim policy if accessed via Rancher GUI, but if you flip it you get "
Validation failed in API: StorageClass.storage.k8s.io “longhorn” is invalid: [parameters: Forbidden: updates to parameters are forbidden., reclaimPolicy: Forbidden: updates to reclaimPolicy are forbidden.]"
So I’m guessing the only option in this case would be to redeploy with that value set?