Suggestion: Add rollback option for ingress changes

Hi,

I dont know if this has been considered, but I inadvertently deleted a hostmapping route in the UI (Workloads -> Load Balancing) and found out the “hard” way there isnt a “rollback” option for ingress changes…

I’m not sure if this is natively supported as a “rollout”/“revision” in the Ingress types, or if this would need to be something at the Rancher level (I admit I searched very superficially outside rancher)? Either way I believe this would be quite helpful.

In general, I think some deletion actions are a bit “too easy” (or maybe some actions should be “easier” so we dont get in the habbit of cmd+clicking). i.e. deleting a deployment is exactly the same effort & experience as deleting a single pod. This can cause problems if you are doing things under pressure for example, or just “on auto-pilot” (User error, I know, but it would be something I’d revisit)

I wonder how others feel?

The only “native” rollback is for deployments, because it keeps old ReplicaSets around at a scale of 0 after transitioning from the old one to the new one.

Hi @vincent, thanks for confirming that! I realized it with a bit of digging in…

Would this be something that rancher could control elsewhere? be it through an annotation in the yaml itself or some other mechanism? (Is this something you guys believe Rancher should be implementing?)

Any comments about the “deletion difficulty levels” I mentioned? do you guys think that is relevant/useful? Is there a better forum for suggesting features/changes?

Another issue the other day prompted an internal discussion here about approvals/workflows. Is this something that Rancher is going to be going into in the future? From a corporate/enterprise point of view I believe there is a demand for this sort of stuff…

We are currently testing keel with approvals in order to prevent errors from seeping into production (although it still doesnt help ingress)… other options to help minimize issues is using git-based tools, which would leave rancher as more of a “view-only” tool (which is a pity imho).

Interested in hearing more about what you guys thing for the future.

I assume you are deploying via an automated release pipeline and that the YAML (or helm chart) you use to specify your resources (Ingress in this case) are under source control. In that situation recovering from an accidental delete is simply a matter of re-running your pipeline (most changes are idempotent).

Yes, ideally things are there, but with rancher you are able to adjust things “on the fly”… with greater flexibility comes greater possibility of screwing things up…

Do you use Rancher as a read-only interface?

The problem with the “re-run the pipeline” scenario, is that if you are running some adjustments and delete something inadvertently midway through, “resetting the entire environment” can be more work than manually going in and recreating whatever you deleted… anyways, it isnt something that happens that frequently, but it is annoying enough I though i’d bring it up.

In our case we never use the Rancher UI for Production deployments, that is only permitted via release pipelines which in our case are implemented in Azure DevOps. One reason for that is that we also need deployments to be audited (yes I am aware that Rancher has an audit log and we do export that to our CISO GSOC) and we also have approval steps such that business owners or release managers accept those production changes. That in addition to the automated, repeatable and versioned managed logic for both the pipeline itself and the assets that it creates or deploys, and the need to ensure a separation of concerns between development and release. So to me, the act of deploying resources into a cluster is much more than just the technical steps which, in themselves are relatively straight forwards. Deployments for us are entirely ‘hands free’, but YMMV.

That’s not to say that we don’t call the Rancher API as part of this process, since that lets us tap into all the power of the platform without sacrificing the engineering principals that we want to apply. The UI is convenient, yes, but, that model doesn’t scale when you have 10’s or even 100’s of micro-services to manage through their life-cycle.

Our scale isn’t especially high but, you might find that for you the UI still gives you the ‘shift-left’ self-service DevSecOps model you want without CI/CD integration. My background in automation and repeatable test driven development tells me something else, but that doesn’t make me right, or wrong. It’s a choice, and in some cases an operating model mandate.

HTHs

Fraser.

Thanks for sharing your scenario Fraser!

Yeah thats pretty much what it looks like / should look like imo too… One of my questions on the follow-up post regarding approvals is exactly tied into this… Currently we have a separate system looking at this (enabling api-based or chatops-based approvals for our deployments by X managers, etc)… In our case it doesnt currently support ingress changes yet…

That prompted me to ask whether the fine folks at rancher had any plans on tackling this problem… For production we will most likely limit the real usage of Rancher for visualization, etc, as its just too risky to leave something in the hands of people who could inadvertently delete a resource… In dev-environments or even staging, where we tweak routes due to tests, etc more frequently we see the problems happening, because that is where ppl have access…

In our case we tend to try it out, sometimes tweaking the yaml and applying, or applying on rancher (which is the problem area), then once we are “happy with the changes” we can save those as the “final” yaml, and commit it to git (or whatever vcs you want) and that would pipeline to production.

Generally k8s “is what it is” and we don’t try to fight that, because anything we add ends up breaking something that expects the native kubectl flow, or can’t really be enforced without taking away access to that entirely.

One thing in the direction of what you’re talking about is using helm/apps to manage the version/config that is deployed (from git), and potentially taking away the ability to update/delete the underlying resource from the regular users.

@vincent That makes a lot of sense!

I can see us changing a bit of our workflow in the future as we experiment with finding the right balance of easy flexibility vs safety/immutability/reproducability across different environments (where you want the prior in developer environment, the latter in production, and a balance in the “unified development”, staging, uat, etc…)

Anyway, its clearer now where Rancher is going “up to”.

A quick question, in some places I see warning about some functionality being disabled or not functionaing because a cluster or deployment was made with Rancher - is there a document that explains why this is and what are these functionality impairments/trade-offs?

The other “where we’re going” is Rio, which is more like cattle ux on top of k8s ecosystem. There we have somewhat more flexibility, it’s still all ultimately all k8s/knative/istio resources, but you don’t necessarily expect to interact with them on a daily basis.

Not sure what you mean on warnings, do you have a specific message?

This is one of them, if I come across others I can also post…

(the issue above isnt a problem for us as we dont really want rancher creating services for us as we have stuff already defined, but just wonder what are the limitations and where they are mapped out)

It’s sort of the opposite, that’s not a “limitation” or “disabled” but a “lack of added convenience”…

If you use the Rancher API (or UI, which is just an API client) to create a workload then we do a few extra things like create a corresponding service for the ports you expose (because that’s probably what you want) or pick an imagePullSecret that works for the registry the image is coming from.

If you’re creating raw resources through kubectl/kube-api, they are what they are and you have to specify the service or pull secret or whatever yourself.

Yes, I understand thats the “effect”, but I was wondering in terms of the “cause”… What would be missing from the deployment to prevent this “added convenience”.

Once the workload has been created be it through Rancher or Kube, its in kubernetes, and as you said, “it is what it is”, I wanted to understand a bit better what prevents this sort of “convenience” from extending to non-rancher created services. Is it some sort of missing data that rancher depends on that is being stashed away in the metadata / a business decision by Rancher, etc… (the second being a valid reason just as much as the first imo!) Just interested in it from a technical aspect because I see Rancher is “sticking to kubernetes” and then it adds some convenience layers in some areas…

I’d also like to understand (if there is such a documentation) a bit more about the metadata and cattle related annotations in the YAML, is there anywhere that this is mapped? (sometimes I export something and I’m reluctant on what to keep / what to remove. Learning more would be great)

Thanks for your time answering these questions! greatly appreciated!

For both of those examples it’s not that something is missing from the deployment which you could trigger by adding labels/annotations to a deployment.

It’s more like a translation layer where you POST /v3/projects/foo/workload X and that results in a POST to the k8s API with a modified X' deployment definition and separately a POST for a new service definition Y.

The workload you originally “asked for” is transformed (e.g. by finding a suitable secret and adding imagePullSecret: blah` to the deployment, and adding a label that can be used for the Service selector) and the original thing you posted to Rancher is never actually persisted to k8s.

There are some things that are actually annotations written to k8s and picked up by controllers which causes something to happen. The only one that comes to mind is what project a namespace belongs to, and creatorId on most resources to identify which user created them, but there may be a couple others. I don’t think we have any docs currently detailing those though.