Keep calm and stay fault tolerant

Posted by Emily Zall

Hi there! My name is Emily and I am excited to share my experiences and thoughts on the world of DevOps and tech through this blog. With over a decade of experience in the tech industry and specifically in DevOps since 2018, I have a lot to share about this ever-evolving field. When I’m not working, I love to singing Barbershop, playing complicated board games and all things cats. I have two cats (Puppy and Cation), a foster cat, and I also volunteer doing Trap Neuter Return. I live in Rhode Island. I’m the organizer of https://www.meetup.com/rhode-island-codes/ and I’m excited to connect with readers from all over and build a community centered around all things DevOps.

ChatGPT: do not trust but do verify

Posted by Emily Zall

As a DevOps Engineer I know not to trust ChatGPT, always fact check and make sure it makes sense. But with that major caveat I still find it a very helpful tool in many scenarios.

Here’s some things I like to ask it to do:

Explain a concept
Give me troubleshooting ideas
Summarize a long document (especially yaml/kubernetes
Answer a question I have about a specific document. Including, but not limited to, you can give it a long yaml file and ask it to tell you what a certain field is nested under. But now that I have a yaml plugin for my IDE, that’s an easier way to see that. I may write a post on Jetbrains plugins too.
Evaluate my work to provide any suggestions
Improve my documentation
Identify issues with Kubernetes resources and yaml files. It doesn’t always work but you can copy it in and ask what’s wrong with it. It’s the most intelligent diff tool I’ve seen, it will compare your resource to known examples and attempt to not just show you every difference but to use context to see which of the differences are relevant. You can give it a specific document to refer to or let it use everything in its corpus.
Variable naming. If it’s a challenging one, you can explain what the variable does and get suggestions. I found the suggested names were descriptive yet fairly concise and I like how it explained the reasoning behind them.
Ask it for reassurance when you’re feeling discouraged! ^_^
Just for fun. Behold the Dapr Kafka ^

Commenting values.yaml in a helm chart

Posted by Emily Zall

Helm charts are supposed to give you an easy way to install a set of kubernetes resources. In theory you just choose the chart you want to install, pass your variables and away you go. In practice there is usually a lot of different configuration values and some are not intuitive or explained so you end up having to dig through the templates to figure out what configuration you need.

When developing a helm chart myself I want the values to have descriptive self-explanatory names and if the name doesn’t say it all, then document it. The helm charts I’ve worked on are not public but here is an example of what you could do with commenting a values.yaml file: https://github.com/concourse/concourse-chart/blob/master/values.yaml. You won’t necessarily want to go quite this verbose but I think this is a useful example of the kind of information users want to know and how you can format it. For example – how do you indicate what values are acceptable if this is not self-evident?

Crossplane Resource Types In a Nutshell

Posted by Emily Zall

I think the way Crossplane explains this could be clearer.

CompositeResourceDefinition (XRD, related but not exactly the same as a CRD) – this is like the interface for a function that defines the parameters, their types and which are required.

apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xglobal-vault-accesses.azure.simbachain.com
spec:
  group: azure.simbachain.com
  # for the CompositeResource XR, the actual resource created by the claim
  names:
    kind: XGlobal-Vault-Access
    plural: xglobal-vault-accesses
  claimNames:
    kind: Global-Vault-Access
    plural: global-vault-accesses
  versions:
    - name: v1beta1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
<<snip>>

Composition – this is like the internal implementation code of the function, note however that there can be more than one implementation of a given function with labels used to distinguish them.

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: global-vault-access
  labels:
    crossplane.io/xrd: xglobal-vault-accesses.azure.simbachain.com
    provider: azure
spec:
  compositeTypeRef:
    apiVersion: azure.simbachain.com/v1beta1
    kind: XGlobal-Vault-Access
  resources:
<<snip>>

Claim – this is a request like the invocation of the function, you pass in your parameters. Note: the kind is not Claim, it will be whatever was under claimNames > kind. The word Claim generally does not appear anywhere, you can add it to a comment if that helps

# below is a crossplane claim that requests an instance of XGlobal-Vault-Access and its managed resources
---
apiVersion: azure.simbachain.com/v1beta1
kind: Global-Vault-Access
metadata:
  annotations:
    crossplane.io/external-name: blocks-simba-kongtest
  # will be used to generate names for the resources in the composition
  name: blocks-simba-kongtest
  namespace: blocks-simba-kongtest
spec:
  compositionRef:
    name: global-vault-access
  parameters:
    location: eastus
<<snip>>

CompositeResource (XR) – Sometimes this is talked about as if it’s the same as the claim but it’s a separate Kubernetes resource and it’s important to differentiate them. This results from the resources above instead of being defined directly. You define an XRD, make a Composition that defines the internal implementation of that XRD, make a claim that passes parameters then a CompositeResource will be created automatically. When I am troubleshooting this is the first place I want to look because it will has Synced and Ready status and contains resource refs to all the individual resources. Note: the kind is not CompositeResource, it will be whatever was under names > kind. It is traditionally named starting with X in the XRD.

kubectl describe xGlobal-Vault-Access

Crossplane will ‘eat’ your errors if you don’t run it with debug

Posted by Emily Zall

There was no error showing on my Crossplane CompositeResource but one of the resources was not being created. No errors in the Crossplane log: kubectl -n crossplane-system logs -lapp=crossplane --since=1h

A helpful person on https://crossplane.slack.com/ssb/redirect pointed out that I needed to run crossplane with args --debug in order to see the error message. This surprised me, I am used to --debug being for relatively obscure information but now I know that in Crossplane this is not the case. Details here: https://github.com/crossplane/crossplane/discussions/4886

Why is Kubernetes so hard?

Posted by Emily Zall

In the spirit of https://jvns.ca/blog/2023/10/06/new-talk–making-hard-things-easy/ and Matt Surabian’s DevOps Days Boston talk “Teaching and Learning When Words No Good Sense Make” – why is Kubernetes so hard?

In a word, complexity, but let’s break that out a couple dimensions of that complexity

Overloading of terms

Overloading is when a single term can have multiple meanings. For example, What The Heck Is Ingress? <— a whole episode of the Kubernetes Unpacked podcast devoted to that question

Very long, very nested resources defined in yaml

We have a single Kubernetes Composition in our codebase that is 529 lines long and has 5 levels of nesting. So the level of Cognitive Complexity is high, putting a strain on working memory.

Distributed system

Kubernetes is a distributed system because it spreads workloads, storage, and operations across multiple machines, coordinating them to appear as one cohesive system from the user’s perspective. This distributed nature allows Kubernetes to provide its core benefits of scalability, resilience, and flexibility. However,

Developing distributed utility computing services, such as reliable long-distance telephone networks, or Amazon Web Services (AWS) services, is hard. Distributed computing is also weirder and less intuitive than other forms of computing because of two interrelated problems. Independent failures and nondeterminism cause the most impactful issues in distributed systems. In addition to the typical computing failures most engineers are used to, failures in distributed systems can occur in many other ways. What’s worse, it’s impossible always to know whether something failed.
https://aws.amazon.com/builders-library/challenges-with-distributed-systems/

Declarative syntax

Indirectness of Action:
- Unlike imperative languages where code flows sequentially, in Kubernetes, you specify a desired state. The system’s controllers then work to realize it. This indirect approach can make it difficult to pinpoint problems since you’re not commanding each step but relying on Kubernetes’ logic to interpret and act on your intent.
Asynchronous Behavior:
- While imperative code executes in a predictable sequence, Kubernetes’ actions often run asynchronously. After declaring a desired state, the system might not immediately reflect that outcome. And it can be hard to tell if something has failed or whether it’s still processing.
Error Reporting:
- In imperative programming, errors are usually tied to a specific action or code line. In Kubernetes, however, errors may be symptomatic of deeper issues. For example, a pod in a “CrashLoopBackOff” state might be due to various reasons, demanding a more in-depth examination of logs, events, and configurations to trace the root cause.

How do we make it easier?

There’s no one simple answer to this. But to start, I recommend:

COMMENT your Kubernetes!
learn the fundamentals: https://jvns.ca/blog/2017/06/04/learning-about-kubernetes/
When you get discouraged, take a break but don’t give up. You’re not alone in your struggles!

“the object has been modified; please apply your changes to the latest version and try again”

Posted by Emily Zall

I don’t know why this error sometimes comes up on a resource managed by FluxCD but it’s useful to know that it seems to usually resolve itself with time.

Edited to add: per https://github.com/crossplane/crossplane/issues/2114 this error is “almost never a cause for concern” and “frequently a red herring”.

Troubleshooting with Crossplane and FluxCD… could be better

Posted by Emily Zall

I’m gaining valuable experience with Crossplane and FluxCD at my current job. While the GitOps approach promises seamless operations, my journey has been a mix of discovery and challenges. If you are considering adopting these tools I would be hesitant to recommend it because I find they have been error prone and difficult to troubleshoot. While I appreciate the innovations, there’s a part of me that still misses the (relative) simplicity of Terraform.

Obscure Errors

Well, for starters, I’ve posted on several of these already:

https://faulttolerant.work/2023/08/15/flux-helm-upgrade-failed-another-operation-install-upgrade-rollback-is-in-progress/

https://faulttolerant.work/2023/08/15/flux-error-timed-out-waiting-for-the-condition/

https://faulttolerant.work/2023/07/11/flux-kustomization-wont-sync-but-theres-no-error-message/

https://faulttolerant.work/2023/06/16/flux-reconcile-helmrelease-doesnt-do-what-it-sounds-like/

Dependency Resolution Creates Duplicates (and more obscure errors)

This is at least the second time in a couple months I’ve run into this. Crossplane Upbound Provider B depends on Crossplane Upbound Provider A so both Providers are created. Then if I also have an explicit creation of Crossplane Upbound Provider A that happens afterward, I get an error (at least if the name I gave it was different than the auto-generated name).

I think that was what caused the two errors below. It was not clear to me at the time that these meant a duplicate Crossplane Provider.

#On the ProviderWarning  
InstallPackageRevision  33m (x6 over 39m)     packages/provider.pkg.crossplane.io  cannot apply package revision: cannot patch object: Operation cannot be fulfilled on providerrevisions.pkg.crossplane.io "upbound-provider-azure-managedidentity-xxxx": the object has been modified; please apply your changes to the latest version and try again

#On the ProviderRevisionWarning  
ResolveDependencies  2m1s (x31 over 27m)  packages/providerrevision.pkg.crossplane.io  cannot resolve package dependencies: node already exists

Why Did The Finalizer Block Deletion?

Crossplane managed objects have a Finalizer, which at times has been a hurdle when trying to delete a resource. Generally, there is a reason for adding a finalizer into the code, so you should always investigate before manually deleting it.

So how does one investigate the finalizer? By looking at the documentation/code/logs for that crossplane managed resource. According to https://docs.crossplane.io/latest/concepts/managed-resources/#finalizers:

When Crossplane deletes a managed resource the Provider begins deleting the external resource, but the managed resource remains until the external resource is fully deleted.

When the external resource is fully deleted Crossplane removes the Finalizer and deletes the managed resource object.

You should track down the external resource and see why that is failing to delete. But it gets very tempting to just remove the finalizer.

Logging

Logs can be invaluable when troubleshooting, but with Crossplane and FluxCD, it’s often perplexing to determine which log to consult. Is it the Crossplane pod logs, Provider Logs, or one of the many Flux logs? Crossplane and Flux both provide guides, but a consolidated guide might be a topic for another day.

How do you decide which log you consult first? Share your experiences in the comments below.

Flux: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress

Posted by Emily Zall

Delete the secret sh.helm.release.v1.<release-name>.<version> and retrigger the reconciliation loop (flux suspend then flux resume on the helmrelease in question).

It may also be useful to know that helm ls will not show any reconciling or failed helmreleases if you want to see them all, you need to run helm ls --all

Flux error: “timed out waiting for the condition”

Posted by Emily Zall

Flux has the unfortunate habit of having a helmrelease fail with the error: “timed out waiting for the condition” without providing any details about what condition it was waiting for.

Here’s a couple things to try:

Look for unhealthy pods and check their logs – somewhat frequently this works, however not always because it might be waiting for a resource other than the pods or it might be timing out on deleting something
Check flux logs for errors. There should be info here but I haven’t necessarily found them that useful for troubleshooting. Don’t get too caught up in an error that could be a red herring flux logs --all-namespaces --level=error
Delete the helmrelease. This is what I had to do today. I had removed several Kustomizations that should have led to the helmrelease being deleted but for some reason flux was timing out on that. When I deleted the helmrelease using kubectl delete though, it remained gone and seems to have cleaned up the resources it had been using
Disable wait on the helmrelease – you can disable the health checks to view the failing resources. This is recommended on https://fluxcd.io/flux/cheatsheets/troubleshooting/ but I would only use it as a last resort. The health check is usually there for a reason so removing it could lead to more messiness.

# USE WITH CAUTION, may lead to instability
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
 name: podinfo
 namespace: default
spec:
 install:
   disableWait: true
 upgrade:
   disableWait: true

Flux Kustomization won’t sync but there’s no error message

Posted by Emily Zall

I added a resource and it’s in a file that is listed in my Flux Kustomization. The gitrepository has synced but my change is not showing up. I did kubectl describe to see the events on the Kustomization but I don’t see any messages about my resource. What’s wrong?

Flux processes changes in a batch and must have a successful dry-run before applying changes. This means that any Warnings on the Kustomization will prevent changes from syncing. So run kubectl describe and look for any other warnings on the Kustomization and resolve them, even if they are on a different resource than the one you are concerned with.

Fault Tolerant

Faults happen – build resiliently

Keep calm and stay fault tolerant

ChatGPT: do not trust but do verify

Commenting values.yaml in a helm chart

Crossplane Resource Types In a Nutshell

Crossplane will ‘eat’ your errors if you don’t run it with debug

Why is Kubernetes so hard?

How do we make it easier?

“the object has been modified; please apply your changes to the latest version and try again”

Troubleshooting with Crossplane and FluxCD… could be better

Obscure Errors

Dependency Resolution Creates Duplicates (and more obscure errors)

Why Did The Finalizer Block Deletion?

Logging

Flux: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress

Flux error: “timed out waiting for the condition”

Flux Kustomization won’t sync but there’s no error message