Crossplane will ‘eat’ your errors if you don’t run it with debug

There was no error showing on my Crossplane CompositeResource but one of the resources was not being created. No errors in the Crossplane log: kubectl -n crossplane-system logs -lapp=crossplane --since=1h

A helpful person on https://crossplane.slack.com/ssb/redirect pointed out that I needed to run crossplane with args --debug in order to see the error message. This surprised me, I am used to --debug being for relatively obscure information but now I know that in Crossplane this is not the case. Details here: https://github.com/crossplane/crossplane/discussions/4886

Why is Kubernetes so hard?

In the spirit of https://jvns.ca/blog/2023/10/06/new-talk–making-hard-things-easy/ and Matt Surabian’s DevOps Days Boston talk “Teaching and Learning When Words No Good Sense Make” – why is Kubernetes so hard?

In a word, complexity, but let’s break that out a couple dimensions of that complexity

Overloading of terms

Overloading is when a single term can have multiple meanings. For example, What The Heck Is Ingress? <— a whole episode of the Kubernetes Unpacked podcast devoted to that question

Very long, very nested resources defined in yaml

We have a single Kubernetes Composition in our codebase that is 529 lines long and has 5 levels of nesting. So the level of Cognitive Complexity is high, putting a strain on working memory.

Distributed system

Kubernetes is a distributed system because it spreads workloads, storage, and operations across multiple machines, coordinating them to appear as one cohesive system from the user’s perspective. This distributed nature allows Kubernetes to provide its core benefits of scalability, resilience, and flexibility. However,

Developing distributed utility computing services, such as reliable long-distance telephone networks, or Amazon Web Services (AWS) services, is hard. Distributed computing is also weirder and less intuitive than other forms of computing because of two interrelated problems. Independent failures and nondeterminism cause the most impactful issues in distributed systems. In addition to the typical computing failures most engineers are used to, failures in distributed systems can occur in many other ways. What’s worse, it’s impossible always to know whether something failed.

https://aws.amazon.com/builders-library/challenges-with-distributed-systems/

Declarative syntax

  1. Indirectness of Action:
    • Unlike imperative languages where code flows sequentially, in Kubernetes, you specify a desired state. The system’s controllers then work to realize it. This indirect approach can make it difficult to pinpoint problems since you’re not commanding each step but relying on Kubernetes’ logic to interpret and act on your intent.
  2. Asynchronous Behavior:
    • While imperative code executes in a predictable sequence, Kubernetes’ actions often run asynchronously. After declaring a desired state, the system might not immediately reflect that outcome. And it can be hard to tell if something has failed or whether it’s still processing.
  3. Error Reporting:
    • In imperative programming, errors are usually tied to a specific action or code line. In Kubernetes, however, errors may be symptomatic of deeper issues. For example, a pod in a “CrashLoopBackOff” state might be due to various reasons, demanding a more in-depth examination of logs, events, and configurations to trace the root cause.

How do we make it easier?

There’s no one simple answer to this. But to start, I recommend:

Troubleshooting with Crossplane and FluxCD… could be better

I’m gaining valuable experience with Crossplane and FluxCD at my current job. While the GitOps approach promises seamless operations, my journey has been a mix of discovery and challenges. If you are considering adopting these tools I would be hesitant to recommend it because I find they have been error prone and difficult to troubleshoot. While I appreciate the innovations, there’s a part of me that still misses the (relative) simplicity of Terraform.

Obscure Errors

Well, for starters, I’ve posted on several of these already:

https://faulttolerant.work/2023/08/15/flux-helm-upgrade-failed-another-operation-install-upgrade-rollback-is-in-progress/

https://faulttolerant.work/2023/08/15/flux-error-timed-out-waiting-for-the-condition/

https://faulttolerant.work/2023/07/11/flux-kustomization-wont-sync-but-theres-no-error-message/

https://faulttolerant.work/2023/06/16/flux-reconcile-helmrelease-doesnt-do-what-it-sounds-like/

Dependency Resolution Creates Duplicates (and more obscure errors)

This is at least the second time in a couple months I’ve run into this. Crossplane Upbound Provider B depends on Crossplane Upbound Provider A so both Providers are created. Then if I also have an explicit creation of Crossplane Upbound Provider A that happens afterward, I get an error (at least if the name I gave it was different than the auto-generated name).

I think that was what caused the two errors below. It was not clear to me at the time that these meant a duplicate Crossplane Provider.

#On the ProviderWarning  
InstallPackageRevision  33m (x6 over 39m)     packages/provider.pkg.crossplane.io  cannot apply package revision: cannot patch object: Operation cannot be fulfilled on providerrevisions.pkg.crossplane.io "upbound-provider-azure-managedidentity-xxxx": the object has been modified; please apply your changes to the latest version and try again

#On the ProviderRevisionWarning  
ResolveDependencies  2m1s (x31 over 27m)  packages/providerrevision.pkg.crossplane.io  cannot resolve package dependencies: node already exists

Why Did The Finalizer Block Deletion?

Crossplane managed objects have a Finalizer, which at times has been a hurdle when trying to delete a resource. Generally, there is a reason for adding a finalizer into the code, so you should always investigate before manually deleting it.

So how does one investigate the finalizer? By looking at the documentation/code/logs for that crossplane managed resource. According to https://docs.crossplane.io/latest/concepts/managed-resources/#finalizers:

When Crossplane deletes a managed resource the Provider begins deleting the external resource, but the managed resource remains until the external resource is fully deleted.

When the external resource is fully deleted Crossplane removes the Finalizer and deletes the managed resource object.

You should track down the external resource and see why that is failing to delete. But it gets very tempting to just remove the finalizer.

Logging

Logs can be invaluable when troubleshooting, but with Crossplane and FluxCD, it’s often perplexing to determine which log to consult. Is it the Crossplane pod logs, Provider Logs, or one of the many Flux logs? Crossplane and Flux both provide guides, but a consolidated guide might be a topic for another day.

How do you decide which log you consult first? Share your experiences in the comments below.



Flux error: “timed out waiting for the condition”

Flux has the unfortunate habit of having a helmrelease fail with the error: “timed out waiting for the condition” without providing any details about what condition it was waiting for.

Here’s a couple things to try:

  • Look for unhealthy pods and check their logs – somewhat frequently this works, however not always because it might be waiting for a resource other than the pods or it might be timing out on deleting something
  • Check flux logs for errors. There should be info here but I haven’t necessarily found them that useful for troubleshooting. Don’t get too caught up in an error that could be a red herring flux logs --all-namespaces --level=error
  • Delete the helmrelease. This is what I had to do today. I had removed several Kustomizations that should have led to the helmrelease being deleted but for some reason flux was timing out on that. When I deleted the helmrelease using kubectl delete though, it remained gone and seems to have cleaned up the resources it had been using
  • Disable wait on the helmrelease – you can disable the health checks to view the failing resources. This is recommended on https://fluxcd.io/flux/cheatsheets/troubleshooting/ but I would only use it as a last resort. The health check is usually there for a reason so removing it could lead to more messiness.
# USE WITH CAUTION, may lead to instability
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
 name: podinfo
 namespace: default
spec:
 install:
   disableWait: true
 upgrade:
   disableWait: true

Flux Kustomization won’t sync but there’s no error message

I added a resource and it’s in a file that is listed in my Flux Kustomization. The gitrepository has synced but my change is not showing up. I did kubectl describe to see the events on the Kustomization but I don’t see any messages about my resource. What’s wrong?

Flux processes changes in a batch and must have a successful dry-run before applying changes. This means that any Warnings on the Kustomization will prevent changes from syncing. So run kubectl describe and look for any other warnings on the Kustomization and resolve them, even if they are on a different resource than the one you are concerned with.

CLI command to show status of an AKS upgrade

The other day I was monitoring an AKS Kubernetes version upgrade but the notifications Azure Portal had stopped updating. I found out that I can check the status from the command line.

emilyzall@Emilys-MBP ~ % az aks show -g my-rg -n my-cluster --query 'provisioningState'
"Upgrading"
emilyzall@Emilys-MBP ~ % az aks show -g my-rg -n my-cluster --query 'provisioningState'
"Succeeded"

“|” and “|-” in YAML and Kubernetes

How this came up

I was looking into an issue with a Flux Kustomization patch and I noticed that in the example given, one of the patches started with “|” and the other started with “|-“

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: podinfo
  namespace: flux-system
spec:
  # ...omitted for brevity
  patches:
    - patch: |-
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: not-used
        spec:
          template:
            metadata:
              annotations:
                cluster-autoscaler.kubernetes.io/safe-to-evict: "true"        
      target:
        kind: Deployment
        labelSelector: "app.kubernetes.io/part-of=my-app"
    - patch: |
        - op: add
          path: /spec/template/spec/securityContext
          value:
            runAsUser: 10000
            fsGroup: 1337
        - op: add
          path: /spec/template/spec/containers/0/securityContext
          value:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            runAsNonRoot: true
            capabilities:
              drop:
                - ALL        
      target:
        kind: Deployment
        name: podinfo
        namespace: apps

What are “|” and “|-” in this context?

These are YAML syntax components.

A patch field in a Kustomization expects a YAML or JSON formatted string. You could write it as a single line string with new-line indicators and trailing spaces for indentation but you usually wouldn’t because it is awkward to read.

patch: "apiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: not-used\nspec:\n  template:\n    metadata:\n      annotations:\n        cluster-autoscaler.kubernetes.io/safe-to-evict: \"true\""

You would rather write this as multiple lines with the proper indentation. However:

# NO, the value for patch is YAML but not a string
    - patch:
        - op: add
          path: /spec/template/spec/securityContext
          value:
            runAsUser: 10000
            fsGroup: 1337

# NO, this does not preserve newlines and is not recommended
    - patch:
        "- op: add
          path: /spec/template/spec/securityContext
          value:
            runAsUser: 10000
            fsGroup: 1337"

So how do I write a valid YAML string across multiple lines? Use the Literal Block Style Indicator: |

This is how you indicate in YAML that everything nested below this line should be interpreted as a multiline string with internal newlines and indentation preserved.

What about the minus sign then?

It only will matter if there are trailing newlines.

Block Chomping Indicators: The chomping indicator in YAML determines what should be done with trailing newlines in a block scalar. It can be one of three values:

  • No indicator: This means that trailing newlines will be included in the value, but a single final newline will be excluded.
  • The ‘+’ indicator: This means all trailing newlines will be included in the value.
  • The ‘-‘ indicator: This means that all trailing newlines will be excluded from the value.

How do I tell if there are trailing newlines?

You can use cat -e <filename> to see EOL (end-of-line) characters. This displays Unix line endings (\n or LF) as $ and Windows line endings (\r\n or CRLF) as ^M$

Do trailing newlines matter?

It really depends. I suggest adhering closely to the style used in the documentation you are referring to in order to be on the safe side.

flux reconcile helmrelease doesn’t do what it sounds like

I was faced with a failed Helm release.

emilyzall@Emilys-MBP datadog-agent % helm history datadog-agent -n datadog
REVISION	UPDATED                 	STATUS  	CHART         	APP VERSION	DESCRIPTION
1       	Thu May 25 09:45:40 2023	deployed	datadog-3.25.1	7          	Install complete
2       	Tue Jun 13 09:10:18 2023	failed  	datadog-3.25.1	7          	Upgrade "datadog-agent" failed: timed out waiting for the condition
3       	Tue Jun 13 09:30:23 2023	failed  	datadog-3.25.1	7          	Release "datadog-agent" failed: timed out waiting for the condition

I want to see if this error is still happening. Being new to Flux CD at first I thought that it might retry this failed Helm release automatically as part of the sync. This didn’t seem to be the case though based on the last date in the history above.

Maybe I could force it to retry the Helm Release using the flux reconcile command. It sounded good at the time!

emilyzall@Emilys-MBP datadog-agent % flux reconcile
The reconcile sub-commands trigger a reconciliation of sources and resources.

So…

emilyzall@Emilys-MBP datadog-agent % flux reconcile helmrelease datadog-agent -n datadog --verbose
► annotating HelmRelease datadog-agent in datadog namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✗ HelmRelease reconciliation failed: Helm rollback failed: release datadog-agent failed: timed out waiting for the condition

Oh, I guess it’s still failing. But wait why is it timing out within seconds, that seems mighty quick. I found that the default timeout is 5 minutes and my HelmRelease resource was configured with a 20 minute time out. So (???). I searched for something like “flux reconcile helmrelease” and found https://stackoverflow.com/questions/65677606/is-there-a-way-to-manually-retry-a-helmrelease-for-fluxcd-helmoperator.

There is a way to manually retry a Helm Release with Flux but it is not flux reconcile helmrelease. Actually you have to do flux suspend then flux resume.

emilyzall@Emilys-MBP datadog-agent % flux suspend helmrelease datadog-agent -n datadog --verbose
► suspending helmrelease datadog-agent in datadog namespace
✔ helmrelease suspended
emilyzall@Emilys-MBP datadog-agent % flux resume helmrelease datadog-agent -n datadog --verbose
► resuming helmrelease datadog-agent in datadog namespace
✔ helmrelease resumed
◎ waiting for HelmRelease reconciliation
✔ HelmRelease reconciliation completed
✔ applied revision 3.25.1