Troubleshooting with Crossplane and FluxCD… could be better

I’m gaining valuable experience with Crossplane and FluxCD at my current job. While the GitOps approach promises seamless operations, my journey has been a mix of discovery and challenges. If you are considering adopting these tools I would be hesitant to recommend it because I find they have been error prone and difficult to troubleshoot. While I appreciate the innovations, there’s a part of me that still misses the (relative) simplicity of Terraform.

Obscure Errors

Well, for starters, I’ve posted on several of these already:

https://faulttolerant.work/2023/08/15/flux-helm-upgrade-failed-another-operation-install-upgrade-rollback-is-in-progress/

https://faulttolerant.work/2023/08/15/flux-error-timed-out-waiting-for-the-condition/

https://faulttolerant.work/2023/07/11/flux-kustomization-wont-sync-but-theres-no-error-message/

https://faulttolerant.work/2023/06/16/flux-reconcile-helmrelease-doesnt-do-what-it-sounds-like/

Dependency Resolution Creates Duplicates (and more obscure errors)

This is at least the second time in a couple months I’ve run into this. Crossplane Upbound Provider B depends on Crossplane Upbound Provider A so both Providers are created. Then if I also have an explicit creation of Crossplane Upbound Provider A that happens afterward, I get an error (at least if the name I gave it was different than the auto-generated name).

I think that was what caused the two errors below. It was not clear to me at the time that these meant a duplicate Crossplane Provider.

#On the ProviderWarning  
InstallPackageRevision  33m (x6 over 39m)     packages/provider.pkg.crossplane.io  cannot apply package revision: cannot patch object: Operation cannot be fulfilled on providerrevisions.pkg.crossplane.io "upbound-provider-azure-managedidentity-xxxx": the object has been modified; please apply your changes to the latest version and try again

#On the ProviderRevisionWarning  
ResolveDependencies  2m1s (x31 over 27m)  packages/providerrevision.pkg.crossplane.io  cannot resolve package dependencies: node already exists

Why Did The Finalizer Block Deletion?

Crossplane managed objects have a Finalizer, which at times has been a hurdle when trying to delete a resource. Generally, there is a reason for adding a finalizer into the code, so you should always investigate before manually deleting it.

So how does one investigate the finalizer? By looking at the documentation/code/logs for that crossplane managed resource. According to https://docs.crossplane.io/latest/concepts/managed-resources/#finalizers:

When Crossplane deletes a managed resource the Provider begins deleting the external resource, but the managed resource remains until the external resource is fully deleted.

When the external resource is fully deleted Crossplane removes the Finalizer and deletes the managed resource object.

You should track down the external resource and see why that is failing to delete. But it gets very tempting to just remove the finalizer.

Logging

Logs can be invaluable when troubleshooting, but with Crossplane and FluxCD, it’s often perplexing to determine which log to consult. Is it the Crossplane pod logs, Provider Logs, or one of the many Flux logs? Crossplane and Flux both provide guides, but a consolidated guide might be a topic for another day.

How do you decide which log you consult first? Share your experiences in the comments below.



Flux error: “timed out waiting for the condition”

Flux has the unfortunate habit of having a helmrelease fail with the error: “timed out waiting for the condition” without providing any details about what condition it was waiting for.

Here’s a couple things to try:

  • Look for unhealthy pods and check their logs – somewhat frequently this works, however not always because it might be waiting for a resource other than the pods or it might be timing out on deleting something
  • Check flux logs for errors. There should be info here but I haven’t necessarily found them that useful for troubleshooting. Don’t get too caught up in an error that could be a red herring flux logs --all-namespaces --level=error
  • Delete the helmrelease. This is what I had to do today. I had removed several Kustomizations that should have led to the helmrelease being deleted but for some reason flux was timing out on that. When I deleted the helmrelease using kubectl delete though, it remained gone and seems to have cleaned up the resources it had been using
  • Disable wait on the helmrelease – you can disable the health checks to view the failing resources. This is recommended on https://fluxcd.io/flux/cheatsheets/troubleshooting/ but I would only use it as a last resort. The health check is usually there for a reason so removing it could lead to more messiness.
# USE WITH CAUTION, may lead to instability
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
 name: podinfo
 namespace: default
spec:
 install:
   disableWait: true
 upgrade:
   disableWait: true

Flux Kustomization won’t sync but there’s no error message

I added a resource and it’s in a file that is listed in my Flux Kustomization. The gitrepository has synced but my change is not showing up. I did kubectl describe to see the events on the Kustomization but I don’t see any messages about my resource. What’s wrong?

Flux processes changes in a batch and must have a successful dry-run before applying changes. This means that any Warnings on the Kustomization will prevent changes from syncing. So run kubectl describe and look for any other warnings on the Kustomization and resolve them, even if they are on a different resource than the one you are concerned with.

CLI command to show status of an AKS upgrade

The other day I was monitoring an AKS Kubernetes version upgrade but the notifications Azure Portal had stopped updating. I found out that I can check the status from the command line.

emilyzall@Emilys-MBP ~ % az aks show -g my-rg -n my-cluster --query 'provisioningState'
"Upgrading"
emilyzall@Emilys-MBP ~ % az aks show -g my-rg -n my-cluster --query 'provisioningState'
"Succeeded"

“|” and “|-” in YAML and Kubernetes

How this came up

I was looking into an issue with a Flux Kustomization patch and I noticed that in the example given, one of the patches started with “|” and the other started with “|-“

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: podinfo
  namespace: flux-system
spec:
  # ...omitted for brevity
  patches:
    - patch: |-
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: not-used
        spec:
          template:
            metadata:
              annotations:
                cluster-autoscaler.kubernetes.io/safe-to-evict: "true"        
      target:
        kind: Deployment
        labelSelector: "app.kubernetes.io/part-of=my-app"
    - patch: |
        - op: add
          path: /spec/template/spec/securityContext
          value:
            runAsUser: 10000
            fsGroup: 1337
        - op: add
          path: /spec/template/spec/containers/0/securityContext
          value:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            runAsNonRoot: true
            capabilities:
              drop:
                - ALL        
      target:
        kind: Deployment
        name: podinfo
        namespace: apps

What are “|” and “|-” in this context?

These are YAML syntax components.

A patch field in a Kustomization expects a YAML or JSON formatted string. You could write it as a single line string with new-line indicators and trailing spaces for indentation but you usually wouldn’t because it is awkward to read.

patch: "apiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: not-used\nspec:\n  template:\n    metadata:\n      annotations:\n        cluster-autoscaler.kubernetes.io/safe-to-evict: \"true\""

You would rather write this as multiple lines with the proper indentation. However:

# NO, the value for patch is YAML but not a string
    - patch:
        - op: add
          path: /spec/template/spec/securityContext
          value:
            runAsUser: 10000
            fsGroup: 1337

# NO, this does not preserve newlines and is not recommended
    - patch:
        "- op: add
          path: /spec/template/spec/securityContext
          value:
            runAsUser: 10000
            fsGroup: 1337"

So how do I write a valid YAML string across multiple lines? Use the Literal Block Style Indicator: |

This is how you indicate in YAML that everything nested below this line should be interpreted as a multiline string with internal newlines and indentation preserved.

What about the minus sign then?

It only will matter if there are trailing newlines.

Block Chomping Indicators: The chomping indicator in YAML determines what should be done with trailing newlines in a block scalar. It can be one of three values:

  • No indicator: This means that trailing newlines will be included in the value, but a single final newline will be excluded.
  • The ‘+’ indicator: This means all trailing newlines will be included in the value.
  • The ‘-‘ indicator: This means that all trailing newlines will be excluded from the value.

How do I tell if there are trailing newlines?

You can use cat -e <filename> to see EOL (end-of-line) characters. This displays Unix line endings (\n or LF) as $ and Windows line endings (\r\n or CRLF) as ^M$

Do trailing newlines matter?

It really depends. I suggest adhering closely to the style used in the documentation you are referring to in order to be on the safe side.