Fluentd a Bit Slow

11/29/2021 k8s logging fluentd

Installing fluentd helm chart

Starting off installing the fluentd helm chart like all other helm charts. Get the values.yaml file and then slowly change the settings being careful to not break it before its even installed.

After deploying the helm chart the single pod in the stateful set is crashing. Starting off looking at the logs for the pod there aren’t any for the container fluentd. It makes it quite a bit harder when its crashing but you don’t know if it is crashing because of the config or a setting you made in the values.yaml file.

Next step was to check the configs. I grabbed all of the configs and found the docker container to run locally. I setup the local directory to match what the config files were expecting. Everthing starts just fine.

Ok, the configs are ok then? Maybe? To answer this question I disabled all of the configs in the values.yaml file. Now fluentd is loading just the bare minimum. It still crashes though.

What is wrong? I exec into the pod and try to see if its reading from the right config location. Sure enough the configs exist and its reading from the right location. The pod restarts after a few seconds so there is only time to check with one or two commands.

Let’s see what command is getting run ps aux | grep fluentd. Ok its picking up the right config.

Let’s see if the config is wrong. fluentd --dry-run -c /fluentd/etc/config. Pod crashes.

Well the pod is not even alive long enough to check this command.

Let’s disable the liveness and readiness probe. Run the config check again. Waiting. Waiting. Waiting. Finally everything is ok.

Let’s check the logs for the pod now that it has been on for a few minutes. Yep there is the output of it starting.

The liveness and readiness probes were just too short. The initial delay was just the default of 10s. After a bit of trial and error at incrementing the initial delay by one minute each time I finally settled on three minutes.

The interesting part of this is that the default settings didn’t work out of the box. I added no plugins. All I did was remove the parts for ingesting kubernetes logs. Something to keep in mind with more plugins the startup time can get even longer which means increasing the initial check even more.

Three minutes is quite a long time for the initial probe. Other things that we can change are how many errors before failure but we want to keep this number low for when the app is running and actually runs into issues. It is best to just delay the initial probe instead.