Engineering

Alert on Failed or Long Running Cronjobs – Kubernetes, Prometheus

Alert on Failed or Long Running Cronjobs with Kubernetes and Prometheus

I think one of the more poorly documented and/or talked about resources in Kubernetes are cronjobs, and the buck doesn’t stop at just configuring your cronjob manifest properly. I have also seen a lot of wrong, outdated or overly complex Prometheus queries to monitor your cronjobs. At Lumo, we run a lot of cronjobs, so it’s important to be accurately alerted on failures or longer-than-average runtimes.

Bottom line for our cronjob alerting philosophy is: since all cronjobs have different schedules and different runtimes, its impossible to create one or two alerts that work for all your cronjobs – so create them individually. Who cares if you have large rules files? In terms of alerting, we find the two most important things to watch for are:

  1. Failed cronjobs (cronjobs that complete in an Error state)
  2. Cronjobs that have been running too long – i.e. your job usually finishes in x minutes but for some reason it’s taking x+n to complete.

Important note about case #1there’s a bug (which has been patched) that when .spec.template.spec.restartPolicy field is set to OnFailure, the back-off limit may be ineffective. If you are running a version that has not been patched, set the restart policy for the embedded template to Never.

Additionally, make sure you have your concurrencyPolicy and restartPolicy set correctly. It’s also important that whatever you are running in the cronjob exits correctly (whether it’s an error or success). We usually use these policy settings, but please read the documentation as your case may vary:

.spec.concurrencyPolicy: Forbid
.spec.jobTemplate.spec.restartPolicy: Never

Setting the following are useful for debugging or just seeing job history in kubectl (again, read the docs about these settings):

.spec.failedJobsHistoryLimit: 2
.spec.successfulJobsHistoryLimit: 1

Alerting on failed or long-running cronjobs

So now to the Prometheus alert rules for failed or long-running cronjobs. This is what we’ve come up with and found to be useful:

Alert when a cronjob has been running longer than it should

- alert: MyCronjobRunningTooLong
  expr: max(abs(kube_job_status_start_time{job=~"my-cronjob.*"} - kube_job_status_completion_time{job=~"my-cronjob.*"})) > 90
  for: 1m
  labels:
    app: my-cronjob
    severity: warning
  annotations:
    description: 'my-cronjob has been running for {{ $value }} seconds (it averages 70s)'
    summary: 'my-cronjob has been running for {{ $value }} seconds (it averages 70s)'

If you are unsure how long your jobs usually run, you can easily graph the above expression (remove the condition > 90) with Prometheus itself or Grafana. This alert will also be more sensitive by using the max() function – if you want it to be less sensitive to spikes in runtime, use avg().

Alert when a cronjob terminates with a Failure

- alert: MyCronjobFailed
  expr: sum(rate(kube_job_status_failed{job=~"my-cronjob.*"}[10m])) > 0
  for: 1m
  labels:
    app: my-cronjob
    severity: warning
  annotations:
    description: 'my-cronjob is failing!'
    summary: 'my-cronjob is failing!'

The rate timeframe on the Failure alert should match the interval at which your cronjob runs. In this case, the cronjob runs every 10 minutes thus I’ve summarized accordingly so it will only look for failures in the last 10 minutes. If your use-case is different, this might be something you would want to modify. Additionally tweak the `for` field if you expect failures sometimes but not all the times – sometimes our jobs fail due to flaky third-party services but succeed on the next run.

Also, note the .* in the job match – you need this when using a kube_job function because a timestamp is appended to the cronjob pods. If you are using a kube_cronjob function in Prometheus, the cronjob should appear as cronjob=my-cronjob without a timestamp.

Conclusion

The next step from here is pushing custom metrics using the PushGateway to detect more than just failures or long-running jobs. One thing I didn’t mention is .spec.activeDeadlineSeconds – this would effectively kill your job if it has been running too long, however we have never used it due to the nature of the parameter – in most cases we don’t want our cronjobs killed because sometimes they just run long.

You can also get fancy with labels and such in your descriptions/summaries or record the average runtime to make the alert description for the long-running jobs more dynamic, but I don’t really find that necessary for the relatively straightforward case here. Like I’ve mentioned above, most of the battle is knowing your cronjob – how long it usually takes to run, if it exits correctly on failures, etc.

Contact us

Message Received


  • Lumo Flights API (Developers)
  • Lumo Navigator (Travel Managers)
  • Help / support
  • Providing feedback
  • Other

  • Travel management company
  • Corporate travel manager
  • Travel tech provider
  • Airline
  • Airport
  • Insurance
  • Personal traveler/passenger
  • Other


shares