If you have node_exporter installed and scrapped by
prometheus, you have access to an interesting metric: node_boot_time_seconds
, which represent
the timestamp in seconds when machine has booted.
To have the information about the reboot we should first verify if the current uptime is low, but also if before we have a higher uptime. If you only do the first condition, when a node is created you will have an alert too.
We will then create 2 alerts, one to check if node has rebooted in the last 10 mins, and the second
over the last hour. offset
is clearly the keyword permitting to verify old metric state here.
groups:
- name: node-exporter.rules
rules:
- alert: NodeHasRebooted
annotations:
description: Node has rebooted
summary: Node {{ (or $labels.node $labels.instance) }} has rebooted {{ $value }} seconds ago.
expr: |
(time() - node_boot_time_seconds < 600) and (time() - 600 - (node_boot_time_seconds offset 10m) > 600)
labels:
severity: critical
- alert: NodeHasRebooted
annotations:
description: Node has rebooted
summary: Node {{ (or $labels.node $labels.instance) }} has rebooted {{ $value }} seconds ago.
expr: |
(time() - node_boot_time_seconds < 3600) and (time() - 3600 - (node_boot_time_seconds offset 60m) > 3600)
labels:
severity: warning
Enjoy !