Measuring and Monitoring With Prometheus and Alertmanager Part 2

This is part two in the series about Prometheus and Alertmanager.

In the first part we installed the Prometheus server and the node exporter, in addition to discovering some of the measuring and graphing capabilities using the Prometheus server web interface.

In this part, we will be looking at Grafana to expand the possibilities of graphing our metrics, and we will use Alertmanager to alert us of any metrics that are outside the boundaries we define for them. Finally, we will install a dashboard application for a nice tactical overview of our Prometheus monitoring platform.

Installing Grafana

The installation of Grafana is fairly straightforward, and all of the steps involved are described in full detail in the official documentation.

For Ubuntu, which we’re using in this series, the steps involved are:

sudo apt-get install -y apt-transport-https software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana

At this point Grafana will be installed, but the service has not been started yet.

To start the service and verify that the service has started:

sudo systemctl daemon-reload
sudo systemctl start grafana-server
sudo systemctl status grafana-server

You should see something like this:

Prometheus and Alertmanager

If you want Grafana to start at boot time, run the following:

sudo systemctl enable grafana-server.service

Grafana listens on port 3000 by default, so at this point you should be able to access your Grafana installation at http://<IP of your Grafana server>:3000

You will be welcomed by the login screen. The default login after installation is admin with password admin.

Prometheus and Alertmanager

After succesfully logging in, you will be asked to change the password for the admin user. Do this immediately!

Creating the Prometheus Data Source

Our next step is to create a new data source in Grafana that connects to our Prometheus installation. To do this, go to Configuration > Data Sources and click the blue Add data source button.

Grafana supports various time series data sources, but we will pick the top one, which is Prometheus.

Enter the URL of your Prometheus server, and that’s it! Leave all the other fields untouched, they are not needed at this point.

You should now have a Prometheus data source in Grafana, and we can start creating some dashboards!

Creating Our First Grafana Dashboard

A lot of community-created dashboards can be found at https://grafana.com/grafana/dashboards. We’re going to use one of them that will give us a very nice overview of the metrics scraped from the node exporter.

To import a dashboard click the + icon in the side menu, and then click Import.

Prometheus and Alertmanager

Enter the dashboard ID 1860 in the ‘Import via grafana.com’ field and click ‘Load’.

The dashboard should be imported, and the only thing we still need to do is select our Prometheus data source we just created in the dropdown at the bottom of the page and click ‘Import’:

Prometheus and Alertmanager

You should now have your first pretty Grafana dashboard, that shows all of the important metrics offered by the node exporter.

Prometheus and Alertmanager

Adding Alertmanager in the Mix

Now that we have all these metrics of our nodes flowing into Prometheus, and we have a nice way of visualising this data, it would be nice if we could also raise alerts when things don’t go as planned. Grafana offers some basic alerting functionality for Prometheus data sources, but if you want more advanced features, Alertmanager is the way to go.

Alerting rules are set up in Prometheus server. These rules allow you to define alert conditions based on PromQL expressions. Whenever an alert expression amounts to a result, the alert is considered active.

To turn this active alert condition into an action, Alertmanager comes into play. It is able to send out notification to a large variety of methods such as email, various communication platforms such as Slack or Mattermost, or several incident/on-call management tools such as Pagerduty and OpsGenie. Alertmanager also handles summarization, aggregation, rate limiting and silencing of the alerts.

Let’s go ahead and install Alertmanager on the Prometheus server instance we installed in part one of this blog.

Installing Alertmanager

Start off by creating a seperate user for alertmanager:

useradd -M -r -s /bin/false alertmanager

Next, we need a directory for the configuration:

mkdir /etc/alertmanager
chown alertmanager:alertmanager /etc/alertmanager

Then download Alertmanager and verify its integrity:

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
wget -O - -q https://github.com/prometheus/alertmanager/releases/download/v0.21.0/sha256sums.txt | grep linux-amd64 | shasum -c -

The last command should result in alertmanager-0.21.0.linux-amd64.tar.gz: OK. If it doesn’t, the downloaded file is corrupted, and you should try again.

Next we unpack the file and move the various components into place:

tar xzf alertmanager-0.21.0.linux-amd64.tar.gz
cp alertmanager-0.21.0.linux-amd64/{alertmanager,amtool} /usr/local/bin/
chown alertmanager:alertmanager /usr/local/bin/{alertmanager,amtool}

And clean up our downloaded files in /tmp:

rm -f /tmp/alertmanager-0.21.0.linux-amd64.tar.gz
rm -rf /tmp/alertmanager-0.21.0.linux-amd64

We need to supply Alertmanager with an initial configuration. For our first test, we will configure alerting by email (Be sure to adapt this configuration for your email setup!):

global:
  smtp_from: 'AlertManager <alertmanager@example.com>'
  smtp_smarthost: 'smtp.example.com:587'
  smtp_hello: 'alertmanager'
  smtp_auth_username: 'username'
  smtp_auth_password: 'password'
  smtp_require_tls: true

route:
  group_by: ['instance', 'alert']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: myteam

receivers:
  - name: 'myteam'
    email_configs:
      - to: 'user@example.com'

Save this in a file called /etc/alertmanager/alertmanager.yml and set permissions:

chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml

To be able to start and stop our alertmanager instance, we will create a systemd unit file. Use you favorite editor to create the file /etc/systemd/system/alertmanager.service and add the following to it (replacing <server IP> with the IP or resolvable FQDN of your server):

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/etc/alertmanager/
ExecStart=/usr/local/bin/alertmanager \
    --config.file=/etc/alertmanager/alertmanager.yml \
    --web.external-url http://<server IP>:9093

[Install]
WantedBy=multi-user.target

Activate and start the service with the following commands:

systemctl daemon-reload
systemctl start alertmanager
systemctl enable alertmanager

The command systemctl status alertmanager should now indicate that our service is up and running:

Prometheus and Alertmanager

Now we need to alter the configuration of our Prometheus server to inform it about our Alertmanager instance. Edit the file /etc/prometheus/prometheus.yml. There should already be a alerting section. All we need to do change the section so it looks like this:

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093

We also need to tell Prometheus where our alerting rules live. Change the rule_files section to look like this:

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/etc/prometheus/rules/*.yml"

Save the changes, and create the directory for the alert rules:

mkdir /etc/prometheus/rules
chown prometheus:prometheus /etc/prometheus/rules

Restart the Prometheus server to apply the changes:

systemctl restart prometheus

Creating Our First Alert Rule

Alerting rules are written using the Prometheus expression language or PromQL. One of the easiest things to check is whether all Prometheus targets are up, and trigger an alert when a certain exporter target becomes unreachable. This is done with the simple expression up.

Let’s create our first alert by creating the file /etc/prometheus/rules/alert-rules.yml with the following content:

groups:
- name: alert-rules
  rules:
  - alert: ExporterDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      description: 'Metrics exporter service for {{ $labels.job }} running on {{ $labels.instance }} has been down for more than 5 minutes.'
      summary: 'Exporter down (instance {{ $labels.instance }})'

This alert will trigger as soon as any of the exporter targets in Prometheus is not reported as up for more than 5 minutes. We apply the severity label critical to it.

Restart prometheus with systemctl restart prometheus to load the new alert rule.

You should be able to see the alert rule in the prometheus web interface now too, by going to the Alerts section.

Prometheus and Alertmanager

Now the easiest way for us to check if this alert actually fires, and we get our email notification, is to stop the node exporter service:

systemctl status node_exporter

As soon as we do this, we can see that the alert status has changed in the Prometheus server dashboard. It is now marked as active, but is not yet firing, because the condition needs to persist for a minimum of 5 minutes, as specified in our alert rule.

Prometheus and Alertmanager

When the 5 minute mark is reached, the alert fires, and we should receive an email from Alertmanager alerting us about the situation:

Prometheus and Alertmanager
Prometheus and Alertmanager

We should also be able to manage the alert now in the Alertmanager web interface. Open http://<server IP>:9093 in your browser and the alert that we just triggered should be listed. We can choose to silence the alert, to prevent any more alerts from being sent out.

Prometheus and Alertmanager

Click silence, and you will be able to configure the duration of the silence period, add a creator and a description for some more metadata, and expand or limit the group of alerts this particular silence applies to. If, for example, i would have wanted to silence all ExporterDown alerts for the next 2 hours, I could remove the instance matcher.

Prometheus and Alertmanager

More Advanced Alert Examples

Since Prometheus alerts use the same powerful PromQL expressions as queries, we are able to define rules that go way beyond whether a service is up or down. For a full rundown of all the PromQL functions available, check out the Prometheus documentation or the excellent PromQL for humans.

Memory Usage

For starters, here is an example of an alert rule to check the memory usage of a node. It fires once the percentage of memory available is smaller than 10% of the total memory available for a duration of 5 minutes:

  - alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Host out of memory (instance {{ $labels.instance }})'
      description: 'Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

Disk Space

We can do something similar for disk space. This alert will fire as soon as one of our target’s filesystems has less than 10% of its capacity available for a duration of 5 minutes:

  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Host out of disk space (instance {{ $labels.instance }})'
      description: 'Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

CPU Usage

To alert on CPU usage, we can use the metrics available under node_cpu_seconds_total. In the previous part of this blog we already went into which specific metrics we can find there.

This alert takes the rate of idle CPU seconds, and multiplies this by 100 to get the average percentage of idle CPU cycles over the last 5 minutes. We average this by instance to include all CPU’s (cores) in this average otherwise we would end up with an average percentage for each CPU in the system.

The alert will fire when the average CPU usage of the system exceeds 80% for 5 minutes:

  - alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Host high CPU load (instance {{ $labels.instance }})'
      description: 'CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

Predictive Alerting

Using the PromQL function predict_linear we can expand on the disk space alert mentioned earlier. predict_linear can predict the value of a certain time series X seconds from now. We can use this to predict when our disk is going to fill up, if the pattern follows a linear prediction model.

The following alert will trigger if the linear prediction algorithm, using disk usage patterns over the last hour, determines that the disk will fill up in the next four hours:

  - alert: DiskWillFillIn4Hours
    expr: predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Disk {{ $labels.device }} will fill up in the next 4 hours'
      description: |
        Based on the trend over the last hour, it looks like the disk {{ $labels.device }} on {{ $labels.mountpoint }}
        will fill up in the next 4 hours ({{ $value | humanize }}% space remaining)

Give Me More!

If you are interested in more examples of alert rules, you can find a very extensive collection at Awesome Prometheus alerts. You can find examples here for exporters we haven’t covered too, such as the Blackbox or MySQL exporter.

Syntax Checking Your Alert Rule Definitions

Prometheus comes with a tool that allows you to verify the syntax of your alert rules. This will come in handy for local development of rules or in CI/CD pipelines, to make sure that no broken syntax makes it to your production Prometheus platform.

You can invoke the tool by running promtool check rules /etc/prometheus/rules/alert-rules.yml

# promtool check rules /etc/prometheus/rules/alert-rules.yml
Checking /etc/prometheus/rules/alert-rules.yml
  SUCCESS: 5 rules found

Scraping Metrics From Alertmanager

Alertmanager has a built in metrics endpoint that exports metrics about how many alerts are firing, resolved or silenced. Now that we have all components running, we can add alertmanager as a target to our Prometheus server to start scraping these metrics.

On your Prometheus server, open /etc/prometheus/prometheus.yml with your favorite editor and add the following new job under the scrape_configs section (replace 192.168.0.10 with the IP of your alertmanager instance):

  - job_name: 'alertmanager'
    static_configs:
    - targets: ['192.168.0.10:9093']

Restart Prometheus, and check in the Prometheus web console if you can see the new Alertmanager section under Status > Targets. If all goes well, a query in the Prometheus web console for alertmanager_cluster_enabled should return one result with the value 1.

We can now continue with adding alert rules for Alertmanager itself:

  - alert: PrometheusNotConnectedToAlertmanager
    expr: prometheus_notifications_alertmanagers_discovered < 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 'Prometheus not connected to alertmanager (instance {{ $labels.instance }})'
      description: 'Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'
  - alert: PrometheusAlertmanagerNotificationFailing
    expr: rate(alertmanager_notifications_failed_total[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 'Prometheus AlertManager notification failing (instance {{ $labels.instance }})'
      description: 'Alertmanager is failing to send notifications\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

The first rule will fire when Alertmanager is no longer connected to Prometheus for over 5 minutes, the second rule will fire when Alertmanager fails to send out notification alerts. But how will we know about the alert, if notifications are failing? That’s where the next section comes in handy!

Alertmanager Dashboard Using Karma

The Alertmanager web console is useful for a basic overview of alerts and to manage silences, but it is not really suitable for use as a dashboard that gives us a tactical overview of our Prometheus monitoring platform.

For this, we will use Karma.

Prometheus and Alertmanager

Karma offers a nice overview of active alerts, grouping of alerts by a certain label, silence management, alert achknowledgement and more.

We can install it on the same machine where Alertmanager is running using the following steps;

Start off by creating a seperate user and configuration folder for karma:

useradd -M -r -s /bin/false karma
mkdir /etc/karma
chown karma:karma /etc/karma

Then download the file and verify its checksum:

cd /tmp
wget https://github.com/prymitive/karma/releases/download/v0.78/karma-linux-amd64.tar.gz
wget -O - -q https://github.com/prymitive/karma/releases/download/v0.78/sha512sum.txt | grep linux-amd64 | shasum -c -

Make sure the last command returns karma-linux-amd64.tar.gz: OK again. Now unpack the file and move it into place:

tar xzf karma-linux-amd64.tar.gz
mv karma-linux-amd64 /usr/local/bin/karma
rm karma-linux-amd64.tar.gz

Create the file /etc/karma/karma.yml and add the following default configuration (replace the username and password):

alertmanager:
  interval: 1m
  servers:
    - name: alertmanager
      uri: http://localhost:9093
      timeout: 20s
authentication:
  basicAuth:
    users:
      - username: cartman
        password: secret

Set the proper permissions on the config file

chown karma:karma /etc/karma/karma/yml
chmod 640 /etc/karma/karma/yml

Create the file /etc/systemd/system/karma.service with the following content:

[Unit]
Description=Karma Alertmanager dashboard
Wants=network-online.target
After=network-online.target
After=alertmanager.service

[Service]
User=karma
Group=karma
Type=simple
WorkingDirectory=/etc/karma/
ExecStart=/usr/local/bin/karma \
    --config.file=/etc/karma/karma.yml

[Install]
WantedBy=multi-user.target

Activate and start the service with the following commands:

systemctl daemon-reload
systemctl start karma
systemctl enable karma

The command systemctl status karma should now indicate that karma is up and running:

Prometheus and Alertmanager

You should be able to visit your new Karma dashboard now at http://<alertmanager server IP>:8080. Here’s what it looks like when we stop the node_exporter service again and wait for 5 minutes for the alert to fire:

Prometheus and Alertmanager

If you want to explore all the possibilities and configuration options of Karma, then please see the documentation.

Conclusion

In this series we’ve installed Prometheus, the node exporter, and the Alertmanager. We’ve given a small introduction in PromQL and how to write Prometheus queries and alert rules, and used Grafana to graph metrics and Karma to offer an overview of triggered alerts.

If you want to explore further, check out the following resources:

Share

Leave a Reply

Your email address will not be published. Required fields are marked *