Measuring and Monitoring With Prometheus and Alertmanager

As one of the most successful projects of the Cloud Native Computing Foundation (CNCF), it is highly likely that you have heard of Prometheus. Initially built at SoundCloud in 2012 to fulfil their monitoring needs, Prometheus is now one of the most popular solutions for time-series based monitoring.

At Leaseweb, we use Prometheus for a variety of purposes – from basic system monitoring of our internal systems, to blackbox monitoring from several of our network locations, to cloud data usage and capacity monitoring.

Whether you have one or several servers, it is always good to have insight into what your systems are doing and how they are performing. In this article, we will show you how to set up a basic Prometheus server and expose system metrics using node_exporter.

For later blogs in this series, we will add Alertmanager to our Prometheus server and use Grafana to graph our recorded metrics.

This is an overview of the components involved and their role:

  • Prometheus: Scrapes metrics on external data sources (or ‘exporters’), stores metrics in time-series databases, and exposes metrics through API.
  • node_exporter: Exposes several system metrics, such as CPU & disk usage
  • Alertmanager: Handles alerts generated by the Prometheus server. Takes care of deduplicating, grouping, and routing alerts to the correct alert channel such as email, Telegram, PagerDuty, Slack, etc.
  • Grafana: Uses Prometheus as a datasource to graph the recorded metrics.

For this tutorial, we are going to use three servers running Ubuntu 18.04 LTS. However, the instructions can be easily adapted for any other recent Linux distribution. These can either be bare metal servers or cloud instances. When your Prometheus setup grows and you start to scrape more and more metrics, it is advisable to have SSD based storage in your Prometheus server.

If you want to start out small or experiment, you can also combine several components on one system.

A Note on Security

Since Prometheus was designed to be run in a private network/cloud setting, it does not offer any authentication or access control out of the box. Because of this, be careful not to expose any of the services to the outside world. There are several ways you can achieve this (implementation of which is outside of the scope of this tutorial).

To achieve this, you could use the Leaseweb private networking feature and bind the Prometheus related services to your private networking interface. Other options are to use a reverse proxy that implements basic authentication, or using firewall rules to only allow certain IP addresses to connect to your Prometheus-related services.

Installing Prometheus

To start off, we will install the Prometheus server. The prometheus package is part of the standard Ubuntu distribution repositories, but unfortunately the version (2.1.0) is quite old. At the time of writing this blog post, the latest version is 2.16.0, which is what we will be using.

On the system that will be our Prometheus server, we start off by creating a user and group called prometheus:

useradd -M -r -s /bin/false prometheus

Next, we create the directories that will contain the configuration and the data of Prometheus:

mkdir /etc/prometheus /var/lib/prometheus

Download Prometheus server and verify its integrity:

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.16.0/prometheus-2.16.0.linux-amd64.tar.gz
wget -O - -q https://github.com/prometheus/prometheus/releases/download/v2.16.0/sha256sums.txt | grep linux-amd64 | shasum -c -

The last command should result in  prometheus-2.16.0.linux-amd64.tar.gz: OK. If it doesn’t, the downloaded file is corrupted. Next we unpack the file and move the various components into place:

tar xzf prometheus-2.16.0.linux-amd64.tar.gz
cp prometheus-2.16.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}
cp -r prometheus-2.16.0.linux-amd64/{consoles,console_libraries} /etc/prometheus/
cp prometheus-2.16.0.linux-amd64/prometheus.yml /etc/prometheus/prometheus.yml

chown -R prometheus:prometheus /etc/prometheus
chown prometheus:prometheus /var/lib/prometheus

And clean up our downloaded files in /tmp

rm -f /tmp/prometheus-2.16.0.linux-amd64.tar.gz
rm -rf /tmp/prometheus-2.16.0.linux-amd64

Add prometheus itself to the config for scraping initially.

To be able to start and stop our prometheus server, we will create a systemd unit file.Use you favorite editor to create the file /etc/systemd/system/prometheus.service and add the following to it:

[Unit]
Description=Prometheus Time Series Collection and Processing Server
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target

Activate and start the service with the following commands:

systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus

The command systemctl status prometheus should now indicate that our service is up and running:

You should be able to access the web interface of the prometheus server now on http://<server IP>:9090:

If we go to Status > Targets we can see that the Prometheus server itself has already been added as a scraping target for metrics. This default target collects metrics about the performance of the Prometheus server. You can view the metrics that are being recorded under http://<server IP>:9090/metrics.

Prometheus provides two convenient endpoints for monitoring its health and status. You can use these to add to any other monitoring system you might have.

root@HRA-blogtest:~# curl localhost:9090/-/healthy
Prometheus is Healthy.
root@HRA-blogtest:~# curl localhost:9090/-/ready
Prometheus is Ready.

Monitor System Metrics with the Node Exporter

To make things a little more interesting, we are going to add a target to obtain system metrics of the Prometheus server. For this, we need to install the node exporter first.

Installing the node exporter

Download Prometheus node exporter and verify its integrity:

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
wget -O - -q https://github.com/prometheus/node_exporter/releases/download/v0.18.1/sha256sums.txt | grep linux-amd64 | shasum -c -

The last command should result in node_exporter-0.18.1.linux-amd64.tar.gz: OK. If it doesn’t, the downloaded file is corrupted.

Next we unpack the file and move the node exporter into place:

tar xzf node_exporter-0.18.1.linux-amd64.tar.gz
cp node_exporter-0.18.1.linux-amd64/node_exporter /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/node_exporter

And clean up our downloaded files in /tmp

rm -f /tmp/node_exporter-0.18.1.linux-amd64.tar.gz
rm -rf /tmp/node_exporter-0.18.1.linux-amd64

Create a unit file /etc/systemd/system/node_exporter.service for the node exporter using your favorite editor.

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Reload the systemd configuration to activate our unit file, start the service, and enable the service to start at boot time:

systemctl daemon-reload
systemctl start node_exporter.service
systemctl enable node_exporter.service

The node exporter should now be running. You can verify this with systemctl status node_exporter

The node exporter listens on TCP port 9100. You should be able to see the node exporter metrics now at http://<server IP>:9100/metrics.

Adding the node exporter target to Prometheus

Now that the node exporter is running, we need to adapt the configuration of the Prometheus server so it can start scraping our node exporter metrics.

Open /etc/prometheus/prometheus.yml in your editor and adapt the scrape config section to look like the following:

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
    - targets: ['localhost:9100']

Save the changes and restart the prometheus server configuration with systemctl restart prometheus

The Prometheus server web interface should show a new target now under Status > Targets:

Querying and Graphing the Recorded Metrics

Now that everything is set up, it is time to start looking into some of the things we are now measuring! Switch to the Graph tab in the Prometheus server web interface.

Enter node_memory_MemAvailable_bytes and click Execute. The Console tab will show you the current amount of memory free in bytes.

Switch to the Graph tab and you will see a graph of the amount of bytes of free memory there were over the course of the last hour. You can increase and decrease the time range with the plus and minus on the top left of the graph.

There is another metric that records the total amount of memory in the system. It is called node_memory_MemTotal_bytes. We can use this to calculate the percentage of memory free in the system. Enter the following in the query area and click execute:

(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

The graph will now show the percentage of free memory over time.

We can make this even more accurate by taking into account buffered and cached memory:

((node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100

Or turn it around and show the percentage of used memory instead:

(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100

The CPU usage is recorded in the metrics under node_cpu_seconds_total. This metric has several modes of the CPU recorded:

  • user: Time spent in userland
  • system: Time spent in the kernel
  • iowait: Time spent waiting for I/O
  • idle: Time the CPU had nothing to do
  • irq&softirq: Time servicing interrupts
  • guest: If you are running VMs, the CPU they use
  • steal: If you are a VM, time other VMs “stole” from your CPUs

These metrics are recorded as counters, so to get the per second values we will use the irate function:

irate(node_cpu_seconds_total{job="node"}[5m])

As you can see, when you have multiple CPU’s in your server, it will return metrics for each CPU individually. To get the overall value across all CPU’s we can use PromQL’s aggregation features using sum by:

sum by (mode, instance) (irate(node_cpu_seconds_total{job="node"}[5m]))

We can also calculate the percentage of CPU used by taking the per second idle rate and multiplying it by 100 (to get the percent CPU idle), and then subtracting it from 100%:

100 - (avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100)

And finally, to get the amount of data sent or received by our server, we can use irate(node_network_transmit_bytes_total{device!="lo"}[1m]) and irate(node_network_receive_bytes_total{device!="lo"}[1m]). This will give us a bytes-per-minute graph. The device!="lo" makes sure we exclude the local loopback interface.

To turn this into megabits, we will have to do some math:

(sum(irate(node_network_receive_bytes_total{device!="lo"}[1m])) by (instance, device) * 8 / 1024 / 1024)

To get a full idea of the possibilities of the PromQL querying language, see the documentation. By investigating the metrics available in the node exporter, you can create a lot more graphs like these – for example, for the amount of available disk space, the amount of file descriptors used, and a lot more.

In the next part of this blog, we will go deeper into visualizing the metrics using Grafana, and will also define alerting rules to receive alerts through Alertmanager.

Share

One thought on “Measuring and Monitoring With Prometheus and Alertmanager”

Leave a Reply

Your email address will not be published. Required fields are marked *