Creating a Real-time HoneyPot Attack Map

Every device connected to the internet is open for cyber attacks. It takes less than one minute before a system is attacked once it is connected to the internet. Recently, I worked on a hackathon project to visualize honeypot attacks on a map in real-time.

A honeypot is a computer system that mimics a target for hackers. It tries to fool hackers into thinking it is a real computer system, distracting them from other targets.

Initial Setup

There are many honeypot systems around but for this project I used T-Pot created by Deutsche Telekom Security. It consists of many honeypot daemons and tools out-of-the-box and is easy to setup.

To create a live map, we need to have the following:

  • Running T-Pot instance
  • Small server with a webserver (nginx) and Node.js/NPM

Follow the instructions to install T-Pot. Confirm your T-Pot instance is running and you see attacks appearing in the dashboards.

There are many instructions on how to install a server with Nginx and Node.js (one example can be found here).

Node.js Application to Receive Logs

On the webserver, we create a small Node.js application that will do two simple tasks:

  • Receive data from the T-Pot installation (logstash)
  • Run a small websocket server to broadcast the received data to connected clients

Install required packages
In our Node.js application we use two packages: `express` and `ws`. First install both packages:

npm install ws express

Now we create a small application called `server.js`:

vi server.js

Insert the following code into the file:

#!/usr/bin/env nodejs
const http = require('http');
const WebSocket = require('ws');
const express = require('express');
const app = express();

const PORT = 8080;
const WS_PORT = 8081;

app.use(express.json());

// Create a WebSocket Server so clients can connect to it
const wss = new WebSocket.Server({ port: WS_PORT })
wss.on('connection', function connection(ws) {
});

// Now we create a simple HTTP server which receives a message
// and forwards the message to the connected WebSocket clients
app.get('/', (req, res) => {
  res.send('Hello');
});

app.post('/', (req, res) => {
  wss.clients.forEach(function each(client) {
    if (client.readyState === WebSocket.OPEN) {
      client.send(JSON.stringify(req.body));
    }
  });
  res.sendStatus(200);
});

app.listen(PORT, () => {
   console.log('The server is running at port 8080!');
});

This small Node.js application simply listens on port 8080 to receive a POST message. WebSockets can connect to port 8081. Once a message is received on port 8080, it is sent to all connected WebSocket clients.

Test Application

To test your application, make it executable:

chmod +x server.js

Now run the application:

./server.js

The output will be:

The server is running at port 8080!

You can use a process manager like PM2 to daemonize your application:

sudo npm install -g pm2
pm2 start server.js

PM2 will restart applications automatically if the application crashed or is killed. In order to have your application to run after a system (re)boot, you will need to execute another command:

pm2 startup systemd

This command output might include a command which needs to be run with superuser privileges:

sudo env PATH=$PATH:/usr/bin /usr/lib/node_modules/pm2/bin/pm2 startup systemd -u your_user — hp /home/your_user

Webpage Showing the Map

Now we will create a small webpage, and add some Javascript code. This code will open a websocket to receive updates and plot them on a map. To create a world map, I used Mapbox GL JS. You will need to create a (free) account in order to create an API key that will be used to create a map.

If you have a server running default Nginx, create a new `index.html` in the web-root folder:

cd /var/www/html
vi index.html

Insert the following HTML code into the file:

<!DOCTYPE HTML>
<html>
   <head> 
   <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.0-beta1/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-giJF6kkoqNQ00vy+HMDP7azOuL0xtbfIcaT9wjKHr8RbDVddVHyTfAAsrekwKmP1" crossorigin="anonymous">
   <script src='https://api.mapbox.com/mapbox-gl-js/v2.0.1/mapbox-gl.js'></script>
    <link href='https://api.mapbox.com/mapbox-gl-js/v2.0.1/mapbox-gl.css' rel='stylesheet' />  
    <style>
        @import url(https://fonts.googleapis.com/css?family=Inconsolata:400,500);

        body { background-color: black }
        #map { height: calc(100vh - 275px);; width: ; z-index: 1; }
        .table { color: #fff; font-family: Inconsolata,sans-serif; font-size: 15px; border-color: #525252;}    
        .thead { font-weight: 700; color: #525252; }            
        
        @-webkit-keyframes flashrow {
           from { background-color: #525252; }
           to { background-color: var(--bs-table-bg); }
         }
         @-moz-keyframes flashrow {
           from { background-color: #525252; }
           to { background-color: var(--bs-table-bg); }
         }
         @-o-keyframes flashrow {
           from { background-color: #525252; }
           to { background-color: var(--bs-table-bg); }
         }
         @keyframes flashrow {
           from { background-color: #525252; }
           to { background-color: var(--bs-table-bg); }
         }
         .flashrow {
           -webkit-animation: flashrow 1.5s; /* Safari 4+ */
           -moz-animation:    flashrow 1.5s; /* Fx 5+ */
           -o-animation:      flashrow 1.5s; /* Opera 12+ */
           animation:         flashrow 1.5s; /* IE 10+ */
         }
    </style>   
   </head>   
   <body>
   <div class="container">   
    <div class="row">
      <div class="col">
         <div id="map"></div>
      </div>
   </div>
   <div class="row">
      <div class="col" id="ticker">
         <table id="tickettable" class="table table-black ticker">
            <thead class="text-uppercase thead">
               <tr>
                  <th class="col-lg-1 thead">Time</th>
                  <th class="col-lg-2 thead">Country</th>
                  <th class="col-lg-3 thead">AS Organisation</th>
                  <th class="col-lg-6 thead">TYPE</th>
               </tr>
            </thead>
            <tbody>            
            </tbody>
         </table>         
      </div>
   </div>
</div>
<script src='map.js'></script>
</html>

This page simple loads Mapbox GL Javascript libraries and some styles. It also loads another Javascript file (at the bottom) which will open the Websocket and update the map.
Let’s create this Javascript file:

vi map.js

and insert the following code into the file:

// Set the IP to your webserver IP
const WEBSOCKET_SERVER = 'ws://<YOUR_WEBSERVERIP:8081>';
// Set your mapboxGL AccessToken
const MAPBOX_TOKEN = 'YOUR_ACCESS_TOKEN';

// Remove points from map after x-seconds
var displayTime = 300;

// Set some defaults for the map

var framesPerSecond = 15; 
var initialOpacity = 1;
var opacity = initialOpacity;
var initialRadius = 3;
var radius = initialRadius;
var maxRadius = 15;
let points = new Map();
var timers = [];

//Set your accessToken here
mapboxgl.accessToken = MAPBOX_TOKEN;

//Create new mapboxGl Map. Set your used style
var map = new mapboxgl.Map({
    container: 'map',
    style: 'mapbox://styles/leaseweb/ckkiepmg40ds717ry6l0htwag',
    center: [0, 0],
    zoom: 1.75
});

// Create a popup, but don't add it to the map yet.
var popup = new mapboxgl.Popup({
   closeButton: false,
   closeOnClick: false
});

// Once the map is loaded, we open the Websockets
map.on('load', function () {   
   openWebSockets(map); 
});

function openWebSockets(map) { 
   if ("WebSocket" in window) {
      // Let us open a web socket
      var ws = new WebSocket( WEBSOCKET_SERVER); 

      ws.onopen = function() {
         // Web Socket is connected, send data using send()         
         console.log("WS Open...");
      };
   
      ws.onmessage = function (event) { 
         var received_msg = JSON.parse(event.data);             
         addPoint(received_msg);            
      };

      ws.onerror = function(error) {
         console.log('Websocket error: ');
         console.log(error);
      }

      ws.onclose = function() { 
         // websocket is closed.
         console.log("Connection is closed..."); 
      };

   } else {
      // The browser doesn't support WebSocket
      alert("WebSocket NOT supported by your Browser!");
   }    
}

function animateMarker(timestamp, pointId) {
   if(!(pointId === undefined)) {
      if (points.has(pointId)) {         
        timers[pointId] = setTimeout(function() {
            requestAnimationFrame(function(timestamp) {
              animateMarker(timestamp, pointId);
            });
            

            radius = points.get(pointId)[0];
            opacity = points.get(pointId)[1];
            
            radius += (maxRadius - radius) / framesPerSecond;            
            opacity -= ( .9 / framesPerSecond );
            if (opacity < 0) {
               opacity = 0;
            }

            map.setPaintProperty('point-'+pointId, 'circle-radius', radius);
            map.setPaintProperty('point-'+pointId, 'circle-opacity', opacity);
         
            if (opacity <= 0) {
                radius = initialRadius;
                opacity = initialOpacity;
            } 
            points.set(pointId,[radius, opacity ]);        
        }, 1000 / framesPerSecond);
     } else {
      //The point is removed, we don't do anything at this moment
     }
  }
}

function addPoint(msg) {
   geo = JSON.parse(msg.geoip);     
   var ip = geo.ip;
   //Create a geohash based on the lat/lon of the IP. We used factor 7 to prevent overlapping point animations
   var geohash = encodeGeoHash(geo.latitude, geo.longitude, 7);
   //Get the AS Organisation name (or unknown)
   var ASORG = (geo.as_org === undefined ? 'Unknown' : geo.as_org);

   //Remove the flashrow style from last added row
   var flashrows = document.getElementById("tickettable").getElementsByClassName('flashrow');
   while (flashrows[0]) {
      flashrows[0].classList.remove('flashrow');
   }
   
   //Get table to add the newly added point information
   var tbody = document.getElementById("tickettable").getElementsByTagName('tbody')[0];
   
   tbody.insertRow().innerHTML = '<tr><td class="flashrow">' + new Date().toLocaleTimeString() + '</td>' +
         '<td class="flashrow">' + geo.country_name + '</td>' +
         '<td class="flashrow">' + ASORG + '</td>' +
         '<td class="flashrow">' + msg.protocol.toUpperCase() + ' Attack on port ' + msg.dest_port +'</td>' +    
         '</tr>';
   
   //If we have more than 5 items in the list, remove the first one
   if (tbody.rows.length > 5) {
      tbody.deleteRow(0);
   }

   //Add the point to the map if it is not already on the map
   if (!(geohash === undefined)) {               
      if (!(points.has(geohash))) {
         
         //Add the point to hash to keep of all active points and prevent duplicate points.
         points.set(geohash, [initialRadius, initialOpacity ]);               

         //Set a timer to remove the poinrt after 5minutes
         setTimeout(function() { removePoint(geohash) }, displayTime * 1000);            

         map.addSource('points-'+geohash, {
           "type": "geojson",
           "data": {
               "type": "Feature",
               "geometry": {
                  "type": "Point",
                  "coordinates": [ geo.longitude, geo.latitude]
               },
               "properties": {
                  "description": "<strong>" + ASORG + " (AS " + geo.asn +")</strong><p>IP: " + ip + "<BR>City: " + (geo.city_name === undefined ? 'Unknown' : geo.city_name) + 
                     "<BR>Region: " + (geo.region_name === undefined ? 'Unknown' : geo.region_name) + "<BR>Country: " + (geo.country_name === undefined ? 'Unknown' : geo.country_name) + "</P>"
               }
           }
         });
         
         map.addLayer({
           "id": "point-"+geohash,
           "source": "points-"+geohash,
           "type": "circle",
           "paint": {
               "circle-radius": initialRadius,
               "circle-radius-transition": {duration: 0},
               "circle-opacity-transition": {duration: 0},
               "circle-color": "#dd7cbf"
           }
         });  

         map.on('mouseenter', 'point-'+geohash, function (e) {
            // Change the cursor style as a UI indicator.
            map.getCanvas().style.cursor = 'pointer';
             
            var coordinates = e.features[0].geometry.coordinates.slice();
            var description = e.features[0].properties.description;
             
            // Ensure that if the map is zoomed out such that multiple
            // copies of the feature are visible, the popup appears
            // over the copy being pointed to.
            while (Math.abs(e.lngLat.lng - coordinates[0]) > 180) {
               coordinates[0] += e.lngLat.lng > coordinates[0] ? 360 : -360;
            }
             
            // Populate the popup and set its coordinates
            // based on the feature found.
            popup.setLngLat(coordinates).setHTML(description).addTo(map);
         });
             
         map.on('mouseleave', 'point-'+geohash, function () {
            map.getCanvas().style.cursor = '';
            popup.remove();
         });

         //Animate the added point.
         animateMarker(0, geohash);               
      }
   }
}

function removePoint(ip) {
   clearTimeout(timers[ip]);
   points.delete(ip);
   map.removeLayer('point-'+ip);
   map.removeSource('points-'+ip);         
}

function encodeGeoHash(latitude, longitude, precision) {
  var BITS = [16, 8, 4, 2, 1];

  var BASE32 = "0123456789bcdefghjkmnpqrstuvwxyz";
  var isEven = 1;
  var lat = [-90.0, 90.0];
  var lon = [-180.0, 180.0];
  var bit = 0;
  var ch = 0;
  precision = precision || 12;

  var geohash = "";
  while (geohash.length < precision) {
    var mid;
    if (isEven) {
      mid = (lon[0] + lon[1]) / 2;
      if (longitude > mid) {
        ch |= BITS[bit];
        lon[0] = mid;
      } else {
        lon[1] = mid;
      }
    } else {
      mid = (lat[0] + lat[1]) / 2;
      if (latitude > mid) {
        ch |= BITS[bit];
        lat[0] = mid;
      } else {
        lat[1] = mid;
      }
    }

    isEven = !isEven;
    if (bit < 4) {
      bit++;
    } else {
      geohash += BASE32[ch];
      bit = 0;
      ch = 0;
    }
  }
  return geohash;
}; 

You will need to make two small modification at the top of the file:

  • Set the IP of your webserver
  • Set your Mapbox access token

Once you have done this, open the page in your browser and a map should appear. NOTE: nothing will happen at this moment 🙂

Configure Logstash

Now we are all set for the last part: configuring Logstash to also forward (some) logs to our Node-application.
On your T-Pot server, we need to get the Logstash configuration as described on the T-Pot Wiki:

docker exec -it logstash ash
cd /etc/logstash/conf.d/
cp logstash.conf /data/elk/logstash.conf
exit

Open the Logstash configuration and add the following lines to the output section, after the Elasticsearch output:

if [type] == "ConPot" and [dest_port] and [event_type] == "NEW_CONNECTION" and [src_ip] != "${MY_INTIP}" {
   http {
     url => "http://${HTTP_LOGIP}"
     http_method => "post"
     mapping => {
         "type" => "%{type}"
         "protocol" => "Elastic"
         "source" => "%{src_ip}"
         "dest_port" => "%{dest_port}"
         "geoip" => "%{geoip}"
     }
   }
}
if [type] == "Ciscoasa" and [src_ip] != "${MY_INTIP}" {
   http {
     url => "http://${HTTP_LOGIP}"
     http_method => "post"
     mapping => {
          "type" => "%{type}"
         "protocol" => "Ciscoasa"
         "source" => "%{src_ip}"
         "geoip" => "%{geoip}"
     }
   }
}
if [type] == "Mailoney" and [dest_port] and [src_ip] != "${MY_INTIP}" {
   http {
     url => "http://${HTTP_LOGIP}"
     http_method => "post"
     mapping => {
          "type" => "%{type}"
         "protocol" => "Mail"
         "source" => "%{src_ip}"
         "dest_port" => "%{dest_port}"
         "geoip" => "%{geoip}"
     }
   }
}
if [type] == "ElasticPot" and [dest_port] and [src_ip] != "${MY_INTIP}" {
   http {
     url => "http://${HTTP_LOGIP}"
     http_method => "post"
     mapping => {
          "type" => "%{type}"
         "protocol" => "Elastic"
         "source" => "%{src_ip}"
         "dest_port" => "%{dest_port}"
         "geoip" => "%{geoip}"
     }
   }
}
if [type] == "Adbhoney" and [dest_port] and [src_ip] != "${MY_INTIP}" {
   http {
     url => "http://${HTTP_LOGIP}"
     http_method => "post"
     mapping => {
          "type" => "%{type}"
         "protocol" => "ADB"
         "source" => "%{src_ip}"
         "dest_port" => "%{dest_port}"
         "geoip" => "%{geoip}"
     }
   }
}
if [type] == "Dionaea" and [dest_port] and [src_ip] != "${MY_INTIP}" {
   http {
     url => "http://${HTTP_LOGIP}"
     http_method => "post"
     mapping => {
          "type" => "%{type}"
         "protocol" => "%{[connection][transport]}"
         "service" => "%{[connection][protocol]}"
         "source" => "%{src_ip}"
         "dest_port" => "%{dest_port}"
         "geoip" => "%{geoip}"
     }
   }
}
if [type] == "Fatt" and [protocol] != "ssh" and [src_ip] != "${MY_INTIP}" {
   http {
     url => "http://${HTTP_LOGIP}"
     http_method => "post"
     mapping => {
          "type" => "%{type}"
         "protocol" => "%{protocol}"
         "source" => "%{src_ip}"
         "dest_port" => "%{dest_port}"
         "geoip" => "%{geoip}"
     }
   }
}
if [type] == "Cowrie" and [dest_port] and [protocol] and [src_ip] != "${MY_INTIP}" {
   http {
     url => "http://${HTTP_LOGIP}"
     http_method => "post"
     mapping => {
          "type" => "%{type}"
         "protocol" => "%{protocol}"
         "source" => "%{src_ip}"
         "dest_port" => "%{dest_port}"
         "geoip" => "%{geoip}"
     }
   }
}
if [type] == "HoneyTrap" and [dest_port] and [src_ip] != "${MY_INTIP}" {
   http {
     url => "http://${HTTP_LOGIP}"
     http_method => "post"
     mapping => {
          "type" => "%{type}"
         "protocol" => "%{[attack_connection][protocol]}"
         "source" => "%{src_ip}"
         "dest_port" => "%{dest_port}"
         "geoip" => "%{geoip}"
     }
   }
}

We need to add a new variable to the docker environment:

vi /opt/tpot/etc/compose/elk_environment

Then add the following line to the file, to exclude the T-Pot server in the Logstash messages:

HTTP_LOGIP=<YOUR_TPOTSERVER_IP>:8080

Now add a new docker volume for the Logstash service:

vi /opt/tpot/etc/tpot.yml

Go to the Logstash service section and add the following line:

- /data/elk/logstash.conf:/etc/logstash/conf.d/logstash.conf

Now we are all set and it’s time to restart your T-port service:

systemctl start tpot

That’s It!

Now take a look at your map. If there are attacks on your server, they should appear on the map and in the listing below.

You can trigger an event by, for example, opening a regular SSH session to your T-Pot server:

ssh <TPOT_SERVER>

Simply close the connection once it is established, and your location should appear on the map.

Daily figures

Almost immediately when you start running a honeypot you will see attacks. Within one day, I saw over 200.000 attacks, mostly on common ports like HTTP(S), SSH and SMTP. You can use this data to make your environments more safe, or just use them for some fun projects.

Some notes
As this was a quick project in limited time, there are definitely some optimisations or better coding that can take place 🙂 The Javascript will give a few errors after some time, probably due to points being removed from the map while a call to update the same point happens at the same time.
In addition, some points on the map will suddenly run on steroids, animating at higher frames than they did initially. The Node.js application was made quick and dirty but is suitable for this demo.

Technical Careers at Leaseweb

We are searching for the next generation of engineers and developers to help us build the infrastructure to automate our global hosting services! If you are interested in finding out more, check out our Careers at Leaseweb.

Share

Measuring and Monitoring With Prometheus and Alertmanager Part 2

This is part two in the series about Prometheus and Alertmanager.

In the first part we installed the Prometheus server and the node exporter, in addition to discovering some of the measuring and graphing capabilities using the Prometheus server web interface.

In this part, we will be looking at Grafana to expand the possibilities of graphing our metrics, and we will use Alertmanager to alert us of any metrics that are outside the boundaries we define for them. Finally, we will install a dashboard application for a nice tactical overview of our Prometheus monitoring platform.

Installing Grafana

The installation of Grafana is fairly straightforward, and all of the steps involved are described in full detail in the official documentation.

For Ubuntu, which we’re using in this series, the steps involved are:

sudo apt-get install -y apt-transport-https software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana

At this point Grafana will be installed, but the service has not been started yet.

To start the service and verify that the service has started:

sudo systemctl daemon-reload
sudo systemctl start grafana-server
sudo systemctl status grafana-server

You should see something like this:

If you want Grafana to start at boot time, run the following:

sudo systemctl enable grafana-server.service

Grafana listens on port 3000 by default, so at this point you should be able to access your Grafana installation at http://<IP of your Grafana server>:3000

You will be welcomed by the login screen. The default login after installation is admin with password admin.

After succesfully logging in, you will be asked to change the password for the admin user. Do this immediately!

Creating the Prometheus Data Source

Our next step is to create a new data source in Grafana that connects to our Prometheus installation. To do this, go to Configuration > Data Sources and click the blue Add data source button.

Grafana supports various time series data sources, but we will pick the top one, which is Prometheus.

Enter the URL of your Prometheus server, and that’s it! Leave all the other fields untouched, they are not needed at this point.

You should now have a Prometheus data source in Grafana, and we can start creating some dashboards!

Creating Our First Grafana Dashboard

A lot of community-created dashboards can be found at https://grafana.com/grafana/dashboards. We’re going to use one of them that will give us a very nice overview of the metrics scraped from the node exporter.

To import a dashboard click the + icon in the side menu, and then click Import.

Enter the dashboard ID 1860 in the ‘Import via grafana.com’ field and click ‘Load’.

The dashboard should be imported, and the only thing we still need to do is select our Prometheus data source we just created in the dropdown at the bottom of the page and click ‘Import’:

You should now have your first pretty Grafana dashboard, that shows all of the important metrics offered by the node exporter.

Adding Alertmanager in the Mix

Now that we have all these metrics of our nodes flowing into Prometheus, and we have a nice way of visualising this data, it would be nice if we could also raise alerts when things don’t go as planned. Grafana offers some basic alerting functionality for Prometheus data sources, but if you want more advanced features, Alertmanager is the way to go.

Alerting rules are set up in Prometheus server. These rules allow you to define alert conditions based on PromQL expressions. Whenever an alert expression amounts to a result, the alert is considered active.

To turn this active alert condition into an action, Alertmanager comes into play. It is able to send out notification to a large variety of methods such as email, various communication platforms such as Slack or Mattermost, or several incident/on-call management tools such as Pagerduty and OpsGenie. Alertmanager also handles summarization, aggregation, rate limiting and silencing of the alerts.

Let’s go ahead and install Alertmanager on the Prometheus server instance we installed in part one of this blog.

Installing Alertmanager

Start off by creating a seperate user for alertmanager:

useradd -M -r -s /bin/false alertmanager

Next, we need a directory for the configuration:

mkdir /etc/alertmanager
chown alertmanager:alertmanager /etc/alertmanager

Then download Alertmanager and verify its integrity:

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
wget -O - -q https://github.com/prometheus/alertmanager/releases/download/v0.21.0/sha256sums.txt | grep linux-amd64 | shasum -c -

The last command should result in alertmanager-0.21.0.linux-amd64.tar.gz: OK. If it doesn’t, the downloaded file is corrupted, and you should try again.

Next we unpack the file and move the various components into place:

tar xzf alertmanager-0.21.0.linux-amd64.tar.gz
cp alertmanager-0.21.0.linux-amd64/{alertmanager,amtool} /usr/local/bin/
chown alertmanager:alertmanager /usr/local/bin/{alertmanager,amtool}

And clean up our downloaded files in /tmp:

rm -f /tmp/alertmanager-0.21.0.linux-amd64.tar.gz
rm -rf /tmp/alertmanager-0.21.0.linux-amd64

We need to supply Alertmanager with an initial configuration. For our first test, we will configure alerting by email (Be sure to adapt this configuration for your email setup!):

global:
  smtp_from: 'AlertManager <alertmanager@example.com>'
  smtp_smarthost: 'smtp.example.com:587'
  smtp_hello: 'alertmanager'
  smtp_auth_username: 'username'
  smtp_auth_password: 'password'
  smtp_require_tls: true

route:
  group_by: ['instance', 'alert']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: myteam

receivers:
  - name: 'myteam'
    email_configs:
      - to: 'user@example.com'

Save this in a file called /etc/alertmanager/alertmanager.yml and set permissions:

chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml

To be able to start and stop our alertmanager instance, we will create a systemd unit file. Use you favorite editor to create the file /etc/systemd/system/alertmanager.service and add the following to it (replacing <server IP> with the IP or resolvable FQDN of your server):

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/etc/alertmanager/
ExecStart=/usr/local/bin/alertmanager \
    --config.file=/etc/alertmanager/alertmanager.yml \
    --web.external-url http://<server IP>:9093

[Install]
WantedBy=multi-user.target

Activate and start the service with the following commands:

systemctl daemon-reload
systemctl start alertmanager
systemctl enable alertmanager

The command systemctl status alertmanager should now indicate that our service is up and running:

Now we need to alter the configuration of our Prometheus server to inform it about our Alertmanager instance. Edit the file /etc/prometheus/prometheus.yml. There should already be a alerting section. All we need to do change the section so it looks like this:

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093

We also need to tell Prometheus where our alerting rules live. Change the rule_files section to look like this:

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/etc/prometheus/rules/*.yml"

Save the changes, and create the directory for the alert rules:

mkdir /etc/prometheus/rules
chown prometheus:prometheus /etc/prometheus/rules

Restart the Prometheus server to apply the changes:

systemctl restart prometheus

Creating Our First Alert Rule

Alerting rules are written using the Prometheus expression language or PromQL. One of the easiest things to check is whether all Prometheus targets are up, and trigger an alert when a certain exporter target becomes unreachable. This is done with the simple expression up.

Let’s create our first alert by creating the file /etc/prometheus/rules/alert-rules.yml with the following content:

groups:
- name: alert-rules
  rules:
  - alert: ExporterDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      description: 'Metrics exporter service for {{ $labels.job }} running on {{ $labels.instance }} has been down for more than 5 minutes.'
      summary: 'Exporter down (instance {{ $labels.instance }})'

This alert will trigger as soon as any of the exporter targets in Prometheus is not reported as up for more than 5 minutes. We apply the severity label critical to it.

Restart prometheus with systemctl restart prometheus to load the new alert rule.

You should be able to see the alert rule in the prometheus web interface now too, by going to the Alerts section.

Now the easiest way for us to check if this alert actually fires, and we get our email notification, is to stop the node exporter service:

systemctl status node_exporter

As soon as we do this, we can see that the alert status has changed in the Prometheus server dashboard. It is now marked as active, but is not yet firing, because the condition needs to persist for a minimum of 5 minutes, as specified in our alert rule.

When the 5 minute mark is reached, the alert fires, and we should receive an email from Alertmanager alerting us about the situation:

We should also be able to manage the alert now in the Alertmanager web interface. Open http://<server IP>:9093 in your browser and the alert that we just triggered should be listed. We can choose to silence the alert, to prevent any more alerts from being sent out.

Click silence, and you will be able to configure the duration of the silence period, add a creator and a description for some more metadata, and expand or limit the group of alerts this particular silence applies to. If, for example, i would have wanted to silence all ExporterDown alerts for the next 2 hours, I could remove the instance matcher.

More Advanced Alert Examples

Since Prometheus alerts use the same powerful PromQL expressions as queries, we are able to define rules that go way beyond whether a service is up or down. For a full rundown of all the PromQL functions available, check out the Prometheus documentation or the excellent PromQL for humans.

Memory Usage

For starters, here is an example of an alert rule to check the memory usage of a node. It fires once the percentage of memory available is smaller than 10% of the total memory available for a duration of 5 minutes:

  - alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Host out of memory (instance {{ $labels.instance }})'
      description: 'Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

Disk Space

We can do something similar for disk space. This alert will fire as soon as one of our target’s filesystems has less than 10% of its capacity available for a duration of 5 minutes:

  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Host out of disk space (instance {{ $labels.instance }})'
      description: 'Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

CPU Usage

To alert on CPU usage, we can use the metrics available under node_cpu_seconds_total. In the previous part of this blog we already went into which specific metrics we can find there.

This alert takes the rate of idle CPU seconds, and multiplies this by 100 to get the average percentage of idle CPU cycles over the last 5 minutes. We average this by instance to include all CPU’s (cores) in this average otherwise we would end up with an average percentage for each CPU in the system.

The alert will fire when the average CPU usage of the system exceeds 80% for 5 minutes:

  - alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Host high CPU load (instance {{ $labels.instance }})'
      description: 'CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

Predictive Alerting

Using the PromQL function predict_linear we can expand on the disk space alert mentioned earlier. predict_linear can predict the value of a certain time series X seconds from now. We can use this to predict when our disk is going to fill up, if the pattern follows a linear prediction model.

The following alert will trigger if the linear prediction algorithm, using disk usage patterns over the last hour, determines that the disk will fill up in the next four hours:

  - alert: DiskWillFillIn4Hours
    expr: predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Disk {{ $labels.device }} will fill up in the next 4 hours'
      description: |
        Based on the trend over the last hour, it looks like the disk {{ $labels.device }} on {{ $labels.mountpoint }}
        will fill up in the next 4 hours ({{ $value | humanize }}% space remaining)

Give Me More!

If you are interested in more examples of alert rules, you can find a very extensive collection at Awesome Prometheus alerts. You can find examples here for exporters we haven’t covered too, such as the Blackbox or MySQL exporter.

Syntax Checking Your Alert Rule Definitions

Prometheus comes with a tool that allows you to verify the syntax of your alert rules. This will come in handy for local development of rules or in CI/CD pipelines, to make sure that no broken syntax makes it to your production Prometheus platform.

You can invoke the tool by running promtool check rules /etc/prometheus/rules/alert-rules.yml

# promtool check rules /etc/prometheus/rules/alert-rules.yml
Checking /etc/prometheus/rules/alert-rules.yml
  SUCCESS: 5 rules found

Scraping Metrics From Alertmanager

Alertmanager has a built in metrics endpoint that exports metrics about how many alerts are firing, resolved or silenced. Now that we have all components running, we can add alertmanager as a target to our Prometheus server to start scraping these metrics.

On your Prometheus server, open /etc/prometheus/prometheus.yml with your favorite editor and add the following new job under the scrape_configs section (replace 192.168.0.10 with the IP of your alertmanager instance):

  - job_name: 'alertmanager'
    static_configs:
    - targets: ['192.168.0.10:9093']

Restart Prometheus, and check in the Prometheus web console if you can see the new Alertmanager section under Status > Targets. If all goes well, a query in the Prometheus web console for alertmanager_cluster_enabled should return one result with the value 1.

We can now continue with adding alert rules for Alertmanager itself:

  - alert: PrometheusNotConnectedToAlertmanager
    expr: prometheus_notifications_alertmanagers_discovered < 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 'Prometheus not connected to alertmanager (instance {{ $labels.instance }})'
      description: 'Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'
  - alert: PrometheusAlertmanagerNotificationFailing
    expr: rate(alertmanager_notifications_failed_total[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 'Prometheus AlertManager notification failing (instance {{ $labels.instance }})'
      description: 'Alertmanager is failing to send notifications\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

The first rule will fire when Alertmanager is no longer connected to Prometheus for over 5 minutes, the second rule will fire when Alertmanager fails to send out notification alerts. But how will we know about the alert, if notifications are failing? That’s where the next section comes in handy!

Alertmanager Dashboard Using Karma

The Alertmanager web console is useful for a basic overview of alerts and to manage silences, but it is not really suitable for use as a dashboard that gives us a tactical overview of our Prometheus monitoring platform.

For this, we will use Karma.

Karma offers a nice overview of active alerts, grouping of alerts by a certain label, silence management, alert achknowledgement and more.

We can install it on the same machine where Alertmanager is running using the following steps;

Start off by creating a seperate user and configuration folder for karma:

useradd -M -r -s /bin/false karma
mkdir /etc/karma
chown karma:karma /etc/karma

Then download the file and verify its checksum:

cd /tmp
wget https://github.com/prymitive/karma/releases/download/v0.78/karma-linux-amd64.tar.gz
wget -O - -q https://github.com/prymitive/karma/releases/download/v0.78/sha512sum.txt | grep linux-amd64 | shasum -c -

Make sure the last command returns karma-linux-amd64.tar.gz: OK again. Now unpack the file and move it into place:

tar xzf karma-linux-amd64.tar.gz
mv karma-linux-amd64 /usr/local/bin/karma
rm karma-linux-amd64.tar.gz

Create the file /etc/karma/karma.yml and add the following default configuration (replace the username and password):

alertmanager:
  interval: 1m
  servers:
    - name: alertmanager
      uri: http://localhost:9093
      timeout: 20s
authentication:
  basicAuth:
    users:
      - username: cartman
        password: secret

Set the proper permissions on the config file

chown karma:karma /etc/karma/karma/yml
chmod 640 /etc/karma/karma/yml

Create the file /etc/systemd/system/karma.service with the following content:

[Unit]
Description=Karma Alertmanager dashboard
Wants=network-online.target
After=network-online.target
After=alertmanager.service

[Service]
User=karma
Group=karma
Type=simple
WorkingDirectory=/etc/karma/
ExecStart=/usr/local/bin/karma \
    --config.file=/etc/karma/karma.yml

[Install]
WantedBy=multi-user.target

Activate and start the service with the following commands:

systemctl daemon-reload
systemctl start karma
systemctl enable karma

The command systemctl status karma should now indicate that karma is up and running:

You should be able to visit your new Karma dashboard now at http://<alertmanager server IP>:8080. Here’s what it looks like when we stop the node_exporter service again and wait for 5 minutes for the alert to fire:

If you want to explore all the possibilities and configuration options of Karma, then please see the documentation.

Conclusion

In this series we’ve installed Prometheus, the node exporter, and the Alertmanager. We’ve given a small introduction in PromQL and how to write Prometheus queries and alert rules, and used Grafana to graph metrics and Karma to offer an overview of triggered alerts.

If you want to explore further, check out the following resources:

Share

Understanding and Interpreting CPU Steal Time on Virtual Machines

Virtual machines report on different types of usage metrics, such as server load, memory usage, and steal time. Customers often ask about steal time – what is it, and why is it reported on their virtual machines? Read on as we explain how steal time works to better understand what it means for your virtual machine. 

What is Steal Time? 

Steal time is the percentage of time the virtual machine process is waiting on the physical CPU for its CPU time. You can monitor processes and resource usage by running the “top” command on your Linux server. Among usage metrics, is steal time is labeled as ‘st’.

CPU in Virtual Environments

In cloud environments, the hypervisor acts as the interface between the physical server and its virtualized environment. The hypervisor kernel manages all these tasks by scheduling the running processes to the physical cores of the server. Processes such as virtual machines, networking operations, and storage I/O requests are given with some CPU time to process jobs. CPU time is allocated between these processes, which shifts priorities and creates contention between these processes over the physical cores.

%Idle Time

Steal time can also be visible on virtual machines alongside idle time. Idle time means that there is CPU time allocated by the hypervisor, but the virtual machine did not use that time. In this case, we can assume there was no effect on the performance at all.

When the idle time percentage is 0 and steal time is present, we can assume that processes on the virtual machine are processed with a delay.

Multi-Tenant Cloud

Leaseweb cloud platforms consist of single-tenant and multi-tenant environments. Leaseweb CloudStack products allow you to develop and run a multi-tenant environment, enabling different kinds of users to run their cloud infrastructures at a lower cost. Along with not overselling virtual cores on our premium CloudStack platforms, we also do not pin virtual machines to CPU cores. This allows the hypervisor to allocate CPU time from all the server’s physical cores to any of its active processes.

Theoretically speaking, if the virtual machine has immediate access to its assigned cores 100% of the time, there would be no steal time visible. However, hypervisors are running many different tasks and are continuously performing actions such as rescheduling tasks for efficiency and processing received data from other systems. All these processes require CPU time from the hypervisor’s CPU, resulting in delayed access to the physical cores and adding steal time to the virtual machine.

Analyze Service Performance

A small amount of steal time is often unavoidable in modern hosting environments, particularly when running on shared cloud hosting. The steal time virtual machines experience is not always visible from outside the virtualized operating system.

If you see a constant steal time registered by the virtual machine, try finding a correlation with the tasks you are executing. More importantly, how does this steal time result in performance loss? Are you noticing any loss in performance on your applications? If so, try measuring output to discover latency in the whole flow of your application in accordance with steal time. Keep your hosting provider informed in case you do see an experience impact on your application. In many situations, they can find a more suitable environment by moving your virtual machine to a different hypervisor.

Share

Using Correlation IDs in API Calls

Over the years, the IT industry has moved from a single domain, monolithic architecture to a microservice architecture. In a microservice architecture, complex processes are split into smaller and simpler sub-processes. While this kind of architecture has many benefits, there are also some downsides – for example, if you send one request to a Leaseweb API, it ends up in multiple requests in other backend systems [FIGURE 1]. How do you keep track of requests and responses processed by multiple systems? This is where Correlation IDs come into play.

[FIGURE 1: Example request/response flow]

Using a Correlation ID

A Correlation ID is a unique, randomly generated identifier value that is added to every request and response. In a microservice architecture, the initial Correlation ID is passed to your sub-processes. If a sub-system also makes sub-requests, it will also pass the Correlation ID to those systems.

How you pass the Correlation ID to other systems depends on your architecture. At Leaseweb we are using REST APIs a lot, with HTTP headers to pass on the Correlation ID. As a rule, we assign a Correlation ID as soon as possible, and always use a Correlation ID if it is passed on. Our public API only accepts Correlation IDs from internally trusted clients. For any other client (such as an employee or customer API clients) a new Correlation ID is generated for the request.

Real Value of Correlation IDs

The real value of Correlation IDs is realized when you also log the Correlation IDs. Debugging or tracing requests becomes much easier, as you can search all of your logs for the same Correlation ID. Combined with central logging solutions (such as the ELK stack), searching logs becomes even easier and can be done by non-technical colleagues. Providing tools to your colleagues to troubleshoot issues allows them to have more responsibility and gives you more time to work on more technical projects.

We mainly use Correlation IDs at Leaseweb for debugging purposes. When an error occurs, we provide the Correlation ID to the client/customer. If users provide the Correlation ID when submitting a support ticket, we can visualize the entire process needed to fulfil the client’s initial intent. This has significantly improved the time it takes us to fix bugs.

[FIGURE 2: Example of one Correlation ID with multiple requests]

Debugging issues is a time-consuming process if Correlation IDs are not used. When your environment scales, you will need to find solutions to group transactions happening in your systems. By using a Correlation ID, you can easily group requests and events in your systems, allowing you to spend more time fixing the problem and less time trying to find it.

Practical examples on how to implement Correlation IDs

The following examples use Symfony, a popular web application framework. These concepts can also be applied to any other framework, such as Laravel, Django, Flask or Ruby on Rails.

If you are unfamiliar with the concept of Service Containers and Dependency Injection, we recommend reading the excellent Symfony documentation about it here: https://symfony.com/doc/current/service_container.html

Using Monolog to append Correlation IDs to your application logs

When processing a HTTP request your application often logs some information – such as when an error occurred, or an important change made in your system that you want to keep track of. When using the Monolog logging library in PHP (https://seldaek.github.io/monolog/), you can use the concept of “Processors” (read more about that here on symfony.com).

One way to do this is by creating a Monolog Processor class:

<?php

namespace App\Monolog\Processor;

use Symfony\Component\HttpFoundation\RequestStack;

class CorrelationIdProcessor
{
    protected $requestStack;

    public function __construct(RequestStack $requestStack)
    { 

       $this->requestStack = $requestStack;

    }
 
    public function processRecord(array $record)
    {
        $request = $this->requestStack->getCurrentRequest();

        if (!$request) {
            return;
        }

        $correlationId = $request->headers->get(‘X-My-Correlation-ID');

        if (empty($correlationId)) {
             return;
        }

        // If we have a correlation id include it in every monolog line
        $record['extra']['correlation_id'] = $correlationId;
 
        return $record;
    }
}

Then register this class on the service container as a monolog processor in services.yml:

# app/config/services.yml

services:
  App\Monolog\Processor\CorrelationIdProcessor:
    arguments: ["@request_stack"]
    tags:
      - name: monolog.processor
        method: processRecord

Now, every time you log something in your application with Monolog:

$this->logger->info('shopping_cart_emptied', [‘cart_id’ => 123]);

You will see the Correlation ID of the HTTP Request in your log files:

$ grep ‘shopping_cart_emptied’ var/logs/prod.log

[2020-07-03 12:14:45] app.INFO: shopping_cart_emptied {“cart_id”: 123} {"correlation_id":"d135d5f1-3dd0-45fa-8f26-55d8d6a44876"}

You can utilize the same pattern to log the name of the user that is currently logged in, the remote IP address of the API client, or anything else that makes troubleshooting faster for you.

Using Guzzle to append Correlation IDs when making sub-requests

If your API makes API calls to other microservices (and you use Guzzle to do this) you can make use of Handlers and Middleware.

Some teams at Leaseweb depend on many downstream microservices, and can therefore have multiple guzzle clients as services on the service container. While each Guzzle client is configured with its own base URL and/or authentication, it is possible for all of the Guzzle clients to share the same HandlerStack.

First, create the middleware:

<?php

namespace App\Guzzle\Middleware;

use Symfony\Component\HttpFoundation\RequestStack;
use Psr\Http\Message\RequestInterface;

class CorrelationIdMiddleware
{
    protected $requestStack;
 
    public function __construct(RequestStack $requestStack)
    {
        $this->requestStack = $requestStack;
    }

    public function __invoke(callable $handler)
    {
        return function (RequestInterface $request, array $options = []) use ($handler) {
            $request = $this->requestStack->getCurrentRequest();

            if (!$request) {
                return $handler($request, $options);
            }

            $correlationId = $request->headers->get(‘X-My-Correlation-ID');

            if (empty($correlationId)) {
                 return $handler($request, $options);
            } 
 
            $request = $request->withHeader(‘X-My-Correlation-ID’, $correlationId);
 
            return $handler($request, $options);
        };
    }
}

Define this middleware as service on the service container and create a HandlerStack:

# app/config/services.yml

services:
  correlation_id_middleware:
    class: App\Guzzle\Middleware:
    arguments: ["@request_stack"]

  correlation_id_handler_stack:
    class: GuzzleHttp\HandlerStack
    factory: ['GuzzleHttp\HandlerStack', 'create']
    calls:
      - [push, ["@correlation_id_middleware", "correlation_id_forwarder"]]

With these two services defined, you can now configure all your Guzzle clients using the HandlerStack so that the Correlation ID of the current HTTP request is forwarded to downstream HTTP requests:

# app/config/services.yml

services:
  my_downstream_api:
    class:
    arguments:
      - base_uri: https://my-downstream-api.example.com
        handler: "@correlation_id_handler_stack”

Now every API call that you make to https://my-downstream-api.example.com will include the HTTP request header ‘X-My-Correlation-ID’ and have the same value as the Correlation ID of the current HTTP request. You can also apply the same Monolog and Guzzle tricks described here to the downstream API.

Expose Correlation IDs in error responses

The missing link between these processes is to now expose your Correlation IDs to your users so they can also log them or use them in support cases they report to your organization.

Symfony makes this easy using Event Listeners. You can define Event Listeners in Symfony to pre-process HTTP requests as well as to post-process HTTP Responses just before they are returned by Symfony to the API caller. In this example, we will create a HTTP Response listener and add the Correlation ID of the current HTTP request as a HTTP Header in the HTTP Response.

First, we create a service on the Service Container:

<?php
 
namespace App\Listener;
 
use Symfony\Component\HttpFoundation\RequestStack;
use Symfony\Component\HttpKernel\Event\FilterResponseEvent;

class CorrelationIdResponseListener
{
    protected $requestStack;
 
    public function __construct(RequestStack $requestStack)
    {
        $this->requestStack = $requestStack;
    }

    public function onKernelResponse(FilterResponseEvent $event)
    {
        $request = $this->requestStack->getCurrentRequest();

        if (!$request) {
            return;
        }

        $correlationId = $request->headers->get(‘X-My-Correlation-ID');

        if (empty($correlationId)) {
             return;
        }

        $event->getResponse()->headers->set(‘X-My-Correlation-ID’, $correlationId);
    }
}

Now configure it as a Symfony Event Listener:

# app/config/services.yml

services:
  correlation_id_response_listener:
    class: App\Listener\CorrelationIdResponseListener
    arguments: ["@request_stack"]
    tags:
      - { name: kernel.event_listener, event: kernel.response, method: onKernelResponse }

Every response that is generated by your Symfony application will now include a X-My-Correlation-ID HTTP response header with the same Correlation ID as the HTTP request.

The Value of Correlation IDs

Using Correlation IDs throughout your whole stack gives you more insight into all (sub)requests during a transaction. Using the right tools allows others to debug issues, giving your developers more time to work on new awesome features.

Implementing Correlation IDs isn’t hard to do, and can be achieved quickly depending on your software stack. At Leaseweb, the use of Correlation IDs has saved us hours of time while debugging issues on numerous occasions.

Technical Careers at Leaseweb

We are searching for the next generation of engineers and developers to help us build infrastructure to automate our global hosting services! If you are interested in finding out more, check out our Careers at Leaseweb.

Share

Measuring and Monitoring With Prometheus and Alertmanager Part 1

As one of the most successful projects of the Cloud Native Computing Foundation (CNCF), it is highly likely that you have heard of Prometheus. Initially built at SoundCloud in 2012 to fulfil their monitoring needs, Prometheus is now one of the most popular solutions for time-series based monitoring.

At Leaseweb, we use Prometheus for a variety of purposes – from basic system monitoring of our internal systems, to blackbox monitoring from several of our network locations, to cloud data usage and capacity monitoring.

Whether you have one or several servers, it is always good to have insight into what your systems are doing and how they are performing. In this article, we will show you how to set up a basic Prometheus server and expose system metrics using node_exporter.

For later blogs in this series, we will add Alertmanager to our Prometheus server and use Grafana to graph our recorded metrics.

This is an overview of the components involved and their role:

  • Prometheus: Scrapes metrics on external data sources (or ‘exporters’), stores metrics in time-series databases, and exposes metrics through API.
  • node_exporter: Exposes several system metrics, such as CPU & disk usage
  • Alertmanager: Handles alerts generated by the Prometheus server. Takes care of deduplicating, grouping, and routing alerts to the correct alert channel such as email, Telegram, PagerDuty, Slack, etc.
  • Grafana: Uses Prometheus as a datasource to graph the recorded metrics.

For this tutorial, we are going to use three servers running Ubuntu 18.04 LTS. However, the instructions can be easily adapted for any other recent Linux distribution. These can either be bare metal servers or cloud instances. When your Prometheus setup grows and you start to scrape more and more metrics, it is advisable to have SSD based storage in your Prometheus server.

If you want to start out small or experiment, you can also combine several components on one system.

A Note on Security

Since Prometheus was designed to be run in a private network/cloud setting, it does not offer any authentication or access control out of the box. Because of this, be careful not to expose any of the services to the outside world. There are several ways you can achieve this (implementation of which is outside of the scope of this tutorial).

To achieve this, you could use the Leaseweb private networking feature and bind the Prometheus related services to your private networking interface. Other options are to use a reverse proxy that implements basic authentication, or using firewall rules to only allow certain IP addresses to connect to your Prometheus-related services.

Installing Prometheus

To start off, we will install the Prometheus server. The prometheus package is part of the standard Ubuntu distribution repositories, but unfortunately the version (2.1.0) is quite old. At the time of writing this blog post, the latest version is 2.16.0, which is what we will be using.

On the system that will be our Prometheus server, we start off by creating a user and group called prometheus:

useradd -M -r -s /bin/false prometheus

Next, we create the directories that will contain the configuration and the data of Prometheus:

mkdir /etc/prometheus /var/lib/prometheus

Download Prometheus server and verify its integrity:

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.16.0/prometheus-2.16.0.linux-amd64.tar.gz
wget -O - -q https://github.com/prometheus/prometheus/releases/download/v2.16.0/sha256sums.txt | grep linux-amd64 | shasum -c -

The last command should result in  prometheus-2.16.0.linux-amd64.tar.gz: OK. If it doesn’t, the downloaded file is corrupted. Next we unpack the file and move the various components into place:

tar xzf prometheus-2.16.0.linux-amd64.tar.gz
cp prometheus-2.16.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}
cp -r prometheus-2.16.0.linux-amd64/{consoles,console_libraries} /etc/prometheus/
cp prometheus-2.16.0.linux-amd64/prometheus.yml /etc/prometheus/prometheus.yml

chown -R prometheus:prometheus /etc/prometheus
chown prometheus:prometheus /var/lib/prometheus

And clean up our downloaded files in /tmp

rm -f /tmp/prometheus-2.16.0.linux-amd64.tar.gz
rm -rf /tmp/prometheus-2.16.0.linux-amd64

Add prometheus itself to the config for scraping initially.

To be able to start and stop our prometheus server, we will create a systemd unit file.Use you favorite editor to create the file /etc/systemd/system/prometheus.service and add the following to it:

[Unit]
Description=Prometheus Time Series Collection and Processing Server
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target

Activate and start the service with the following commands:

systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus

The command systemctl status prometheus should now indicate that our service is up and running:

You should be able to access the web interface of the prometheus server now on http://<server IP>:9090:

If we go to Status > Targets we can see that the Prometheus server itself has already been added as a scraping target for metrics. This default target collects metrics about the performance of the Prometheus server. You can view the metrics that are being recorded under http://<server IP>:9090/metrics.

Prometheus provides two convenient endpoints for monitoring its health and status. You can use these to add to any other monitoring system you might have.

root@HRA-blogtest:~# curl localhost:9090/-/healthy
Prometheus is Healthy.
root@HRA-blogtest:~# curl localhost:9090/-/ready
Prometheus is Ready.

Monitor System Metrics with the Node Exporter

To make things a little more interesting, we are going to add a target to obtain system metrics of the Prometheus server. For this, we need to install the node exporter first.

Installing the node exporter

Download Prometheus node exporter and verify its integrity:

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
wget -O - -q https://github.com/prometheus/node_exporter/releases/download/v0.18.1/sha256sums.txt | grep linux-amd64 | shasum -c -

The last command should result in node_exporter-0.18.1.linux-amd64.tar.gz: OK. If it doesn’t, the downloaded file is corrupted.

Next we unpack the file and move the node exporter into place:

tar xzf node_exporter-0.18.1.linux-amd64.tar.gz
cp node_exporter-0.18.1.linux-amd64/node_exporter /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/node_exporter

And clean up our downloaded files in /tmp

rm -f /tmp/node_exporter-0.18.1.linux-amd64.tar.gz
rm -rf /tmp/node_exporter-0.18.1.linux-amd64

Create a unit file /etc/systemd/system/node_exporter.service for the node exporter using your favorite editor.

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Reload the systemd configuration to activate our unit file, start the service, and enable the service to start at boot time:

systemctl daemon-reload
systemctl start node_exporter.service
systemctl enable node_exporter.service

The node exporter should now be running. You can verify this with systemctl status node_exporter

The node exporter listens on TCP port 9100. You should be able to see the node exporter metrics now at http://<server IP>:9100/metrics.

Adding the node exporter target to Prometheus

Now that the node exporter is running, we need to adapt the configuration of the Prometheus server so it can start scraping our node exporter metrics.

Open /etc/prometheus/prometheus.yml in your editor and adapt the scrape config section to look like the following:

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
    - targets: ['localhost:9100']

Save the changes and restart the prometheus server configuration with systemctl restart prometheus

The Prometheus server web interface should show a new target now under Status > Targets:

Querying and Graphing the Recorded Metrics

Now that everything is set up, it is time to start looking into some of the things we are now measuring! Switch to the Graph tab in the Prometheus server web interface.

Enter node_memory_MemAvailable_bytes and click Execute. The Console tab will show you the current amount of memory free in bytes.

Switch to the Graph tab and you will see a graph of the amount of bytes of free memory there were over the course of the last hour. You can increase and decrease the time range with the plus and minus on the top left of the graph.

There is another metric that records the total amount of memory in the system. It is called node_memory_MemTotal_bytes. We can use this to calculate the percentage of memory free in the system. Enter the following in the query area and click execute:

(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

The graph will now show the percentage of free memory over time.

We can make this even more accurate by taking into account buffered and cached memory:

((node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100

Or turn it around and show the percentage of used memory instead:

(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100

The CPU usage is recorded in the metrics under node_cpu_seconds_total. This metric has several modes of the CPU recorded:

  • user: Time spent in userland
  • system: Time spent in the kernel
  • iowait: Time spent waiting for I/O
  • idle: Time the CPU had nothing to do
  • irq&softirq: Time servicing interrupts
  • guest: If you are running VMs, the CPU they use
  • steal: If you are a VM, time other VMs “stole” from your CPUs

These metrics are recorded as counters, so to get the per second values we will use the irate function:

irate(node_cpu_seconds_total{job="node"}[5m])

As you can see, when you have multiple CPU’s in your server, it will return metrics for each CPU individually. To get the overall value across all CPU’s we can use PromQL’s aggregation features using sum by:

sum by (mode, instance) (irate(node_cpu_seconds_total{job="node"}[5m]))

We can also calculate the percentage of CPU used by taking the per second idle rate and multiplying it by 100 (to get the percent CPU idle), and then subtracting it from 100%:

100 - (avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100)

And finally, to get the amount of data sent or received by our server, we can use irate(node_network_transmit_bytes_total{device!="lo"}[1m]) and irate(node_network_receive_bytes_total{device!="lo"}[1m]). This will give us a bytes-per-minute graph. The device!="lo" makes sure we exclude the local loopback interface.

To turn this into megabits, we will have to do some math:

(sum(irate(node_network_receive_bytes_total{device!="lo"}[1m])) by (instance, device) * 8 / 1024 / 1024)

To get a full idea of the possibilities of the PromQL querying language, see the documentation. By investigating the metrics available in the node exporter, you can create a lot more graphs like these – for example, for the amount of available disk space, the amount of file descriptors used, and a lot more.

In the next part of this blog, we will go deeper into visualizing the metrics using Grafana, and will also define alerting rules to receive alerts through Alertmanager.

Share