Overview of Monitoring System with Prometheus and Grafana

12 min readNov 27, 2023

Hello, my name is Seva, and I work as a backend developer at Doubletapp. I also handle some devops tasks. In this article, I’ll share insights into monitoring our backend applications: collecting metrics, visualizing them, and sending notifications. I’ll provide examples of configurations with detailed comments and share GitHub links.

At Doubletapp, we are engaged in outsourcing development. We often receive new projects, and some of the older ones transition into a support mode. This results in having many services where active development is no longer ongoing, and they are deployed on clients’ servers. To avoid learning about issues from users days later, we need an overview system that collects metrics from all projects and answers questions such as:

Is the server/service alive?
How is the Requests Per Second (RPS) changing?
What is the response time?
Is there enough memory and CPU?
How much disk space is left?

For this purpose, we chose a popular solution: collecting metrics using Prometheus, displaying them on dashboards in Grafana, and sending alerts to Telegram through Prometheus Alert Manager.

Prometheus

Applications are usually deployed on cloud services, and each client has their own infrastructure. Therefore, we cannot use service discovery, which allows Prometheus to automatically detect new objects such as Docker containers on a host or pods in a Kubernetes cluster.

Instead, we use static configuration. Editing the config has to be done manually, but it allows us to configure each source: specify the interval for collecting metrics, the endpoint, authentication data, and more.

Prometheus Configuration

prometheus/prometheus.yml:

# The global section specifies parameters for all metric collection configurations
global:
  # Default interval for scraping metrics
  scrape_interval: 10s
  # Interval for evaluating rules for alerts (more details in alerts.yml)
  evaluation_interval: 10s


# List of metric collection configurations
scrape_configs:
  # Name of the process collecting metrics
  - job_name: example-service
    static_configs:
      # List of domains/IPs from which metrics will be collected
      - targets:   
         - example.url.com
        # Additional data to be added to the record
        labels:
          # Project name
          instance_group: example_project
          # Application environment
          instance_env: test
          # Type of metric source
          instance_type: service
          # Team responsible for the project
          team: backend
. . .

To add a new service, you need to modify the config and restart Prometheus. You can enable configuration reloading at runtime, but we simply restart Prometheus during the rollout of a new config via CI/CD.

Collecting Metrics

Unlike Graphite or InfluxDB, Prometheus goes into services and fetches telemetry from them. For this to work, the service must provide an endpoint with information about its state in a format understandable by Prometheus — either their own format or OpenMetrics.

This is usually done using an exporter — a daemon that looks into the system logs, prepares telemetry, and provides it via HTTP/S. Ready-made exporters are already available for most existing services. In a simple case, you can implement /metrics in the application itself, for example, if you only want to check if the application is running and ready to handle requests.

Exporters usually work over HTTP without authentication by default, allowing anyone to access them. It’s essential to set up TLS and authorization. This can be done either through the exporter’s settings or with a proxy like Nginx.

Application Metrics

For all our projects, we use Nginx to proxy requests to the API. Therefore, we can build basic metrics from its logs. There is an official exporter from Nginx Inc. For the standard version of Nginx, you can get information about active connections and the number of processed requests, but we need more information, at least HTTP response codes and request processing time. The official exporter can collect this data from the Nginx API, but it is only available in the Nginx Plus version.

For free Nginx, there is an alternative exporter — martin-helmich/prometheus-nginxlog-exporter. To configure it, you need to specify the path and log format it will parse. You also need to set the bucket boundaries for histograms to see how much time was spent processing a certain fraction of requests. In the default Nginx log format, there is no information about response time, so we’ll change that too.

Nginx Configuration

nginx/default.conf:

# Log format
log_format timed_combined
   '$remote_addr - $remote_user [$time_local] '
   '"$request" $status $body_bytes_sent '
   '"$http_referer" "$http_user_agent" '
   'rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';


server {
   # Port that Nginx will listen on
   listen 80;

   # Forwarding requests to the API
   location / {
    # Log file
    access_log /var/log/nginx/app.log timed_combined;
    proxy_pass http://app:80/;
       . . .
   }

   # Forwarding requests to /metrics to the exporter
   location /metrics {
  # Logging is turned off to avoid affecting metrics
    access_log off;
   
    proxy_pass http://nginx-exporter:4040/metrics;
       . . .
   }
}

Nginx-exporter Configuration

nginx-exporter/config.hcl:

listen {
  # Port that the exporter will listen on
  port = 4040
  # Endpoint for fetching metrics
  metrics_endpoint = "/metrics"
}

namespace "nginx" {
  # Nginx log format for parsing
  format = "$remote_addr - $remote_user [$time_local] \"$request\" $status $body_bytes_sent \"$http_referer\" \"$http_user_agent\" rt=$request_time uct=\"$upstream_connect_time\" uht=\"$upstream_header_time\" urt=\"$upstream_response_time\""
  # Histogram bucket boundaries
  histogram_buckets = [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
  # Log source
  source {
    files = ["/var/log/nginx/app.log"]
  }
}

Here is an example application with Nginx and the exporter.

Server Metrics

It’s also important to collect metrics from the servers where applications are running — monitor CPU load, memory consumption, and disk usage. For collecting these metrics, Prometheus has an official exporter — prometheus/node_exporter. It runs as a daemon and periodically collects metrics from system files and interfaces.

Here’s the link to the project with the exporter.

Dashboards

A dashboard in Grafana is a tool for real-time visualization of metrics. A dashboard consists of panels, each displaying a specific set of data, such as in the form of a graph or table. You can use pre-built dashboards or create them from scratch in the web interface. To create a panel, you need to write a data query in PromQL and configure its representation.

In Grafana, we’ve created three dashboards: a main one that gathers all applications and servers, a dashboard for a specific application, and a dashboard for a specific server.

Main Dashboard

The main dashboard contains tables for servers and applications with key metrics: status (up or down), CPU usage, memory, and disk usage for the server and RPS, status, response codes, and response time for the application. It allows a quick assessment of the state of all our projects. The table consists of individual panels for servers and applications, divided into production and test ones. You can navigate to a specific server/application dashboard by clicking on its name or one of the metrics.

Server Table

PromQL queries used in the server table

State

up{instance_type="node", instance_env="prod"}

CPU Used

100 * (1 - avg by(instance)(irate(node_cpu_seconds_total{mode='idle', instance_env="prod"}[1m])))

Memory Available

avg by(instance)(node_memory_MemAvailable_bytes{instance_env="prod"})

Memory Total

avg by(instance)(node_memory_MemTotal_bytes{instance_env="prod"})

Disk Available

avg by(instance)(node_filesystem_avail_bytes{mountpoint='/', instance_env="prod"})

Disk Total

avg by(instance)(node_filesystem_size_bytes{mountpoint='/', instance_env="prod"})

Application Table

PromQL queries used in the application table

State

up{instance_type="service"}

RPS

sum by(instance)(rate(nginx_http_response_count_total{status=~'...'}[24h]))

RP24H

floor(sum by(instance)(increase(nginx_http_response_count_total{status=~'...'}[24h])))

2хх

floor(sum by(instance)(increase(nginx_http_response_count_total{status=~'2..'}[24h])))

95th

histogram_quantile(0.95, sum by (instance,le) (rate(nginx_http_upstream_time_seconds_hist_bucket{status='200'}[10m]))) * 1000

Application Dashboard

On the applications page, you can see the status, RPS, count and response codes, and response time.

PromQL queries used to build panels

Status Panel

up{instance=~'$instance:.+'}

RPS Panel

sum(rate(nginx_http_response_count_total{instance=~'$instance:.+'}[1m]))

• The metric nginx_http_response_count_total{instance=~'$instance:.+',status='200'} 
contains information about the number of processed HTTP requests.

• The average growth rate is requested for 1-minute intervals for this metric (rate(...)[1m])

• The sum is calculated for all HTTP methods and response statuses sum(...)

Requests Panel for the Last 24 Hours

floor(sum(increase(nginx_http_response_count_total{instance=~'$instance:.+'}[24h])))

• The metric nginx_http_response_count_total{instance=~'$instance:.+',status='200'}
contains information about the number of processed HTTP requests.

• The quantitative increase in the number of processed HTTP requests over 24 hours is calculated (increase(...)[24h])

• The sum is calculated for all HTTP methods and response statuses (sum(...)), and then rounded to the nearest integer floor(...)

Requests Panel by Response Status for the Last 24 Hours

floor(sum(increase(nginx_http_response_count_total{instance=~'$instance:.+',status=~'2..'}[24h])))

Similar to the previous panel, but data is requested for each type of response.

Response Time Panel at Different Percentiles

histogram_quantile(0.95, sum by (le) (rate(nginx_http_upstream_time_seconds_hist_bucket{instance=~'$instance:.+',status='200'}[10m]))) * 1000

Requests per Second Chart Panel

sum(rate(nginx_http_response_count_total{instance=~'$instance:.+',status=~'...'}[1m]))

Response Status Chart Panel

floor(sum(increase(nginx_http_response_count_total{instance=~'$instance:.+',status=~'2..'}[1m])))

4xx Requests Chart Panel

floor(sum by (status) (increase(nginx_http_response_count_total{instance=~'$instance:.+',status=~'4..'}[1m])))

5xx Requests Chart Panel

floor(sum by (status) (increase(nginx_http_response_count_total{instance=~'$instance:.+',status=~'5..'}[1m])))

Response Time Chart Panel at Different Percentiles

histogram_quantile(0.99, sum by (le) (rate(nginx_http_upstream_time_seconds_hist_bucket{instance=~'$instance:.+',status='200'}[10m]))) * 1000

• The metric 
nginx_http_upstream_time_seconds_hist_bucket{instance=~'$instance:.+',status='200'} contains information about the distribution of time taken by the application to process requests with successful responses, across histogram buckets.

• The average growth rate is requested for 10-minute intervals for this metric (rate(...)[10m])

• Multiple charts are plotted for different percentiles (0.5, 0.75, 0.9, 0.95, 0.99) using histogram_quantile(n, sum by (le)(...)

Server Dashboard

On the server page, there are graphs showing CPU load, memory consumption, and disk usage of the server.

PromQL queries used to build panels

Overview Panel

CPU Used

100 * (1 - avg by(instance)(irate(node_cpu_seconds_total{mode='idle',instance=~'$instance.*'}[1m])))


• The metric node_cpu_seconds_total{mode='idle',instance=~'$instance.'} contains information about the distribution of time the processor spends in an idle state.

• It calculates the per-second instantaneous rate of increase for the time series in 10-minute intervals for this metric (irate(...)[10m])

• From this value, the percentage of time the processor spends in any state other than idle is calculated (100(1 - …))


Memory Available

(node_memory_MemAvailable_bytes{instance=~'$instance.*'}) / (node_memory_MemTotal_bytes{instance=~'$instance.*'}) * 100

Percentage ratio of available memory to total memory.

Disk Available

(node_filesystem_avail_bytes{instance=~'$instance.*', mountpoint='/'}) / (node_filesystem_size_bytes{instance=~'$instance.*', mountpoint='/'}) * 100

Percentage ratio of available disk space to total disk space.

Memory Panel

Free Memory Graph

node_memory_MemAvailable_bytes{instance=~'$instance.*'}

Used Memory Graph

node_memory_MemTotal_bytes{instance=~'$instance.*'} - 
node_memory_MemAvailable_bytes{instance=~'$instance.*'}

The panel is configured to display the graphs one above the other.

Disk Space Panel

Free Disk Space Graph

node_filesystem_avail_bytes{instance=~'$instance.*', mountpoint='/'}

Used Disk Space Graph

node_filesystem_size_bytes{instance=~'$instance.*', mountpoint='/'} - 
node_filesystem_avail_bytes{instance=~'$instance.*', mountpoint='/'}

The panel is configured similarly to the memory panel.

CPU Panel

100 * (1 - avg by(instance)
(irate(node_cpu_seconds_total{mode='idle',instance=~'$instance.*'}[1m])))

The PromQL query is similar to the CPU Used query in the overview panel, but the data is visualized as a graph.

CPU Cores Panel

100 * (1 - 
(irate(node_cpu_seconds_total{mode='idle',instance=~'$instance.*'}[1m])))

The PromQL query is similar to the CPU Used query in the overview panel, but there is no aggregation by instance. This provides data on the load for each CPU core.

CPU Cores Usage Panel

100 * (1 - 
(irate(node_cpu_seconds_total{mode='idle',instance=~'$instance.*'}[1m])))

The PromQL query is similar to the previous one, but the data is visualized as a graph.

The data we collect from all servers and applications is the same, so we use the same dashboards to display them, only changing the Instance variable, whose values are pulled from Prometheus.

As a local database for storing configuration (dashboards, users, etc.) in Grafana, we use sqlite3. However, dashboards are convenient to store in JSON format in the project repository for deploying Grafana on a new server. You can export them on the dashboard page and import them in the Grafana menu.

Link to the project with Prometheus+Grafana — here.

Alerts

Without notifications about critical situations, you can only find out if you stare at the monitor all day, which is an extremely ineffective way to detect problems and quite a boring pastime. To learn about problems at the moment of their occurrence, alert mechanisms come to the rescue.

Alert rules are configured in Prometheus. A rule is a PromQL expression that must become true to activate an alert. Prometheus evaluates these expressions in a loop with a configured frequency.

We have notifications set up for basic critical situations:

The application/server is not working. For 5 minutes, the up metric is in the value 0.
There is little free memory/disk on the server. The ratio of used volume to total is greater than 95 percent.
High CPU consumption on the server. The time the processor spends in any state other than idle is more than 80 percent.
The application is unavailable. The number of 502 Bad Gateway proxy server responses increases over 3 minutes.
The application returns 5xx response codes. In the last minute, a 5xx response occurred.

Active alerts (i.e., those for which the condition became true) are sent to Alertmanager, where they are redirected to communication channels. It sends notifications to various communication channels, such as email or messengers, with many integrations supported out of the box. You can also set up forwarding to webhooks, which can help integrate communication channels for which there are no ready-made integrations. For example, if you want to be informed about high server load in your public.

Prometheus Alerts Configuration

prometheus/alerts.yml:

groups:
  - name: default
    rules:   
        # Alert name
      - alert: NodeLowMemory
        # PromQL expression for evaluation
       expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes <= 0.05
        # Alert is considered active if the expr expression is true for this duration
       for: 5m
        # Additional data for the alert
       labels:
          severity: critical
         annotations:
            # Textual description
            summary: "Node {{ $labels.instance }} is low on RAM for more than 5 minutes"

The configuration of alerts and their dispatch in the Prometheus main configuration

prometheus/prometheus.yml:

. . .

# List of files with alert descriptions
rule_files:
 - alerts.yml

# Options related to Alertmanager. Protocol, URL prefix, address.
alerting:
  alertmanagers:
   - scheme: http
     path_prefix: /alertmanager
     static_configs:
       - targets: [ "alertmanager:9093" ]

For notification purposes, we utilize Telegram. Although Alertmanager has its own configuration specifically for Telegram, for more flexible message formatting, we have a webhook application in place. Alertmanager sends POST requests with data about active alerts in JSON format via URLs like /alert/{id} (id being the Telegram channel identifier) to the webhook receiver. The webhook receiver formats the data and sends messages to Telegram.

For each project, we create separate Telegram channels. Additionally, there is a channel for infrastructure servers and applications. Those responsible for the project, including developers, DevOps engineers, and managers, subscribe to the respective channel to receive notifications about issues and respond promptly.

Alertmanager Configuration

alertmanager/alertmanager.yml:

# Parameters that will be inherited by all defined nodes in the routes routing
route:
  group_wait: 0s
  group_interval: 1s
  repeat_interval: 4h
  group_by: [...]
  
# Alert routing nodes  
routes:
     # Parameters for selecting notifications
   - match:
       instance_group: example_group 
     # Name of the receiver to which the notification will be sent
     receiver: example-receiver 
# List of notification receivers
receivers:
    # Name of the receiver
  - name: example-receiver
    webhook_configs:
        # Notify about resolved notifications
      - send_resolved: True
        # Where the POST request with the notification will be sent
        url: http://webhook-receiver:3000/alert/<tg-channel-id>

In this article, I’ve explained how, at Doubletapp, we collect metrics using Prometheus, visualize them in Grafana, and send notifications through Alertmanager to Telegram.

I’ll be happy to answer your questions and discuss everything in the comments.

Links to repositories with projects — Prometheus+Grafana, exporter for servers, application template with Nginx exporter.