Overview of Monitoring System with Prometheus and Grafana
Hello, my name is Seva, and I work as a backend developer at Doubletapp. I also handle some devops tasks. In this article, I’ll share insights into monitoring our backend applications: collecting metrics, visualizing them, and sending notifications. I’ll provide examples of configurations with detailed comments and share GitHub links.
At Doubletapp, we are engaged in outsourcing development. We often receive new projects, and some of the older ones transition into a support mode. This results in having many services where active development is no longer ongoing, and they are deployed on clients’ servers. To avoid learning about issues from users days later, we need an overview system that collects metrics from all projects and answers questions such as:
- Is the server/service alive?
- How is the Requests Per Second (RPS) changing?
- What is the response time?
- Is there enough memory and CPU?
- How much disk space is left?
For this purpose, we chose a popular solution: collecting metrics using Prometheus, displaying them on dashboards in Grafana, and sending alerts to Telegram through Prometheus Alert Manager.
Prometheus
Applications are usually deployed on cloud services, and each client has their own infrastructure. Therefore, we cannot use service discovery, which allows Prometheus to automatically detect new objects such as Docker containers on a host or pods in a Kubernetes cluster.
Instead, we use static configuration. Editing the config has to be done manually, but it allows us to configure each source: specify the interval for collecting metrics, the endpoint, authentication data, and more.
Prometheus Configuration
# The global section specifies parameters for all metric collection configurations
global:
# Default interval for scraping metrics
scrape_interval: 10s
# Interval for evaluating rules for alerts (more details in alerts.yml)
evaluation_interval: 10s
# List of metric collection configurations
scrape_configs:
# Name of the process collecting metrics
- job_name: example-service
static_configs:
# List of domains/IPs from which metrics will be collected
- targets:
- example.url.com
# Additional data to be added to the record
labels:
# Project name
instance_group: example_project
# Application environment
instance_env: test
# Type of metric source
instance_type: service
# Team responsible for the project
team: backend
. . .
To add a new service, you need to modify the config and restart Prometheus. You can enable configuration reloading at runtime, but we simply restart Prometheus during the rollout of a new config via CI/CD.
Collecting Metrics
Unlike Graphite or InfluxDB, Prometheus goes into services and fetches telemetry from them. For this to work, the service must provide an endpoint with information about its state in a format understandable by Prometheus — either their own format or OpenMetrics.
This is usually done using an exporter — a daemon that looks into the system logs, prepares telemetry, and provides it via HTTP/S. Ready-made exporters are already available for most existing services. In a simple case, you can implement /metrics in the application itself, for example, if you only want to check if the application is running and ready to handle requests.
Exporters usually work over HTTP without authentication by default, allowing anyone to access them. It’s essential to set up TLS and authorization. This can be done either through the exporter’s settings or with a proxy like Nginx.
Application Metrics
For all our projects, we use Nginx to proxy requests to the API. Therefore, we can build basic metrics from its logs. There is an official exporter from Nginx Inc. For the standard version of Nginx, you can get information about active connections and the number of processed requests, but we need more information, at least HTTP response codes and request processing time. The official exporter can collect this data from the Nginx API, but it is only available in the Nginx Plus version.
For free Nginx, there is an alternative exporter — martin-helmich/prometheus-nginxlog-exporter. To configure it, you need to specify the path and log format it will parse. You also need to set the bucket boundaries for histograms to see how much time was spent processing a certain fraction of requests. In the default Nginx log format, there is no information about response time, so we’ll change that too.
Nginx Configuration
# Log format
log_format timed_combined
'$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';
server {
# Port that Nginx will listen on
listen 80;
# Forwarding requests to the API
location / {
# Log file
access_log /var/log/nginx/app.log timed_combined;
proxy_pass http://app:80/;
. . .
}
# Forwarding requests to /metrics to the exporter
location /metrics {
# Logging is turned off to avoid affecting metrics
access_log off;
proxy_pass http://nginx-exporter:4040/metrics;
. . .
}
}
Nginx-exporter Configuration
listen {
# Port that the exporter will listen on
port = 4040
# Endpoint for fetching metrics
metrics_endpoint = "/metrics"
}
namespace "nginx" {
# Nginx log format for parsing
format = "$remote_addr - $remote_user [$time_local] \"$request\" $status $body_bytes_sent \"$http_referer\" \"$http_user_agent\" rt=$request_time uct=\"$upstream_connect_time\" uht=\"$upstream_header_time\" urt=\"$upstream_response_time\""
# Histogram bucket boundaries
histogram_buckets = [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
# Log source
source {
files = ["/var/log/nginx/app.log"]
}
}
Here is an example application with Nginx and the exporter.
Server Metrics
It’s also important to collect metrics from the servers where applications are running — monitor CPU load, memory consumption, and disk usage. For collecting these metrics, Prometheus has an official exporter — prometheus/node_exporter. It runs as a daemon and periodically collects metrics from system files and interfaces.
Here’s the link to the project with the exporter.
Dashboards
A dashboard in Grafana is a tool for real-time visualization of metrics. A dashboard consists of panels, each displaying a specific set of data, such as in the form of a graph or table. You can use pre-built dashboards or create them from scratch in the web interface. To create a panel, you need to write a data query in PromQL and configure its representation.
In Grafana, we’ve created three dashboards: a main one that gathers all applications and servers, a dashboard for a specific application, and a dashboard for a specific server.
Main Dashboard
The main dashboard contains tables for servers and applications with key metrics: status (up or down), CPU usage, memory, and disk usage for the server and RPS, status, response codes, and response time for the application. It allows a quick assessment of the state of all our projects. The table consists of individual panels for servers and applications, divided into production and test ones. You can navigate to a specific server/application dashboard by clicking on its name or one of the metrics.
Server Table
PromQL queries used in the server table
State
up{instance_type="node", instance_env="prod"}
CPU Used
100 * (1 - avg by(instance)(irate(node_cpu_seconds_total{mode='idle', instance_env="prod"}[1m])))
Memory Available
avg by(instance)(node_memory_MemAvailable_bytes{instance_env="prod"})
Memory Total
avg by(instance)(node_memory_MemTotal_bytes{instance_env="prod"})
Disk Available
avg by(instance)(node_filesystem_avail_bytes{mountpoint='/', instance_env="prod"})
Disk Total
avg by(instance)(node_filesystem_size_bytes{mountpoint='/', instance_env="prod"})
Application Table
PromQL queries used in the application table
State
up{instance_type="service"}
RPS
sum by(instance)(rate(nginx_http_response_count_total{status=~'...'}[24h]))
RP24H
floor(sum by(instance)(increase(nginx_http_response_count_total{status=~'...'}[24h])))
2хх
floor(sum by(instance)(increase(nginx_http_response_count_total{status=~'2..'}[24h])))
95th
histogram_quantile(0.95, sum by (instance,le) (rate(nginx_http_upstream_time_seconds_hist_bucket{status='200'}[10m]))) * 1000
Application Dashboard
On the applications page, you can see the status, RPS, count and response codes, and response time.
PromQL queries used to build panels
Status Panel
up{instance=~'$instance:.+'}
RPS Panel
sum(rate(nginx_http_response_count_total{instance=~'$instance:.+'}[1m]))
• The metric nginx_http_response_count_total{instance=~'$instance:.+',status='200'}
contains information about the number of processed HTTP requests.
• The average growth rate is requested for 1-minute intervals for this metric (rate(...)[1m])
• The sum is calculated for all HTTP methods and response statuses sum(...)
Requests Panel for the Last 24 Hours
floor(sum(increase(nginx_http_response_count_total{instance=~'$instance:.+'}[24h])))
• The metric nginx_http_response_count_total{instance=~'$instance:.+',status='200'}
contains information about the number of processed HTTP requests.
• The quantitative increase in the number of processed HTTP requests over 24 hours is calculated (increase(...)[24h])
• The sum is calculated for all HTTP methods and response statuses (sum(...)), and then rounded to the nearest integer floor(...)
Requests Panel by Response Status for the Last 24 Hours
floor(sum(increase(nginx_http_response_count_total{instance=~'$instance:.+',status=~'2..'}[24h])))
Similar to the previous panel, but data is requested for each type of response.
Response Time Panel at Different Percentiles
histogram_quantile(0.95, sum by (le) (rate(nginx_http_upstream_time_seconds_hist_bucket{instance=~'$instance:.+',status='200'}[10m]))) * 1000
Requests per Second Chart Panel
sum(rate(nginx_http_response_count_total{instance=~'$instance:.+',status=~'...'}[1m]))
Response Status Chart Panel
floor(sum(increase(nginx_http_response_count_total{instance=~'$instance:.+',status=~'2..'}[1m])))
4xx Requests Chart Panel
floor(sum by (status) (increase(nginx_http_response_count_total{instance=~'$instance:.+',status=~'4..'}[1m])))
5xx Requests Chart Panel
floor(sum by (status) (increase(nginx_http_response_count_total{instance=~'$instance:.+',status=~'5..'}[1m])))
Response Time Chart Panel at Different Percentiles
histogram_quantile(0.99, sum by (le) (rate(nginx_http_upstream_time_seconds_hist_bucket{instance=~'$instance:.+',status='200'}[10m]))) * 1000
• The metric
nginx_http_upstream_time_seconds_hist_bucket{instance=~'$instance:.+',status='200'} contains information about the distribution of time taken by the application to process requests with successful responses, across histogram buckets.
• The average growth rate is requested for 10-minute intervals for this metric (rate(...)[10m])
• Multiple charts are plotted for different percentiles (0.5, 0.75, 0.9, 0.95, 0.99) using histogram_quantile(n, sum by (le)(...)
Server Dashboard
On the server page, there are graphs showing CPU load, memory consumption, and disk usage of the server.
PromQL queries used to build panels
Overview Panel
CPU Used
100 * (1 - avg by(instance)(irate(node_cpu_seconds_total{mode='idle',instance=~'$instance.*'}[1m])))
• The metric node_cpu_seconds_total{mode='idle',instance=~'$instance.'} contains information about the distribution of time the processor spends in an idle state.
• It calculates the per-second instantaneous rate of increase for the time series in 10-minute intervals for this metric (irate(...)[10m])
• From this value, the percentage of time the processor spends in any state other than idle is calculated (100(1 - …))
Memory Available
(node_memory_MemAvailable_bytes{instance=~'$instance.*'}) / (node_memory_MemTotal_bytes{instance=~'$instance.*'}) * 100
Percentage ratio of available memory to total memory.
Disk Available
(node_filesystem_avail_bytes{instance=~'$instance.*', mountpoint='/'}) / (node_filesystem_size_bytes{instance=~'$instance.*', mountpoint='/'}) * 100
Percentage ratio of available disk space to total disk space.
Memory Panel
Free Memory Graph
node_memory_MemAvailable_bytes{instance=~'$instance.*'}
Used Memory Graph
node_memory_MemTotal_bytes{instance=~'$instance.*'} -
node_memory_MemAvailable_bytes{instance=~'$instance.*'}
The panel is configured to display the graphs one above the other.
Disk Space Panel
Free Disk Space Graph
node_filesystem_avail_bytes{instance=~'$instance.*', mountpoint='/'}
Used Disk Space Graph
node_filesystem_size_bytes{instance=~'$instance.*', mountpoint='/'} -
node_filesystem_avail_bytes{instance=~'$instance.*', mountpoint='/'}
The panel is configured similarly to the memory panel.
CPU Panel
100 * (1 - avg by(instance)
(irate(node_cpu_seconds_total{mode='idle',instance=~'$instance.*'}[1m])))
The PromQL query is similar to the CPU Used query in the overview panel, but the data is visualized as a graph.
CPU Cores Panel
100 * (1 -
(irate(node_cpu_seconds_total{mode='idle',instance=~'$instance.*'}[1m])))
The PromQL query is similar to the CPU Used query in the overview panel, but there is no aggregation by instance. This provides data on the load for each CPU core.
CPU Cores Usage Panel
100 * (1 -
(irate(node_cpu_seconds_total{mode='idle',instance=~'$instance.*'}[1m])))
The PromQL query is similar to the previous one, but the data is visualized as a graph.
The data we collect from all servers and applications is the same, so we use the same dashboards to display them, only changing the Instance variable, whose values are pulled from Prometheus.
As a local database for storing configuration (dashboards, users, etc.) in Grafana, we use sqlite3. However, dashboards are convenient to store in JSON format in the project repository for deploying Grafana on a new server. You can export them on the dashboard page and import them in the Grafana menu.
Link to the project with Prometheus+Grafana — here.
Alerts
Without notifications about critical situations, you can only find out if you stare at the monitor all day, which is an extremely ineffective way to detect problems and quite a boring pastime. To learn about problems at the moment of their occurrence, alert mechanisms come to the rescue.
Alert rules are configured in Prometheus. A rule is a PromQL expression that must become true to activate an alert. Prometheus evaluates these expressions in a loop with a configured frequency.
We have notifications set up for basic critical situations:
- The application/server is not working. For 5 minutes, the up metric is in the value 0.
- There is little free memory/disk on the server. The ratio of used volume to total is greater than 95 percent.
- High CPU consumption on the server. The time the processor spends in any state other than idle is more than 80 percent.
- The application is unavailable. The number of 502 Bad Gateway proxy server responses increases over 3 minutes.
- The application returns 5xx response codes. In the last minute, a 5xx response occurred.
Active alerts (i.e., those for which the condition became true) are sent to Alertmanager, where they are redirected to communication channels. It sends notifications to various communication channels, such as email or messengers, with many integrations supported out of the box. You can also set up forwarding to webhooks, which can help integrate communication channels for which there are no ready-made integrations. For example, if you want to be informed about high server load in your public.
Prometheus Alerts Configuration
groups:
- name: default
rules:
# Alert name
- alert: NodeLowMemory
# PromQL expression for evaluation
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes <= 0.05
# Alert is considered active if the expr expression is true for this duration
for: 5m
# Additional data for the alert
labels:
severity: critical
annotations:
# Textual description
summary: "Node {{ $labels.instance }} is low on RAM for more than 5 minutes"
The configuration of alerts and their dispatch in the Prometheus main configuration
. . .
# List of files with alert descriptions
rule_files:
- alerts.yml
# Options related to Alertmanager. Protocol, URL prefix, address.
alerting:
alertmanagers:
- scheme: http
path_prefix: /alertmanager
static_configs:
- targets: [ "alertmanager:9093" ]
For notification purposes, we utilize Telegram. Although Alertmanager has its own configuration specifically for Telegram, for more flexible message formatting, we have a webhook application in place. Alertmanager sends POST requests with data about active alerts in JSON format via URLs like /alert/{id} (id being the Telegram channel identifier) to the webhook receiver. The webhook receiver formats the data and sends messages to Telegram.
For each project, we create separate Telegram channels. Additionally, there is a channel for infrastructure servers and applications. Those responsible for the project, including developers, DevOps engineers, and managers, subscribe to the respective channel to receive notifications about issues and respond promptly.
Alertmanager Configuration
alertmanager/alertmanager.yml:
# Parameters that will be inherited by all defined nodes in the routes routing
route:
group_wait: 0s
group_interval: 1s
repeat_interval: 4h
group_by: [...]
# Alert routing nodes
routes:
# Parameters for selecting notifications
- match:
instance_group: example_group
# Name of the receiver to which the notification will be sent
receiver: example-receiver
# List of notification receivers
receivers:
# Name of the receiver
- name: example-receiver
webhook_configs:
# Notify about resolved notifications
- send_resolved: True
# Where the POST request with the notification will be sent
url: http://webhook-receiver:3000/alert/<tg-channel-id>
In this article, I’ve explained how, at Doubletapp, we collect metrics using Prometheus, visualize them in Grafana, and send notifications through Alertmanager to Telegram.
I’ll be happy to answer your questions and discuss everything in the comments.
Links to repositories with projects — Prometheus+Grafana, exporter for servers, application template with Nginx exporter.
Read our other stories:
- “External” navigation in a multi-module project in Kotlin
- Parsing responses to BLE commands in Swift using the example of GoPro
- What Is Happening With The Mobile App Market
- Yandex Travel
- Case Ural Music Night
- Player’s Health Protect
- DJU Drone Club
- Case Alibra School
- Case Praktika
- Case Watchmen
- Case Adventure Aide
- Neural Network Optimization: Ocean in a Drop
- Case Elixir Gallery
- Forgive us, John Connor, or How We Taught a Neural Network to Accurately Recognize Gunshots
- Case Bus Factor
- CI/CD for iOS-projects: device or cloud? What’s better, Doubletapp’s take
- How to set up Gitlab CI/CD with Fastlane for iOS-project on a Mac mini
- The story about the contest from telegram
- What should be done before starting the site? Checklist from Doubletapp
- How to Generate PDF Documents in Python
- How to manage Gradle dependencies in an Android project properly
- The History of the Doubletapp Company