title | description | lang |
---|---|---|
Prometheus on Swarm |
Instructions for population of Prometheus with Storidge metrics |
en-US |
The stats for a Storidge cluster is easily integrated into Prometheus or similar applications such as the Telegraf agent shipped with InfluxDB.
Storidge provides a containerized exporter (storidge/cio-prom) that exposes stats at port 16990 on the /metrics endpoint. This exporter aggregates stats from nodes in the Storidge cluster, including auto-discovering new nodes as they join the cluster. Your monitoring application can poll http://<IP_ADDRESS>:16990/metrics to scrap the metrics.
::: tip Releases of the containerized exporter (storidge/cio-prom) prior to September 17, 2020 uses port 16995 instead of 16990. See Prometheus Exporter table below. :::
Prometheus Exporter | Port | Storidge CIO compatibility |
---|---|---|
storidge/cio-prom:0.3 | 16990 | v2.0.0-3336 and above |
storidge/cio-prom:0.2 | 16995 | v1.0.0-3249 and below |
Prometheus is the standard open-source monitoring solution for many clusters. As it does not come with a feature-rich dashboard, it is often paired with Grafana; Prometheus gathers time-series data, and Grafana visualizes it.
This guide assumes basic familiarity with Prometheus. Follow the link to install Prometheus.
Start the exporter as a service on a Storidge cluster.
docker service create \
--name cio_prom \
--publish 16990:16990 \
storidge/cio-prom:0.3
The exporter automatically gathers data from all nodes in the cluster, including data from newly added nodes.
If running the Prometheus monitor on an external server, add the exporter as a target to the Prometheus configuration file (prometheus.yml). In the static_configs section below, we are pointing the Prometheus monitor to 192.168.3.65 port 16990 on the Storidge cluster. Any node IP address in the Storidge cluster can be used to pull the metrics.
# my global config
global:
scrape_interval: 10s
evaluation_interval: 10s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "prometheus.rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['192.168.3.65:16990']
Edit Prometheus configuration file with localhost target on port 16990, if the Prometheus monitor is on the Storidge cluster, e.g.:
static_configs:
- targets: ['localhost:16990']
Start the Prometheus monitor, e.g.:
docker run --rm -d -p 9090:9090 \
-v /home/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
--net=host \
--name prom-monitor \
prom/prometheus
Replace the host network setting with your overlay network name as needed.
Start Prometheus to watch the exporter (e.g. ./prometheus --config.file=prometheus.yml). Verify Prometheus is serving metrics by navigating to IP address of your server at port 9090.
::: tip If you do not see metrics being served, verify the IP address and port number. Also confirm time setting between the Storidge cluster and the Prometheus server is correct. :::
To visualize the metrics with Grafana, follow the steps in Grafana with Storidge.
The following cluster stats are available on each of the nodes.
Exported Cluster Data | Description |
---|---|
cio_cluster_nodes_online | Number of nodes that are healthy |
cio_cluster_nodes_maintenance | Number of nodes that are in maintenance mode |
cio_cluster_nodes_cordoned | Number of nodes that are cordoned |
cio_cluster_drives_online | Number of drives currently in use by CIO |
cio_cluster_drives_available | Number of drives that can be used by CIO |
cio_cluster_drives_failed | Number of drives are flagged as faulty, and should be replaced |
cio_cluster_capacity_total | Total capacity currently available in CIO cluster |
cio_cluster_capacity_used | Total capacity currently in use |
cio_cluster_capacity_free | Total capacity that is available for use |
cio_cluster_capacity_provisioned | Total capacity that is allocated for use by CIO volumes |
cio_cluster_iops_total | Total IOPS currently available in CIO cluster |
cio_cluster_iops_used | Total IOPS currently in use |
cio_cluster_iops_free | Total IOPS that is available for use |
cio_cluster_iops_provisioned | Total IOPS that is currently reserved for use by CIO volumes |
cio_cluster_bw_total | Total bandwidth currently available in CIO cluster |
cio_cluster_bw_used | Total bandwidth currently in use |
cio_cluster_bw_free | Total bandwidth that is available for use |
cio_cluster_bw_provisioned | Total bandwidth that is currently reserved for use by CIO volumes |
The /metrics endpoint dynamically exports the following data about drives and volumes in the Storidge cluster. Metrics are removed once volumes are deleted. The data is derived from /proc/diskstats
.
The sample data below applies to drives as well; however, they will be marked as drive and their name will be generated from node ID and drive letter, e.g. cio_drive_5927e513sdb_reads_merged
.
Exported Volume Data | Description |
---|---|
cio_volume_vd0_current_ios | Number of current IOs in progress |
cio_volume_vd0_reads_completed | Number of reads that have been performed on the volume |
cio_volume_vd0_reads_merged | Number of times that two or more similar reads have been merged for increased efficiency |
cio_volume_vd0_sectors_read | Number of sectors that have been read |
cio_volume_vd0_sectors_written | Number of sectors that have been written |
cio_volume_vd0_time_doing_ios | Time doing IOs, in ms |
cio_volume_vd0_time_reading | Time spent reading, in ms |
cio_volume_vd0_writes_completed | Number of write operations that have been completed on the volume |
cio_volume_vd0_time_writing | Time spent writing, in ms |
cio_volume_vd0_writes_merged | Number of times that two or more write requests have been merged for increased efficiency |
The API response data is also exported.
Exported API Data | Description |
---|---|
cio_api_calls | The total number of API calls |
cio_api_calls_ok | The total number of calls that returned 200 OK |
cio_api_calls_bad_request | The total number of calls that returned 400 BAD REQUEST |
cio_api_calls_not_found | The total number of calls that returned 404 NOT FOUND |
cio_api_calls_conflict | The total number of calls that returned 409 CONFLICT |
cio_api_calls_internal_server_error | The total number of calls that returned 500 INTERNAL SERVER ERROR |
cio_api_calls_errors_overall | The total number of calls that returned non-200 OK responses |
Environment variables can be passed to configure the amount of metrics exported. The default settings are API_LEVEL=2, DRIVE_LEVEL=1, and SYSTEM_LEVEL=1.
The following environment variables and values are supported:
Environment Variable | Description |
---|---|
API_LEVEL=2 | All API stats are exported |
API_LEVEL=1 | Only Total OK and Total Errors are exported |
API_LEVEL=0 | No API stats exported |
DRIVE_LEVEL=1 | All drive stats exported |
DRIVE_LEVEL=0 | No drive stats exported |
SYSTEM_LEVEL=1 | All cluster stats from cio info are exported |
SYSTEM_LEVEL=0 | No cluster stats from cio info exported |
Example:
docker service create --name cio_prom --publish 16990:16990 -e API_LEVEL=1 -e SYSTEM_LEVEL=0 -e DRIVE_LEVEL=1 storidge/cio-prom:latest