Skip to content

Datadog Metrics get multiplied by 10 for some unknown reason #10944

@jrimmer-housecallpro

Description

@jrimmer-housecallpro

Relevant telegraf.conf

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply surround
# them with ${}. For strings the variable must be within quotes (ie, "${STR_VAR}"),
# for numbers and booleans they should be plain (ie, ${INT_VAR}, ${BOOL_VAR})

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "30s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 2500

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 25000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "30s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Log at debug level.
  # debug = false
  ## Log only error level messages.
  quiet = false

  ## Log target controls the destination for logs and can be one of "file",
  ## "stderr" or, on Windows, "eventlog".  When set to "file", the output file
  ## is determined by the "logfile" setting.
  # logtarget = "file"

  ## Name of the file to be logged to when using the "file" logtarget.  If set to
  ## the empty string then logs are written to stderr.
  # logfile = ""

  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.  Logs are rotated only when
  ## written to, if there is no log activity rotation may be delayed.
  # logfile_rotation_interval = "0d"

  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  # logfile_rotation_max_size = "0MB"

  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  # logfile_rotation_max_archives = 5

  ## Pick a timezone to use when logging or type 'local' for local time.
  ## Example: America/Chicago
  # log_with_timezone = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# # Configuration for DataDog API to send metrics to.
[[outputs.datadog]]
  ## Datadog API key
  apikey = "${DD_API_KEY}"

  ## Connection timeout.
  # timeout = "5s"

  ## Write URL override; useful for debugging.
  # url = "https://app.datadoghq.com/api/v1/series"

  ## Set http_proxy (telegraf uses the system wide proxy settings if it isn't set)
  # http_proxy_url = "http://localhost:8888"


###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################


# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false


# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]

  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]


# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb", "vd*"]
  ## Uncomment the following line if you need disk serial numbers.
  # skip_serial_number = false
  #
  ## On systems which support it, device metadata can be added in the form of
  ## tags.
  ## Currently only Linux is supported via udev properties. You can view
  ## available properties for a device by running:
  ## 'udevadm info -q property -n /dev/sda'
  ## Note: Most, but not all, udev properties can be accessed this way. Properties
  ## that are currently inaccessible include DEVTYPE, DEVNAME, and DEVPATH.
  # device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
  #
  ## Using the same metadata source as device_tags, you can also customize the
  ## name of the device via templates.
  ## The 'name_templates' parameter is a list of templates to try and apply to
  ## the device. The template may contain variables in the form of '$PROPERTY' or
  ## '${PROPERTY}'. The first template which does not contain any variables not
  ## present for the device is used as the device name tag.
  ## The typical use case is for LVM volumes, to get the VG/LV name instead of
  ## the near-meaningless DM-0 name.
  # name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]


# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration


# Read metrics about memory usage
[[inputs.mem]]
  # no configuration


# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration


# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration


# Read metrics about system load & uptime
[[inputs.system]]
  ## Uncomment to remove deprecated metrics.
  # fielddrop = ["uptime_format"]


# Read metrics about ECS containers
[[inputs.ecs]]
  ## ECS metadata url.
  ## Metadata v2 API is used if set explicitly. Otherwise,
  ## v3 metadata endpoint API is used if available.
  # endpoint_url = ""

  ## Containers to include and exclude. Globs accepted.
  ## Note that an empty array for both will include all containers
  # container_name_include = []
  # container_name_exclude = []

  ## Container states to include and exclude. Globs accepted.
  ## When empty only containers in the "RUNNING" state will be captured.
  ## Possible values are "NONE", "PULLED", "CREATED", "RUNNING",
  ## "RESOURCES_PROVISIONED", "STOPPED".
  # container_status_include = []
  # container_status_exclude = []

  ## ecs labels to include and exclude as tags.  Globs accepted.
  ## Note that an empty array for both will include all labels as tags
  ecs_label_include = [ "com.amazonaws.ecs.*" ]
  ecs_label_exclude = []

  ## Timeout for queries.
  # timeout = "5s"

###############################################################################
#                            service input plugins                            #
###############################################################################

# Statsd UDP/TCP Server
[[inputs.statsd]]
  ## Protocol, must be "tcp", "udp", "udp4" or "udp6" (default=udp)
  protocol = "udp"

  ## MaxTCPConnection - applicable when protocol is set to tcp (default=250)
  max_tcp_connections = 250

  ## Enable TCP keep alive probes (default=false)
  tcp_keep_alive = false

  ## Specifies the keep-alive period for an active network connection.
  ## Only applies to TCP sockets and will be ignored if tcp_keep_alive is false.
  ## Defaults to the OS configuration.
  # tcp_keep_alive_period = "2h"

  ## Address and port to host UDP listener on
  service_address = ":${TELEGRAF_AGENT_PORT}"

  ## The following configuration options control when telegraf clears it's cache
  ## of previous values. If set to false, then telegraf will only clear it's
  ## cache when the daemon is restarted.
  ## Reset gauges every interval (default=true)
  delete_gauges = true
  ## Reset counters every interval (default=true)
  delete_counters = true
  ## Reset sets every interval (default=true)
  delete_sets = true
  ## Reset timings & histograms every interval (default=true)
  delete_timings = true

  ## Percentiles to calculate for timing & histogram stats
  percentiles = [50.0, 90.0, 99.0, 99.9, 99.95, 100.0]

  ## separator to use between elements of a statsd metric
  ## DO NOT CHANGE THIS
  metric_separator = "."

  ## Parses tags in the datadog statsd format
  ## http://docs.datadoghq.com/guides/dogstatsd/
  parse_data_dog_tags = true

  ## Parses datadog extensions to the statsd format
  datadog_extensions = true

  ## Parses distributions metric as specified in the datadog statsd format
  ## https://docs.datadoghq.com/developers/metrics/types/?tab=distribution#definition
  datadog_distributions = true

  ## Statsd data translation templates, more info can be read here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/TEMPLATE_PATTERN.md
  # templates = [
  #     "cpu.* measurement*"
  # ]

  ## Number of UDP messages allowed to queue up, once filled,
  ## the statsd server will start dropping packets
  allowed_pending_messages = 10000

  ## Number of timing/histogram values to track per-measurement in the
  ## calculation of percentiles. Raising this limit increases the accuracy
  ## of percentiles but also increases the memory usage and cpu time.
  percentile_limit = 1000

  ## Max duration (TTL) for each metric to stay cached/reported without being updated.
  #max_ttl = "1000h"

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"

Logs from Telegraf

2022-04-06T16:49:25.901-07:00	2022-04-06T23:49:25Z I! Starting Telegraf 1.21.4

2022-04-06T16:49:25.901-07:00	2022-04-06T23:49:25Z I! Using config file: /etc/telegraf/telegraf.conf

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Loaded inputs: cpu disk diskio ecs kernel mem processes statsd swap system

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Loaded aggregators:

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Loaded processors:

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Loaded outputs: datadog sumologic

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Tags enabled: host=baf01b4f1d4a

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! [agent] Config: Interval:30s, Quiet:false, Hostname:"baf01b4f1d4a", Flush Interval:30s

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z W! [inputs.statsd] 'parse_data_dog_tags' config option is deprecated, please use 'datadog_extensions' instead

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! [inputs.statsd] UDP listening on "[::]:8126"

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! [inputs.statsd] Started the statsd service on ":8126"

System info

Telegraf 1.21.4, Docker image 1.21-alpine, AWS ECS

Docker

Dockerfile:

FROM telegraf:1.21-alpine

COPY dist/telegraf.conf /etc/telegraf/telegraf.conf

RUN apk add --no-cache
curl
python3
py3-pip
&& pip3 install --upgrade pip
&& pip3 install --no-cache-dir
awscli
&& rm -rf /var/cache/apk/*

RUN aws --version

COPY custom_entrypoint.sh /custom_entrypoint.sh

COPY parse_tags.py /parse_tags.py

ENTRYPOINT ["/custom_entrypoint.sh"]

CMD ["telegraf"]


parse_tags.py:

#!/usr/bin/env python

import sys
import json

file = sys.argv[1]
f = open(file)
data = json.load(f)

for i in data['Tags']:
if i['Key'] and i['Value']:
output = "export EC2_TAG_" + i['Key'].upper() + '=' + i['Value']
print(output.replace(':', '_'))

f.close()

custom_entrypoint.sh:

#!/bin/sh

set -e

output_file="/tmp/ec2_tags.json"

instance_id=$(curl http://169.254.169.254/latest/meta-data/instance-id)

aws ec2 describe-tags --filters "Name=resource-id,Values=$instance_id" --region=us-west-2 > $output_file

eval python3 /parse_tags.py $output_file

exec /entrypoint.sh "$@"

Steps to reproduce

  1. Install telegraf and configure it to send data to Datadog
  2. Send custom metrics (counters) via statsd to Datadog
  3. watch as values get multiplied by 10 for no goddamned good reason

Expected behavior

If a counter is incremented 3 times and transmitted, Datadog should show a value of "3"

Actual behavior

If a counter is incremented 3 times and transmitted, Datadog instead shows a value of "30"

Additional info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/awsAWS plugins including cloudwatch, ecs, kinesisbugunexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions