For all JVMs we have, if HEAP space hits a limit (90%?), send an alert.
Description
Details
Related Objects
- Mentioned In
- T153951: Yarn node manager JVM memory leaks
Event Timeline
Note: we have created https://grafana.wikimedia.org/dashboard/db/analytics-hadoop to monitor all the GC/Heap metrics of the Hadoop cluster actors, but we are still missing alarms. Maybe something based on graphite thresholds? I can see stuff like the following in puppet:
monitoring::graphite_threshold { 'restbase_analytics_<<some-metric-name>>': description => 'Analytics RESTBase req/s returning 5xx http://grafana.wikimedia.org/#/dashboard/db/restbase', metric => '<<the metric and any transformations>>', from => '10min', warning => '<<warning threshold>>', # <<explain>> critical => '<<critical threshold>>', # <<explain>> percentage => '20', contact_group => 'aqs-admins', }
So we have all the graphite metrics needed and we'd only need to figure out the correct thresholds. Going to ask for a ticket prioritization, this one seems rather important.
While reviewing this task, I opened https://phabricator.wikimedia.org/T153951 :D
Change 330154 had a related patch set uploaded (by Elukey):
Add JVM Heap usage alarms for basic Hadoop daemons
Change 337574 had a related patch set uploaded (by Elukey):
Fix and tune the new Analytics Hadoop alarms
Change 337575 had a related patch set uploaded (by Elukey):
Fix and tune the new Analytics Hadoop alarms
Change 337886 had a related patch set uploaded (by Ottomata):
Fix heapsize alert conditionals so that they work in labs
Change 337886 merged by Ottomata:
Fix heapsize alert conditionals so that they work in labs