Monitor Hadoop cluster running out of HEAP space with Icinga
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	• Nuria
	Feb 5 2015, 1:13 AM

Description

For all JVMs we have, if HEAP space hits a limit (90%?), send an alert.

Details

Subject	Repo	Branch	Lines +/-
Fix and tune the new Analytics Hadoop alarms	operations/puppet	production	+8 -8
Fix heapsize alert conditionals so that they work in labs	operations/puppet	production	+101 -71
Fix and tune the new Analytics Hadoop alarms	operations/puppet	production	+8 -8
Add JVM Heap usage alarms for basic Hadoop daemons	operations/puppet	production	+87 -1

Customize query in gerrit

Related Objects

Mentioned In: T153951: Yarn node manager JVM memory leaks

Event Timeline

• Nuria created this task.Feb 5 2015, 1:13 AM

• Nuria raised the priority of this task from to High.

• Nuria updated the task description. (Show Details)

• Nuria added a project: Analytics-Kanban.

• Nuria subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2015, 1:13 AM

• Nuria renamed this task from Icinga Monitoring should detect cluster running out of space to Icinga Monitoring should detect cluster running out of HEAP space .Feb 5 2015, 1:13 AM

• Nuria set Security to None.

• kevinator renamed this task from Icinga Monitoring should detect cluster running out of HEAP space to Monitor (Icinga) cluster running out of HEAP space.Feb 5 2015, 1:19 AM

• kevinator added a project: Analytics-Clusters.

• kevinator renamed this task from Monitor (Icinga) cluster running out of HEAP space to Monitor cluster running out of HEAP space with Icinga.Feb 5 2015, 1:21 AM

• kevinator assigned this task to Ottomata.Feb 9 2015, 10:55 PM

Namenodes only (that's where the problem is really serious)

• ggellerman edited projects, added Analytics-Backlog; removed Analytics-Kanban.Jun 12 2015, 9:35 PM

• ggellerman moved this task from Incoming to Backlog on the Analytics-Backlog board.Jul 10 2015, 4:33 PM

• ggellerman edited projects, added Analytics; removed Analytics-Backlog.Jan 12 2016, 7:42 PM

Milimetric moved this task from Incoming to Event Platform on the Analytics board.Jan 12 2016, 7:42 PM

elukey moved this task from Backlog to Q4 2019/2020 on the Analytics-Clusters board.Jul 27 2016, 12:37 PM

elukey subscribed.

Ottomata removed Ottomata as the assignee of this task.Aug 8 2016, 7:45 PM

Ottomata subscribed.

Note: we have created https://grafana.wikimedia.org/dashboard/db/analytics-hadoop to monitor all the GC/Heap metrics of the Hadoop cluster actors, but we are still missing alarms. Maybe something based on graphite thresholds? I can see stuff like the following in puppet:

monitoring::graphite_threshold { 'restbase_analytics_<<some-metric-name>>':
    description   => 'Analytics RESTBase req/s returning 5xx http://grafana.wikimedia.org/#/dashboard/db/restbase',
    metric        => '<<the metric and any transformations>>',
    from          => '10min',
    warning       => '<<warning threshold>>', # <<explain>>
    critical      => '<<critical threshold>>', # <<explain>>
    percentage    => '20',
    contact_group => 'aqs-admins',
}

So we have all the graphite metrics needed and we'd only need to figure out the correct thresholds. Going to ask for a ticket prioritization, this one seems rather important.

elukey moved this task from Event Platform to Operational Excellence Future on the Analytics board.Aug 10 2016, 8:37 AM

Milimetric assigned this task to elukey.Sep 15 2016, 4:12 PM

Milimetric updated the task description. (Show Details)

Milimetric set the point value for this task to 8.

Milimetric moved this task from Operational Excellence Future to Dashiki on the Analytics board.

Milimetric moved this task from Dashiki to Backlog (Later) on the Analytics board.

• Nuria moved this task from Backlog (Later) to Wikistats on the Analytics board.Oct 5 2016, 4:44 PM

Let's do it!

elukey removed elukey as the assignee of this task.Dec 14 2016, 11:02 AM

elukey added a project: User-Elukey.

agreed, this seems a good one to add to ops-excellence next quarter.

elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.Dec 14 2016, 5:44 PM

• Nuria moved this task from Wikistats to Backlog (Later) on the Analytics board.Dec 15 2016, 5:54 PM

elukey mentioned this in T153951: Yarn node manager JVM memory leaks.Dec 22 2016, 4:00 PM

While reviewing this task, I opened https://phabricator.wikimedia.org/T153951 :D

elukey claimed this task.Dec 22 2016, 4:03 PM

• Nuria edited projects, added Analytics-Kanban; removed Analytics.Dec 22 2016, 4:10 PM

Change 330154 had a related patch set uploaded (by Elukey):
Add JVM Heap usage alarms for basic Hadoop daemons

https://gerrit.wikimedia.org/r/330154

gerritbot added a project: Patch-For-Review.Jan 2 2017, 3:46 PM

ArielGlenn subscribed.Jan 2 2017, 5:59 PM

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Jan 17 2017, 4:04 PM

• Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Jan 30 2017, 4:07 PM

• Nuria moved this task from Paused to Ready to Deploy on the Analytics-Kanban board.Jan 30 2017, 4:49 PM

elukey moved this task from Analytics Backlog to In Progress on the User-Elukey board.Jan 31 2017, 3:02 PM

Change 330154 merged by Elukey:
Add JVM Heap usage alarms for basic Hadoop daemons

https://gerrit.wikimedia.org/r/330154

elukey moved this task from Ready to Deploy to Paused on the Analytics-Kanban board.Feb 7 2017, 2:35 PM

Change 337574 had a related patch set uploaded (by Elukey):
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337574

Change 337575 had a related patch set uploaded (by Elukey):
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337575

Change 337575 merged by Elukey:
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337575

elukey renamed this task from Monitor cluster running out of HEAP space with Icinga to Monitor Hadoop cluster running out of HEAP space with Icinga.Feb 14 2017, 1:39 PM

elukey moved this task from Paused to Done on the Analytics-Kanban board.

• Tnegrin unsubscribed.Feb 14 2017, 1:45 PM

Change 337886 had a related patch set uploaded (by Ottomata):
Fix heapsize alert conditionals so that they work in labs

https://gerrit.wikimedia.org/r/337886

Change 337886 merged by Ottomata:
Fix heapsize alert conditionals so that they work in labs

https://gerrit.wikimedia.org/r/337886

Milimetric closed this task as Resolved.Feb 23 2017, 4:46 PM

Milimetric reopened this task as Open.

elukey closed this task as Resolved.Feb 24 2017, 9:49 AM

Change 337574 abandoned by Elukey:
Fix and tune the new Analytics Hadoop alarms