Page MenuHomePhabricator

Add hardware capacity to AQS
Closed, ResolvedPublic0 Estimated Story Points

Related Objects

StatusSubtypeAssignedTask
Resolvedelukey
ResolvedRobH
Resolvedelukey

Event Timeline

Current capacity of AQS/Pageview API is documented here:

https://wikitech.wikimedia.org/wiki/Analytics/AQS#Scaling:_Settings.2C_Failover_and_Capacity_Projections

We know that at our current resolution storage-wise we will be running out of capacity in 6 months. While we are going to investigate whether we can lower the resolution of our data (and thus, lighten our storage requirements. see: https://phabricator.wikimedia.org/T144837) we need to order hardware now such we are reday to met demand in 6 months.

Ok, some HW context!

We recently added 3 new AQS nodes (aqs100[456]) to replace 3 OOW ones (aqs100[123]). We ordered these nodes with 8 very large SSDs, with the intention of increasing capacity by just adding more large SSDs later. In the meantime, we've learned from the Cassandra pros over in Services (e.g. @Eevans) that the amount of data served per Cassandra instance should be limited, and the number of Cassandra instances per node is limited by available RAM.

We are currently running 2 Cassandra instances per aqs node, each using 4 disks.

Whatever we do to increase capacity, we'd like to keep the aqs cluster as homogenious as possible. Some possible options:

  • Buy new nodes with exactly the same specs. This would be unnecessarily expensive, as more smaller SSDs would be preferred to fewer large SSDs
  • Add more RAM and 4 more SSDs to each existing node. We'd then run 3 Cassandra instances on each node. This would work, but would also increase our failure profile, as a single node failure would take down 3 Cassandra instances.
  • Some combo of the above options.
  • Swap nodes or SSDs with others? Perhaps someone in ops has a need for large 1.6 TB SSDs. We have 24 of them that we could give away, and order a large batch new smaller ones. Or we could swap the whole nodes with someone and order a full new cluster.

If we were ordering a new aqs cluster with the information we have now, we'd likely order 6 nodes with each with 12 ~1 TB SSDs (or perhaps smaller). Since we already have 3 nodes with 8 1.6 TB SSDs, we're trying to figure out the most cost effective thing to do to increase capacity, while still keeping the aqs cluster homogeneous.

I don't know if it would be relevant, but we recently decomissioned 16 servers from the elasticsearch cluster due to being out of warenty. These machines had 32 SSD's (@300GB each iirc). I'm not entirely clear on the history of things, but if my understanding is correct those machines were pre-existing and the ssd's were added at a later date, meaning there may be warenty life left on those ssd's. I don't know how much, or if it's worthwhile to kick the can down the road for a year, but that might be 9.6T of SSD space you could use (until they are out of warrenty, earlier than the servers).

Again i'm not sure that's particularly useful to you ... but commenting on the off chance it is.

Milimetric triaged this task as Medium priority.Sep 15 2016, 3:40 PM
Milimetric edited projects, added Analytics; removed Analytics-Kanban.
Milimetric moved this task from Incoming to Backlog (Later) on the Analytics board.
Milimetric assigned this task to elukey.
Milimetric edited projects, added Analytics-Kanban; removed Analytics.
Milimetric set the point value for this task to 0.