Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | elukey | T144833 Add hardware capacity to AQS | |||
Resolved | RobH | T149920 Analytics AQS cluster expansion | |||
Unknown Object (Task) | |||||
Unknown Object (Task) | |||||
Unknown Object (Task) | |||||
Resolved | elukey | T155654 rack and set up aqs100[7-9] |
Event Timeline
Current capacity of AQS/Pageview API is documented here:
We know that at our current resolution storage-wise we will be running out of capacity in 6 months. While we are going to investigate whether we can lower the resolution of our data (and thus, lighten our storage requirements. see: https://phabricator.wikimedia.org/T144837) we need to order hardware now such we are reday to met demand in 6 months.
Ok, some HW context!
We recently added 3 new AQS nodes (aqs100[456]) to replace 3 OOW ones (aqs100[123]). We ordered these nodes with 8 very large SSDs, with the intention of increasing capacity by just adding more large SSDs later. In the meantime, we've learned from the Cassandra pros over in Services (e.g. @Eevans) that the amount of data served per Cassandra instance should be limited, and the number of Cassandra instances per node is limited by available RAM.
We are currently running 2 Cassandra instances per aqs node, each using 4 disks.
Whatever we do to increase capacity, we'd like to keep the aqs cluster as homogenious as possible. Some possible options:
- Buy new nodes with exactly the same specs. This would be unnecessarily expensive, as more smaller SSDs would be preferred to fewer large SSDs
- Add more RAM and 4 more SSDs to each existing node. We'd then run 3 Cassandra instances on each node. This would work, but would also increase our failure profile, as a single node failure would take down 3 Cassandra instances.
- Some combo of the above options.
- Swap nodes or SSDs with others? Perhaps someone in ops has a need for large 1.6 TB SSDs. We have 24 of them that we could give away, and order a large batch new smaller ones. Or we could swap the whole nodes with someone and order a full new cluster.
If we were ordering a new aqs cluster with the information we have now, we'd likely order 6 nodes with each with 12 ~1 TB SSDs (or perhaps smaller). Since we already have 3 nodes with 8 1.6 TB SSDs, we're trying to figure out the most cost effective thing to do to increase capacity, while still keeping the aqs cluster homogeneous.
I don't know if it would be relevant, but we recently decomissioned 16 servers from the elasticsearch cluster due to being out of warenty. These machines had 32 SSD's (@300GB each iirc). I'm not entirely clear on the history of things, but if my understanding is correct those machines were pre-existing and the ssd's were added at a later date, meaning there may be warenty life left on those ssd's. I don't know how much, or if it's worthwhile to kick the can down the road for a year, but that might be 9.6T of SSD space you could use (until they are out of warrenty, earlier than the servers).
Again i'm not sure that's particularly useful to you ... but commenting on the off chance it is.