-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
We have a table in our DB that has 20MM rows in it. Each row is a small bit of data and an array of integers representing dates. That array can be 10 to 1600 entries (average 30). The base data is approximately 10G. The primary key is 56-bytes (sha3-224bit).
Array data looks like: [1443398400, 1443484800, 1443571200, 1443657600]
We multi-index on that array to be able to quickly find items that apply to a specific date for a specific account quickly.
The index is defined like this: (Ruby syntax)
->(d){ d['dates'].map{|e| [d['account_id'], e]} }, multi: true
Note: We tried a compound index with account_id and two dates (start & end) -- but the secondary indexes don't filter as intuitively as imagined with dates involved. This required us to filter results after using the index which is 10x slower than the mutli-index approach.
We have 2 nodes (db-1 & db-2), which according to the admin panel have 9.5MM rows each.
The table is set to 2 shards/2 replicas.
- on db-1 this table takes up: 242G (1 shard + 1 replica)
- on db-2 this table takes up: 241G (1 shard + 1 replica)
We were adding a 3rd node and resharding this table to be (3/3) -- the data on db-3 exceeded 1.3T
We've had trouble with this table before...
Here's the part where we believe there's a bug:
Deleting and rebuilding just this one index can reduce the space by as much as 50%.
This is seriously impacting our ability to grow (we can't add a 3rd node, for example). We have tables with 40MM entries where the rows are much larger but the file takes up less space.
Our Math:
(56bytes for ID + ~20bytes for each date in the index)
* 30 dates(average per row)
* 20MM rows = ~45G
our db-3 server was needing 31x that amount of space for this one table.
FYI:
- RethinkDB 2.3.4 from your APT repos on Ubuntu 14.04.1 w/ no virtualization using XFS
- Servers have 256G RAM, 24 cores, 1.4T storage