Skip to content

Sharding with Multi-index causes severe file bloat #5923

@bsharpe

Description

@bsharpe

We have a table in our DB that has 20MM rows in it. Each row is a small bit of data and an array of integers representing dates. That array can be 10 to 1600 entries (average 30). The base data is approximately 10G. The primary key is 56-bytes (sha3-224bit).
Array data looks like: [1443398400, 1443484800, 1443571200, 1443657600]

We multi-index on that array to be able to quickly find items that apply to a specific date for a specific account quickly.

The index is defined like this: (Ruby syntax)

->(d){ d['dates'].map{|e| [d['account_id'], e]} }, multi: true

Note: We tried a compound index with account_id and two dates (start & end) -- but the secondary indexes don't filter as intuitively as imagined with dates involved. This required us to filter results after using the index which is 10x slower than the mutli-index approach.

We have 2 nodes (db-1 & db-2), which according to the admin panel have 9.5MM rows each.
The table is set to 2 shards/2 replicas.

  • on db-1 this table takes up: 242G (1 shard + 1 replica)
  • on db-2 this table takes up: 241G (1 shard + 1 replica)

We were adding a 3rd node and resharding this table to be (3/3) -- the data on db-3 exceeded 1.3T

We've had trouble with this table before...
Here's the part where we believe there's a bug:

Deleting and rebuilding just this one index can reduce the space by as much as 50%.

This is seriously impacting our ability to grow (we can't add a 3rd node, for example). We have tables with 40MM entries where the rows are much larger but the file takes up less space.

Our Math:

(56bytes for ID + ~20bytes for each date in the index) 
 * 30 dates(average per row) 
 * 20MM rows = ~45G

our db-3 server was needing 31x that amount of space for this one table.

FYI:

  • RethinkDB 2.3.4 from your APT repos on Ubuntu 14.04.1 w/ no virtualization using XFS
  • Servers have 256G RAM, 24 cores, 1.4T storage

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions