Hash Field Expiration#13172
Conversation
|
we need to update db->hexpire when executing writable hash command or other commands that may affect hash.
|
|
I'm wondering why we introduced list ebucket which is only one per db. |
Hello @sundb,
You can review test tests at file: |
@sundb , I am not sure that i understand the comment. There is no relation between the fact that there is only one per db and the fact that it adds "complexity". In fact the code is rather simple, doing unregister-and-register, from source to destination, to BTW, currently there is no real reason why |
Abstract
This proposal advocates for enhancing Redis expiration functionality by introducing the option to set expiration on individual hash fields. In addition, it challenges the existing keyspace expiration implementation by presenting an alternative data structure known as
ebuckets, initially designed for HFE usage.ebucketscombine different recurring ideas that were raised up by the community, in the context of keyspace, potentially extending their relevance to hash fields as well. The new DS is optimized for active expiration, sorted by TTL, boasting a reduced memory cost of approximately ~20 bytes of metadata per item compared to the existing ~40 bytes in the dict for expiry. It uses an embedded expiry struct in the registered item which saves another pointer and gives along the way an O(1) TTL lookup.In that context, we also introduce a new structure, named
MSTR, which stands for immutable STRing with Metadata. It comes as a replacement forSDS-fields in a hash. WhereasSDSonly allows manipulation of strings, withMSTRwe can add and remove metadata attached to immutable strings generically.Overview
Redis is often used as a cache, and as such, one of its main features is being able to expire keys based on timeout. Over the years there was a recurring demand for more granular expiration control, such as to expire members in a set [1,#135], or members in sorted set [1,2], and particularly there are requests for expiration of fields of a hash [1,2,3,#167,#1042]. Several initiatives have been undertaken to address these needs, taking the form of third-party modules [1,2] and Redis forks [1].
The key aspects of this feature include:
EXPIREcommand for keyspace which only needs to decide whether a given object is expired. Here we need to fine-tune a few hash commands for corresponding action to be taken.EXPIREfeature for keys, however, some commands adopt a slightly different approach.Active Expiration
When designing a data structure for hash field expiration, it might be a good idea to examine first the current Redis implementation of key expiration, considering its pros and cons. By evaluating the existing approach, we can determine whether adopting a similar strategy would be beneficial. This examination may not only aid in designing an effective data structure for hash field expiration but potentially, may also improve the mechanism of Redis for key expiration. Therefore, the terms “key” and “field” (or “hash field”) might be used interchangeably. In addition, even if eventually it might be implemented only for hash field expiration, it allows us to challenge the current implementation and provide a good reference.
Redis uses a hash table DS to implement key expiration, wherein the insertion of a new key with TTL into the table is determined by the key's hash value. The pros and cons are:
Pros:
Cons:
As can be seen, using a hash-table for expiration brings with it one major advantage of O(1) access time to a given item, but on the other hand it raises several severe issues regarding active expiration (as highlighted in previous issues [#8700,#10802,#9480]). An alternative DS obviously should be sorted by expiration time for efficient active expiration, and be able to scale from a few items to millions.
RAXis a candidate that keeps reoccurring in that context [1,#9480], as it is already available and well tested. There are more advanced algorithms for caching and support for expiration [1,2] but they might require major change and they don't scale from a few items for the case of hash fields.One might claim that the downside of using
RAXis log(n) access time. This claim can be mitigated by keeping no more than 6 bytes of the expiration time (which is sufficient until the date of 02 August, 10889) and as a result the depth of the RADIX tree will be limited to 6 levels. Another concern might be thatRAXis wasting too much memory for metadata . This can be mitigated as well by holding a segment of multiple items, instead of holding a single item at the leaf of the tree. Those segments will be optimized in memory such that the items themselves will embed the structure of the segment as a linked list. Furthermore, it also reduces the frequency of updates of the rax tree and resolves conflicts of items with the same expiration time.Having said that, the idea to embed metadata attached to the hash fields (or keys) is problematic as long as the hash fields, or keys, are of type
SDS. Hence, as an alternative, we introduce a new structure, namedMSTR, which stands for immutable STRing with Metadata. WhereasSDSallows to keep only strings, withMSTRwe can add and remove metadata attached to the hash field (or key) string in a generic way. It might seem overkill to have general purposeMSTRjust for TTL, but it is likely that other features down the road, require attaching their own metadata to hash fields (or keys) as well.New DS: ebuckets
ebuckets, which stands for expiry-buckets, is used to store items that are set with expiration-time. It supports the basic API of add, remove, and active expiration. Its implementation is based on aRAXtree, or a plain linked list when small.The expiration time of an item is used as the key to traverse the
RAXtree.The
ebucketsdata structure is organized hierarchically as follows:RAX. Each leaf in theRAXtree is a bucket.RAXtree for each bucket represents low bound expiration time for the items within this ebucket and the key of the next bucket represents the upper bound expiration time.EB_SEG_MAX_ITEMS(currently it is 16) items as a linked list. If needed to add an item to a full segment, then the segment will try to split the bucket. To avoid wasting memory, it is a singly linked list (single pointer to next item). The list is cyclic to allow efficient removal of items from the middle of the segment without necessarily traversing theRAXtree.ebucketsshould embed theExpireMetastructure and supply getter function.ExpireMetaholds the expiration time of the item and a few more fields that are used to maintain the segments data-structure (Described at the end of this doc).The following diagram summarizes those ideas of using
RAX, segments, items and embeddedExpireMeta. It also gives a hint toMSTRlayout and how hash of type dict is going to integrate with it:Splitting bucket
Each segment can hold up-to
EB_SEG_MAX_ITEMSitems. On insertion of a new item, it will try to split the segment. Here is an example For adding another item to segment number 1 that already reached its maximum capacity which will cause to split of the segment and in turn split of the bucket as well to a finer grained ranges:Extending bucket
In the example above, the reason it wasn't split evenly, is that the segment must have been holding items with the same TTL and they must reside together in the same bucket after the split. Which brings us to another important point. If there is a segment that reaches its maximum capacity and all the items have the same expiration-time key, then we cannot split the bucket but aggregate all the items, with the same expiration time key, by allocating an extended-segment and chain it to the first segment in the visited bucket. In that sense, extended segments will only hold items with the same expiration-time key.
The following diagram describes in more details the way to extend a full segment and chain a new segment:
Memory evaluation
Optimal memory utilization occurs when multiple items share the same expiration time, as they can be stored efficiently within a single bucket utilizing extended segments, akin to a cost-effective linked list structure. Initially, let's disregard this scenario and assume that each item possesses a unique expiration time, resulting in all buckets operating without extended segments.
Note that:
ExpireMeta.Use case: ebuckets contains few items. No more than
EB_LIST_MAX_ITEMSWhen ebuckets contains no more than
EB_LIST_MAX_ITEMS(=16) items, it is optimized not to allocate any memory. It just uses embeddedExpireMetato maintain its own data-structure as a plain list. Therefore, memory overhead is 16 bytes per item.Use case: Most items are removed via Active-Expire
If most of the items are being removed by active expiration, then sizes of the segments (buckets) will be an outcome of split and will hold at least 8 items and no more than 16 (
=EB_SEG_MAX_ITEMS). So, it is expected to have an average of 12 items per bucket.Average of 12 items in bucket: 16 + 16/12 + 40/12 = 20.6 bytes
Use case: Items removed NOT only by Active-Expire
If expected to have many "non-sequential" removal of items from buckets, then many buckets can shrink down to 4 items in a bucket. Below it,
ebucketswill try merge the bucket with adjacent one. So, it is expected to have an average of 10 items per segment bucket.Average of 10 items in bucket: 16 + 16/10 + 40/10 = 21.6 bytes
Worst case of 4 items in bucket: 16 + 16/4 + 40/4 = 30 bytes
Use case: All items with same expiration-time
In that case
ebucketswill have a single bucket of type extended-segments with an unbounded number of items in it.ExpireMetabecomes the dominant value andNextSegHdr, of size 24 bytes, will be allocated per segment. Assuming 16 items per segment, we get:16 + 24/16 = 17.5 bytes
Performance Evaluation
ebucketsand then actively expiring all of them.ebucketsDS).raxmemory along with segment headers.To run the benchmark:
MSTR (immutable string with metadata)
SDSstring is widely used across the system and serves as a general purpose container to hold data. The need to optimize memory and aggregate strings along with metadata and store it into Redis data-structures as single bulk keep reoccur. One thought might be to attach metadata toSDS. The trouble is thatSDSis a mutable string in its nature, with a wide API (split, join, etc.). Pushing metadata logic intoSDSwill make it very fragile, and complex to maintain.As an alternative, we introduce a new concept of immutable strings, simplified, with limited API, and with the option to attach metadata. This idea isn’t new and was suggested in different contexts. The new thing here is defining it as infrastructure to different kinds of use cases. One use case can be attaching TTL to hash fields. Another use case can be attaching TTL and encoding to keys. Etc.
The representation of the string, without any metadata, in its basic form, resembles
SDSbut without the API to manipulate the string. The following diagram shows the memory layout of mstring (mstrhdr8) when no metadata is attached:If metadata flag is set, depicted in diagram above as
m-bitin the diagram, then the header will be preceded with additional 16 bits of metadata flags such that if i'th bit is set, then corresponding i'th metadata structure is attached to the mstring. The metadata layout and their sizes are defined bymstrKindstructure (More below). The following diagram shows the memory layout ofMSTR(mstrhdr8) when 3 bits inmFlagsare set to indicate that 3 fields of metadata are attached to the mstring at the beginning.Note: Initially
MSTRwas designed to support implict conversion fromSDStoMSTR. That is, the layout ofMSTRwithout any metadata attached is aligned withSDSlayout. But later on it was decided to discard this approach and to strive for explicit conversion. This decision can be challenged in the future as the code evolves.Kinds of MSTR
The
MSTRallows defining different kinds (classes) of mstrings, each with its own unique metadata layout. For example, in case of hash fields, all instances of it can optionally have TTL metadata attached to it. This is achieved by prototyping oncemstrKindstructure that defines the metadata layout and metadata sizes of this specific kind.Here is
mstrKindprototype for hash fields:Note that each hash field instance still has the freedom to optionally attach the expiration metadata to it. Most of them will start their life without any metadata attached.
In the future, the keys of Redis keyspace can be another kind of
MSTRthat aggregate TTL, LRU, or even dictEntry metadata embedded into a single allocation. Here is a general idea of how it might look like:This idea is further elaborated in Appendix B.
Alignment issues
There are two types of alignments to take into consideration:
Alignment of the metadata - As the metadatas layout are reversed to their enumeration, it is recommended to put metadata with "better" alignment first in memory layout (enumerated last) and the worst, or those that simply don't require any alignment will be last in memory layout (enumerated first). This is similar to the applied consideration when defining a new struct in C. Note also that each metadata might either be attached to
MSTRor not which complicates a little the design phase of a newmstrKind. In the example above, HKEY_META_VAL_REF_COUNT, with the worst alignment of 4 bytes, is enumerated first, and therefore, will be last in memory layout.Alignment of returned MSTR pointer - Few optimizations in Redis rely on the fact that
SDSaddress is always an odd pointer. We can achieve the same with a little effort. It was already taken care that all headers of typemstrhdrXhas odd size. With that in mind, if a new kind ofMSTRis required to be limited to odd addresses, then we must make sure that sizes of all related metadatas that are defined inmstrKindare even in size.Put the pieces together
Each hash instance will maintain its own set of HFEs in its private
ebuckets. It can be attached to dict structure as metadata. In order to support active expiration across hash instances, hashes with associated HFE will be also registered in globalebuckets, per db, with expiration time value that reflects their next minimum time to expire. The global active expiration of HFE will be triggered from theactiveExpireCycle()function and will invoke "local" HFE Active expiration for each hash instance that has expired fields.In addition, current implementation of hashes by dict will be modified from using
SDSas fields tohfield, which is kind ofMSTR, in order to be able to attachExpireMetastructure to the hash field. A field of hash can initially be allocated withoutExpireMetaand following setting of TTL, it will be reallocated withExpireMetaand will be added to theebucketsof the hash. If the new TTL of the hash field is also the new minimum TTL of the hash, then the hash will be updated in the globalebucketsas well.Lazy Expiration
Conceptually, every time before a hash object is accessed, it is needed to delete all expired fields. This approach, even though will bring the object to its desired state, might freeze the main thread for a long time in case there considerable amount of items to expire in a short period of time. As an alternative, we can refine the action taken before each command, to better suit and reduce the risk of stucking with a long expiry operation.
HGET, HLEN, HMGET, HINCRBY(FLOAT), HEXISTS - Will make lazy expire to the field.
HRANDFIELD - To keep it simple, first it will delete all expired fields, before picking random fields.
HLEN - Return a number of fields, expired or not!
HGETALL - Filter expired fields from the reply. Might return empty array (just like KEYS)
HSCAN - Will skip expired fields (may result in an empty array, like SCAN)
API
Appendix A: Structure ExpireMeta
ExpireMetastruct should be embedded inside the item that needs to be stored inebuckets:Appendix B: Enhance EXPIRE keyspace (Idea for future work)
Now that we have a functioning
MSTRandebucketsfor hash fields, we can consider challenging the existing keyspace expiration implementation with the same approach and even take it further.As a first step, we can try
ebucketsinstead of hashtable DS forEXPIRE. This also requires to modifySDS-keys to be based on kind ofMSTRand attaching itExpireMeta, just like we did for hash fields in this suggestion.The next step might be to consider even to merge
dictEntryandrobjinto keys which are kind ofMSTR, after the previous step.