Skip to content

Modules: Add remaining list API functions#8439

Merged
oranagra merged 6 commits intoredis:unstablefrom
zuiderkwast:modules-api-list
Sep 14, 2021
Merged

Modules: Add remaining list API functions#8439
oranagra merged 6 commits intoredis:unstablefrom
zuiderkwast:modules-api-list

Conversation

@zuiderkwast
Copy link
Contributor

@zuiderkwast zuiderkwast commented Feb 2, 2021

List functions operating on elements by index:

  • RM_ListGet
  • RM_ListSet
  • RM_ListInsert
  • RM_ListDelete

List iterator functions: RM_ListIteratorStart, RM_ListIteratorStop, RM_ListIteratorNext, RM_ListIteratorInsert, RM_ListIteratorDelete

Iteration is done using a simple for loop over indices. The index based functions use an internal iterator as an optimization. This is explained in the docs:

 * Many of the list functions access elements by index. Since a list is in
 * essence a doubly-linked list, accessing elements by index is generally an
 * O(N) operation. However, if elements are accessed sequentially or with
 * indices close together, the functions are optimized to seek the index from
 * the previous index, rather than seeking from the ends of the list.
 *
 * This enables iteration to be done efficiently using a simple for loop:
 *
 *     long n = RM_ValueLength(key);
 *     for (long i = 0; i < n; i++) {
 *         RedisModuleString *elem = RedisModule_ListGet(key, i);
 *         // Do stuff...
 *     }

@zuiderkwast
Copy link
Contributor Author

As listed on #8157. RM_ListLen is not needed since RM_ValueLength gives the length of a list.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a preliminary review.
I'm not sure i like the iterator API, something feels odd and i'm not sure i can put my finger on it.

I suppose that the fact that the iterator is actually embedded in the RedisModuleKey (rather than returning some opaque object) is ok, since in theory the user can open the same key twice. and if he modifies it during iteration that's wrong in any case that has two iterators open at the same time.

i see that zset API doesn't prevent the module from doing edits during the iterations, and also don't offer any iterator specific API for modifications (it just leaves this use case unsolved).

here, if we want to solve it, maybe we can go the extra mile and just make the normal modification APIs implicitly work during an iteration rather than provide a specific APIs for that?

i'm not sure what's the right thing, maybe i need to "sleep on it".

@zuiderkwast
Copy link
Contributor Author

Thanks for the review.

I'm not sure i like the iterator API, something feels odd and i'm not sure i can put my finger on it.

👉 Functions are duplicated with iterator and non-iterator variants. Not so nice. Is that the main point?

I suppose that the fact that the iterator is actually embedded in the RedisModuleKey (rather than returning some opaque object) is ok, since in theory the user can open the same key twice. and if he modifies it during iteration that's wrong in any case that has two iterators open at the same time.

Yes, this is modeled after zset and stream. Zset was first to set this standard. Good point though. Maybe it's worth documenting that you shouldn't open the key twice and do modifications on one while iterating on the other.

i see that zset API doesn't prevent the module from doing edits during the iterations, and also don't offer any iterator specific API for modifications (it just leaves this use case unsolved).

Good point. I can look into this (see what happens to the iterator if changes are made, maybe add a test case and think about improving the situation) later.

here, if we want to solve it, maybe we can go the extra mile and just make the normal modification APIs implicitly work during an iteration rather than provide a specific APIs for that?

It can totally be done. Then we don't need the duplicated functions. We can add a special index value (a macro) for CURRENT, usable only while iterating.

i'm not sure what's the right thing, maybe i need to "sleep on it".

Sleep well. Same here, after posting these comments.

@zuiderkwast
Copy link
Contributor Author

I'm not sure i like the iterator API, something feels odd and i'm not sure i can put my finger on it.

I've been thinking about this for a while and I think we need to find actual use cases to motivate what we should have. I have some ideas/problems:

  • Next() allocates a string. This makes typical use cases very expensive, such as searching for a particular element for returning or removing it. It's actually worse than using RM_Call() with LPOS and LREM. We could add a RM_ListFind() function, or better IMO is to have a function which exposes the pointer, length or interger value (if string pointer is NULL), just like a ziplist entry.
  • Removing and inserting may elements one-by-one is slightly expensive, since ziplists use memmove() on these operations. Perhaps we should work with ranges here, e.g. DeleteRange(key, start, count), or inserting multiple elements using a format string, accepting integers, C-strings and RedisModuleStrings as va_args?
  • Even if the main use case is stuff like queues (push and pop) like for doubly-linked lists, it can also be used for traversal and access in the middle since there is no other type for these. Seeking is a bit like a skiplist since it's really a linked list of ziplists.
  • We could make the iterator completely hidden, only providing access by index and automatically using an interator behind the scenes when users access index 0 (or -1), hoping they will access 1 (or -2) next. Maybe it gives the wrong illusions but the optimization could be documented in that case...

Just an observation: Push/Pop is cheaper on the tail than on the head (also for RPUSH/RPOP vs LPUSH/LPOP) since ziplists are cheaper in the tail (no memmove).

To avoid allocations, I've also been contemplating integer access to plain keys (RM_IntegerSet/Get e.g. for efficient implementation of things like INCRBY) and to ziplists (lists and hashes). I might draft it later. Avoiding allocations has great optimization potential in modules, or so I've heard.

I'm sort of waiting for input about how to continue this PR.

@oranagra
Copy link
Member

oranagra commented Feb 8, 2021

i don't have response for everything yet, but i wanna offer some perspective.
i feel that we don't want to make the API too coupled with the internals, or too hard to use.
i.e. i think it would be a bad idea to expose any char* to internal memory, and integer values.

I think the API needs to be easy to use, and flexible enough for that future changes in the internals of redis can still support it.
if that's not efficient enough for some module, it can always implement it's own data type, rather than using redis's ones.

@zuiderkwast
Copy link
Contributor Author

i don't have response for everything yet, but i wanna offer some perspective.
i feel that we don't want to make the API too coupled with the internals, or too hard to use.
i.e. i think it would be a bad idea to expose any char* to internal memory, and integer values.

I think the API needs to be easy to use, and flexible enough for that future changes in the internals of redis can still support it.
if that's not efficient enough for some module, it can always implement it's own data type, rather than using redis's ones.

Thanks. It's your decisions to make. Let me just give our perspective.

Maybe we're not using modules as intended, but we've been using them to speed up things that could also be done in Lua. In one of our products, they hit 100% CPU load using Lua (while testing, not live). When they rewrote it to a module, it was much faster even using RM_Call(). We don't really know why. Maybe LuaJIT would make it faster.

Moving a part of the application logic into the database itself is a great way of designing fast software. You save network traffic for starters.

We've identified allocations (jemalloc's not-very-fast malloc_usable_size(), which we have a patch for disabling and instead storing the usable size inside the allocation, something Redis already can do if HAVE_MALLOC_SIZE is undefined, trading one word of memory for CPU), RM_Call()'s allocations and the fact that you need to allocate strings for most things, as particularly expensive.

Given the already existing RedisModule_StringDMA which provides direct read/write access to internal memory, I thought returning a const char * or a long long wouldn't be that problematic, but it's easy for me to say---your're the maintainers who will suffer forever once things are merged. :)

@oranagra
Copy link
Member

oranagra commented Feb 8, 2021

I think string DMA is a special case, letting modules store blobs of bytes, and also, that's an API which we'll probably always be able to support (no matter what changes we'll do internally in redis), and it is also quite easy to use for the user. i'm not sure the same can be said about ziplists and their integers.
maybe if you draft an API declaration, we'll be able to look at it and come to some new understandings.
or maybe we need to figure out some primitive access API that is able to abstract these, and still be useful and efficient.

p.s. unsetting HAVE_MALLOC_SIZE, you also lose track of internal fragmentation, tracking its cost, or being able to take advantage of the spare memory in some places.

@yossigo
Copy link
Collaborator

yossigo commented Feb 8, 2021

StringDMA is actually an historic item in the API that dates back to the time module data types did not exist, it might even be a candidate for deprecation (if we decide to start doing that) IMHO.

To work around costly short-lived allocations, perhaps we should consider using the stack more? I think this could be more feasible if RedisModuleString would not be bound only to robj+sds but could have a const reference to arbitrary buffers.

@oranagra
Copy link
Member

oranagra commented Feb 8, 2021

that's an interesting idea, so like OBJ_STATIC_REFCOUNT, you won't be able to RM_RetainString these, but you'll be able to RM_StringPtrLen and RM_HoldString them (creating a new heap based string).
and they'll possibly vanish when you exit the scope or do some operation that destroys their underlying data.
is that what you mean?

this could possibly still be an robj, and we have 60 bits for the length.
or we can make a revolution and stop using robj.

@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Feb 8, 2021

RM_ListIteratorDelete() and RM_ListIteratorInsert() can be deleted. Their non-iterator versions can function even if there is an iterator. The iterator can be re-seeked behind the scenes on the next Next() when necessary. The non-iterator function need index, so Next() can return the index also:

int RM_ListIteratorNext(RedisModuleKey *key, long *index, RedisModuleString **elem);

If RedisModuleString can be stack-allocated, that's awesome!

I already drafted on the following non-allocating Next() (before the idea of stack strings):

int RM_ListIteratorNextPtrOrLongLong(RedisModuleKey *key,
                                     long *index,
                                     const char **stringptr,
                                     unsigned int *lenptr,
                                     long long *intptr);

Fetches the next element in a list, either as a char pointer or as an integer
value. If you prefer to fetch the element as a RedisModuleString*, use
RM_ListIteratorNext() instead.

  • key -- A key opened for reading.
  • index -- The 0-based index of the found element is set here. Use NULL
    if you don't want to receive the index of the element.
  • stringptr -- A pointer to a string where the the result is returned when
    the element cannot be represented as a long long. May be set
    to NULL if you're only interested in integer elements.
  • lenptr -- The length of the string returned at *stringptr.
  • intptr -- if the element contains an integer value, *intptr is set to
    the value and *stringptr is set to NULL.

On success, *index is set and REDISMODULE_OK is returned.

If the element can be represented as a long long, it is set at *intptr, and
*stringptr is set to NULL (if stringptr is provided).

If the element cannot be represented as a long long, *stringptr is set to
the characters of the element and *stringlen is set to its length. Note that
this string is not nul-terminated and must not be modified.

If stringptr is not provided (i.e. it is NULL) and the element cannot be
represented as a long long, REDISMODULE_ERR is returned and errno is set to
ERANGE.

@yossigo
Copy link
Collaborator

yossigo commented Feb 8, 2021

@oranagra Yes I guess it could be a new robj type, the real culprit here is sds. I think that would be the best of all worlds - no copies, no breaking API changes, and no exposing of internals.

@oranagra
Copy link
Member

oranagra commented Feb 8, 2021

robj is 16 bytes (on 64bit systems), has encoding, refcount and lru which are completely useless for our cause.

you may not like it, but here's my proposals:

  1. create a new type of for the type field: #define OBJ_STRING_PTR 15, and use the encoding, refcount and lru fields bits for the length (60 bits), and the ptr as the data.
  2. RedisModuleString is always a heap allocation, so we can set the LSB of the pointer we return to the module to 1 when the string is RedisModuleStaticString allocation rather than an robj, in which case we cast it to that type, a struct which has size_t len, and void* ptr (also 16 bytes on 64bit systems)
  3. if we don't like using the LSB of the pointer we hand to the user, we can reserve one bit in the robj's type member instead, and store lengths of up to 63 bits (note that we can't use a bit of the ptr member, since these are not always heap allocation pointers).

i prefer option 2, considering RedisModuleString is an opaque type, the modules don't need normally need to care how me manage it.
although they will need to know that some operations are impossible with some strings, like they already have (not being able to RM_RetainString OBJ_STATIC_REFCOUNT

@oranagra
Copy link
Member

oranagra commented Feb 8, 2021

p.s. we can also let the module create such strings via some API, when it passes them to redis. e.g. when adding something from the module's own memory to the reply buffers, there's no reason for the module to copy it into a RedisModuleString just so that redis can copy it again into the reply buffers. one of these copies can be eliminated by letting the module create a static string.

@zuiderkwast
Copy link
Contributor Author

Let's move the static string discussion here to this issue: #8473. Feel free to change all the text.

@zuiderkwast
Copy link
Contributor Author

Regarding how to continue this API, here are some more concerns/observations/ideas:

  • Insert and Delete by index is impossible to replicate using a regular command. Perhaps write APIs should mimic the write commands for this reason. The commands LREM, LSET and LINSERT operate by occurence, not by index.
  • Should the iterator mimic the ZsetRange, e.g. Start/Stop, First/Last, Next/Prev? (Changing direction is not yet possible in the quicklist iterator, but it's easy to implement.)
  • FindNext/FindPrev (return index or seek the iterator, maybe both?)

If this design requires too much of you precious time, I will focus on something else for a while.

@oranagra oranagra mentioned this pull request Jul 26, 2021
52 tasks
@zuiderkwast zuiderkwast added the state:needs-design the solution is not obvious and some effort should be made to design it label Aug 20, 2021
List functions operating on elements by index:

* RM_Get
* RM_Set
* RM_Insert
* RM_Delete

List iterator functions:

* RM_ListIteratorStart
* RM_ListIteratorStop
* RM_ListIteratorNext
* RM_ListIteratorInsert
* RM_ListIteratorDelete
@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Aug 30, 2021

So, what we don't want is to have both iterator + index versions of insert/delete/set. I see two ways to continue this.

  1. Let the iterator return the current element and index (e.g. we have next/prev/currentElement/currentIndex) and let the write functions (insert/delete/set) operate on index. We can optimize the insert/delete/set call to the current index, e.g. ListInsert(ListIteratorCurrentIndex(key), newelement), by reusing the current iterator to avoid seeking using a fresh iterator, when index is matching. This optimization can be documented so the user can rely on it.

    API functions, alt. 1
    /* list iterator */
    int RM_ListIteratorStart(RedisModuleKey key); /* OK or ERR */
    int RM_ListIteratorStop(RedisModuleKey key); /* OK or ERR */
    int RM_ListIteratorFirst(RedisModuleKey key); /* OK or ERR */
    int RM_ListIteratorLast(RedisModuleKey key); /* OK or ERR */
    int RM_ListIteratorNext(RedisModuleKey key); /* OK or ERR */
    int RM_ListIteratorPrev(RedisModuleKey key); /* OK or ERR */
    RedisModuleString *RM_ListIteratorCurrentElement(RedisModuleKey key);
    long RM_ListIteratorCurrentIndex(RedisModuleKey key);
    /* other list functions */
    RedisModuleString *RM_ListGet(RedisModuleKey key, long index);
    int *RM_ListSet(RedisModuleKey key, long index, RedisModuleString *value);
    int *RM_ListInsert(RedisModuleKey key, int before_or_after, long index, RedisModuleString *value);
    int *RM_ListDelete(RedisModuleKey key, long index);
  2. To take this one step further, we can even hide the iterator alltogether and just optimize get/set/insert/delete (all by index) so that when index+1 is accessed, the iterator is stepped. No Iterator functions are added to the API. You'd just iterate using for(;;):

    int n = RedisModule_ValueLength(key);
    for (int i = 0; i < n; i++) {
        RedisModuleString *elem = RedisModule_ListGet(key, i);
        RedisModule_ListSet(key, i, new_value);
        ...
    }
    API functions, alt. 2
    /* list functions */
    RedisModuleString *RM_ListGet(RedisModuleKey key, long index);
    int *RM_ListSet(RedisModuleKey key, long index, RedisModuleString *value);
    int *RM_ListInsert(RedisModuleKey key, int before_or_after, long index, RedisModuleString *value);
    int *RM_ListDelete(RedisModuleKey key, long index);

I think #2 presents a very minimal API and allows a single iterator to be used under the hood. The implementation get simpler than #1 since the internal iterator can be used by all functions with less special cases. @yossigo or @oranagra WDYT?

The list iterator API functions are gone. Iterate using a simple
for loop over indices. The functions use an internal iterator as
an optimization.
@zuiderkwast
Copy link
Contributor Author

PR is updated. See the top comment for the current API suggestion.

Copy link
Contributor

@bjosv bjosv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this look like a nice API, it got a lot simpler when hiding iterators.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the API you proposed, it's a lot easier for the user to use.
on the other hand, it hides the (performance) implications from the user.
so a user that's doing a replace during an iteration, will have no idea that we invalidated the iterator and the next call is gonna be expensive. the user won't even notice that since the testing is likely to be on small lists.
the upside is that we can always later improve the implementation, and it won't affect the API or the modules that use it.

i'd like like to hear @guybe7 @MeirShpilraien @yossigo opinion about this (read this proposal)

@guybe7
Copy link
Collaborator

guybe7 commented Sep 2, 2021

@zuiderkwast nice PR!

@oranagra i like the current API (hidden iterator) and i understand your concern. maybe we can document somewhere that modifying the list inside an iteration loop is possible but may cause a degradation in performance?

@oranagra oranagra added approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus and removed state:needs-design the solution is not obvious and some effort should be made to design it labels Sep 2, 2021
@oranagra
Copy link
Member

oranagra commented Sep 2, 2021

@redis/core-team please approve.
note the considerations specified here and here, and how it looks when compared to the zset iteration and modification APIs.

Copy link
Member

@itamarhaber itamarhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, especially the simple-to-use API.

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just reviewed the APIs)
I like the APIs as well.

Copy link
Collaborator

@yossigo yossigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the API, LGTM (plus a typo and a doc suggestion).

@oranagra oranagra merged commit ea36d4d into redis:unstable Sep 14, 2021
@oranagra
Copy link
Member

@zuiderkwast thank you for pushing this through 🎉

@zuiderkwast zuiderkwast deleted the modules-api-list branch September 14, 2021 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants