server: enable checkpoint for recurrent models#1310
Conversation
03c1862 to
8c298f8
Compare
|
Is the kv cache is being deleted for every batch? INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140581237866496" timestamp=1771922359 id_slot=0 id_task=216 p0=0 INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140581237866496" timestamp=1771922366 id_slot=0 id_task=216 p0=4096 The console says "Common part does not match fully" but the text of cache and prompt are identical. Even without deleting any message the whole context is still being rebuilt from position 0. This is without interval btw. With interval only a small portion is being rebuilt. That probably means that the cache is problematic, my front end should be sending the same identical prompt every time. |
|
Better check if your front end injects/changes anything in your prompt. If you see "Common part does not match fully", your prompt does not fully contains your cache. Check the log to see the difference. |
|
LGTM, but can we have 2-3 people confirm that it works for them? |
|
String ban works again. |
|
Update to save only recurrent cache for the checkpoint. It reduces the size of checkpoint on longer context size. |
|
@firecoperana Could you try this: Send a long prompt Because I did this with b3cf43e and as expected the kv cache didn't need to rebuild. True, I did this with silly tavern, and that program has a million settings and missing one that changes the prompt isn't impossible. But I kept the same browser window open for both tests, same st session, I only changed the ik PR. This, btw, I tried both with qwen3.5 and MinimaxM2.5, which ofc isn't hybrid. |
|
These two are thinking models. Did the old prompt get saved to ram when it is rebuilt from 0? Need to see your detailed logs to understand what's going on. I tried with Qwen3, but it worked fine. |
Fair, I've been lazy with the tests on this one. I'm using --reasoning-format none and handling thinking tags at the front end level using a custom jinja. This might sound strange but it's the only way to get every model to work both in thinking and non thinking mode. Thinking tags are treated as if they were normal text, and the Jinja outputs everything unprocessed. Again this works for every model in main, Qwen3.5 included, but not in this PR. But I don't think the thinking is the problem, because I have a 12000 tokens system prompt, and in non hybrid models like minimax-m2.5, even if the thinking doesn't match, I would still expect those 12000 tokens to not be recalculated. I'll compare everything with -v properly, size of the .bin file included. I'll also test 68bd30d specifically, which unless I'm wrong is this PR's parent. This PR might not be the one that introduced the problem after all. |
|
I checkout to your branch and merge with the main branch at 0bf7043 and got problem when running Qwen3 Coder Next. |
95555e2 to
b82fe0f
Compare
create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache
b82fe0f to
2338987
Compare
Yes, with non hybrid model, processing from 0 does not look right. I just synced to latest main and removed some code that could cause different behavior. See if it fixed it. |
If you continue to send new prompt, does it work after that? I've seen it happened with other models, so it may not be related to this PR. It could also indicate some issues with the kv cache, but I'm not familiar with it. @ikawrakow Do you know what is wrong? |
|
Doesn't one need to take a snapshot after processing the system prompt but before any other tokens have been added? Otherwise the system prompt will always have to be re-processed, so people using long system prompts (as @MrHills-rs) will definitely notice. |
Yes, noticed that checkpoints are created during TG and not at the end of PP, at least when --ctx-checkpoints- interval is used, not sure when is unset. This also is a problem when starting a conversation with a very long user message, for example when you want to input a long text into the AI and ask questions about it. Or when loading a large old conversation. In both cases you're unable to swipe the first AI answer, which isn't optimal. Ideally one would create checkpoints both during TG and PP, so when loading a large old conversation one can actually delete a few messages without having to rebuild. Now thanks to everyone kv cache is extremely efficient, 131k f16 only being around 4GB + half a GB for every checkpoint with qwen3.5. Very long context are going to be increasingly common even for consumers, so I think this is quite important. |
Do you have an easy reproduction? |
No, I ran my agent loop which just append new messages to the context and send over. After this the output is totally junk and I got the problem of leaking context (my agent A got agent B context, very weird). I suspect this is related to some save/load and slot picking mechanism, because this happens when I ran 2-3 loops in parallel |
|
It was deepseek v2.5. What fixed for me back then was to modify the prompt slightly and it generated the response normally. I thought it could just be a poor model and didn't investigate more. @MrHills-rs The interval of checkpoint created during PP must be multiple of the batch size, but it's better than no checkpoint at all. |
|
@chulucninh09 Can you pull the latest PR, which has a fix for prompts sent in parallel? #1303 |
Yeah I suspected that. Unfortunately this means that if you save a checkpoint before the last batch you don't really know how many tokens before generation that includes, can be anything between 1 and batch size. Well, in a fresh first prompt you'll have no checkpoints anyway, so you might as well save all of them after processing each of the last --ctx-checkpoints number of batches. Even if you have to redo a whole batch that's generally not that time consuming anyway, and it's far better then redoing a whole 100k+ tokens. |
|
2338987 works. I can save and load caches from .bin, the cache isn't rebuilt completely after every message, both for minimax and qwen3.5. I've had a little bit of a garbled output every once and a while, but that happened with main too, and it may very well be because of my really tight IQ2_XS quant. It gets really bad at long context, but no one expects a 2bpw model to do well at 40k ctx. Next week I'll have more ram and I'll test IQ4_XS in depth. As for now, everything looks fine. |
|
@firecoperana it still happening For more information, this is my git log, I checkout to yours and merge from main |
|
People reported that batch processing did not work very well as of now. Your issue may not be related to this PR. |
This PR enables checkpoints for recurrent models without the need to port the recurrent cache code from mainline. It always creates the checkpoint after prompt is fully processed and response ends.
--ctx-checkpoints: set the number of checkpoints per slot--ctx-checkpoints-interval: minimum number of tokens between each context checkpoint. If you want to create the checkpoint more frequently, set it to a small value. If it's set to positive number, it saves checkpoints during TG at this interval. During PP, it can only save checkpoint every batch size, so it becomes minimum number of tokens between each context checkpoint.Use
llm_arch_is_hybridto replaceQWEN3NEXTandQWEN35MOEwhen dealing with recurrent/hybrid models.Fix the bug that ban strings not work and kv cache not removed. @SneedwareInc