Skip to content

Conversation

@abetlen
Copy link
Owner

@abetlen abetlen commented Jan 23, 2024

Uses prompt lookup decoding but the draft model class can be extended to support almost any existing method.

Server Usage

python3 -m llama_cpp.server --model models/7B/llama-model.gguf --draft_model=prompt-lookup-decoding --draft_model_num_pred_tokens=2

Python Usage

>>> from llama_cpp import Llama
>>> from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
>>> llm = Llama(
    model_path="./models/7B/llama-model.gguf", 
    draft_model=LlamaPromptLookupDecoding(
        num_pred_tokens=10, # Good default for gpu offloading, 2 is better for cpu-only machines
        max_ngram_size=2, # 2 is the huggingface implementation and found to work the best for me as well.
    )
)

Performance

This is a very dumb / easy example but it looks like it's working!

With prompt lookup decoding

image

Without prompt lookup decoding

image

Closes #675

@abetlen abetlen mentioned this pull request Jan 23, 2024
@abetlen abetlen changed the title Add speculative decoding support Add speculative decoding Jan 23, 2024
@abetlen
Copy link
Owner Author

abetlen commented Jan 24, 2024

Tried on a more realistic example and got worse performance, think I'll need to tune / implement a heuristic for draft models similar to https://huggingface.co/blog/assisted-generation

Adjust the number of candidate tokens to be produced in the next iteration — our original heuristic increases it by 2 if ALL tokens match and decreases it by 1 otherwise.

@abetlen
Copy link
Owner Author

abetlen commented Jan 24, 2024

Added the adaptive heuristic and it does do better but still occasionally slower even with termperature=0, will need to investigate.

@oobabooga
Copy link
Contributor

Highly appreciated PR. Is it possible to make prompt_lookup_num_tokens a generation parameter on the same footing as temperature as is done in the transformers library? That would make it possible to change that parameter without having to reload the model.

@abetlen
Copy link
Owner Author

abetlen commented Jan 25, 2024

@oobabooga I saw, I was looking at the hf implementation as a reference. I could add it as a general num_pred_tokens because I want to keep it open to other implementations of speculative decoding. I'll think on that one.

@abetlen
Copy link
Owner Author

abetlen commented Jan 31, 2024

@oobabooga going to merge this now. For updating the draft model or it's properties without re-creating the entire Llama model class just assume that you can access llm.draft_model, set it to None to disable.

@abetlen abetlen merged commit fb762a6 into main Jan 31, 2024
@oobabooga
Copy link
Contributor

Awesome, thanks @abetlen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speculative sampling

2 participants