Parallel Python execution for tool completion#470
Conversation
| gen_item.completed = True | ||
| continue | ||
|
|
||
| pool = ThreadPool(CPU_COUNT) |
There was a problem hiding this comment.
it's probably fine to use single process to start subprocesses when CPU number is low
Georgepu1
left a comment
There was a problem hiding this comment.
lgtm, thanks for investigating! btw were u able to validate the completions were close enough on the "467s total, 283s tool use" vs "862s total, 687s tool use" runs? just want to make sure nothing weird's happening here on the generation side
there are actually differences, but for a hand-picked sample i can't repro either for some reason. i wonder if this is due to some tiny randomness in vLLM, or python execution. will investigate a bit more |
|
took a bit more look into this @Georgepu1 15/1000 output texts have changed, i manually checked 5 of them and i don't think python code execution results are different. the differences appear to be some randomness in sampling that sentences in both cases made sense to me |
model-engine/model_engine_server/inference/batch_inference/vllm_batch.py
Outdated
Show resolved
Hide resolved
…ine into yunfeng-parallel-python
Pull Request Summary
with some sample data (1000 prompts). on my devbox (96 CPU cores), but since i used threadpool to start subprocess i think it still might be slowed down by GIL to start processes, probably not high util for the 96 cores
467s total, 283s tool use
vs
without this change 862s total, 687s tool use
Test Plan and Usage Guide
tested with sample data
also CPU count works inside and outside container