I wondered whether the embedding endpoint uses parallelism/elastic scaling to handle multiple documents from a single request in parallel. After doing some short experiments, I discovered that it seems not to do so. In consequence, there is no need to maximize your chunk size and send a single request per minute when you are embedding huge corpora, but you may split up your corpus into a larger number of chunks in favor of more fluent progress updates unless network overheads become significant. This observation, however, is only a snapshot of the current server load and the ada-002 model and they might change this behavior in the future. Maybe this saved someone else a few minutes.
You are likely referring to the embeddings endpoint’s ability to take not just a string, but also a list of strings (array), and return an embedding for each of them.
There is a clue that lets you understand how this works: the maximum you can send in total of strings is still 8k tokens.
If the endpoint was dispatching multiple inputs to multiple AI instances, this limitation on the total input would not make sense.
What does make sense though is that the entire embedding request is loaded into AI context, and then the hidden embedding state after each sequence is individually processed by that AI is returned. There are “resets” at each input separation as it works through the context.
Interestingly though, with language inference, you get a significant speedup with n>1, which is asking for multiple outputs for the same prompt.
This has to be more than just precalculation of the state from input shared between instances, given the magnitude of the total token rate increase from such a job. Parallelism is indicated.
Yes, you are right, thank you for the additional context. Yes, I need to investigate making parallel requests, but honestly, this is something that the API should handle for me. I noticed the https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py but I will have to port that to the language I am working with.
What are you referring to with language inference? Are we still talking about embeddings?
Inference → completion that deduces the best output → talking to a chatbot