-
Notifications
You must be signed in to change notification settings - Fork 33
Open
Description
Which version of LM Studio?
Example: LM Studio 0.4.3
Which operating system?
Windows
What is the bug?
- Making a request to
/v1/embeddingsreturns always responses withprompt_tokensandtotal_tokensvalues 0 (tested with 4 different embedding models, please see screenshot below) :
..
"usage": {
"prompt_tokens": 0,
"total_tokens": 0
}
...- Tried the same process with Ollama but it worked without any issue (please check the latest screenshot below).
Screenshots
- LMStudio
- Ollama:
Logs
2026-02-22 13:51:00 [DEBUG]
[INFO] [PaniniRagEngine] Loading model into embedding engine...
[WARNING] Batch size (512) is < context length (4096). Resetting batch size to context length to avoid unexpected behavior.
[INFO] [LlamaEmbeddingEngine] Loading model from path: L:\lmstudio-models\gpustack\bge-reranker-v2-m3-GGUF\bge-reranker-v2-m3-Q8_0.gguf
2026-02-22 13:51:00 [DEBUG]
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes
2026-02-22 13:51:01 [DEBUG]
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1060 6GB) (0000:01:00.0) - 5189 MiB free
2026-02-22 13:51:01 [DEBUG]
llama_model_loader: loaded meta data with 35 key-value pairs and 393 tensors from L:\lmstudio-models\gpustack\bge-reranker-v2-m3-GGUF\bge-reranker-v2-m3-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bert
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Bge M3
llama_model_loader: - kv 3: general.organization str = BAAI
llama_model_loader: - kv 4: general.size_label str = 568M
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.tags arr[str,4] = ["transformers", "sentence-transforme...
llama_model_loader: - kv 7: general.languages arr[str,1] = ["multilingual"]
llama_model_loader: - kv 8: bert.block_count u32 = 24
llama_model_loader: - kv 9: bert.context_length u32 = 8192
llama_model_loader: - kv 10: bert.embedding_length u32 = 1024
llama_model_loader: - kv 11: bert.feed_forward_length u32 = 4096
llama_model_loader: - kv 12: bert.attention.head_count u32 = 16
llama_model_loader: - kv 13: bert.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 14: general.file_type u32 = 7
llama_model_loader: - kv 15: bert.attention.causal bool = false
llama_model_loader: - kv 16: tokenizer.ggml.model str = t5
llama_model_loader: - kv 17: tokenizer.ggml.pre str = default
2026-02-22 13:51:01 [DEBUG]
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,250002] = ["<s>", "<pad>", "</s>", "<unk>", ","...
2026-02-22 13:51:01 [DEBUG]
llama_model_loader: - kv 19: tokenizer.ggml.scores arr[f32,250002] = [0.000000, 0.000000, 0.000000, 0.0000...
2026-02-22 13:51:01 [DEBUG]
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,250002] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.add_space_prefix bool = true
llama_model_loader: - kv 22: tokenizer.ggml.token_type_count u32 = 1
llama_model_loader: - kv 23: tokenizer.ggml.remove_extra_whitespaces bool = true
2026-02-22 13:51:01 [DEBUG]
llama_model_loader: - kv 24: tokenizer.ggml.precompiled_charsmap arr[u8,237539] = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 28: tokenizer.ggml.seperator_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 30: tokenizer.ggml.cls_token_id u32 = 0
llama_model_loader: - kv 31: tokenizer.ggml.mask_token_id u32 = 250001
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = true
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - type f32: 247 tensors
llama_model_loader: - type q8_0: 146 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 599.70 MiB (8.86 BPW)
2026-02-22 13:51:02 [DEBUG]
load: model vocab missing newline token, using special_pad_id instead
2026-02-22 13:51:02 [DEBUG]
load: 0 unused tokens
2026-02-22 13:51:02 [DEBUG]
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 2 ('</s>')
load: special tokens cache size = 4
2026-02-22 13:51:02 [DEBUG]
load: token to piece cache size = 2.1668 MB
print_info: arch = bert
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 8192
print_info: n_embd = 1024
print_info: n_embd_inp = 1024
print_info: n_layer = 24
print_info: n_head = 16
print_info: n_head_kv = 16
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 1.0e-05
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 4096
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 0
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 8192
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 335M
print_info: model params = 567.75 M
print_info: general.name = Bge M3
print_info: vocab type = UGM
print_info: n_vocab = 250002
print_info: n_merges = 0
print_info: BOS token = 0 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 3 '<unk>'
print_info: SEP token = 2 '</s>'
print_info: PAD token = 1 '<pad>'
print_info: MASK token = 250001 '[PAD250000]'
print_info: LF token = 0 '<s>'
print_info: EOG token = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
2026-02-22 13:51:02 [DEBUG]
load_tensors: offloading output layer to GPU
load_tensors: offloading 23 repeating layers to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors: CPU_Mapped model buffer size = 291.41 MiB
load_tensors: CUDA0 model buffer size = 308.29 MiB
2026-02-22 13:51:02 [DEBUG]
common_init_result: added </s> logit bias = -inf
2026-02-22 13:51:02 [DEBUG]
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 4096
llama_context: n_ubatch = 4096
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
2026-02-22 13:51:02 [DEBUG]
llama_context: CUDA_Host output buffer size = 0.96 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
2026-02-22 13:51:02 [DEBUG]
sched_reserve: CUDA0 compute buffer size = 168.02 MiB
sched_reserve: CUDA_Host compute buffer size = 112.05 MiB
sched_reserve: graph nodes = 779
sched_reserve: graph splits = 4 (with bs=4096), 2 (with bs=1)
sched_reserve: reserve took 25.16 ms, sched copies = 1
2026-02-22 13:51:02 [DEBUG]
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
2026-02-22 13:51:02 [DEBUG]
[INFO] [LlamaEmbeddingEngine] Model load complete!
[INFO] [PaniniRagEngine] Model loaded into embedding engine!
[INFO] [PaniniRagEngine] Model loaded without an active session.
2026-02-22 13:51:07 [DEBUG]
Received request: POST to /v1/embeddings with body {
"model": "text-embedding-bge-reranker-v2-m3",
"input": "Lorem ipsum dolor sit amet, consectetur adipiscing... <Truncated in logs> ...tristique ipsum, nec semper magna condimentum non."
}
2026-02-22 13:51:07 [INFO]
Received request to embed: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec ultrices sapien eu placerat vehicula....
2026-02-22 13:51:07 [INFO]
Returning embeddings (not shown in logs)
To Reproduce
Steps to reproduce the behavior:
- Start Dev Server
- make a
POSTrequest to/v1/embeddingswith this JSON request body data:
{
"model": "text-embedding-bge-reranker-v2-m3",
"input": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec ultrices sapien eu placerat vehicula. Suspendisse nibh mi, iaculis ut ligula posuere, iaculis ornare magna. Suspendisse vitae nulla ut odio sodales maximus eu quis felis. Donec suscipit non erat eget dignissim. Integer aliquet mattis rutrum. Donec blandit ullamcorper erat, et finibus lorem. Etiam rutrum, urna in iaculis malesuada, est velit maximus quam, ut faucibus ante magna vel leo. Phasellus velit lectus, tristique sit amet iaculis eget, luctus eget tellus. Phasellus congue gravida gravida. Praesent et eros vel elit pharetra sollicitudin. Fusce ac tellus elementum, rutrum massa eu, eleifend elit. Phasellus iaculis nibh quis rutrum fermentum. Aliquam euismod interdum mi. Nulla porta odio tortor, varius semper nulla dapibus et. Donec egestas velit in diam scelerisque, at ultricies mi tempus. Aliquam aliquam ante et odio ullamcorper, eu commodo lacus porta. Suspendisse cursus ligula at ipsum pharetra mollis. Quisque varius fermentum turpis. Integer laoreet ligula leo, ut aliquet nibh interdum sit amet. Suspendisse in eleifend odio, interdum cursus leo. Duis a purus elit. Ut consectetur mollis dui, nec luctus mauris pretium nec. In et tellus ligula. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras quis mi non tortor maximus lacinia. Ut tempor fermentum nisi, ac aliquam lacus cursus eu. Nam sagittis odio sit amet enim facilisis imperdiet nec vel augue. Fusce diam sem, semper vel tempus eu, finibus id turpis."
}- Check the response
Why this issue matters?
Some RAG systems use the usage parameter to verify that embeddings were generated correctly before saving them to memory (database, files, etc.).
If usage is 0, the system may treat the embedding as failed and store [0, 0, 0, ...] instead of the real vector.
I faced this exact issue while creating a PR for Flowise to integrate LM Studio. For more details, please check my Folowise Pull Request (Important Note) section
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels