Skip to content

[Usage]: There is no module or parameter named 'language_model' in Gemma3ForCausalLM #15031

@risqaliyevds

Description

@risqaliyevds

Description:
I'm encountering an error when serving a merged model with vLLM. The merged model was created using the following command:

model.save_pretrained_merged("/home/mata/llm/data/models/tuned/unsloth/gemma-3-12b-it-bnb-4bit-17-03-2025/checkpoint-625/merged_16bit", tokenizer)

I then start the vLLM server with:

vllm serve /home/mata/llm/data/models/tuned/unsloth/gemma-3-12b-it-bnb-4bit-17-03-2025/checkpoint-625/merged_16bit --chat-template /home/mata/llm/data/models/chat_temp/google--gemma-3-12b-it.jinja --gpu-memory-utilization 0.9 --max_model_len 8192

However, the server fails to load the model and returns the following error:

ValueError: There is no module or parameter named 'language_model' in Gemma3ForCausalLM

The stack trace indicates that during the weight loading process, vLLM attempts to locate a submodule or parameter called language_model in the model class Gemma3ForCausalLM, but it isn’t found.

Steps to Reproduce:

  1. Merge the model using model.save_pretrained_merged(...) as shown above.
  2. Run the vLLM server with the provided command.
  3. Observe the error during model loading.

Expected Behavior:
The merged model should load without error, and vLLM should serve the model correctly.

Additional Context:
It appears that the issue might be related to a discrepancy between the saved merged model structure and what vLLM expects. Specifically, vLLM is looking for a language_model module within Gemma3ForCausalLM that is not present in the merged checkpoint. A possible workaround or fix might involve adjusting the model loader to correctly reference the appropriate module names from the merged model.

Request:
Could you please investigate if the issue is with the merging process (i.e., save_pretrained_merged) or with vLLM's model loading logic? Any guidance or workaround would be appreciated.

Full error

(venv-llm) azureuser@a100gpu:/home/mata/llm$ vllm serve /home/mata/llm/data/models/tuned/unsloth/gemma-3-12b-it-bnb-4bit-17-03-2025/checkpoint-625/merged_16bit --chat-template /home/mata/llm/data/models/chat_temp/google--gemma-3-12b-it.jinja --gpu-memory-utilization 0.9 --max_model_len 8192
INFO 03-18 11:37:30 [__init__.py:256] Automatically detected platform cuda.
INFO 03-18 11:37:32 [api_server.py:972] vLLM API server version 0.8.0rc3.dev6+g53a0cf8b
INFO 03-18 11:37:32 [api_server.py:973] args: Namespace(subparser='serve', model_tag='/home/mata/llm/data/models/tuned/unsloth/gemma-3-12b-it-bnb-4bit-17-03-2025/checkpoint-625/merged_16bit', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/home/mata/llm/data/models/chat_temp/google--gemma-3-12b-it.jinja', chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/mata/llm/data/models/tuned/unsloth/gemma-3-12b-it-bnb-4bit-17-03-2025/checkpoint-625/merged_16bit', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7aae78540f40>)
INFO 03-18 11:37:38 [config.py:583] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
INFO 03-18 11:37:38 [config.py:1677] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-18 11:37:44 [__init__.py:256] Automatically detected platform cuda.
INFO 03-18 11:37:47 [core.py:53] Initializing a V1 LLM engine (v0.8.0rc3.dev6+g53a0cf8b) with config: model='/home/mata/llm/data/models/tuned/unsloth/gemma-3-12b-it-bnb-4bit-17-03-2025/checkpoint-625/merged_16bit', speculative_config=None, tokenizer='/home/mata/llm/data/models/tuned/unsloth/gemma-3-12b-it-bnb-4bit-17-03-2025/checkpoint-625/merged_16bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/mata/llm/data/models/tuned/unsloth/gemma-3-12b-it-bnb-4bit-17-03-2025/checkpoint-625/merged_16bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-18 11:37:47 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fda449f1b10>
INFO 03-18 11:37:48 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-18 11:37:48 [cuda.py:215] Using Flash Attention backend on V1 engine.
INFO 03-18 11:37:48 [gpu_model_runner.py:1128] Starting to load model /home/mata/llm/data/models/tuned/unsloth/gemma-3-12b-it-bnb-4bit-17-03-2025/checkpoint-625/merged_16bit...
WARNING 03-18 11:37:48 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
ERROR 03-18 11:37:48 [core.py:340] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/v1/engine/core.py", line 332, in run_engine_core
ERROR 03-18 11:37:48 [core.py:340]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 03-18 11:37:48 [core.py:340]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/v1/engine/core.py", line 287, in __init__
ERROR 03-18 11:37:48 [core.py:340]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/v1/engine/core.py", line 59, in __init__
ERROR 03-18 11:37:48 [core.py:340]     self.model_executor = executor_class(vllm_config)
ERROR 03-18 11:37:48 [core.py:340]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/executor/executor_base.py", line 52, in __init__
ERROR 03-18 11:37:48 [core.py:340]     self._init_executor()
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 03-18 11:37:48 [core.py:340]     self.collective_rpc("load_model")
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 03-18 11:37:48 [core.py:340]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 03-18 11:37:48 [core.py:340]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/utils.py", line 2216, in run_method
ERROR 03-18 11:37:48 [core.py:340]     return func(*args, **kwargs)
ERROR 03-18 11:37:48 [core.py:340]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/v1/worker/gpu_worker.py", line 136, in load_model
ERROR 03-18 11:37:48 [core.py:340]     self.model_runner.load_model()
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/v1/worker/gpu_model_runner.py", line 1131, in load_model
ERROR 03-18 11:37:48 [core.py:340]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 03-18 11:37:48 [core.py:340]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 03-18 11:37:48 [core.py:340]     return loader.load_model(vllm_config=vllm_config)
ERROR 03-18 11:37:48 [core.py:340]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/model_executor/model_loader/loader.py", line 426, in load_model
ERROR 03-18 11:37:48 [core.py:340]     loaded_weights = model.load_weights(
ERROR 03-18 11:37:48 [core.py:340]                      ^^^^^^^^^^^^^^^^^^^
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/model_executor/models/gemma3.py", line 528, in load_weights
ERROR 03-18 11:37:48 [core.py:340]     return loader.load_weights(weights)
ERROR 03-18 11:37:48 [core.py:340]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/model_executor/models/utils.py", line 235, in load_weights
ERROR 03-18 11:37:48 [core.py:340]     autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 03-18 11:37:48 [core.py:340]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 11:37:48 [core.py:340]   File "/home/mata/llm/src/benchmark/vllm/vllm/model_executor/models/utils.py", line 224, in _load_module
ERROR 03-18 11:37:48 [core.py:340]     raise ValueError(msg)
ERROR 03-18 11:37:48 [core.py:340] ValueError: There is no module or parameter named 'language_model' in Gemma3ForCausalLM
ERROR 03-18 11:37:48 [core.py:340] 
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]

CRITICAL 03-18 11:37:48 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

Metadata

Metadata

Assignees

Labels

usageHow to use vllm

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions