Skip to content

[inductor] dynamo benchmark model dm_nfnet_f0 fails with torch._inductor.exc.InductorError: KeyError: 'op566' #151423

@jianyizh

Description

@jianyizh

🐛 Describe the bug

After #150845 The ci log https://ossci-raw-job-status.s3.amazonaws.com/log/40563911423 shows the timm model dm_nfnet_f0 training is failed.
2025-04-15T10:14:44.3504086Z loading model: 0it [00:00, ?it/s]
2025-04-15T10:14:44.3504436Z loading model: 0it [00:02, ?it/s]
2025-04-15T10:14:44.3504749Z cuda train dm_nfnet_f0
2025-04-15T10:15:27.7967917Z ERROR:common:Backend dynamo failed in warmup()
2025-04-15T10:15:27.7968391Z Traceback (most recent call last):
2025-04-15T10:15:27.7969315Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/common.py", line 2533, in warmup
2025-04-15T10:15:27.7969857Z fn(model, example_inputs)
2025-04-15T10:15:27.7970432Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 671, in _fn
2025-04-15T10:15:27.7971149Z raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
2025-04-15T10:15:27.7971921Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 766, in _compile_fx_inner
2025-04-15T10:15:27.7972662Z raise InductorError(e, currentframe()).with_traceback(
2025-04-15T10:15:27.7973394Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 750, in _compile_fx_inner
2025-04-15T10:15:27.7974193Z mb_compiled_graph = fx_codegen_and_compile(
2025-04-15T10:15:27.7974913Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1356, in fx_codegen_and_compile
2025-04-15T10:15:27.7975793Z return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
2025-04-15T10:15:27.7976659Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1245, in codegen_and_compile
2025-04-15T10:15:27.7977368Z compiled_module = graph.compile_to_module()
2025-04-15T10:15:27.7978026Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2205, in compile_to_module
2025-04-15T10:15:27.7978679Z return self._compile_to_module()
2025-04-15T10:15:27.7979316Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2213, in _compile_to_module
2025-04-15T10:15:27.7980099Z self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
2025-04-15T10:15:27.7980826Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2150, in codegen
2025-04-15T10:15:27.7981441Z self.scheduler.codegen()
2025-04-15T10:15:27.7982034Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 4309, in codegen
2025-04-15T10:15:27.7982649Z else self._codegen(self.nodes)
2025-04-15T10:15:27.7983255Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 4445, in _codegen
2025-04-15T10:15:27.7983909Z self.get_backend(device).codegen_node(node)
2025-04-15T10:15:27.7984680Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 104, in codegen_node
2025-04-15T10:15:27.7985469Z return self._triton_scheduling.codegen_node(node)
2025-04-15T10:15:27.7986172Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codegen/simd.py", line 1318, in codegen_node
2025-04-15T10:15:27.7986838Z return self.codegen_node_schedule(
2025-04-15T10:15:27.7987542Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codegen/simd.py", line 1359, in codegen_node_schedule
2025-04-15T10:15:27.7988558Z self.codegen_node_schedule_with_kernel(node_schedule, kernel)
2025-04-15T10:15:27.7989397Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codegen/simd.py", line 1439, in codegen_node_schedule_with_kernel
2025-04-15T10:15:27.7990150Z node.decide_inplace_update()
2025-04-15T10:15:27.7990800Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 550, in decide_inplace_update
2025-04-15T10:15:27.7991506Z and single_index_in_fused_node(input_buf)
2025-04-15T10:15:27.7992484Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 484, in single_index_in_fused_node
2025-04-15T10:15:27.7993249Z buf_to_be_inplaced.scheduler.get_fused_node(user_node)
2025-04-15T10:15:27.7993944Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 2989, in get_fused_node
2025-04-15T10:15:27.7996226Z return self.name_to_fused_node[node.get_first_name()]
2025-04-15T10:15:27.7996734Z torch._inductor.exc.InductorError: KeyError: 'op566'
2025-04-15T10:15:27.7997104Z warmup_failed
2025-04-15T10:15:31.8469481Z Run failed with return code: 255
2025-04-15T10:15:31.8469861Z Output: None
2025-04-15T10:15:31.8470092Z Error: None
2025-04-15T10:15:35.6188853Z

Versions

This error is on current main, 40ce4fb, see https://hud.pytorch.org/benchmark/timm_models/inductor_no_cudagraphs?dashboard=torchinductor&startTime=Sat,%2001%20Mar%202025%2007:14:57%20GMT&stopTime=Wed,%2016%20Apr%202025%2007:14:57%20GMT&granularity=hour&mode=training&model=dm_nfnet_f0&dtype=amp&deviceName=cuda%20(a100)&lBranch=main&lCommit=ccfce9ae868131cc87dd99584ab79e316c14e7d4&rBranch=main&rCommit=ccfce9ae868131cc87dd99584ab79e316c14e7d4

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @aakhundov

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: inductortriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions