Cranelift: Big performance difference between two cases with the same semantics

Hi, I use the `Emscripten` to compile the attached test case(`flops.fast.c`) and run it on Wasmtime. And then I do a very small change(`flops.slow.c`, only move the location of two `printf` statements) in this case which don't affect the program semantics. However, the execution time of `flops.slow.c` is almost 4 slower than that of `flops.fast.c` (measured by `time` tool).

- `flops.fast.c`: 0.615s
- `flops.slow.c`: 2.395s

![image](https://github.com/bytecodealliance/wasmtime/assets/32137313/b6ba42cc-d2e7-4bda-83a9-245a4c20c1b8)

And then I run these two cases in `WasmEdge` and `Wamr` with AOT mode, and there is almost not difference between the execution time, so I think this bug is not caused by `Emscripten`.

#### WasmEdge
- `flops.fast.c`: 0.43s
- `flops.slow.c`: 0.43s

#### Wamr
- `flops.fast.c`: 0.42s
- `flops.slow.c`: 0.42s


### Try to Debug
I try to increase the number of iterations (the case only contains one loop) from `156250000` to `1562500000`, and I find that the execution time will also increase by the same multiple(10), so I guess such small changes will affect the execution time of the loop.

Since `Wasmtime` will compile the wasm code to binary using `Cranelift` before execution, I dump the binary using `wasmtime compile flops.fast.wasm -o flops.fast.o` and check the difference of the loop between such two binary code but I don't find any semantics difference in my eyes.

```
# flops.fast.c
2f2:	8d 71 01             	lea    esi,[rcx+0x1]
2f5:	c5 4b 59 25 73 04 00 	vmulsd xmm12,xmm6,QWORD PTR [rip+0x473]        # 770 <wasm[0]::function[3]+0x730>
2fc:	00 
2fd:	c4 41 53 5e fc       	vdivsd xmm15,xmm5,xmm12
302:	c5 1b 2a e1          	vcvtsi2sd xmm12,xmm12,ecx
306:	c4 41 03 59 e4       	vmulsd xmm12,xmm15,xmm12
30b:	c4 c1 1b 59 fc       	vmulsd xmm7,xmm12,xmm12
310:	c5 63 59 e7          	vmulsd xmm12,xmm3,xmm7
314:	c5 1b 58 e2          	vaddsd xmm12,xmm12,xmm2
318:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
31d:	c5 1b 58 e0          	vaddsd xmm12,xmm12,xmm0
321:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
326:	c4 41 1b 58 e6       	vaddsd xmm12,xmm12,xmm14
32b:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
330:	c4 41 1b 58 e5       	vaddsd xmm12,xmm12,xmm13
335:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
33a:	f3 44 0f 6f 44 24 10 	movdqu xmm8,XMMWORD PTR [rsp+0x10]
341:	c4 41 1b 58 e0       	vaddsd xmm12,xmm12,xmm8
346:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
34b:	c4 41 1b 58 e1       	vaddsd xmm12,xmm12,xmm9
350:	c4 41 23 58 cc       	vaddsd xmm9,xmm11,xmm12
355:	81 fe 90 2f 50 09    	cmp    esi,0x9502f90
35b:	0f 84 08 00 00 00    	je     369 <wasm[0]::function[3]+0x329>
361:	48 89 f1             	mov    rcx,rsi
364:	e9 89 ff ff ff       	jmp    2f2 <wasm[0]::function[3]+0x2b2>
```

```
# flops.slow.c
2c5:	8d 7e 01             	lea    edi,[rsi+0x1]
2c8:	c5 1b 2a e6          	vcvtsi2sd xmm12,xmm12,esi
2cc:	c5 1b 59 25 8c 04 00 	vmulsd xmm12,xmm12,QWORD PTR [rip+0x48c]        # 760 <wasm[0]::function[3]+0x720>
2d3:	00 
2d4:	c4 41 1b 59 e4       	vmulsd xmm12,xmm12,xmm12
2d9:	c4 41 33 59 ec       	vmulsd xmm13,xmm9,xmm12
2de:	c4 41 13 58 e8       	vaddsd xmm13,xmm13,xmm8
2e3:	c4 41 1b 59 ed       	vmulsd xmm13,xmm12,xmm13
2e8:	c5 13 58 ee          	vaddsd xmm13,xmm13,xmm6
2ec:	c4 41 1b 59 ed       	vmulsd xmm13,xmm12,xmm13
2f1:	c5 13 58 ed          	vaddsd xmm13,xmm13,xmm5
2f5:	c4 41 1b 59 ed       	vmulsd xmm13,xmm12,xmm13
2fa:	c5 13 58 ec          	vaddsd xmm13,xmm13,xmm4
2fe:	c4 41 1b 59 ed       	vmulsd xmm13,xmm12,xmm13
303:	c5 13 58 eb          	vaddsd xmm13,xmm13,xmm3
307:	c4 41 1b 59 e5       	vmulsd xmm12,xmm12,xmm13
30c:	c4 41 1b 58 d2       	vaddsd xmm10,xmm12,xmm10
311:	c5 2b 58 15 57 04 00 	vaddsd xmm10,xmm10,QWORD PTR [rip+0x457]        # 770 <wasm[0]::function[3]+0x730>
318:	00 
319:	81 ff 90 2f 50 09    	cmp    edi,0x9502f90
31f:	0f 84 08 00 00 00    	je     32d <wasm[0]::function[3]+0x2ed>
325:	48 89 fe             	mov    rsi,rdi
328:	e9 98 ff ff ff       	jmp    2c5 <wasm[0]::function[3]+0x285>
```

### Steps to Reproduce

1. Compile the two C cases into Wasm: 
  ```
    emcc  -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.fast.c -o flops.fast.wasm
    emcc  -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.slow.c -o flops.slow.wasm
  ```
2. Run wasm in `Wasmtime` and collect execution time using `time`
  ```
   time wasmtime flops.fast.wasm
   time wasmtime flops.slow.wasm
  ```

### Versions and Environment

- `Wasmtime`: f10d665bb9cacceee6051989e99c0934c542ad84
- `WasmEdge`: 777d8b4ba50567b3b39aa8c9d81a6dfade355af8
- `Wamr`: 1ff41ebdc257b45c9f574d8d1f7ee6df4be79645
- `Emscripten`:
  emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.45 (ef3e4e3b044de98e1811546e0bc605c65d3412f4)
clang version 18.0.0 (https://github.com/llvm/llvm-project d1e685df45dc5944b43d2547d0138cd4a3ee4efe)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /home/ringzzz/emsdk/upstream/bin

- kernel version: Linux 5.15.0-83-generic

[code.zip](https://github.com/bytecodealliance/wasmtime/files/12718355/code.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift: Big performance difference between two cases with the same semantics #7085

WasmEdge

Wamr

Try to Debug

Steps to Reproduce

Versions and Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cranelift: Big performance difference between two cases with the same semantics #7085

Description

WasmEdge

Wamr

Try to Debug

Steps to Reproduce

Versions and Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions