Skip to content

Cranelift: Big performance difference between two cases with the same semantics #7085

@hungryzzz

Description

@hungryzzz

Hi, I use the Emscripten to compile the attached test case(flops.fast.c) and run it on Wasmtime. And then I do a very small change(flops.slow.c, only move the location of two printf statements) in this case which don't affect the program semantics. However, the execution time of flops.slow.c is almost 4 slower than that of flops.fast.c (measured by time tool).

  • flops.fast.c: 0.615s
  • flops.slow.c: 2.395s

image

And then I run these two cases in WasmEdge and Wamr with AOT mode, and there is almost not difference between the execution time, so I think this bug is not caused by Emscripten.

WasmEdge

  • flops.fast.c: 0.43s
  • flops.slow.c: 0.43s

Wamr

  • flops.fast.c: 0.42s
  • flops.slow.c: 0.42s

Try to Debug

I try to increase the number of iterations (the case only contains one loop) from 156250000 to 1562500000, and I find that the execution time will also increase by the same multiple(10), so I guess such small changes will affect the execution time of the loop.

Since Wasmtime will compile the wasm code to binary using Cranelift before execution, I dump the binary using wasmtime compile flops.fast.wasm -o flops.fast.o and check the difference of the loop between such two binary code but I don't find any semantics difference in my eyes.

# flops.fast.c
2f2:	8d 71 01             	lea    esi,[rcx+0x1]
2f5:	c5 4b 59 25 73 04 00 	vmulsd xmm12,xmm6,QWORD PTR [rip+0x473]        # 770 <wasm[0]::function[3]+0x730>
2fc:	00 
2fd:	c4 41 53 5e fc       	vdivsd xmm15,xmm5,xmm12
302:	c5 1b 2a e1          	vcvtsi2sd xmm12,xmm12,ecx
306:	c4 41 03 59 e4       	vmulsd xmm12,xmm15,xmm12
30b:	c4 c1 1b 59 fc       	vmulsd xmm7,xmm12,xmm12
310:	c5 63 59 e7          	vmulsd xmm12,xmm3,xmm7
314:	c5 1b 58 e2          	vaddsd xmm12,xmm12,xmm2
318:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
31d:	c5 1b 58 e0          	vaddsd xmm12,xmm12,xmm0
321:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
326:	c4 41 1b 58 e6       	vaddsd xmm12,xmm12,xmm14
32b:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
330:	c4 41 1b 58 e5       	vaddsd xmm12,xmm12,xmm13
335:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
33a:	f3 44 0f 6f 44 24 10 	movdqu xmm8,XMMWORD PTR [rsp+0x10]
341:	c4 41 1b 58 e0       	vaddsd xmm12,xmm12,xmm8
346:	c4 41 43 59 e4       	vmulsd xmm12,xmm7,xmm12
34b:	c4 41 1b 58 e1       	vaddsd xmm12,xmm12,xmm9
350:	c4 41 23 58 cc       	vaddsd xmm9,xmm11,xmm12
355:	81 fe 90 2f 50 09    	cmp    esi,0x9502f90
35b:	0f 84 08 00 00 00    	je     369 <wasm[0]::function[3]+0x329>
361:	48 89 f1             	mov    rcx,rsi
364:	e9 89 ff ff ff       	jmp    2f2 <wasm[0]::function[3]+0x2b2>
# flops.slow.c
2c5:	8d 7e 01             	lea    edi,[rsi+0x1]
2c8:	c5 1b 2a e6          	vcvtsi2sd xmm12,xmm12,esi
2cc:	c5 1b 59 25 8c 04 00 	vmulsd xmm12,xmm12,QWORD PTR [rip+0x48c]        # 760 <wasm[0]::function[3]+0x720>
2d3:	00 
2d4:	c4 41 1b 59 e4       	vmulsd xmm12,xmm12,xmm12
2d9:	c4 41 33 59 ec       	vmulsd xmm13,xmm9,xmm12
2de:	c4 41 13 58 e8       	vaddsd xmm13,xmm13,xmm8
2e3:	c4 41 1b 59 ed       	vmulsd xmm13,xmm12,xmm13
2e8:	c5 13 58 ee          	vaddsd xmm13,xmm13,xmm6
2ec:	c4 41 1b 59 ed       	vmulsd xmm13,xmm12,xmm13
2f1:	c5 13 58 ed          	vaddsd xmm13,xmm13,xmm5
2f5:	c4 41 1b 59 ed       	vmulsd xmm13,xmm12,xmm13
2fa:	c5 13 58 ec          	vaddsd xmm13,xmm13,xmm4
2fe:	c4 41 1b 59 ed       	vmulsd xmm13,xmm12,xmm13
303:	c5 13 58 eb          	vaddsd xmm13,xmm13,xmm3
307:	c4 41 1b 59 e5       	vmulsd xmm12,xmm12,xmm13
30c:	c4 41 1b 58 d2       	vaddsd xmm10,xmm12,xmm10
311:	c5 2b 58 15 57 04 00 	vaddsd xmm10,xmm10,QWORD PTR [rip+0x457]        # 770 <wasm[0]::function[3]+0x730>
318:	00 
319:	81 ff 90 2f 50 09    	cmp    edi,0x9502f90
31f:	0f 84 08 00 00 00    	je     32d <wasm[0]::function[3]+0x2ed>
325:	48 89 fe             	mov    rsi,rdi
328:	e9 98 ff ff ff       	jmp    2c5 <wasm[0]::function[3]+0x285>

Steps to Reproduce

  1. Compile the two C cases into Wasm:
  emcc  -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.fast.c -o flops.fast.wasm
  emcc  -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.slow.c -o flops.slow.wasm
  1. Run wasm in Wasmtime and collect execution time using time
 time wasmtime flops.fast.wasm
 time wasmtime flops.slow.wasm

Versions and Environment

  • Wasmtime: f10d665

  • WasmEdge: 777d8b4ba50567b3b39aa8c9d81a6dfade355af8

  • Wamr: 1ff41ebdc257b45c9f574d8d1f7ee6df4be79645

  • Emscripten:
    emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.45 (ef3e4e3b044de98e1811546e0bc605c65d3412f4)
    clang version 18.0.0 (https://github.com/llvm/llvm-project d1e685df45dc5944b43d2547d0138cd4a3ee4efe)
    Target: wasm32-unknown-emscripten
    Thread model: posix
    InstalledDir: /home/ringzzz/emsdk/upstream/bin

  • kernel version: Linux 5.15.0-83-generic

code.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIncorrect behavior in the current implementation that needs fixingcraneliftIssues related to the Cranelift code generator

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions