Skip to content

PHP benchmark: report histogram and unconstrained scenarios#12760

Merged
ZhouyihaiDing merged 1 commit intogrpc:masterfrom
ZhouyihaiDing:opt_php_benchmark
Oct 8, 2017
Merged

PHP benchmark: report histogram and unconstrained scenarios#12760
ZhouyihaiDing merged 1 commit intogrpc:masterfrom
ZhouyihaiDing:opt_php_benchmark

Conversation

@ZhouyihaiDing
Copy link
Copy Markdown
Contributor

@ZhouyihaiDing ZhouyihaiDing commented Sep 28, 2017

  1. Currently, each time when the php cliend do a call, it will upload latency result through a streamming RPC immediately, and that time is counted in the benchmark time.

Now, it sends a histogram stats to the ruby proxy for every 2000 ping-pong.
QPS tests in my own local machine (protobuf c extension, 20s warm-up, 60s benchmark):

QPS            before    after
unary          2865.9    4166.7
streamming     2898.0    4200
  1. When unconstrained scenario wants to use multiple channels, ruby will create multiple php client and assign a server target to each of them.

Copy link
Copy Markdown
Contributor

@vjpai vjpai Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there,
It shouldn't be assumed that there are exactly 3 marks. It is possible that we would want more statistics reporting points than that, and all the benchmarks are written to assume that there can be an unlimited number of marks. That's why the mark has a reset field; to allow some marks to specify that the stats should be cleared. It's fine to react to marks differently from other operations, but I don't think you should assume a count of marks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the correction. Then I also can't assume that only the last mark is always for the benchmark.
If there is a middle "mark" used for the benchmark, uploading the data still consumes time for the next mark, which lower the qps. I didn't see where I can pause the drive, letting ruby finish uploading before start next "mark".
I guess I can continue the signal way to let php know when "mark" is called. But instead of uploading all latencies, I should implement histogram in the php side and upload it once before each mark.
Do you have some suggestions to let php knows when to upload other than using signal? The response from the server seems useless because it remains the same until it is closed.
n the benchmark framework I

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that right that "reqs.each do |req|" works as
while (1){
in = stream.recv(); // the program will hang there for warm-up time, lots of benchmark time.
case in.isSetup:
case in.inMark: get data from php, upload histogram to the driver.
}
Thus in the in.inMark, streamming upload will always be expensive and it cost time from the next "mark". What I'm doing now is to make the split latency sent together, which may speed up cache or buffer re-use or something. But the best way is just uploading once, right?

Copy link
Copy Markdown
Contributor

@vjpai vjpai Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right thing is to restructure the proxy worker. Right now the PHP client talks to the proxy worker over a 1-sided client-side streaming RPC called ReportTime. Instead, it should be changed to a bidi-streaming RPC. On the bidi streaming RPC, the proxy worker should send each mark to the PHP client. The PHP client should read that and then respond with a histogram that it has been collecting over the execution. The proxy worker can then pass that histogram back to the driver. That will require maintaining a histogram in the PHP client and sending a ClientStats proto back to the proxy worker on a mark. Does that make sense? Maybe the Quit and other messages could also be passed from the proxy worker to the PHP client like that to take signal out of the mix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds cool by using a bidi-streaming. I'll try use async way or fork a child process listening to this bidi-streaming RPC in sync way. Thanks for your help!
I'll close this PR and reopen it until I finish them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I forgot about that point, that neither of those options is really natural in PHP. I think that's why I went with the 1-directional streaming thing in the first place. Probably needs better investigation then.

Copy link
Copy Markdown
Contributor

@vjpai vjpai Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility: go with the example that you're currently using (storing the latency results in an array and sending it out as a batch) but send it out periodically like once every 100 ms or something so that you get the batching effects that you're talking about but without having to alter the framework altogether. The important thing is just to send far fewer stats-reporting RPCs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still using signal to let them talk to each other. I tried create a child process and let a streaming RPC listening there. But it seems php-gRPC doesn't work on child process(the same code works in the parent, will hang in the child). I don't know it's php's problem or php-gRPC's problem yet.
Currently every mark will trigger php uploading, and continue until the data is uploaded.
I'll implement histogram in the php side next.

@ZhouyihaiDing
Copy link
Copy Markdown
Contributor Author

I tried to create a child process in php and let it handle stats stream but it seems grpc has some problem inside child process. Thus I can only handle the signal inside main process.
Tick also has influence in the php client side. But setting time_alarm also needs tick. I'll check if sending histogram for every ping-pong is faster or not after I implement merge hist in ruby side.
Each signal will block ruby proxy because it should wait until php program is started or data is transferred.
Bucket data can't be send back to the ruby proxy. I am checking why.

@ZhouyihaiDing ZhouyihaiDing force-pushed the opt_php_benchmark branch 2 times, most recently from ad2cdfb to 2d52819 Compare October 2, 2017 16:00
@ZhouyihaiDing
Copy link
Copy Markdown
Contributor Author

Result shows batch sending result for every 1000 times is faster than using signal as trigger.

@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@ZhouyihaiDing ZhouyihaiDing force-pushed the opt_php_benchmark branch 2 times, most recently from 41d4ac6 to b1365d4 Compare October 2, 2017 18:29
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@grpc grpc deleted a comment from grpc-testing Oct 2, 2017
@ZhouyihaiDing
Copy link
Copy Markdown
Contributor Author

I changed it in the ruby side:

  1. Each client-channel will create a thread which start a php-client (although using thread seems unnecessary, but start php-client is really slow, that most times first mark is called before php-client is initiate. But using signal to keep the order will introduce 'tick' in the php side, which slow the program);
  2. php report histogram to the ruby by "report_hist" for every 2000 ping-pong. It's faster than using signal trigger uploading the histogram.
    Can you please help me to have a review for the ruby code? @vjpai I changed proxy-server.rb and histogram.rb

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after if for consistency

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a space around the = for consistency

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I was using this to print the benchmark scenario from other languages to see how many channel/RPC per channel they are using.
I'll delete it.

@grpc grpc deleted a comment from grpc-testing Oct 5, 2017
@grpc grpc deleted a comment from grpc-testing Oct 5, 2017
@grpc grpc deleted a comment from grpc-testing Oct 5, 2017
@grpc-testing
Copy link
Copy Markdown

[microbenchmarks] No significant performance differences

1 similar comment
@grpc-testing
Copy link
Copy Markdown

[microbenchmarks] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

[trickle] No significant performance differences

1 similar comment
@grpc-testing
Copy link
Copy Markdown

[trickle] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

[microbenchmarks] No significant performance differences

1 similar comment
@grpc-testing
Copy link
Copy Markdown

[microbenchmarks] No significant performance differences

@ZhouyihaiDing
Copy link
Copy Markdown
Contributor Author

Asan C (internal CI) -
Basic Tests C++ Linux [opt] (internal CI) - (100 failure in last 24h)json_run_localhost:cpp_protobuf_async_client_unary_1channel_64wide_128Breq_8MBresp_secure
Basic Tests MacOS [opt] (internal CI - #12857
Basic Tests Windows [dbg] (internal CI) - #12858
Interop Cloud-to-Prod Tests (internal CI) - #10763
Tsan C (internal CI) - #12859

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd just enumerate "php7" and "php7_protobuf_c" here, no need to use wildcards.
Also no reason to pass $CONFIG as an argument - that variable will be available in build_performance_php.sh anyway - taking an env variable, passing it as a param only to make it an env variable seems unnecessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I use wildcards in case we need add benchmark for both php and php7. But that requires switch language version in VM, thus only focus on build benchmark for php7 now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, looks like you are now building php twice - once for php7 and once for php7_protobuf_c, given the build process seems to be the same regardless of using protobuf_c, maybe you can just build everything as language "php"?

Basically, that would mean differentiating the language name in scenario_config.py ( = str(language)) and "language to build" by introducing language.lang_to_build().

Building PHP twice is not a problem if:
1.) the second build takes negligible time (everything already built and installed)
2.) you can guarantee that the PHP builds won't run concurrently and therefore collide with each other.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the builds are not running concurrently as they are executed one by one by build_perfomance.sh

One possible trick (and possibly the simplest):
if both php7 and php7_protobuf_c are passed to build_performance.sh,
it can remember that php has already been build when building the first build is triggered
(PHP_ALREADY_BUILT=the_lang_that_built_it), and the second request to build can be skipped ("Skipping PHP build as already built by $PHP_ALREADY_BUILT").

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you for the suggestions. I check PHP_ALREADY_BUILT and make it build only once.

@grpc-testing
Copy link
Copy Markdown

****************************************************************

libgrpc.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



Copy link
Copy Markdown
Contributor

@jtattermusch jtattermusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the changes to performance testing scripts (.py and .sh). Thanks for addressing my comments, I think now things look much better!

After merging, please monitor the benchmark suite for a while to check that PHP builds are not introducing extra flakiness.

@grpc-testing
Copy link
Copy Markdown

[trickle] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

[microbenchmarks] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

****************************************************************

libgrpc.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link
Copy Markdown

[trickle] No significant performance differences

Copy link
Copy Markdown
Contributor

@vjpai vjpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm fine with merging, but I would like an opinion from @jtattermusch or @adelez : does it make sense to dashboard the PHP QPS test? The model of parallelism used here (separate process per client channel) is completely different from the other wrapped languages (which use 1 process and may have multiple threads). I know that this is since multithreading isn't idiomatic in PHP, but it might muddle the actual contents of the multilingual dashboard.

@grpc-testing
Copy link
Copy Markdown

****************************************************************

libgrpc.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link
Copy Markdown

[trickle] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

Corrupt JSON data (indicates timeout or crash): 
    bm_fullstack_streaming_ping_pong.BM_StreamingPingPongWithCoalescingApi_MinInProcessCHTTP2_NoOpMutator_NoOpMutator__2M_1_0.counters.new: 1
    bm_fullstack_streaming_ping_pong.BM_StreamingPingPongWithCoalescingApi_InProcessCHTTP2_NoOpMutator_NoOpMutator__2M_1_0.opt.new: 1
    bm_fullstack_streaming_ping_pong.BM_StreamingPingPongWithCoalescingApi_InProcessCHTTP2_NoOpMutator_NoOpMutator__2M_2_0.counters.new: 1


[microbenchmarks] No significant performance differences

@ZhouyihaiDing
Copy link
Copy Markdown
Contributor Author

ZhouyihaiDing commented Oct 6, 2017

Not related:
Basic Tests Windows [dbg] (internal CI) - #12872 cmake flow_control.cc
Basic Tests Windows [opt] (internal CI) - #12872
Artifact Build Windows (internal CI)
Portability Tests Windows [Build Only] (internal CI) - #12872
Interop Cloud-to-Cloud Tests (internal CI) - #12875 all node tests fail
Interop Cloud-to-Prod Tests (internal CI) - #12875
Portability Tests Linux [Build Only] (internal CI) - #12871
Basic Tests Multi-language Linux (internal CI) - #12886 - php
Basic Tests C++ Linux [dbg] (internal CI) - #12887
Basic Tests C++ Linux [opt] (internal CI) - #12887
Basic Tests MacOS [dbg] (internal CI) -#12889
Basic Tests MacOS [opt] (internal CI) - #12889
Msan C (internal CI) - #12892
Asan C (internal CI) - #12892

@fengli79
Copy link
Copy Markdown
Contributor

fengli79 commented Oct 6, 2017

@vjpai @adelez @jtattermusch
I actually have the same question. What's the expected behavior in the multiple cores testing?
I'm under the impression that either client or server may be the bottleneck and occupied all the cores, however that sounds like a blur test scenario for QPS measuring.
For example, in a 8 cores client and 8 cores server test, some languages may get the bottleneck on client side while other may hit it on the server side. These numbers are not really comparable.
Ideally it should be:

  1. Have different set of tests to limit the core on client and server side. Like give client 8 cores and give the server enough CPUs to serve all the traffic. And make a different test case to limit the server for 8 cores and give client enough CPUs.
  2. On the machine which get limited CPU cores, adjust the concurrency and thread model to get max QPS. I'm expecting that the QPS under fixed number of cores is a curve when we adjust the concurrency factors (threads/processes, etc).

@grpc-testing
Copy link
Copy Markdown

Corrupt JSON data (indicates timeout or crash): 
    bm_fullstack_streaming_ping_pong.BM_StreamingPingPongWithCoalescingApi_MinInProcessCHTTP2_NoOpMutator_NoOpMutator__2M_1_0.opt.old: 3
    bm_fullstack_streaming_ping_pong.BM_StreamingPingPongWithCoalescingApi_MinInProcessCHTTP2_NoOpMutator_NoOpMutator__2M_1_0.opt.new: 1
    bm_fullstack_streaming_ping_pong.BM_StreamingPingPongWithCoalescingApi_InProcessCHTTP2_NoOpMutator_NoOpMutator__2M_1_0.counters.old: 1
    bm_fullstack_streaming_ping_pong.BM_StreamingPingPongWithCoalescingApi_MinInProcessCHTTP2_NoOpMutator_NoOpMutator__2M_1_0.counters.old: 1


[microbenchmarks] No significant performance differences

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants