Skip to content

Conversation

@liaoxin01
Copy link
Contributor

@liaoxin01 liaoxin01 commented Aug 26, 2025

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Group Commit Stream Load Forward Mode in Cloud Environment:
Problem:
Group commit requires that requests for the same table be sent to the same BE node
to achieve better batching efficiency. However, in cloud mode with Load Balancer (LB),
the LB randomly selects a BE node for forwarding, which breaks the group commit strategy
and reduces batching effectiveness.

Solution:
Implement a two-stage forwarding mechanism:

  1. FE redirects to public/private endpoint (LB) as usual
  2. BE performs a second forwarding to the actual target BE node that handles the specific table

This ensures that all requests for the same table ultimately reach the same BE node,
preserving the group commit batching strategy while still utilizing the LB infrastructure.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…de (apache#4113)

Handle stream load redirect with optional group commit forwarding.

 Group Commit Stream Load Forward Mode in Cloud Environment:

 Problem:
Group commit requires that requests for the same table be sent to the
same BE node
to achieve better batching efficiency. However, in cloud mode with Load
Balancer (LB),
the LB randomly selects a BE node for forwarding, which breaks the group
commit strategy
 and reduces batching effectiveness.

 Solution:
 Implement a two-stage forwarding mechanism:
 1. FE redirects to public/private endpoint (LB) as usual
2. BE performs a second forwarding to the actual target BE node that
handles the specific table

This ensures that all requests for the same table ultimately reach the
same BE node,
preserving the group commit batching strategy while still utilizing the
LB infrastructure.
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liaoxin01
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33516 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 947ba49b24bcda3949c943c8f2be53fd4ead5ede, data reload: false

------ Round 1 ----------------------------------
q1	17575	5266	5085	5085
q2	1918	282	190	190
q3	10312	1228	720	720
q4	10208	970	519	519
q5	7488	2436	2267	2267
q6	179	161	131	131
q7	886	729	599	599
q8	9296	1279	1048	1048
q9	6946	5076	5090	5076
q10	6923	2383	1967	1967
q11	475	287	258	258
q12	343	344	213	213
q13	17766	3616	2991	2991
q14	228	233	227	227
q15	572	481	482	481
q16	418	418	365	365
q17	605	854	356	356
q18	7355	7255	6920	6920
q19	1364	943	555	555
q20	335	318	216	216
q21	3665	3190	2337	2337
q22	1091	1049	995	995
Total cold run time: 105948 ms
Total hot run time: 33516 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5177	5057	5064	5057
q2	242	323	219	219
q3	2179	2662	2300	2300
q4	1318	1775	1308	1308
q5	4191	4397	4429	4397
q6	212	167	133	133
q7	2010	1965	1823	1823
q8	2646	2546	2483	2483
q9	7377	7289	7368	7289
q10	3095	3322	2832	2832
q11	729	566	492	492
q12	702	747	595	595
q13	3377	3923	3366	3366
q14	307	305	277	277
q15	516	470	459	459
q16	443	489	445	445
q17	1194	1502	1434	1434
q18	7847	7639	7519	7519
q19	821	784	814	784
q20	1998	2120	2159	2120
q21	4884	4392	4258	4258
q22	1093	1061	996	996
Total cold run time: 52358 ms
Total hot run time: 50586 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 182174 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 947ba49b24bcda3949c943c8f2be53fd4ead5ede, data reload: false

query1	1019	387	416	387
query2	6545	1785	1779	1779
query3	7323	227	227	227
query4	26217	23456	22986	22986
query5	4375	629	482	482
query6	316	219	197	197
query7	4644	500	297	297
query8	276	227	229	227
query9	8613	2876	2867	2867
query10	488	322	292	292
query11	16027	15235	14777	14777
query12	165	112	122	112
query13	1674	571	421	421
query14	9336	5727	5740	5727
query15	219	186	177	177
query16	7652	641	459	459
query17	1193	722	642	642
query18	2025	412	313	313
query19	193	183	154	154
query20	122	116	151	116
query21	208	117	109	109
query22	4275	4106	4020	4020
query23	33817	32877	32909	32877
query24	8103	2367	2353	2353
query25	531	473	395	395
query26	1230	265	155	155
query27	2722	508	366	366
query28	4276	2263	2220	2220
query29	732	563	440	440
query30	294	221	189	189
query31	891	801	718	718
query32	106	72	76	72
query33	558	379	339	339
query34	791	835	514	514
query35	801	796	747	747
query36	983	1002	895	895
query37	117	115	86	86
query38	4107	4099	3977	3977
query39	1452	1410	1417	1410
query40	213	131	110	110
query41	60	61	54	54
query42	123	110	117	110
query43	498	497	468	468
query44	1334	867	871	867
query45	179	172	170	170
query46	846	1012	635	635
query47	1750	1818	1719	1719
query48	376	407	304	304
query49	716	473	384	384
query50	662	682	388	388
query51	4182	4164	4117	4117
query52	112	108	99	99
query53	236	253	194	194
query54	598	573	522	522
query55	86	86	89	86
query56	316	315	306	306
query57	1199	1191	1160	1160
query58	272	262	273	262
query59	2600	2678	2584	2584
query60	345	369	339	339
query61	130	124	122	122
query62	866	727	673	673
query63	221	197	185	185
query64	4262	1015	684	684
query65	4206	4197	4165	4165
query66	1081	416	316	316
query67	15550	15214	14952	14952
query68	8089	913	628	628
query69	473	321	277	277
query70	1200	1166	1167	1166
query71	452	338	308	308
query72	5576	4665	2322	2322
query73	722	584	361	361
query74	9018	8888	8959	8888
query75	3753	3050	2622	2622
query76	3627	1090	737	737
query77	801	410	320	320
query78	9637	9571	8878	8878
query79	2474	826	593	593
query80	615	549	461	461
query81	492	253	219	219
query82	448	142	106	106
query83	249	260	228	228
query84	249	99	86	86
query85	806	381	438	381
query86	392	324	299	299
query87	4224	4356	4216	4216
query88	3475	2226	2204	2204
query89	392	316	303	303
query90	1836	232	214	214
query91	141	135	108	108
query92	90	68	65	65
query93	1729	983	686	686
query94	689	381	306	306
query95	396	310	302	302
query96	499	568	276	276
query97	2636	2684	2602	2602
query98	246	229	225	225
query99	1643	1398	1291	1291
Total cold run time: 274504 ms
Total hot run time: 182174 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.1 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 947ba49b24bcda3949c943c8f2be53fd4ead5ede, data reload: false

query1	0.04	0.04	0.03
query2	0.08	0.04	0.04
query3	0.25	0.07	0.07
query4	1.62	0.11	0.11
query5	0.42	0.42	0.41
query6	1.16	0.66	0.65
query7	0.03	0.02	0.02
query8	0.05	0.04	0.04
query9	0.61	0.53	0.53
query10	0.59	0.57	0.57
query11	0.16	0.10	0.11
query12	0.14	0.11	0.12
query13	0.63	0.61	0.61
query14	0.78	0.83	0.86
query15	0.89	0.86	0.87
query16	0.39	0.39	0.39
query17	1.06	1.03	1.01
query18	0.21	0.19	0.20
query19	1.93	1.81	1.77
query20	0.01	0.01	0.01
query21	15.41	0.92	0.57
query22	0.78	1.24	0.78
query23	14.75	1.29	0.60
query24	6.72	1.82	0.30
query25	0.29	0.28	0.15
query26	0.50	0.15	0.12
query27	0.06	0.05	0.04
query28	10.05	0.93	0.42
query29	12.54	3.90	3.24
query30	3.02	2.96	2.96
query31	2.83	0.56	0.38
query32	3.25	0.60	0.47
query33	3.01	3.21	3.08
query34	15.69	5.41	4.88
query35	4.91	4.91	5.01
query36	0.70	0.52	0.50
query37	0.10	0.07	0.06
query38	0.05	0.05	0.04
query39	0.04	0.02	0.03
query40	0.17	0.16	0.15
query41	0.08	0.03	0.03
query42	0.03	0.03	0.02
query43	0.04	0.03	0.04
Total cold run time: 106.07 s
Total hot run time: 32.1 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/60) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 1.50% (5/333) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 51.75% (17150/33137)
Line Coverage 37.23% (156240/419684)
Region Coverage 31.92% (119150/373243)
Branch Coverage 33.22% (52347/157558)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 3.92% (13/332) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 70.75% (23031/32554)
Line Coverage 57.01% (239188/419573)
Region Coverage 52.44% (198644/378788)
Branch Coverage 54.15% (85865/158556)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 28, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@dataroaring dataroaring merged commit b358032 into apache:master Aug 29, 2025
27 of 30 checks passed
morrySnow pushed a commit that referenced this pull request Sep 4, 2025
@morrySnow morrySnow mentioned this pull request Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.1-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants