Automatically detect whether a v1 query could have run on the v2 query engine #13628

yashmayya · 2024-07-16T13:31:33Z

With the v2 / multi-stage query engine becoming more robust and feature complete (https://github.com/apache/pinot/pulls?q=is%3Apr+label%3Amulti-stage+is%3Aclosed), existing users might want to potentially migrate their existing v1 query workload to the multi-stage query engine.
However, there are some differences between the two query engines in terms of supported syntax (some of which is documented in https://docs.pinot.apache.org/reference/troubleshooting/troubleshoot-multi-stage-query-engine).
In case a user wants to migrate from the v1 query engine to the v2 query engine, it would be useful to be able to automatically detect whether the existing query workload would work as is without modification. This patch adds a counter metric (optionally enabled via a broker configuration that defaults to false for performance reasons) that is incremented each time a v1 query is run that fails during compilation with the v2 query engine. If this counter remains 0 during regular query workload execution, it signals that users can potentially migrate their query workload to the multi-stage query engine.
Note that there could be situations where a query compiles successfully but fails during execution. There could also be cases where a query runs successfully but there are differences in the result between the v1 and the v2 query engine - in terms of result types, result set size etc.
In order to address these above concerns, we plan to also add a tool / API in the near future that actually executes a given query against both the query engines and compares the results. This is much more heavyweight and shouldn't be done for every query during regular execution and is intended more for ad-hoc use cases.

yashmayya · 2024-07-16T13:45:29Z

Metric name and broker config are up for debate (naming is hard) 😄

codecov-commenter · 2024-07-16T14:29:51Z

Codecov Report

Attention: Patch coverage is 18.36735% with 40 lines in your changes missing coverage. Please review.

Project coverage is 61.99%. Comparing base (59551e4) to head (bc4c32b).
Report is 806 commits behind head on master.

Files	Patch %	Lines
...sthandler/BaseSingleStageBrokerRequestHandler.java	15.38%	18 Missing and 4 partials ⚠️
.../java/org/apache/pinot/query/QueryEnvironment.java	9.09%	10 Missing ⚠️
...g/apache/pinot/query/parser/utils/ParserUtils.java	0.00%	6 Missing ⚠️
...roker/requesthandler/GrpcBrokerRequestHandler.java	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #13628      +/-   ##
============================================
+ Coverage     61.75%   61.99%   +0.23%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2554     +118     
  Lines        133233   140652    +7419     
  Branches      20636    21879    +1243     
============================================
+ Hits          82274    87192    +4918     
- Misses        44911    46824    +1913     
- Partials       6048     6636     +588

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`61.95% <18.36%> (+0.24%)`	⬆️
java-21	`61.86% <18.36%> (+0.24%)`	⬆️
skip-bytebuffers-false	`61.97% <18.36%> (+0.22%)`	⬆️
skip-bytebuffers-true	`61.84% <18.36%> (+34.12%)`	⬆️
temurin	`61.99% <18.36%> (+0.23%)`	⬆️
unittests	`61.98% <18.36%> (+0.23%)`	⬆️
unittests1	`46.41% <15.78%> (-0.48%)`	⬇️
unittests2	`27.79% <12.24%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yashmayya · 2024-07-17T05:33:36Z

pinot-query-planner/src/main/java/org/apache/pinot/query/QueryEnvironment.java

+      if (sqlNode.getKind().equals(SqlKind.EXPLAIN)) {
+        sqlNode = ((SqlExplain) sqlNode).getExplicandum();
+      }
+      compileAndOptimizeQuery(sqlNode, plannerContext);


In the initial version, I'd skipped the optimize step here for perf reasons but turns out that there are some cases where bad queries are only caught in the optimize step. For instance, this test started failing because the bad query compiles fine but the PinotEvaluateLiteralRule catches the issue during the optimize query phase.

gortiz · 2024-07-19T06:25:56Z

...rc/main/java/org/apache/pinot/broker/requesthandler/BaseSingleStageBrokerRequestHandler.java

+      _brokerMetrics.addMeteredGlobalValue(BrokerMeter.QUERIES_GLOBAL, 1);
      _brokerMetrics.addValueToTableGauge(rawTableName, BrokerGauge.REQUEST_SIZE, query.length());

+      if (!pinotQuery.isExplain() && _enableMultistageMigrationMetric) {


Can we also add a config to sample the queries we are going to compare. I mean, instead of compiling every query in V2, we can say something like: compile 1 every X queries or a X% of the queries.

Given the high QPS Pinot usually has, this sampling should be good enough to provide a good enough overview without having to parse all queries in V2, which may add some impact.

Alternatively we can delegate this into a background thread that may be reading from a concurrent queue, so running V1 queries won't be affected by the time spent in planning V2. The important part here is that the thread adding into the queue should be using Queue.offer so if the queue is full it just continues (the query won't be reported, but at the same time it won't be blocked).

Thanks for the suggestions! I do agree that we need to be careful about affecting query latency in the v1 engine due to this additional query compilation overhead. I liked your second option more because the first option can still introduce unpredictable latency in some queries which can increase the tail latency by a fair amount. I did some ad-hoc benchmarking and for some queries, the v2 engine query compilation time is actually comparable in order of magnitude to the overall v1 query execution time (and I'm sure in certain cases it could even be higher).

Also, I'm not a big fan of adding more user configuration than is necessary. I've gone with the second option for now - background thread processing (v2 query compiling) a queue of v1 query requests where the queue capacity is currently hardcoded / not configurable. We could make this queue capacity configurable in the future if we see a need for allowing this to be configurable by users - i.e., in order to tune the percentage of requests that are effectively sampled based on the query workload.

I've gone with a simple single threaded executor and a blocking queue implementation for now, but on second thoughts I think we might run into contention issues in high QPS environments with the request processing threads all vying for the blocking queue's internal lock during the call to offer. I think we could potentially solve this using our own additional external lock and just drop requests if the lock isn't free (using Lock.tryLock). Although this would still introduce some additional overhead. Thoughts?

Alternatively, we could potentially use a lock-free bounded queue (like BoundedFifoBuffer from commons-collections) with some additional logic on our end to avoid busy waiting on the consumer side when the buffer is empty. Or, we could use Java's ConcurrentLinkedQueue which is lock-free but unbounded and add some logic on our end to make it bounded (and also the logic for avoiding busy waiting on the consumer side when the queue is empty). Neither option is too appealing though 😄

Edit: for the waiting logic we'd need our own external lock in any case, so we could just wrap a regular non-synchronized lock-free bounded queue.

I don't think that lock will be that problematic. Given this feature is behind a config flag, it can be enabled as needed. In case ends up being a performance issue we can just turn it off.

yashmayya

Thanks for the review Gonzalo!

yashmayya · 2024-07-22T04:54:43Z

...rc/main/java/org/apache/pinot/broker/requesthandler/BaseSingleStageBrokerRequestHandler.java

+      _brokerMetrics.addMeteredGlobalValue(BrokerMeter.QUERIES_GLOBAL, 1);
      _brokerMetrics.addValueToTableGauge(rawTableName, BrokerGauge.REQUEST_SIZE, query.length());

+      if (!pinotQuery.isExplain() && _enableMultistageMigrationMetric) {


Thanks for the suggestions! I do agree that we need to be careful about affecting query latency in the v1 engine due to this additional query compilation overhead. I liked your second option more because the first option can still introduce unpredictable latency in some queries which can increase the tail latency by a fair amount. I did some ad-hoc benchmarking and for some queries, the v2 engine query compilation time is actually comparable in order of magnitude to the overall v1 query execution time (and I'm sure in certain cases it could even be higher).

Also, I'm not a big fan of adding more user configuration than is necessary. I've gone with the second option for now - background thread processing (v2 query compiling) a queue of v1 query requests where the queue capacity is currently hardcoded / not configurable. We could make this queue capacity configurable in the future if we see a need for allowing this to be configurable by users - i.e., in order to tune the percentage of requests that are effectively sampled based on the query workload.

gortiz · 2024-07-29T10:31:28Z

Approved. Waiting to fix the conflicts, which should be pretty simple given they are in the CommonConstants.

…y engine

…ion from broker v1 request processing threads

yashmayya added Configuration Config changes (addition/deletion/change in behavior) metrics multi-stage Related to the multi-stage query engine labels Jul 16, 2024

yashmayya force-pushed the v2-migration-metric branch from 51dd65e to 6f1bc80 Compare July 16, 2024 13:44

yashmayya force-pushed the v2-migration-metric branch from 6f1bc80 to 9b8e2de Compare July 16, 2024 14:30

yashmayya commented Jul 17, 2024

View reviewed changes

yashmayya marked this pull request as ready for review July 17, 2024 05:34

yashmayya force-pushed the v2-migration-metric branch 2 times, most recently from 2fe750d to 7ad4e58 Compare July 18, 2024 11:04

gortiz reviewed Jul 19, 2024

View reviewed changes

yashmayya commented Jul 22, 2024

View reviewed changes

gortiz approved these changes Jul 29, 2024

View reviewed changes

Automatically detect whether a v1 query could have run on the v2 quer…

c753548

…y engine

yashmayya force-pushed the v2-migration-metric branch 4 times, most recently from ccca689 to 8639e71 Compare July 29, 2024 11:01

Use background thread and blocking queue to offload v2 query compilat…

bc4c32b

…ion from broker v1 request processing threads

yashmayya force-pushed the v2-migration-metric branch from 8639e71 to bc4c32b Compare July 29, 2024 11:38

gortiz approved these changes Jul 30, 2024

View reviewed changes

gortiz merged commit 835949b into apache:master Jul 30, 2024

yashmayya mentioned this pull request Aug 5, 2024

Add broker API to run a query on both query engines and compare results #13746

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatically detect whether a v1 query could have run on the v2 query engine #13628

Automatically detect whether a v1 query could have run on the v2 query engine #13628

Uh oh!

yashmayya commented Jul 16, 2024

Uh oh!

yashmayya commented Jul 16, 2024

Uh oh!

codecov-commenter commented Jul 16, 2024 •

edited

Loading

Uh oh!

yashmayya Jul 17, 2024

Uh oh!

gortiz Jul 19, 2024

Uh oh!

yashmayya Jul 22, 2024

Uh oh!

yashmayya Jul 22, 2024

Uh oh!

yashmayya Jul 22, 2024 •

edited

Loading

Uh oh!

gortiz Jul 29, 2024

Uh oh!

yashmayya left a comment

Uh oh!

yashmayya Jul 22, 2024

Uh oh!

gortiz commented Jul 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Automatically detect whether a v1 query could have run on the v2 query engine #13628

Automatically detect whether a v1 query could have run on the v2 query engine #13628

Uh oh!

Conversation

yashmayya commented Jul 16, 2024

Uh oh!

yashmayya commented Jul 16, 2024

Uh oh!

codecov-commenter commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yashmayya Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

gortiz Jul 19, 2024

Choose a reason for hiding this comment

Uh oh!

yashmayya Jul 22, 2024

Choose a reason for hiding this comment

Uh oh!

yashmayya Jul 22, 2024

Choose a reason for hiding this comment

Uh oh!

yashmayya Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gortiz Jul 29, 2024

Choose a reason for hiding this comment

Uh oh!

yashmayya left a comment

Choose a reason for hiding this comment

Uh oh!

yashmayya Jul 22, 2024

Choose a reason for hiding this comment

Uh oh!

gortiz commented Jul 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jul 16, 2024 •

edited

Loading

yashmayya Jul 22, 2024 •

edited

Loading