-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Emit a server side metric to identify if MV columns are present as part of the selection order by queries #9117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit a server side metric to identify if MV columns are present as part of the selection order by queries #9117
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9117 +/- ##
=============================================
- Coverage 69.96% 15.30% -54.66%
+ Complexity 4674 170 -4504
=============================================
Files 1838 1790 -48
Lines 97349 95322 -2027
Branches 14672 14456 -216
=============================================
- Hits 68107 14588 -53519
- Misses 24493 79700 +55207
+ Partials 4749 1034 -3715
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we need to flag the offending query by sending the metadata to the broker and flagging the query in the broker log?. if the metric is hit, we can search the log, find the offending query, and ask the users to clean up these queries? If the mechanism is made generic enough, it could be used for other such usecases as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! What do you mean by flagging the query on the Broker side?
@siddharthteotia and I did discuss whether we need a Broker side metric for this before opening the PR. We can always look for the metric on the servers and check the server logs for the broker and check the queries coming in at the time on the Broker side to identify what these queries are. So we didn't think it was necessary to add a metric on the Broker side too. Let me know if you think there are other advantages of adding this on the Broker side too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broker side metric is probably on overkill. ServerInstanceRequestHandler.java has a requestId, so if we add a log message on the server something like "Request " + requestId + " has MV in ORDER BY." This should make it very easy to pull out the query string from Broker logs since the request id is same between server and broker. Although, the current approach is fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion @amrishlal I've updated the PR to add the log too on the server side. and as you suggested, we can use that request ID on the broker side to find the exact offending queries.
…rt of the selection order by queries
|
Let's rebase this on top of #9119 |
8631fd4 to
028dc4d
Compare
|
I read the original issue but am not convinced about the motivation to emit a metric when the order by has a MV column. My guess is you want to identify the queries that have multi valued columns in the order by fragment and ask them to remove them? IMO, a better solution would be to simply remove the multi-value columns in order by fragment since it was getting ignored anyway. We can probably throw exception few versions later |
Yes, if you read the original PR, there are 2 categories of MV expression in ORDER BY queries where 1 is already failing in the engine (with not a user friendly exception). Then there is another category where MV expression in ORDER BY is ignored. See this comment thread - #9078 (comment) We agreed on the original PR that users should write more sensible queries and throwing proper generic exception for any presence of MV expression in ORDER BY indicating it is not supported is best. However, we wanted to first find out if anyone is running 2nd category of query through this metric, work with them to change (if any) and then come back to throw the exception properly.
I don't think that's a good solution. Users should simply not write such queries. It's engine's responsibility to throw meaningful exceptions when users are writing bad / unsupported queries as opposed to silently ignoring things (one of the cases today) or throwing very implementation specific exception (another case)
The problem will still remain on how to do this smoothly as opposed to a "potentially" breaking change where we suddenly start throwing exception for the 2nd case I mentioned above. |
Do you mean silently dropping the MV columns from the order-by list? Won't this also be backward incompatible for users that are expecting the MV column in the result returned by the query? |
| // This metadata field is only required for emitting a metric to identify whether queries using MV columns (as | ||
| // identifiers or via transform) are present in the Selection Order-By queries. This was the only place where | ||
| // such a metadata could be added in the existing interfaces even though we don't need this on the Broker side. | ||
| QUERY_HAS_MV_SELECTION_ORDER_BY("queryHasMVSelectionOrderBy", MetadataValueType.STRING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest not adding this because this is a temporary step to catch the bad queries, and we need to remove this later. If we add it as a DataTable.MetadataKey, we cannot remove it in the future. See the javadoc for this enum for why it cannot be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. However, the current code is structured in a way that this is the only mechanism to communicate back such info from operators to QueryScheduler code in server which emits / logs metrics.
This is a problem as it forces (like in this case) to add a metadata key when we don't even need one.
Two potential solutions but important to do them cleanly/ right abstraction
- either change APIs in way to propagate down ServerMetrics into operators (processQuery API ....)
- have a map in DataTable which is not part of serialization but for transient purpose like above
Thoughts @somandal @Jackie-Jiang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Didn't notice ServerMetrics is not available in query path.
I'd suggest passing down ServerMetrics because it can be useful for other features when we want to track certain stats for query runtime. ServerMetrics is available in ServerQueryExecutorV1Impl, and we need to pass it down to the plan maker all the way to the operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @Jackie-Jiang, @siddharthteotia and I discussed the ServerMetrics change and do think it is a good idea in general. One problem is that we need an additional way to correlate a query with this metric, and for that purpose we actually extended a log in QueryScheduler in this existing PR to capture the value of this flag at the query level associated at the request ID. This is especially important for high QPS use cases where we may need to comb through hundreds of logs on the Broker side to find the possible offending queries. In addition to that, it may not be desirable to emit a metric for every segment (since the operators are at a segment level) and instead just emit once per query.
Thinking some more we have a couple of ideas to address this (have both a log and a metric and avoid doing this for every segment belonging to the query):
Option 1 (preferred method):
- Modify the
QueryContextto also contain the request ID from theServerQueryRequest. This will then be available throughout all the plan nodes and operators. We can then use this request ID for logging purposes temporarily. - To avoid emitting a metric and log in the
SelectionOrderByOperatorbut instead do this at theInstanceResponseOperatorlevel where we will have the combinedIntermediateResultsBlockavailable. This will allow us to emit the metric and log only once per query per server. TheSelectionOrderByOperatorwill still detect this scenario and set it in theIntermediateResultsBlockas is being done in the current PR. - For now we can avoid modifying all the operators and plan nodes to take the
ServerMetricsas input and instead only pass it down up to theInstanceResponsePlanNodeandInstanceResponseOperator. In the future it should be trivial to extend all operators and plan nodes to have theServerMetricsif the need arises.
Option 2
- Add a transient metadata object to the
IntermediateResultsBlockand potentiallyDataTablewhere we can store all the transient metadata fields which aren't serializable and will be ignored when creating the serializedDataTable. The drawback with this approach is that certain operators like Filter don't useIntermediateResultsBlockdirectly and we'll need to add getters/setters to all the intermediate block types and copy these over. We still only want to emit the metric and log just once per query.
Let us know your thoughts @Jackie-Jiang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be there is no need to log the requestID etc to be able to correlate with the offending query on broker. All we can do is simply log the ORDER BY expressions
"Table <T> has MV column in ORDER BY. Expressions: <ORDER BY mvCol1, mvCol2>"
Check table level metric in internal dashboard, in the server log we grep for "has MV column in ORDER BY", and use the ORDER BY expressions to grep in broker log and find full offending query.
So, we just need the following change
- Pass ServerMetrics to plan maker
- From PlanMaker pass to SelectionOrderByPlanNode / InstanceResponsePlanNode
- Emit table level metric
- Log MV ORDER BY expressions and tableName available in QueryContext in server log for easy grep and mapping back to full query in broker log
- Done.
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just emit a server metric without other changes? This is a temporary step to track the invalid queries, and after that we want to revert it and throw exception instead when there is order-by on MV expressions.
| } | ||
| List<OrderByExpressionContext> orderByExpressions = _queryContext.getOrderByExpressions(); | ||
| LOGGER.warn("Table {} has MV column in ORDER BY. Expressions: {}", tableName, | ||
| orderByExpressions == null ? "" : orderByExpressions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orderByExpressions == null check can be avoided because if that were the case, we should not have gone past line 132
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay i'll remove that, thanks!
| // TODO: Remove this once the SelectionOrderByOperator is modified to throw an exception to catch cases where | ||
| // an MV column (identifier or via transform) is present on the order-by list | ||
| IntermediateResultsBlock intermediateResultsBlock = getCombinedResults(); | ||
| logAndEmitMetricForQueryHasMVSelectionOrderBy(intermediateResultsBlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest doing a bit of cleanup
- Move
IntermediateResultBlockandInstanceResponseBlockdeclaration before if-else block - Call logA
ndEmitMetricForQueryHasMVSelectionOrderBy(intermediateResultsBlock)outside if-else - return
instanceResponseBlockafter if-else
siddharthteotia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @Jackie-Jiang PTAL if you like. I want to deploy this asap to start observing the metric and do the final fix pretty quickly
Emit a metric to identify whether Selection OrderBy queries have any multi-value columns as part of the order by clause (either as an identifier or via a transform).
The original PR #9078 gives details about the issue we are trying to catch with this metric. As discussed on the original PR, we decide to emit a metric first, monitor the our Pinot tables for this metric and then make the proper fix to throw an exception for any Selection OrderBy queries with multi-value columns present.
This PR adds the metric via adding a new
MetadataKeyto theDataTable. TheQuerySchedulerthen emits a metric if thisMetadataKeyis present and set totrue. This mechanism is not ideal but is the only way we could find to emit this metric. Technically we do not need to send this data to the Broker side, nor do we need to do anything more than emit a metric for this case. The issue with updating theMetadataKeyis now an extra enum field will be left behind and cannot be cleaned up for backward compatibility reasons. As a future PR we should consider changing the interfaces slightly to allow capturing some genericmetadatawhich do not need to be sent to the Broker side but which can be used for metrics or logging purposes.cc @siddharthteotia @Jackie-Jiang