Skip to content

[xnnpack] basic QDQ operators support#11912

Merged
wejoncy merged 20 commits intomainfrom
jicwen/xnnpack_QDQConv
Aug 11, 2022
Merged

[xnnpack] basic QDQ operators support#11912
wejoncy merged 20 commits intomainfrom
jicwen/xnnpack_QDQConv

Conversation

@wejoncy
Copy link
Contributor

@wejoncy wejoncy commented Jun 20, 2022

Description: Describe your changes.

  • QDQConv qu8 qs8 qc8
  • QDQSoftmax qu8
  • QDQAveragepool qu8
  • QMaxpool qu8 qs8
    Motivation and Context
  • Why is this change required? What problem does it solve?
  • If it fixes an open issue, please link to the issue here.

@wejoncy wejoncy requested a review from a team as a code owner June 24, 2022 08:15
@wejoncy wejoncy changed the title [xnnpack] QDQConv [xnnpack] basic QDQ operators support Jun 24, 2022
@skottmckay
Copy link
Contributor

Can you also please address the cpplint warnings.

Amadeuy
Amadeuy previously approved these changes Jul 4, 2022
@wejoncy wejoncy requested a review from skottmckay July 8, 2022 03:17
@wejoncy wejoncy force-pushed the jicwen/xnnpack_QDQConv branch 2 times, most recently from a539394 to 03a54be Compare July 8, 2022 05:21
namespace internal_nhwc_onnx {

void RegisterInternalNHWCOpset() {
ONNX_CONTRIB_OPERATOR_SCHEMA(QLinearSoftmax)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the cost of this vs. using function_utils::CreateSchema or something similar to dynamically create a schema as needed.

It makes more sense to me if we also have a CPU EP implementation of the operator.

It makes less sense to me if it's XNNPACK specific, as you're going to have to create/maintain a schema for every quantized operator (including multiple opsets) that we add XNNPACK support for.

What value does this static schema provide that a dynamically created schema does not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A CPU implementation is here. #12177
I will refactor it after that PR is finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Softmax isn't a layout sensitive operator though so the domain should not be onnxruntime::kMSInternalNHWCDomain. We only want to use that for operators that specifically take NCHW input.

The transpose optimizer has a list of layout sensitive ops if you ever need to check.

static std::unordered_set<std::string_view> layout_sensitive_ops = {"Conv", "QLinearConv", "BatchNormalization",
"AveragePool", "GlobalAveragePool", "MaxPool",
"GlobalMaxPool", "LRN", "GridSample",
"DepthToSpace", "SpaceToDepth"};

Copy link
Contributor

@skottmckay skottmckay Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some extra info. When the layout transformer runs (TransformLayoutForEP) it will replace each layout sensitive node with a node in the NHWC domain, and wrap Transpose operators around that node to convert from the original NCHW to NHWC. That limits the new nodes in the kMSInternalNHWCDomain domain to this list, along with a few ORT specific ones listed here.

Once that process finishes, the transpose optimizer will move the Transpose nodes and cancel as many out as possible, but that process does not involve changing the domain of the other nodes. Due to this the Softmax should stay in the ONNX domain.

Say we the EP asked for Conv -> Conv -> Softmax. First step would create Transpose(to NHWC) -> Conv(kMSInternalNHWCDomain) -> Transpose(to NCHW) -> Transpose(to NHWC) -> Conv(kMSInternalNHWCDomain) -> Transpose(to NCHW) -> Softmax.

Second step would cancel out most of the Transpose ops, and push the last one past the Softmax so it's at the boundary of the nodes requested by the EP to maximize the nodes the EP can run in NHWC layout. That results in Transpose(to NHWC) -> Conv(kMSInternalNHWCDomain) -> Conv(kMSInternalNHWCDomain) -> Softmax -> Transpose(to NCHW).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaning.
On top of that, It should be fine to register only Layout-sensitive Ops in kMSInternalNHWCDomain, and OutputShape inferencing is the same as onnx Ops.

xnn_compute_type_qu8_to_fp32,*/
};

struct QuantParam {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - still not understanding this. A NodeUnit has QuantParam info in the input/output defs so why do we need to duplicate it here instead of using NodeUnit?

Also not understanding 'all ops will have at least X or W or Y'. Softmax only has one input. MatMul has 'A' and 'B' not and 'X' and 'W'. The quant param info in NodeUnit is attached to each input/output and will match exactly what the node uses.

Jicheng Wen added 2 commits July 25, 2022 20:02
update Xnnpack to latest

unit test

adress comments
@wejoncy wejoncy force-pushed the jicwen/xnnpack_QDQConv branch from 10be23c to 3c7ede3 Compare July 25, 2022 12:05
@lgtm-com
Copy link

lgtm-com bot commented Jul 25, 2022

This pull request introduces 5 alerts when merging 9c522c0 into 8d0e86d - view on LGTM.com

new alerts:

  • 5 for Uncontrolled data used in path expression

namespace internal_nhwc_onnx {

void RegisterInternalNHWCOpset() {
ONNX_CONTRIB_OPERATOR_SCHEMA(QLinearSoftmax)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Softmax isn't a layout sensitive operator though so the domain should not be onnxruntime::kMSInternalNHWCDomain. We only want to use that for operators that specifically take NCHW input.

The transpose optimizer has a list of layout sensitive ops if you ever need to check.

static std::unordered_set<std::string_view> layout_sensitive_ops = {"Conv", "QLinearConv", "BatchNormalization",
"AveragePool", "GlobalAveragePool", "MaxPool",
"GlobalMaxPool", "LRN", "GridSample",
"DepthToSpace", "SpaceToDepth"};

const GraphViewer& graph,
const std::unordered_set<const Node*>& supported_nodes) {
const Node* fuse_with{nullptr};
static const std::unordered_set<std::string> node_to_be_fuse = {"Conv", "MaxPool", "AveragePool"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it in a separate PR.

I think we need to try and support both the input and activation nodes being be a QDQ group. We should think about things in terms of NodeUnit and whether we can fuse one NodeUnit to another. A supported NodeUnit will have an entry in the map to the ComputeCapability. When fusing we're updating that ComputeCapability. It's just that there may be more steps if they're QDQ groups (e.g. use the zp/scale from the activation node output in the ComputeCapability).

Jicheng Wen added 2 commits July 26, 2022 16:40
@wejoncy wejoncy requested a review from skottmckay August 3, 2022 07:44
Copy link
Contributor

@skottmckay skottmckay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@wejoncy wejoncy merged commit 819c367 into main Aug 11, 2022
@wejoncy wejoncy deleted the jicwen/xnnpack_QDQConv branch August 11, 2022 02:12
pengwa pushed a commit that referenced this pull request Aug 16, 2022
* basic ops for mobilenet,qconv,qsoftmax,qavgpool

update Xnnpack to latest

unit test

* NodeUnit: use outputedge to replace output-node

* qdq model e2e test

* use inlinedvector to replace vector

* conv bias check

* tensorshape helpers

* Refactor xnn_op minmax

* Qlinearsoftmax schema update

* Remove qlinearsoftmax registration

Co-authored-by: Jicheng Wen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants