[Qlinearsoftmax] contrib cpu by wejoncy · Pull Request #12177 · microsoft/onnxruntime

wejoncy · 2022-07-14T11:38:19Z

Description: Describe your changes.
Qlinearsoftmax

uint8
int8

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.

onnxruntime/core/graph/contrib_ops/quantization_defs.cc

yufenglee · 2022-07-14T22:29:11Z

onnxruntime/core/graph/contrib_ops/quantization_defs.cc

+The output tensor has the same shape
+and contains the QLinearSoftmax values of the corresponding input.
+)DOC")
+      .Attr("axis", "apply softmax to elements for dimensions axis or higher", AttributeProto::INT, static_cast<int64_t>(-1))


-1

keep default value same as Softmax for consistency.

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

yufenglee · 2022-07-15T00:27:23Z

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

+        for (; first < last; first++) {
+          // reduceMax
+          uint8_t xmax = *std::max_element(x_t, x_t + D);
+          const size_t adjustment = xmax ^ 255;


adjustment

is the adjustment necessary? #Resolved

Yes, The LookupTable is build by exp(x-255), we assuming 255 is always the max_value in input tensor.
We can get the real max value in the middle of computation, so a shift (255-x_max) for each number in runtime is required, but we just adjust lookup table for convenience .

lgtm-com · 2022-07-25T11:42:24Z

This pull request introduces 1 alert when merging 896ac1b into 0fa3aeb - view on LGTM.com

new alerts:

1 for Explicit returns mixed with implicit (fall through) returns

jchen351

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

skottmckay · 2022-07-29T07:32:53Z

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

+                                             const Tensor& input, Tensor& output,
+                                             concurrency::ThreadPool* thread_pool,
+                                             const uint32_t* lookup_table) const {
+  const auto* Y_scale_tensor = context->Input<Tensor>(3);


Would be good to get an idea of the binary size of this op using something like SizeBench or bloaty.

There are a lot of places where we could reduce the templatized code or refactor to split out common code.

After refactored of removing templatized code, this file is accounting for 26.1Ki of VMsize, it should be not too much.

After consideration, I choose to keep QlinearSoftmaxCPU even if this two function may take a lot VMsize, keeping the two function clear and faster. And it may require speedup with some intrinsic instructions in different platform afterwards.

On-disk size is more important. There's a size limit for mobile apps in the Google Play store that some partners are very close to.

Vmsize and filesize has the equal value, 22.2ki after stripped debuginfos.

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

skottmckay · 2022-07-29T07:36:23Z

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.h

+class QLinearSoftmax final : public OpKernel {
+ public:
+  QLinearSoftmax(const OpKernelInfo& info);
+  void BuildLookupTableIfFixed(const OpKernelInfo& info, uint32_t channels);


Why does BuildLookupTableIfFixed need to be public (or part of the API at all)? Seems like a file local helper could be used. Keep the class declaration to the minimum required.

Would this function within class make binarysize Bigger?

I moved this function to a local helper function in Anonymous namespace.

It shouldn't make the binary size bigger unless the class itself was templatized.

skottmckay · 2022-07-29T07:39:40Z

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.h

+  std::vector<uint32_t> fixed_lookup_table_;
+  mutable std::vector<uint32_t> tmp_lookup_table_;


I don't think either of these should be part of the class.

Definitely a temporary variable should not be a class member.

Thanks. You are right. If we want a Op is thread safe. Different thread will compete this buffer for each other.

skottmckay · 2022-07-29T08:12:02Z

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

+
+// concept enabled in cpp20
+template <typename T>
+constexpr bool ValidType = std::is_same_v<T, int8_t> || std::is_same_v<T, uint8_t>;


What's the purpose of this given we only register int8_t and uint8_t kernels?

i.e. in what scenario do we need BuildLookupTableIfFixed?

Assuming we don't actually need it, we can minimize the binary size with something like the below. Minimizes templatized code. Return value optimization will mean there's no copy to return the vector from QlinearBuildLookupTableUint32. Utilizes a local static for one-off init in GetLookupTable. The code calling GetLookupTable can use a constexpr std::is_same<T, int8_t> which should allow the compiler to throw away the unused path for each calling site.

namespace { std::vector<uint32_t> QlinearBuildLookupTableUint32(const float x_scale, size_t reduce_len, bool is_signed) { std::vector<uint32_t> lookup_table; lookup_table.reserve(256); const double qscale = fmin(static_cast<double>(UINT32_MAX) / static_cast<double>(reduce_len), static_cast<double>(0x7fffff)); for (int32_t i = 0; i < 256; ++i) { double scaled_exp_xi = qscale * exp(static_cast<double>(i - 255) * static_cast<double>(x_scale)); // we can't get the real max number of input tensor here, so we just assume 255. // in the process of computation, all numbers will have a shift to align 255 // if is_signed: 1 2 3 ......126 127 -128 -127 ..... -3 -2 -1 uint8_t index = static_cast<uint8_t>(is_signed ? i - 128 : i); lookup_table[index] = static_cast<uint32_t>(lrint(scaled_exp_xi)); } } gsl::span<const uint32_t> GetLookupTable(const float x_scale, size_t reduce_len, bool is_signed) { if (is_signed) { static auto signed_lookup_table = QlinearBuildLookupTableUint32(x_scale, reduce_len, is_signed); return signed_lookup_table; } else { static auto unsigned_lookup_table = QlinearBuildLookupTableUint32(x_scale, reduce_len, is_signed); return unsigned_lookup_table; } } } // namespace

If x_scale is fixed, then we only need to build lookup_table once.
But if it's dynamic, then we have to build it every time.

It's true for us to get rid of template, and is_signed is enough

skottmckay · 2022-07-29T08:17:34Z

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

+  const uint32_t* lookup_table = fixed_lookup_table_.data();
+  if (fixed_lookup_table_.size() == 0) {
+    tmp_lookup_table_.resize(256);
+    lookup_table = tmp_lookup_table_.data();
+    const float X_scale = *(context->Input<Tensor>(1)->Data<float>());
+    QlinearBuildLookupTableUint32<T>(tmp_lookup_table_.data(), X_scale, reduce_len);


you do an allocation in tmp_lookup_table_ (so tmp_lookup_table_ owns that data) and assign it to the pointer in lookup_table (which is a local variable). what actually a) allocates data in the fixed_lookup_table_ vector, and b) transfers the values in tmp_lookup_table_ to it?

Possibly a moot question if the above suggestion to rework how the lookup table is created works.

What I think is actually happening is that you're re-creating the lookup table in tmp_lookup_table_ on every call to GetLookupTable given tmp_lookup_table_.data() is the pointer being returned. Guessing that would break (silently) as soon as you had concurrent calls given tmp_lookup_table_ could have a resize done from a second call whilst the first call was still using the data.

Thanks for your pointing out. Would one object be running in multiple thread? I don't know that.

I don't want to assign data in fixed_lookup_table_ for its meaning is straightforward, quantized paraments is fixed. Then I have to re-calculate the lookup table every time if it's dynamic. And I am worried about the memory alloc cost.

If Op has to be keep thread-safe. I would rather alloc 256x4bytes on the stack, each thread owns a seperate memory.

onnxruntime/core/graph/contrib_ops/quantization_defs.cc

onnxruntime/test/optimizer/qdq_transformer_test.cc

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

skottmckay · 2022-08-01T10:04:31Z

onnxruntime/core/optimizer/selectors_actions/actions.cc

+  // special case for Softmax, opset 1-12 and 13+ have different semantic meaning of performing axis
+  if (target.OpType() == "Softmax") {
+    replacement_attributes.insert_or_assign(
+        "opset", utils::MakeAttribute(std::string("opset"), int64_t(target.SinceVersion())));
+  }


CreateReplacementNode is a general purpose helper. It doesn't scale to have op specific logic in the implementation. The caller should pass in extra_attributes to set the opset.

Basically you need to create a derived class from UnaryReplaceWithQLinear to add the Softmax specific behavior at that level. When you register the selector and action for Softmax in qdq_selector_action_transformer.cc it that registration should use the derived action.

Moved softmax specific logic into UnaryReplaceWithQLinear and overwrite ExtraAttributes method

skottmckay · 2022-08-01T10:06:39Z

onnxruntime/test/contrib_ops/qlinear_lookup_table_test.cc

  run_test(true);
 }

+TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_UInt8) {


The QLinearSoftmax kernel requires the 'opset' attribute to be set but I don't see where that happens. I see it added when we convert a QDQ group to Softmax but that should be a different path than using OpTester with a QLinearSoftmax.

Also wondering if we have tests to cover opsets prior to 13 and 13 or later.

I modified the qliearsoftmax schema, "opset" is a must attrubutes.

QDQ_transformtest covered QDQ softmax

Added opset 12 and 13 test for qliearsoftmax, and axis is the second to last.

skottmckay · 2022-08-01T10:10:53Z

onnxruntime/test/contrib_ops/qlinear_lookup_table_test.cc

+  for (int64_t i = 1; i < dims[0]; i++) {
+    for (int64_t j = 0; j < dims[1]; j++) {
+      x_in.push_back(x_in[j]);
+      y_out.push_back(y_out[j]);


What's the reason for duplicating the input data? Could we just change the shape to {2, 10} so that we're using unique items for all the input data?

To simulate multiple-dims tensor, and I will set 'axis' as -2

yufenglee · 2022-08-03T03:23:31Z

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

+      KERNEL_CLASS);
+
+REGISTER_QLINEAR_LOOKUPTABLE_TYPED_KERNEL(QLinearSoftmax, 1, uint8_t, QLinearSoftmax);
+REGISTER_QLINEAR_LOOKUPTABLE_TYPED_KERNEL(QLinearSoftmax, 1, int8_t, QLinearSoftmax);


You can register 2 types with one register: ONNX_CPU_OPERATOR_MS_KERNEL and TypeConstraint to include both int8_t and uint8_t, like:

onnxruntime/onnxruntime/contrib_ops/cpu/quantization/quant_gemm.cc

Line 219 in 3f66297

.TypeConstraint("TB", {DataTypeImpl::GetTensorType<uint8_t>(), DataTypeImpl::GetTensorType<int8_t>()})

Sure. Thanks

skottmckay · 2022-08-08T22:38:30Z

onnxruntime/core/optimizer/selectors_actions/actions.h

  }

+  NodeAttributes ExtraAttributes(const RuntimeState&) const override { return extra_attrs_; }
+


what is the reason for making this public?

Decorated as Protected

skottmckay

yufenglee

yufenglee · 2022-08-09T18:29:39Z

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc

+  const double qscale =
+      fmin(static_cast<double>(UINT32_MAX) / static_cast<double>(reduce_len), static_cast<double>(0x7fffff));
+  for (int32_t i = 0; i < 256; i++) {
+    double scaled_exp_xi = qscale * exp(static_cast<double>(i - 255) * static_cast<double>(x_scale));


qscale

what's the benefit to convert it to integer comparing to just use double or float

Sounds good to keep using Float type. Will try it in next PR.

* [Qlinearsoftmax] contrib cpu * int8 implementation * contrib operator md * qdq transformer test * new attribute: opset * doc * quantized tool * remove template to reduce Binary size * doc of contribe operators * enforce x_shape is valid * fix reduce_size if input-shape is dynamic * add UT * register one op for reducing binarysize * kernel hash update * docs/ContribOperators.md

[Qlinearsoftmax] contrib cpu

5acc4f1

wejoncy requested a review from a team as a code owner July 14, 2022 11:38

yufenglee reviewed Jul 14, 2022

View reviewed changes

onnxruntime/core/graph/contrib_ops/quantization_defs.cc Outdated Show resolved Hide resolved

yufenglee reviewed Jul 14, 2022

View reviewed changes

onnxruntime/core/graph/contrib_ops/quantization_defs.cc Outdated Show resolved Hide resolved

yufenglee reviewed Jul 14, 2022

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc Outdated Show resolved Hide resolved

yufenglee reviewed Jul 14, 2022

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc Outdated Show resolved Hide resolved

yufenglee reviewed Jul 14, 2022

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc Outdated Show resolved Hide resolved

yufenglee reviewed Jul 15, 2022

View reviewed changes

wejoncy added 3 commits July 15, 2022 11:32

fix

948229f

int8 implementation

94d88a2

kernel test fix

d6d5c15

wejoncy requested a review from a team as a code owner July 15, 2022 09:59

wejoncy added 4 commits July 15, 2022 18:42

fix win warning

09a605b

fix

6f94db6

contrib operator md

5a9bcc3

qdq transformer test

6c81106

wejoncy mentioned this pull request Jul 19, 2022

[xnnpack] basic QDQ operators support #11912

Merged

4 tasks

wejoncy added 6 commits July 19, 2022 22:10

contrib test

a4a6c40

new attribute: opset

81aec71

doc

26ef44f

quantized tool

1927687

fix

13dd3df

-

896ac1b

wejoncy added 2 commits July 26, 2022 11:23

fix

bae9276

fix

ccef405

jchen351 approved these changes Jul 27, 2022

View reviewed changes

jchen351 self-requested a review July 27, 2022 17:10

jchen351 previously approved these changes Jul 27, 2022

View reviewed changes

comments

260aa04

skottmckay reviewed Jul 29, 2022

View reviewed changes

wejoncy added 5 commits July 29, 2022 20:15

remove template to reduce Binary size

2dd915a

doc of contribe operators

6ad123e

use gsl::span

f0f5f98

remove duplicate code

f690719

fix warning

0a9a5ac

skottmckay reviewed Aug 1, 2022

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc Outdated Show resolved Hide resolved

enforce x_shape is valid

f4adeff

skottmckay reviewed Aug 1, 2022

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc Outdated Show resolved Hide resolved

wejoncy added 2 commits August 1, 2022 15:42

fix reduce_size if input-shape is dynamic

fc1a400

Merge remote-tracking branch 'origin/master' into jicwen/qlinearsoftmax

c8d3e6b

skottmckay reviewed Aug 1, 2022

View reviewed changes

add UT

d814623

yufenglee reviewed Aug 3, 2022

View reviewed changes

wejoncy added 4 commits August 3, 2022 11:45

register one op for reducing binarysize

b4c5df4

kernel hash update

ffca23b

json fix

277f4dc

docs/ContribOperators.md

1d91b6d

skottmckay reviewed Aug 8, 2022

View reviewed changes

wejoncy added 4 commits August 9, 2022 15:02

qualifier for ExtraAttributes

1eb9434

-

fd7adf2

Merge remote-tracking branch 'origin/master' into jicwen/qlinearsoftmax

a89e58b

black format

1b48d21

skottmckay approved these changes Aug 9, 2022

View reviewed changes

yufenglee approved these changes Aug 9, 2022

View reviewed changes

yufenglee reviewed Aug 9, 2022

View reviewed changes

wejoncy merged commit 64e991a into main Aug 10, 2022

wejoncy deleted the jicwen/qlinearsoftmax branch August 10, 2022 02:52

		std::vector<uint32_t> fixed_lookup_table_;
		mutable std::vector<uint32_t> tmp_lookup_table_;

		}

		NodeAttributes ExtraAttributes(const RuntimeState&) const override { return extra_attrs_; }

Conversation

wejoncy commented Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yufenglee Jul 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wejoncy Jul 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgtm-com bot commented Jul 25, 2022

Uh oh!

jchen351 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wejoncy Jul 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wejoncy Jul 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skottmckay Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wejoncy Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yufenglee Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

wejoncy commented Jul 14, 2022 •

edited

Loading

yufenglee Jul 15, 2022 •

edited

Loading

wejoncy Jul 15, 2022 •

edited

Loading

wejoncy Jul 30, 2022 •

edited

Loading

wejoncy Jul 30, 2022 •

edited

Loading

skottmckay Jul 29, 2022 •

edited

Loading

wejoncy Aug 3, 2022 •

edited

Loading

yufenglee Aug 3, 2022 •

edited

Loading