Skip to content

[Qlinearsoftmax] contrib cpu#12177

Merged
wejoncy merged 37 commits intomainfrom
jicwen/qlinearsoftmax
Aug 10, 2022
Merged

[Qlinearsoftmax] contrib cpu#12177
wejoncy merged 37 commits intomainfrom
jicwen/qlinearsoftmax

Conversation

@wejoncy
Copy link
Contributor

@wejoncy wejoncy commented Jul 14, 2022

Description: Describe your changes.
Qlinearsoftmax

  • uint8
  • int8

Motivation and Context

  • Why is this change required? What problem does it solve?
  • If it fixes an open issue, please link to the issue here.

@wejoncy wejoncy requested a review from a team as a code owner July 14, 2022 11:38
The output tensor has the same shape
and contains the QLinearSoftmax values of the corresponding input.
)DOC")
.Attr("axis", "apply softmax to elements for dimensions axis or higher", AttributeProto::INT, static_cast<int64_t>(-1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1

keep default value same as Softmax for consistency.

for (; first < last; first++) {
// reduceMax
uint8_t xmax = *std::max_element(x_t, x_t + D);
const size_t adjustment = xmax ^ 255;
Copy link
Member

@yufenglee yufenglee Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjustment

is the adjustment necessary? #Resolved

Copy link
Contributor Author

@wejoncy wejoncy Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, The LookupTable is build by exp(x-255), we assuming 255 is always the max_value in input tensor.
We can get the real max value in the middle of computation, so a shift (255-x_max) for each number in runtime is required, but we just adjust lookup table for convenience .

@wejoncy wejoncy requested a review from a team as a code owner July 15, 2022 09:59
@lgtm-com
Copy link

lgtm-com bot commented Jul 25, 2022

This pull request introduces 1 alert when merging 896ac1b into 0fa3aeb - view on LGTM.com

new alerts:

  • 1 for Explicit returns mixed with implicit (fall through) returns

Copy link
Contributor

@jchen351 jchen351 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jchen351 jchen351 self-requested a review July 27, 2022 17:10
jchen351
jchen351 previously approved these changes Jul 27, 2022
const Tensor& input, Tensor& output,
concurrency::ThreadPool* thread_pool,
const uint32_t* lookup_table) const {
const auto* Y_scale_tensor = context->Input<Tensor>(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to get an idea of the binary size of this op using something like SizeBench or bloaty.

There are a lot of places where we could reduce the templatized code or refactor to split out common code.

Copy link
Contributor Author

@wejoncy wejoncy Jul 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After refactored of removing templatized code, this file is accounting for 26.1Ki of VMsize, it should be not too much.

After consideration, I choose to keep QlinearSoftmaxCPU even if this two function may take a lot VMsize, keeping the two function clear and faster. And it may require speedup with some intrinsic instructions in different platform afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On-disk size is more important. There's a size limit for mobile apps in the Google Play store that some partners are very close to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vmsize and filesize has the equal value, 22.2ki after stripped debuginfos.

class QLinearSoftmax final : public OpKernel {
public:
QLinearSoftmax(const OpKernelInfo& info);
void BuildLookupTableIfFixed(const OpKernelInfo& info, uint32_t channels);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does BuildLookupTableIfFixed need to be public (or part of the API at all)? Seems like a file local helper could be used. Keep the class declaration to the minimum required.

Copy link
Contributor Author

@wejoncy wejoncy Jul 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this function within class make binarysize Bigger?

I moved this function to a local helper function in Anonymous namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't make the binary size bigger unless the class itself was templatized.

Comment on lines 29 to 30
std::vector<uint32_t> fixed_lookup_table_;
mutable std::vector<uint32_t> tmp_lookup_table_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think either of these should be part of the class.

Definitely a temporary variable should not be a class member.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. You are right. If we want a Op is thread safe. Different thread will compete this buffer for each other.


// concept enabled in cpp20
template <typename T>
constexpr bool ValidType = std::is_same_v<T, int8_t> || std::is_same_v<T, uint8_t>;
Copy link
Contributor

@skottmckay skottmckay Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this given we only register int8_t and uint8_t kernels?

i.e. in what scenario do we need BuildLookupTableIfFixed?

Assuming we don't actually need it, we can minimize the binary size with something like the below. Minimizes templatized code. Return value optimization will mean there's no copy to return the vector from QlinearBuildLookupTableUint32. Utilizes a local static for one-off init in GetLookupTable. The code calling GetLookupTable can use a constexpr std::is_same<T, int8_t> which should allow the compiler to throw away the unused path for each calling site.

namespace {
std::vector<uint32_t> QlinearBuildLookupTableUint32(const float x_scale,
                                                    size_t reduce_len,
                                                    bool is_signed) {
  std::vector<uint32_t> lookup_table;
  lookup_table.reserve(256);

  const double qscale = fmin(static_cast<double>(UINT32_MAX) / static_cast<double>(reduce_len),
                             static_cast<double>(0x7fffff));
  for (int32_t i = 0; i < 256; ++i) {
    double scaled_exp_xi = qscale * exp(static_cast<double>(i - 255) * static_cast<double>(x_scale));
    // we can't get the real max number of input tensor here, so we just assume 255.
    // in the process of computation, all numbers will have a shift to align 255
    // if is_signed: 1 2 3 ......126 127 -128 -127 ..... -3 -2 -1
    uint8_t index = static_cast<uint8_t>(is_signed ? i - 128 : i);
    lookup_table[index] = static_cast<uint32_t>(lrint(scaled_exp_xi));
  }
}

gsl::span<const uint32_t> GetLookupTable(const float x_scale,
                                         size_t reduce_len,
                                         bool is_signed) {
  if (is_signed) {
    static auto signed_lookup_table = QlinearBuildLookupTableUint32(x_scale, reduce_len, is_signed);
    return signed_lookup_table;
  } else {
    static auto unsigned_lookup_table = QlinearBuildLookupTableUint32(x_scale, reduce_len, is_signed);
    return unsigned_lookup_table;
  }
}
}  // namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If x_scale is fixed, then we only need to build lookup_table once.
But if it's dynamic, then we have to build it every time.

It's true for us to get rid of template, and is_signed is enough

Comment on lines 261 to 266
const uint32_t* lookup_table = fixed_lookup_table_.data();
if (fixed_lookup_table_.size() == 0) {
tmp_lookup_table_.resize(256);
lookup_table = tmp_lookup_table_.data();
const float X_scale = *(context->Input<Tensor>(1)->Data<float>());
QlinearBuildLookupTableUint32<T>(tmp_lookup_table_.data(), X_scale, reduce_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do an allocation in tmp_lookup_table_ (so tmp_lookup_table_ owns that data) and assign it to the pointer in lookup_table (which is a local variable). what actually a) allocates data in the fixed_lookup_table_ vector, and b) transfers the values in tmp_lookup_table_ to it?

Possibly a moot question if the above suggestion to rework how the lookup table is created works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think is actually happening is that you're re-creating the lookup table in tmp_lookup_table_ on every call to GetLookupTable given tmp_lookup_table_.data() is the pointer being returned. Guessing that would break (silently) as soon as you had concurrent calls given tmp_lookup_table_ could have a resize done from a second call whilst the first call was still using the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your pointing out. Would one object be running in multiple thread? I don't know that.

I don't want to assign data in fixed_lookup_table_ for its meaning is straightforward, quantized paraments is fixed. Then I have to re-calculate the lookup table every time if it's dynamic. And I am worried about the memory alloc cost.

If Op has to be keep thread-safe. I would rather alloc 256x4bytes on the stack, each thread owns a seperate memory.

Comment on lines 84 to 88
// special case for Softmax, opset 1-12 and 13+ have different semantic meaning of performing axis
if (target.OpType() == "Softmax") {
replacement_attributes.insert_or_assign(
"opset", utils::MakeAttribute(std::string("opset"), int64_t(target.SinceVersion())));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateReplacementNode is a general purpose helper. It doesn't scale to have op specific logic in the implementation. The caller should pass in extra_attributes to set the opset.

Basically you need to create a derived class from UnaryReplaceWithQLinear to add the Softmax specific behavior at that level. When you register the selector and action for Softmax in qdq_selector_action_transformer.cc it that registration should use the derived action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved softmax specific logic into UnaryReplaceWithQLinear and overwrite ExtraAttributes method

run_test(true);
}

TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_UInt8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QLinearSoftmax kernel requires the 'opset' attribute to be set but I don't see where that happens. I see it added when we convert a QDQ group to Softmax but that should be a different path than using OpTester with a QLinearSoftmax.

Also wondering if we have tests to cover opsets prior to 13 and 13 or later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the qliearsoftmax schema, "opset" is a must attrubutes.

QDQ_transformtest covered QDQ softmax

Added opset 12 and 13 test for qliearsoftmax, and axis is the second to last.

for (int64_t i = 1; i < dims[0]; i++) {
for (int64_t j = 0; j < dims[1]; j++) {
x_in.push_back(x_in[j]);
y_out.push_back(y_out[j]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for duplicating the input data? Could we just change the shape to {2, 10} so that we're using unique items for all the input data?

Copy link
Contributor Author

@wejoncy wejoncy Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simulate multiple-dims tensor, and I will set 'axis' as -2

KERNEL_CLASS);

REGISTER_QLINEAR_LOOKUPTABLE_TYPED_KERNEL(QLinearSoftmax, 1, uint8_t, QLinearSoftmax);
REGISTER_QLINEAR_LOOKUPTABLE_TYPED_KERNEL(QLinearSoftmax, 1, int8_t, QLinearSoftmax);
Copy link
Member

@yufenglee yufenglee Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can register 2 types with one register: ONNX_CPU_OPERATOR_MS_KERNEL and TypeConstraint to include both int8_t and uint8_t, like:

.TypeConstraint("TB", {DataTypeImpl::GetTensorType<uint8_t>(), DataTypeImpl::GetTensorType<int8_t>()})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks

}

NodeAttributes ExtraAttributes(const RuntimeState&) const override { return extra_attrs_; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for making this public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decorated as Protected

Copy link
Contributor

@skottmckay skottmckay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

@yufenglee yufenglee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

const double qscale =
fmin(static_cast<double>(UINT32_MAX) / static_cast<double>(reduce_len), static_cast<double>(0x7fffff));
for (int32_t i = 0; i < 256; i++) {
double scaled_exp_xi = qscale * exp(static_cast<double>(i - 255) * static_cast<double>(x_scale));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qscale

what's the benefit to convert it to integer comparing to just use double or float

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to keep using Float type. Will try it in next PR.

@wejoncy wejoncy merged commit 64e991a into main Aug 10, 2022
@wejoncy wejoncy deleted the jicwen/qlinearsoftmax branch August 10, 2022 02:52
pengwa pushed a commit that referenced this pull request Aug 16, 2022
* [Qlinearsoftmax] contrib cpu

* int8 implementation

* contrib operator md

* qdq transformer test

* new attribute: opset

* doc

* quantized tool

* remove template to reduce Binary size

* doc of contribe operators

* enforce x_shape is valid

* fix reduce_size if input-shape is dynamic

* add UT

* register one op for reducing binarysize

* kernel hash update

* docs/ContribOperators.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants