Conversation
| The output tensor has the same shape | ||
| and contains the QLinearSoftmax values of the corresponding input. | ||
| )DOC") | ||
| .Attr("axis", "apply softmax to elements for dimensions axis or higher", AttributeProto::INT, static_cast<int64_t>(-1)) |
| for (; first < last; first++) { | ||
| // reduceMax | ||
| uint8_t xmax = *std::max_element(x_t, x_t + D); | ||
| const size_t adjustment = xmax ^ 255; |
There was a problem hiding this comment.
Yes, The LookupTable is build by exp(x-255), we assuming 255 is always the max_value in input tensor.
We can get the real max value in the middle of computation, so a shift (255-x_max) for each number in runtime is required, but we just adjust lookup table for convenience .
|
This pull request introduces 1 alert when merging 896ac1b into 0fa3aeb - view on LGTM.com new alerts:
|
| const Tensor& input, Tensor& output, | ||
| concurrency::ThreadPool* thread_pool, | ||
| const uint32_t* lookup_table) const { | ||
| const auto* Y_scale_tensor = context->Input<Tensor>(3); |
There was a problem hiding this comment.
Would be good to get an idea of the binary size of this op using something like SizeBench or bloaty.
There are a lot of places where we could reduce the templatized code or refactor to split out common code.
There was a problem hiding this comment.
After refactored of removing templatized code, this file is accounting for 26.1Ki of VMsize, it should be not too much.
After consideration, I choose to keep QlinearSoftmaxCPU even if this two function may take a lot VMsize, keeping the two function clear and faster. And it may require speedup with some intrinsic instructions in different platform afterwards.
There was a problem hiding this comment.
On-disk size is more important. There's a size limit for mobile apps in the Google Play store that some partners are very close to.
There was a problem hiding this comment.
Vmsize and filesize has the equal value, 22.2ki after stripped debuginfos.
| class QLinearSoftmax final : public OpKernel { | ||
| public: | ||
| QLinearSoftmax(const OpKernelInfo& info); | ||
| void BuildLookupTableIfFixed(const OpKernelInfo& info, uint32_t channels); |
There was a problem hiding this comment.
Why does BuildLookupTableIfFixed need to be public (or part of the API at all)? Seems like a file local helper could be used. Keep the class declaration to the minimum required.
There was a problem hiding this comment.
Would this function within class make binarysize Bigger?
I moved this function to a local helper function in Anonymous namespace.
There was a problem hiding this comment.
It shouldn't make the binary size bigger unless the class itself was templatized.
| std::vector<uint32_t> fixed_lookup_table_; | ||
| mutable std::vector<uint32_t> tmp_lookup_table_; |
There was a problem hiding this comment.
I don't think either of these should be part of the class.
Definitely a temporary variable should not be a class member.
There was a problem hiding this comment.
Thanks. You are right. If we want a Op is thread safe. Different thread will compete this buffer for each other.
|
|
||
| // concept enabled in cpp20 | ||
| template <typename T> | ||
| constexpr bool ValidType = std::is_same_v<T, int8_t> || std::is_same_v<T, uint8_t>; |
There was a problem hiding this comment.
What's the purpose of this given we only register int8_t and uint8_t kernels?
i.e. in what scenario do we need BuildLookupTableIfFixed?
Assuming we don't actually need it, we can minimize the binary size with something like the below. Minimizes templatized code. Return value optimization will mean there's no copy to return the vector from QlinearBuildLookupTableUint32. Utilizes a local static for one-off init in GetLookupTable. The code calling GetLookupTable can use a constexpr std::is_same<T, int8_t> which should allow the compiler to throw away the unused path for each calling site.
namespace {
std::vector<uint32_t> QlinearBuildLookupTableUint32(const float x_scale,
size_t reduce_len,
bool is_signed) {
std::vector<uint32_t> lookup_table;
lookup_table.reserve(256);
const double qscale = fmin(static_cast<double>(UINT32_MAX) / static_cast<double>(reduce_len),
static_cast<double>(0x7fffff));
for (int32_t i = 0; i < 256; ++i) {
double scaled_exp_xi = qscale * exp(static_cast<double>(i - 255) * static_cast<double>(x_scale));
// we can't get the real max number of input tensor here, so we just assume 255.
// in the process of computation, all numbers will have a shift to align 255
// if is_signed: 1 2 3 ......126 127 -128 -127 ..... -3 -2 -1
uint8_t index = static_cast<uint8_t>(is_signed ? i - 128 : i);
lookup_table[index] = static_cast<uint32_t>(lrint(scaled_exp_xi));
}
}
gsl::span<const uint32_t> GetLookupTable(const float x_scale,
size_t reduce_len,
bool is_signed) {
if (is_signed) {
static auto signed_lookup_table = QlinearBuildLookupTableUint32(x_scale, reduce_len, is_signed);
return signed_lookup_table;
} else {
static auto unsigned_lookup_table = QlinearBuildLookupTableUint32(x_scale, reduce_len, is_signed);
return unsigned_lookup_table;
}
}
} // namespaceThere was a problem hiding this comment.
If x_scale is fixed, then we only need to build lookup_table once.
But if it's dynamic, then we have to build it every time.
It's true for us to get rid of template, and is_signed is enough
| const uint32_t* lookup_table = fixed_lookup_table_.data(); | ||
| if (fixed_lookup_table_.size() == 0) { | ||
| tmp_lookup_table_.resize(256); | ||
| lookup_table = tmp_lookup_table_.data(); | ||
| const float X_scale = *(context->Input<Tensor>(1)->Data<float>()); | ||
| QlinearBuildLookupTableUint32<T>(tmp_lookup_table_.data(), X_scale, reduce_len); |
There was a problem hiding this comment.
you do an allocation in tmp_lookup_table_ (so tmp_lookup_table_ owns that data) and assign it to the pointer in lookup_table (which is a local variable). what actually a) allocates data in the fixed_lookup_table_ vector, and b) transfers the values in tmp_lookup_table_ to it?
Possibly a moot question if the above suggestion to rework how the lookup table is created works.
There was a problem hiding this comment.
What I think is actually happening is that you're re-creating the lookup table in tmp_lookup_table_ on every call to GetLookupTable given tmp_lookup_table_.data() is the pointer being returned. Guessing that would break (silently) as soon as you had concurrent calls given tmp_lookup_table_ could have a resize done from a second call whilst the first call was still using the data.
There was a problem hiding this comment.
Thanks for your pointing out. Would one object be running in multiple thread? I don't know that.
I don't want to assign data in fixed_lookup_table_ for its meaning is straightforward, quantized paraments is fixed. Then I have to re-calculate the lookup table every time if it's dynamic. And I am worried about the memory alloc cost.
If Op has to be keep thread-safe. I would rather alloc 256x4bytes on the stack, each thread owns a seperate memory.
| // special case for Softmax, opset 1-12 and 13+ have different semantic meaning of performing axis | ||
| if (target.OpType() == "Softmax") { | ||
| replacement_attributes.insert_or_assign( | ||
| "opset", utils::MakeAttribute(std::string("opset"), int64_t(target.SinceVersion()))); | ||
| } |
There was a problem hiding this comment.
CreateReplacementNode is a general purpose helper. It doesn't scale to have op specific logic in the implementation. The caller should pass in extra_attributes to set the opset.
Basically you need to create a derived class from UnaryReplaceWithQLinear to add the Softmax specific behavior at that level. When you register the selector and action for Softmax in qdq_selector_action_transformer.cc it that registration should use the derived action.
There was a problem hiding this comment.
Moved softmax specific logic into UnaryReplaceWithQLinear and overwrite ExtraAttributes method
| run_test(true); | ||
| } | ||
|
|
||
| TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_UInt8) { |
There was a problem hiding this comment.
The QLinearSoftmax kernel requires the 'opset' attribute to be set but I don't see where that happens. I see it added when we convert a QDQ group to Softmax but that should be a different path than using OpTester with a QLinearSoftmax.
Also wondering if we have tests to cover opsets prior to 13 and 13 or later.
There was a problem hiding this comment.
I modified the qliearsoftmax schema, "opset" is a must attrubutes.
QDQ_transformtest covered QDQ softmax
Added opset 12 and 13 test for qliearsoftmax, and axis is the second to last.
| for (int64_t i = 1; i < dims[0]; i++) { | ||
| for (int64_t j = 0; j < dims[1]; j++) { | ||
| x_in.push_back(x_in[j]); | ||
| y_out.push_back(y_out[j]); |
There was a problem hiding this comment.
What's the reason for duplicating the input data? Could we just change the shape to {2, 10} so that we're using unique items for all the input data?
There was a problem hiding this comment.
To simulate multiple-dims tensor, and I will set 'axis' as -2
| KERNEL_CLASS); | ||
|
|
||
| REGISTER_QLINEAR_LOOKUPTABLE_TYPED_KERNEL(QLinearSoftmax, 1, uint8_t, QLinearSoftmax); | ||
| REGISTER_QLINEAR_LOOKUPTABLE_TYPED_KERNEL(QLinearSoftmax, 1, int8_t, QLinearSoftmax); |
There was a problem hiding this comment.
You can register 2 types with one register: ONNX_CPU_OPERATOR_MS_KERNEL and TypeConstraint to include both int8_t and uint8_t, like:
| } | ||
|
|
||
| NodeAttributes ExtraAttributes(const RuntimeState&) const override { return extra_attrs_; } | ||
|
|
There was a problem hiding this comment.
what is the reason for making this public?
There was a problem hiding this comment.
Decorated as Protected
| const double qscale = | ||
| fmin(static_cast<double>(UINT32_MAX) / static_cast<double>(reduce_len), static_cast<double>(0x7fffff)); | ||
| for (int32_t i = 0; i < 256; i++) { | ||
| double scaled_exp_xi = qscale * exp(static_cast<double>(i - 255) * static_cast<double>(x_scale)); |
There was a problem hiding this comment.
Sounds good to keep using Float type. Will try it in next PR.
* [Qlinearsoftmax] contrib cpu * int8 implementation * contrib operator md * qdq transformer test * new attribute: opset * doc * quantized tool * remove template to reduce Binary size * doc of contribe operators * enforce x_shape is valid * fix reduce_size if input-shape is dynamic * add UT * register one op for reducing binarysize * kernel hash update * docs/ContribOperators.md
Description: Describe your changes.
Qlinearsoftmax
Motivation and Context