Skip to content

Comments

Fix #334: Improve the batch norm algorithm#339

Closed
zolkis wants to merge 1 commit intowebmachinelearning:mainfrom
zolkis:batchnorm-algorithm
Closed

Fix #334: Improve the batch norm algorithm#339
zolkis wants to merge 1 commit intowebmachinelearning:mainfrom
zolkis:batchnorm-algorithm

Conversation

@zolkis
Copy link
Collaborator

@zolkis zolkis commented Feb 7, 2023

A draft PR for now. Needs review, though.
Depends on prior merge of #337 and possibly #329.
Please make general comments and discussion in #334, and syntax related comments in this PR.


Preview | Diff

@zolkis
Copy link
Collaborator Author

zolkis commented Feb 8, 2023

Note this is not supposed to build yet until #337 is merged.

@zolkis
Copy link
Collaborator Author

zolkis commented Mar 18, 2023

Fixed the issue of initialization vs execution for batch norm.
Included the layout tests from note to algorithmic steps.
Removed the statements about linking issues with #337. After that is merged, this PR will be re-checked and marked for review.

@zolkis zolkis changed the title [WiP] Fix #334: Improve the batch norm algorithm Fix #334: Improve the batch norm algorithm Mar 19, 2023
@zolkis
Copy link
Collaborator Author

zolkis commented Mar 21, 2023

I am preparing the PR for input() and constant(), and that changes this PR, too. Moving it back to WiP.

@zolkis zolkis changed the title Fix #334: Improve the batch norm algorithm WiP: Fix #334: Improve the batch norm algorithm Mar 21, 2023
@zolkis
Copy link
Collaborator Author

zolkis commented Mar 22, 2023

Small changes done, removing WiP. Can be reviewed/merged. @huningxin PTAL.

@zolkis zolkis changed the title WiP: Fix #334: Improve the batch norm algorithm Fix #334: Improve the batch norm algorithm Mar 22, 2023
index.bs Outdated
Comment on lines 1186 to 1188
Normalize the tensor values of input features using [[Batch-Normalization]].
Batch normalization modifies the *mean* and *variance* of the inputs of a layer's activation functions, using parameters learned during the training phase over mini-batches of the training set. The goal is to reduce the internal covariate shift of the inputs, hence allowing higher learning rates and regularizing the inputs.
During training, a pair of scaling and biasing parameters *gamma* and *beta* are learned over the input distribution of the mini-batches of the training set. Also, a pair of *moving mean* and *moving variance* are calculated using a parameter called *momentum* and the averages of means and variances calculated over the mini-batches. During inference, the layer output is normalized using the learned *gamma* (used as *scale*) and *beta* (used as *bias*) parameters, as well as the *moving mean* (used as *mean*) and *moving variance* (used as *variance*) parameters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it would be hard for me to precisely review the accuracy of these new statements about batchnorm, what I can say here is that this additional information is not necessarily needed in the context of describing the behavior of this operation during model inference.

For instance, the paragraph starts with "Batch normalization modifies the mean and variance...". This isn't true for the inference pass. The mean and variance values are simply given as inputs to the inference pass of batchnorm. They were previously computed during the training pass, but WebNN only concerns itself with inference operation and not training, therefore the behavior of how these values came to be maybe irrelevant. The only reason it is even mentioned in the original text that means and variance values are previously computed is only to draw a distinction with the other similar operation instanceNormalization where both value vectors are computed on the fly and only within a single batch during the inference pass.

My recommendation is to keep the original text as-is.

index.bs Outdated
Comment on lines 1179 to 1210
- *axis*: an {{unsigned long}} scalar. The index to the feature count dimension of the input shape for which the mean and variance values are. Its value must be in the range [0, N-1] where N is the rank of input tensor. When it's not specified, the default value is 1.
- *epsilon*: a {{float}} scalar. A small value to prevent computational error due to divide-by-zero. The default value is 0.00001 when not specified.
- *axis*: a {{long}} scalar, the index of the feature dimension of the input shape, which is collapsed for calculating the mean and variance for all other input dimensions. The value is between 0 and the rank of the input tensor. When it's not specified, the default value is 1, corresponding to the channel (*"c"*) dimension in the *"nchw"* data layout.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

axis is an unsigned long type. I believe that's what we agreed on, which is consistent with other operations with the same param. Why do we change it here to long?

Also, please keep the existing description of the axis param. The new description doesn't read well to me. For instance, the index is to the "feature count" dimension of the input shape, and not the "feature" dimension. Nor was it collapsed for the calculation of the means and variance values. Those wordings just seem wrong.

index.bs Outdated
Comment on lines 1211 to 1212
- *scale*: an {{MLOperand}}. The 1-D tensor (vector) of the scaling values (the *gamma* parameters learned during training), whose size is equal to the size of the input dimension indexed by {{MLBatchNormalizationOptions/axis}}.
- *bias*: an {{MLOperand}}. The 1-D tensor (vector) of the bias values (the *beta* parameter learned during training) whose size is equal to the size of the input dimension indexed by {{MLBatchNormalizationOptions/axis}}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional text added here (about the gamma and beta param) aren't necessary and could well be confusing to readers since neither is a parameter of the batchnorm API.

index.bs Outdated
- *axis*: a {{long}} scalar, the index of the feature dimension of the input shape, which is collapsed for calculating the mean and variance for all other input dimensions. The value is between 0 and the rank of the input tensor. When it's not specified, the default value is 1, corresponding to the channel (*"c"*) dimension in the *"nchw"* data layout.
- *scale*: an {{MLOperand}}. The 1-D tensor (vector) of the scaling values (the *gamma* parameters learned during training), whose size is equal to the size of the input dimension indexed by {{MLBatchNormalizationOptions/axis}}.
- *bias*: an {{MLOperand}}. The 1-D tensor (vector) of the bias values (the *beta* parameter learned during training) whose size is equal to the size of the input dimension indexed by {{MLBatchNormalizationOptions/axis}}.
- *epsilon*: a {{float}} scalar. A small value added to the variance of the batch in order to prevent computational error due to divide-by-zero. When it's not specified, the default value is 0.00001.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the batchnorm calculations, the epsilon value is added to the square of the corresponding variance value of the mini-batch being inspected. I think this is an extra information that may not be needed in the description of epsilon because the spec already provides a reference to the full paper (that includes all the calculations relevant to batchnorm). However, the additional wording being introduced here, in my opinion, may cause more confusion as it sounds like each of the precomputed variance value is adjusted by the epsilon amount, which will be incorrect from the mathematical aspect of the behavior. My recommendation is to keep the existing wording.

index.bs Outdated
Comment on lines 1231 to 1233
<div class="note">
The axis value designates the feature or channel count dimension of the input tensor.
</div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant to what already stated above this section.

Comment on lines 1236 to 1417
<div class="informalsteps">
1. If |options|.activation [=map/exists=], the implementation MAY use it to fuse it with the batch normalization.
</div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we feel compelled to add this note, there may be a better way to express it. The current wording doesn't read well because it doesn't explain what "fuse" in this context means, and what is "it" that the statement refers to. For example, a more complete wording may be "At the graph execution time, the implementation may choose to combine the execution of the standard batch normalization operation with the optional activation function provided by options.activation to improve the overall execution performance of the combined operation." etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think the use of "fuse" should be handled in a separate issue/PR, consistently over the whole spec.

@zolkis zolkis changed the title Fix #334: Improve the batch norm algorithm [WiP] Fix #334: Improve the batch norm algorithm Apr 13, 2023
@zolkis
Copy link
Collaborator Author

zolkis commented Apr 13, 2023

I will reboot this PR completely. Marking WiP.

@zolkis zolkis changed the title [WiP] Fix #334: Improve the batch norm algorithm [Draft] Fix #334: Improve the batch norm algorithm Apr 13, 2023
@zolkis zolkis force-pushed the batchnorm-algorithm branch 2 times, most recently from d74b0f2 to 710e5ba Compare April 17, 2023 19:54
@zolkis
Copy link
Collaborator Author

zolkis commented Apr 17, 2023

I have reset to the original text, and kept the new/moved text.
@wchao1115 @huningxin: it can be re-reviewed. Thanks for the patience.

@zolkis zolkis changed the title [Draft] Fix #334: Improve the batch norm algorithm Fix #334: Improve the batch norm algorithm Apr 17, 2023
@zolkis zolkis requested a review from wchao1115 April 20, 2023 13:39
@zolkis zolkis force-pushed the batchnorm-algorithm branch from 710e5ba to 24fbd8d Compare May 16, 2023 15:25
@zolkis zolkis force-pushed the batchnorm-algorithm branch from 24fbd8d to 0eea13e Compare May 16, 2023 15:30
@zolkis
Copy link
Collaborator Author

zolkis commented May 16, 2023

@wchao1115 PTAL - reduced the branch to one clean commit.

@zolkis
Copy link
Collaborator Author

zolkis commented Jun 13, 2023

Requested changes have been done in 0eea13e
Continued in / moved to #397 (changed the target branch from main to
https://github.com/webmachinelearning/webnn/tree/zk-conventions-integration

@zolkis
Copy link
Collaborator Author

zolkis commented Jun 13, 2023

Closing this PR, as it's moved to #397.

@zolkis zolkis closed this Jun 13, 2023
@zolkis zolkis deleted the batchnorm-algorithm branch November 12, 2024 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants