Fix #334: Improve the batch norm algorithm#339
Fix #334: Improve the batch norm algorithm#339zolkis wants to merge 1 commit intowebmachinelearning:mainfrom
Conversation
|
Note this is not supposed to build yet until #337 is merged. |
|
Fixed the issue of initialization vs execution for batch norm. |
|
I am preparing the PR for input() and constant(), and that changes this PR, too. Moving it back to WiP. |
|
Small changes done, removing WiP. Can be reviewed/merged. @huningxin PTAL. |
index.bs
Outdated
| Normalize the tensor values of input features using [[Batch-Normalization]]. | ||
| Batch normalization modifies the *mean* and *variance* of the inputs of a layer's activation functions, using parameters learned during the training phase over mini-batches of the training set. The goal is to reduce the internal covariate shift of the inputs, hence allowing higher learning rates and regularizing the inputs. | ||
| During training, a pair of scaling and biasing parameters *gamma* and *beta* are learned over the input distribution of the mini-batches of the training set. Also, a pair of *moving mean* and *moving variance* are calculated using a parameter called *momentum* and the averages of means and variances calculated over the mini-batches. During inference, the layer output is normalized using the learned *gamma* (used as *scale*) and *beta* (used as *bias*) parameters, as well as the *moving mean* (used as *mean*) and *moving variance* (used as *variance*) parameters. |
There was a problem hiding this comment.
Although it would be hard for me to precisely review the accuracy of these new statements about batchnorm, what I can say here is that this additional information is not necessarily needed in the context of describing the behavior of this operation during model inference.
For instance, the paragraph starts with "Batch normalization modifies the mean and variance...". This isn't true for the inference pass. The mean and variance values are simply given as inputs to the inference pass of batchnorm. They were previously computed during the training pass, but WebNN only concerns itself with inference operation and not training, therefore the behavior of how these values came to be maybe irrelevant. The only reason it is even mentioned in the original text that means and variance values are previously computed is only to draw a distinction with the other similar operation instanceNormalization where both value vectors are computed on the fly and only within a single batch during the inference pass.
My recommendation is to keep the original text as-is.
index.bs
Outdated
| - *axis*: an {{unsigned long}} scalar. The index to the feature count dimension of the input shape for which the mean and variance values are. Its value must be in the range [0, N-1] where N is the rank of input tensor. When it's not specified, the default value is 1. | ||
| - *epsilon*: a {{float}} scalar. A small value to prevent computational error due to divide-by-zero. The default value is 0.00001 when not specified. | ||
| - *axis*: a {{long}} scalar, the index of the feature dimension of the input shape, which is collapsed for calculating the mean and variance for all other input dimensions. The value is between 0 and the rank of the input tensor. When it's not specified, the default value is 1, corresponding to the channel (*"c"*) dimension in the *"nchw"* data layout. |
There was a problem hiding this comment.
axis is an unsigned long type. I believe that's what we agreed on, which is consistent with other operations with the same param. Why do we change it here to long?
Also, please keep the existing description of the axis param. The new description doesn't read well to me. For instance, the index is to the "feature count" dimension of the input shape, and not the "feature" dimension. Nor was it collapsed for the calculation of the means and variance values. Those wordings just seem wrong.
index.bs
Outdated
| - *scale*: an {{MLOperand}}. The 1-D tensor (vector) of the scaling values (the *gamma* parameters learned during training), whose size is equal to the size of the input dimension indexed by {{MLBatchNormalizationOptions/axis}}. | ||
| - *bias*: an {{MLOperand}}. The 1-D tensor (vector) of the bias values (the *beta* parameter learned during training) whose size is equal to the size of the input dimension indexed by {{MLBatchNormalizationOptions/axis}}. |
There was a problem hiding this comment.
The additional text added here (about the gamma and beta param) aren't necessary and could well be confusing to readers since neither is a parameter of the batchnorm API.
index.bs
Outdated
| - *axis*: a {{long}} scalar, the index of the feature dimension of the input shape, which is collapsed for calculating the mean and variance for all other input dimensions. The value is between 0 and the rank of the input tensor. When it's not specified, the default value is 1, corresponding to the channel (*"c"*) dimension in the *"nchw"* data layout. | ||
| - *scale*: an {{MLOperand}}. The 1-D tensor (vector) of the scaling values (the *gamma* parameters learned during training), whose size is equal to the size of the input dimension indexed by {{MLBatchNormalizationOptions/axis}}. | ||
| - *bias*: an {{MLOperand}}. The 1-D tensor (vector) of the bias values (the *beta* parameter learned during training) whose size is equal to the size of the input dimension indexed by {{MLBatchNormalizationOptions/axis}}. | ||
| - *epsilon*: a {{float}} scalar. A small value added to the variance of the batch in order to prevent computational error due to divide-by-zero. When it's not specified, the default value is 0.00001. |
There was a problem hiding this comment.
According to the batchnorm calculations, the epsilon value is added to the square of the corresponding variance value of the mini-batch being inspected. I think this is an extra information that may not be needed in the description of epsilon because the spec already provides a reference to the full paper (that includes all the calculations relevant to batchnorm). However, the additional wording being introduced here, in my opinion, may cause more confusion as it sounds like each of the precomputed variance value is adjusted by the epsilon amount, which will be incorrect from the mathematical aspect of the behavior. My recommendation is to keep the existing wording.
index.bs
Outdated
| <div class="note"> | ||
| The axis value designates the feature or channel count dimension of the input tensor. | ||
| </div> |
There was a problem hiding this comment.
This is redundant to what already stated above this section.
| <div class="informalsteps"> | ||
| 1. If |options|.activation [=map/exists=], the implementation MAY use it to fuse it with the batch normalization. | ||
| </div> |
There was a problem hiding this comment.
If we feel compelled to add this note, there may be a better way to express it. The current wording doesn't read well because it doesn't explain what "fuse" in this context means, and what is "it" that the statement refers to. For example, a more complete wording may be "At the graph execution time, the implementation may choose to combine the execution of the standard batch normalization operation with the optional activation function provided by options.activation to improve the overall execution performance of the combined operation." etc.
There was a problem hiding this comment.
I agree, but I think the use of "fuse" should be handled in a separate issue/PR, consistently over the whole spec.
|
I will reboot this PR completely. Marking WiP. |
d74b0f2 to
710e5ba
Compare
|
I have reset to the original text, and kept the new/moved text. |
710e5ba to
24fbd8d
Compare
Signed-off-by: Zoltan Kis <[email protected]>
24fbd8d to
0eea13e
Compare
|
@wchao1115 PTAL - reduced the branch to one clean commit. |
|
Requested changes have been done in 0eea13e |
|
Closing this PR, as it's moved to #397. |
A draft PR for now. Needs review, though.
Depends on prior merge of #337 and possibly #329.
Please make general comments and discussion in #334, and syntax related comments in this PR.
Preview | Diff