2
Batch Normalization is a popular technique used to train deep neural networks. It normalizes the input to a layer during every training iteration using a mini-batch of data. It smooths and simplifies the optimization function leading to more stable and faster training.
Maria has two options to batch-normalize the input of the hidden layer she is interested in: she can use Batch Normalization before the layer she wants to affect, either before the previous activation function or after. Remember that Batch Normalization will normalize the data that goes into the layer, so Maria needs to ensure to apply it before that layer.
But where is it better to use Batch Normalization? Before or after the previous layer’s activation function?
The authors of “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” recommend using it right before the activation function:
The goal of Batch Normalization is to achieve a stable distribution of activation values throughout training, and in our experiments we apply it before the nonlinearity since that is where matching the first and second moments is more likely to result in a stable distribution.
That has been the way many teams have used Batch Normalization, but later, there are experiments showing that Batch Normalization works best when used after the previous layer’s activation function. Here is an excerpt from “Busting the Myth About Batch Normalization” from the Paperspace blog:
While the original paper talks about applying batch norm just before the activation function, it has been found in practice that applying batch norm after the activation yields better results. This seems to make sense, as if we were to put an activation after batch norm, then the batch norm layer cannot fully control the statistics of the input going into the next layer since the output of the batch norm layer has to go through an activation.
Following the last information we have, for her use case, Maria should use Batch Normalization after the previous’s layer activation function.
Recommended reading