A morning walk

Published

October 30, 2022

Maria was taking her morning walk when she decided to modify her deep neural network architecture.

She’s been researching different ways to speed up her training process and decided to experiment with Batch Normalization right before one of the hidden layers of her network.

There was only one question left to answer:

How should Maria use Batch Normalization as part of her model?

Maria should use Batch Normalization before the activation function of the layer that comes before the layer she wants to affect.
Maria should use Batch Normalization after the activation function of the layer that comes before the layer she wants to affect.
Maria should use Batch Normalization right after the layer she wants to affect but before that layer’s activation function.
Maria should use Batch Normalization right after the activation function of the layer she wants to affect.

Expand to see the answer

Batch Normalization is a popular technique used to train deep neural networks. It normalizes the input to a layer during every training iteration using a mini-batch of data. It smooths and simplifies the optimization function leading to more stable and faster training.

Maria has two options to batch-normalize the input of the hidden layer she is interested in: she can use Batch Normalization before the layer she wants to affect, either before the previous activation function or after. Remember that Batch Normalization will normalize the data that goes into the layer, so Maria needs to ensure to apply it before that layer.

But where is it better to use Batch Normalization? Before or after the previous layer’s activation function?

The authors of “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” recommend using it right before the activation function:

The goal of Batch Normalization is to achieve a stable distribution of activation values throughout training, and in our experiments we apply it before the nonlinearity since that is where matching the first and second moments is more likely to result in a stable distribution.

That has been the way many teams have used Batch Normalization, but later, there are experiments showing that Batch Normalization works best when used after the previous layer’s activation function. Here is an excerpt from “Busting the Myth About Batch Normalization” from the Paperspace blog:

While the original paper talks about applying batch norm just before the activation function, it has been found in practice that applying batch norm after the activation yields better results. This seems to make sense, as if we were to put an activation after batch norm, then the batch norm layer cannot fully control the statistics of the input going into the next layer since the output of the batch norm layer has to go through an activation.

Following the last information we have, for her use case, Maria should use Batch Normalization after the previous’s layer activation function.

Recommended reading

“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” is the original paper introducing Batch Normalization.
“Intro to Optimization in Deep Learning: Busting the Myth About Batch Normalization” is the blog post challenging the original paper’s position regarding where to use Batch Normalization.
“A Gentle Introduction to Batch Normalization for Deep Neural Networks” is a good introduction to Batch Normalization.