Below is defined to only take gradient of the β and γ in batch norm layers.
trainable(bn::BatchNorm) = (bn.β, bn.γ)
However, this stops us from using params and loadparams! to save and load parameters as the other two fields, μ and σ², which are updated during training as well, to be saved and loaded.
Maybe it's just fine to not define trainable(bn::BatchNorm) = (bn.β, bn.γ) as μ and σ² doesn't seems to have gradient?