(Receptive Field - Field of View) - Effective Receptive Field (ERF) - Theoretical Receptive Field (TRF)
Created on · Last Modified on
(Receptive Field - Field of View) - Effective Receptive Field (ERF) - Theoretical Receptive Field (TRF)
refers to the region of the input image that a particular neuron in a convolutional layer is “looking at” or taking into account when making its predictions or feature extractions
is determined by the size of the convolutional kernel (also known as the filter) and the stride used during the convolution operation
two types of receptive fields:
theoretical receptive field - theoretically evenly spaced out across the entire image
effective receptive field - in CNNs it is usually more centered of the image
Theoretical Receptive Field (TRF) - Examples
the receptive field in a convolutional neural network with two 3x3 convolutional (conv) layers
in the 2nd conv layer, every pixel has a 5x5 field of view, a.k.a. receptive field
Receptive fields of CNNs vs. Transformers
in CNNs, the receptive field grows incrementally one layer after another
in transformers, the receptive field spans all input (tokens) after a single layer. Yet, These receptive fields’ estimates are only theoretical
in CNNs, the actual receptive field differs from the theoretical
Theoretical Receptive Field (TRF) vs Effective Receptive Field (ERF)
In CNNs, the pixels at the center of a receptive field have a large impact on the output
In the forward pass, the center pixels can propagate information to the output through many different paths
Therefore, during a backward pass, the center pixels have a much larger gradient magnitude
Evaluating the receptive field in CNNs empirically is ERF.
While the TRF depends on the architecture only, the ERF dy/dx is dependent on the input, i.e., different inputs generate different ERFs dy/dx
The ERF both follows a Gaussian distribution and occupies only a fraction of the full TRF
comparing the effect of:
(1) the number of layers on the ERF
(2) random weight initialization on the ERF
(3) nonlinear activation on the ERF
Kernel size is fixed at 3 × 3 for all the networks.
Uniform: convolutional kernel weights are all ones, no nonlinearity;
Random: random kernel weights, no nonlinearity;
Random + ReLU: random kernel weights, ReLU nonlinearity
Comparing the effect of non-linearities (RELU, Tanh, and Sigmoid) on the ERF.
ReLU makes the distribution a bit less Gaussian. ReLU units output exactly zero for half of its inputs. Thus, it is easy to get a zero output for the center pixel on the output plane
Comparing the effect of subsampling and dilation on the ERF. Both increase the ERF significantly
Comparison of ERF before and after training for models trained on:
CIFAR-10 classification
CamVid semantic segmentation tasks
The effective receptive field grows significantly after training.
In the CIFAR experiment, the TRF is 74x74 (i.e. bigger than the input image 32x32). Yet, ERF still won’t cover the input image