refers to the region of the input image that a particular neuron in a convolutional layer is “looking at” or taking into account when making its predictions or feature extractions
is determined by the size of the convolutional kernel (also known as the filter) and the stride used during the convolution operation
two types of receptive fields:
- theoretical receptive field - theoretically evenly spaced out across the entire image
- effective receptive field - in CNNs it is usually more centered of the image

Theoretical Receptive Field (TRF) - Examples


the receptive field in a convolutional neural network with two 3x3 convolutional (conv) layers in the 2nd conv layer, every pixel has a 5x5 field of view, a.k.a. receptive field	Receptive fields of CNNs vs. Transformers in CNNs, the receptive field grows incrementally one layer after another in transformers, the receptive field spans all input (tokens) after a single layer. Yet, These receptive fields’ estimates are only theoretical in CNNs, the actual receptive field differs from the theoretical

Theoretical Receptive Field (TRF) vs Effective Receptive Field (ERF)

In CNNs, the pixels at the center of a receptive field have a large impact on the output
In the forward pass, the center pixels can propagate information to the output through many different paths
Therefore, during a backward pass, the center pixels have a much larger gradient magnitude

Evaluating the receptive field in CNNs empirically is ERF.

While the TRF depends on the architecture only, the ERF dy/dx is dependent on the input, i.e., different inputs generate different ERFs dy/dx

The ERF both follows a Gaussian distribution and occupies only a fraction of the full TRF

comparing the effect of:
- (1) the number of layers on the ERF
- (2) random weight initialization on the ERF
- (3) nonlinear activation on the ERF
Kernel size is fixed at 3 × 3 for all the networks.
- Uniform: convolutional kernel weights are all ones, no nonlinearity;
- Random: random kernel weights, no nonlinearity;
- Random + ReLU: random kernel weights, ReLU nonlinearity

Comparing the effect of non-linearities (RELU, Tanh, and Sigmoid) on the ERF.
ReLU makes the distribution a bit less Gaussian. ReLU units output exactly zero for half of its inputs. Thus, it is easy to get a zero output for the center pixel on the output plane

Comparing the effect of subsampling and dilation on the ERF. Both increase the ERF significantly

Comparison of ERF before and after training for models trained on:
- CIFAR-10 classification
- CamVid semantic segmentation tasks
The effective receptive field grows significantly after training.
- In the CIFAR experiment, the TRF is 74x74 (i.e. bigger than the input image 32x32). Yet, ERF still won’t cover the input image