(Receptive Field - Field of View) - Effective Receptive Field (ERF) - Theoretical Receptive Field (TRF)
  • refers to the region of the input image that a particular neuron in a convolutional layer is “looking at” or taking into account when making its predictions or feature extractions
  • is determined by the size of the convolutional kernel (also known as the filter) and the stride used during the convolution operation
  • two types of receptive fields:
    • theoretical receptive field - theoretically evenly spaced out across the entire image
    • effective receptive field - in CNNs it is usually more centered of the image

Theoretical Receptive Field (TRF) - Examples

  • the receptive field in a convolutional neural network with two 3x3 convolutional (conv) layers
  • in the 2nd conv layer, every pixel has a 5x5 field of view, a.k.a. receptive field
  • Receptive fields of CNNs vs. Transformers
  • in CNNs, the receptive field grows incrementally one layer after another
  • in transformers, the receptive field spans all input (tokens) after a single layer. Yet, These receptive fields’ estimates are only theoretical
  • in CNNs, the actual receptive field differs from the theoretical

Theoretical Receptive Field (TRF) vs Effective Receptive Field (ERF)

  • In CNNs, the pixels at the center of a receptive field have a large impact on the output
  • In the forward pass, the center pixels can propagate information to the output through many different paths
  • Therefore, during a backward pass, the center pixels have a much larger gradient magnitude

Evaluating the receptive field in CNNs empirically is ERF.

While the TRF depends on the architecture only, the ERF dy/dx is dependent on the input, i.e., different inputs generate different ERFs dy/dx

The ERF both follows a Gaussian distribution and occupies only a fraction of the full TRF

  • comparing the effect of:
    • (1) the number of layers on the ERF
    • (2) random weight initialization on the ERF
    • (3) nonlinear activation on the ERF
  • Kernel size is fixed at 3 × 3 for all the networks.
    • Uniform: convolutional kernel weights are all ones, no nonlinearity;
    • Random: random kernel weights, no nonlinearity;
    • Random + ReLU: random kernel weights, ReLU nonlinearity

  • Comparing the effect of non-linearities (RELU, Tanh, and Sigmoid) on the ERF.
  • ReLU makes the distribution a bit less Gaussian. ReLU units output exactly zero for half of its inputs. Thus, it is easy to get a zero output for the center pixel on the output plane

  • Comparing the effect of subsampling and dilation on the ERF. Both increase the ERF significantly

  • Comparison of ERF before and after training for models trained on:
    • CIFAR-10 classification
    • CamVid semantic segmentation tasks
  • The effective receptive field grows significantly after training.
    • In the CIFAR experiment, the TRF is 74x74 (i.e. bigger than the input image 32x32). Yet, ERF still won’t cover the input image

Resources