Rectified Linear Unit (ReLU)

ReLU Visual

ReLU Problem

The downside of being zero for all negative values is a problem called “dying ReLU.”

A ReLU neuron is “dead” if it’s stuck on the negative side and always outputs 0. Because the slope of ReLU in the negative range is also 0, once a neuron gets negative, it’s unlikely for it to recover. Such neurons are not playing any role in discriminating the input and are essentially useless. Over time you may end up with a large part of your network doing nothing

ReLU Variants

Variant

Description

Graph

Rectified Linear Unit (ReLU)

𝑓(𝑧) = 𝑚𝑎𝑥(𝑧, 0)

  • ✔ simple
  • ✔ can wipe the negative signal out
  • ✘ zero gradient for negative inputs, thus can be fragile during training and “die”

Leaky ReLU

𝑓(𝑧) = 𝑚𝑎𝑥(𝑧, 𝛼𝑧)

  • ✔ non-zero gradient for negative inputs
  • ✘ slope 𝛼 needs to be hand-tuned
  • ✘ cannot wipe the negative signal out

Parametric ReLU (PReLU)

𝑓(𝑧) = 𝑚𝑎𝑥(𝑧, 𝛼𝑧) # is Leaky ReLU where 𝛼 is learned

  • ✔ non-zero gradient for negative inputs
  • ✘ cannot wipe the negative signal out

Scaled Exponential Linear Unit (SELU)

  • ✔ non-zero gradient for negative inputs
  • ✘ 𝛼 needs to be hand-tuned
  • ✘ exponential is computationally expensive

Exponential Linear Unit (ELU)

Is SeLU where 𝛼 = 1

Gaussian Error Linear Unit (GeLU)

𝑓(𝑧) = 𝑧 ⨯ 𝛷(𝑧)

where:

  • 𝛷(𝑧) is the CDF of the Standard Gaussian
  • ✔ non-zero gradient for negative inputs
  • ✘ is computationally expensive

SeLU

TODO

Concatenated ReLU (CReLU)

  • has two outputs concatenated together (this DOUBLES the output dimension):
    • one normal ReLU
    • one negative ReLU
  • In other words:
    • for positive input 𝑧 it outputs the following two values {𝑧, 0}
    • for negative input 𝑧 it outputs the following two values {0, 𝑧}

ReLU-6

𝑓(𝑧) = 𝑚𝑎𝑥(𝑚𝑖𝑛(𝑧, 6), 0)

  • is ReLU capped at 6
  • according to the authors, the upper bound encouraged their model to learn sparse features earlier