Notation
| Symbol | Meaning |
|---|---|
| Elementwise multiplication |
Recurrent network architectures
LSTM
Here, is the scalar input dimension, and is the scalar hidden dimension.
Parameters:
| Step | Equation | Role |
|---|---|---|
| Hidden-state update | Produces the next hidden-state vector (and for stacked LSTMs, the input vector for the next layer). | |
| Output gate | Controls how much of the updated cell state is exposed. | |
| Cell-state update | Combines retained memory with newly written memory. | |
| Forget gate | Controls how much previous memory is retained. | |
| Input gate | Controls how much candidate memory is written. | |
| Candidate cell state | Proposes new memory content. |
Notes:
- The input and forget gates inspect because they are computed before the cell-state update. The output gate inspects , so it must be computed after the new cell state.
- The optional peephole connections allow the gates to use the information stored in the cell-state in their decisions, rather than only relying on the current input () and what was previously exposed from the cell-state ().
- For stacked LSTMs, let be the scalar layer index. At the same time step , the input vector to layer is the hidden-state vector produced by layer : .
Activation functions
GELU
The GELU function is formally defined as ,
where is the standard normal cumulative distribution function, and is the error function. The ERF function is defined as ,
The figure above shows the commonly used tanh approximation,
Activation function tradeoffs
| Activation | Pros | Cons | Common use cases |
|---|---|---|---|
| ReLU (rectified linear unit) | Cheap, sparse outputs, reduces vanishing gradients for positive inputs. | Can create dead units when activations stay negative; output is not centered. | Default hidden activation for CNNs and MLPs. |
| Leaky ReLU (leaky rectified linear unit) | Keeps a small negative slope, reducing dead ReLU behavior. | Negative slope is a fixed hyperparameter; less common in modern transformer blocks. | CNNs, GAN discriminators, and networks where dead ReLUs appear. |
| PReLU (parametric rectified linear unit) | Learns the negative slope from data, making Leaky ReLU adaptive. | Adds parameters and can overfit in small models or datasets. | Computer vision models, especially deeper CNNs. |
| Sigmoid | Maps values to , making it useful as a probability gate. | Saturates for large positive or negative inputs, causing small gradients; not zero-centered. | Binary output heads and gates in recurrent networks. |
| tanh (hyperbolic tangent) | Zero-centered output in ; stronger gradients near zero than sigmoid. | Saturates for large magnitudes, causing vanishing gradients. | RNN state updates, bounded hidden states, and small MLPs. |
| ELU (exponential linear unit) | Smooth negative branch and negative outputs help center activations. | Uses exponentials, so it is more expensive than ReLU; includes a fixed scale parameter. | CNNs and MLPs where centered activations improve optimization. |
| SELU (scaled exponential linear unit) | Designed to preserve mean and variance in self-normalizing networks. | Requires LeCun normal initialization and AlphaDropout-like regularization assumptions. | Self-normalizing feedforward networks. |
| GELU (Gaussian error linear unit) | Smooth probabilistic gating; strong empirical performance in transformer MLP blocks. | More expensive than ReLU and less interpretable; often implemented with an approximation. | Transformers, large language models, and modern vision architectures. |
| SiLU (sigmoid linear unit) | Smooth, non-monotonic, and often competitive with GELU. | More expensive than ReLU; can be slightly harder to reason about than piecewise activations. | Efficient CNNs, diffusion models, and transformer variants. |
| Mish | Smooth, non-monotonic, and can improve gradient flow in some CNNs. | More expensive than SiLU and GELU; gains are architecture-dependent. | Experimental CNNs and MLPs where activation choice is tuned. |
| Softplus | Smooth differentiable approximation to ReLU; always positive. | Does not create exact sparsity and can saturate for large negative values. | Positive-valued parameters, smooth ReLU replacement, and probabilistic models. |