Notation

Symbol	Meaning
$\odot$	Elementwise multiplication

Recurrent network architectures

LSTM

\begin{aligned} \vec{x}_t &\in \mathbb{R}^{d_x}, &&\text{current input vector, shape }(d_x,),\\[4pt] \vec{h}_{t-1}&\in \mathbb{R}^{d_h}, &&\text{previous hidden-state vector, shape }(d_h,),\\[4pt] \vec{c}_{t-1} &\in \mathbb{R}^{d_h}, &&\text{previous cell-state vector, shape }(d_h,),\\[4pt] \vec{i}_t,\vec{f}_t,\vec{o}_t &\in (0,1)^{d_h}, &&\text{input, forget, and output gate vectors, shape }(d_h,),\\[4pt] \vec{g}_t &\in (-1,1)^{d_h}, &&\text{candidate cell-state vector, shape }(d_h,),\\[4pt] \vec{c}_t,\vec{h}_t &\in \mathbb{R}^{d_h}, &&\text{updated cell and hidden-state vectors, shape }(d_h,). \end{aligned}

Here, $d_x\in\mathbb{N}_{>0}$ is the scalar input dimension, and $d_h\in\mathbb{N}_{>0}$ is the scalar hidden dimension.

Parameters:

\begin{aligned} \mathbf{W}_i,\mathbf{W}_f,\mathbf{W}_g,\mathbf{W}_o &\in \mathbb{R}^{d_h\times d_x}, &&\text{input-weight matrices, shape }(d_h,d_x),\\[4pt] \mathbf{U}_i,\mathbf{U}_f,\mathbf{U}_g,\mathbf{U}_o &\in \mathbb{R}^{d_h\times d_h}, &&\text{recurrent-weight matrices, shape }(d_h,d_h),\\[4pt] \vec{b}_i,\vec{b}_f,\vec{b}_g,\vec{b}_o &\in \mathbb{R}^{d_h}, &&\text{bias vectors, shape }(d_h,),\\[4pt] \vec{p}_i,\vec{p}_f,\vec{p}_o &\in \mathbb{R}^{d_h}, &&\text{optional diagonal peephole-weight vectors, shape }(d_h,). \end{aligned}

Step	Equation	Role
Hidden-state update	$\vec{h}_t = \vec{o}_t\odot\tanh\left(\vec{c}_t\right)$	Produces the next hidden-state vector (and for stacked LSTMs, the input vector for the next layer).
Output gate	$\vec{o}_t = \sigma\left( \textcolor{blue}{\mathbf{W}_o\vec{x}_t} + \textcolor{purple}{\mathbf{U}_o\vec{h}_{t-1}} + \underbrace{ \textcolor{orange}{\vec{p}_o\odot\vec{c}_t} }_{\text{optional peephole}} + \textcolor{gray}{\vec{b}_o} \right)$	Controls how much of the updated cell state is exposed.
Cell-state update	$\vec{c}_t = \underbrace{ \textcolor{teal}{\vec{f}_t\odot\vec{c}_{t-1}} }_{\text{retained previous memory}} + \underbrace{ \textcolor{red}{\vec{i}_t\odot\vec{g}_t} }_{\text{newly written memory}}$	Combines retained memory with newly written memory.
Forget gate	$\vec{f}_t = \sigma\left( \textcolor{blue}{\mathbf{W}_f\vec{x}_t} + \textcolor{purple}{\mathbf{U}_f\vec{h}_{t-1}} + \underbrace{ \textcolor{orange}{\vec{p}_f\odot\vec{c}_{t-1}} }_{\text{optional peephole}} + \textcolor{gray}{\vec{b}_f} \right)$	Controls how much previous memory is retained.
Input gate	$\vec{i}_t = \sigma\left( \textcolor{blue}{\mathbf{W}_i\vec{x}_t} + \textcolor{purple}{\mathbf{U}_i\vec{h}_{t-1}} + \underbrace{ \textcolor{orange}{\vec{p}_i\odot\vec{c}_{t-1}} }_{\text{optional peephole}} + \textcolor{gray}{\vec{b}_i} \right)$	Controls how much candidate memory is written.
Candidate cell state	$\vec{g}_t = \tanh\left( \textcolor{blue}{\mathbf{W}_g\vec{x}_t} + \textcolor{purple}{\mathbf{U}_g\vec{h}_{t-1}} + \textcolor{gray}{\vec{b}_g} \right)$	Proposes new memory content.

Notes:

The input and forget gates inspect $\vec{c}_{t-1}$ because they are computed before the cell-state update. The output gate inspects $\vec{c}_t$ , so it must be computed after the new cell state.
The optional peephole connections allow the gates to use the information stored in the cell-state in their decisions, rather than only relying on the current input ( $\vec{x}_t$ ) and what was previously exposed from the cell-state ( $\vec{h}_{t-1}$ ).
For stacked LSTMs, let $\ell\in\mathbb{N}_{>0}$ be the scalar layer index. At the same time step $t$ , the input vector to layer $\ell+1$ is the hidden-state vector produced by layer $\ell$ : $\vec{x}_t^{(\ell+1)}=\vec{h}_t^{(\ell)}$ .

Activation functions

GELU

The GELU function is formally defined as $\operatorname{GELU}: \mathbb{R} \to \mathbb{R}$ ,

\operatorname{GELU}(x) = x \Phi(x) = \frac{1}{2}x\left(1 + \operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)\right)

where $\Phi$ is the standard normal cumulative distribution function, and $\operatorname{erf}$ is the error function. The ERF function is defined as $\operatorname{erf}: \mathbb{R} \to (-1, 1)$ ,

\operatorname{erf}(z) = \frac{2}{\sqrt{\pi}}\int_0^z e^{-t^2}\,dt

The figure above shows the commonly used tanh approximation,

\operatorname{GELU}(x) \approx \frac{x}{2}\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right)\right)

Activation function tradeoffs

Activation	Pros	Cons	Common use cases
ReLU (rectified linear unit)	Cheap, sparse outputs, reduces vanishing gradients for positive inputs.	Can create dead units when activations stay negative; output is not centered.	Default hidden activation for CNNs and MLPs.
Leaky ReLU (leaky rectified linear unit)	Keeps a small negative slope, reducing dead ReLU behavior.	Negative slope is a fixed hyperparameter; less common in modern transformer blocks.	CNNs, GAN discriminators, and networks where dead ReLUs appear.
PReLU (parametric rectified linear unit)	Learns the negative slope from data, making Leaky ReLU adaptive.	Adds parameters and can overfit in small models or datasets.	Computer vision models, especially deeper CNNs.
Sigmoid	Maps values to $(0, 1)$ , making it useful as a probability gate.	Saturates for large positive or negative inputs, causing small gradients; not zero-centered.	Binary output heads and gates in recurrent networks.
tanh (hyperbolic tangent)	Zero-centered output in $(-1, 1)$ ; stronger gradients near zero than sigmoid.	Saturates for large magnitudes, causing vanishing gradients.	RNN state updates, bounded hidden states, and small MLPs.
ELU (exponential linear unit)	Smooth negative branch and negative outputs help center activations.	Uses exponentials, so it is more expensive than ReLU; includes a fixed scale parameter.	CNNs and MLPs where centered activations improve optimization.
SELU (scaled exponential linear unit)	Designed to preserve mean and variance in self-normalizing networks.	Requires LeCun normal initialization and AlphaDropout-like regularization assumptions.	Self-normalizing feedforward networks.
GELU (Gaussian error linear unit)	Smooth probabilistic gating; strong empirical performance in transformer MLP blocks.	More expensive than ReLU and less interpretable; often implemented with an approximation.	Transformers, large language models, and modern vision architectures.
SiLU (sigmoid linear unit)	Smooth, non-monotonic, and often competitive with GELU.	More expensive than ReLU; can be slightly harder to reason about than piecewise activations.	Efficient CNNs, diffusion models, and transformer variants.
Mish	Smooth, non-monotonic, and can improve gradient flow in some CNNs.	More expensive than SiLU and GELU; gains are architecture-dependent.	Experimental CNNs and MLPs where activation choice is tuned.
Softplus	Smooth differentiable approximation to ReLU; always positive.	Does not create exact sparsity and can saturate for large negative values.	Positive-valued parameters, smooth ReLU replacement, and probabilistic models.