Deep learning quick reference

Notation

SymbolMeaning
\odotElementwise multiplication

Recurrent network architectures

LSTM

xtRdx,current input vector, shape (dx,),ht1Rdh,previous hidden-state vector, shape (dh,),ct1Rdh,previous cell-state vector, shape (dh,),it,ft,ot(0,1)dh,input, forget, and output gate vectors, shape (dh,),gt(1,1)dh,candidate cell-state vector, shape (dh,),ct,htRdh,updated cell and hidden-state vectors, shape (dh,).\begin{aligned} \vec{x}_t &\in \mathbb{R}^{d_x}, &&\text{current input vector, shape }(d_x,),\\[4pt] \vec{h}_{t-1}&\in \mathbb{R}^{d_h}, &&\text{previous hidden-state vector, shape }(d_h,),\\[4pt] \vec{c}_{t-1} &\in \mathbb{R}^{d_h}, &&\text{previous cell-state vector, shape }(d_h,),\\[4pt] \vec{i}_t,\vec{f}_t,\vec{o}_t &\in (0,1)^{d_h}, &&\text{input, forget, and output gate vectors, shape }(d_h,),\\[4pt] \vec{g}_t &\in (-1,1)^{d_h}, &&\text{candidate cell-state vector, shape }(d_h,),\\[4pt] \vec{c}_t,\vec{h}_t &\in \mathbb{R}^{d_h}, &&\text{updated cell and hidden-state vectors, shape }(d_h,). \end{aligned}

Here, dxN>0d_x\in\mathbb{N}_{>0} is the scalar input dimension, and dhN>0d_h\in\mathbb{N}_{>0} is the scalar hidden dimension.

Parameters:

Wi,Wf,Wg,WoRdh×dx,input-weight matrices, shape (dh,dx),Ui,Uf,Ug,UoRdh×dh,recurrent-weight matrices, shape (dh,dh),bi,bf,bg,boRdh,bias vectors, shape (dh,),pi,pf,poRdh,optional diagonal peephole-weight vectors, shape (dh,).\begin{aligned} \mathbf{W}_i,\mathbf{W}_f,\mathbf{W}_g,\mathbf{W}_o &\in \mathbb{R}^{d_h\times d_x}, &&\text{input-weight matrices, shape }(d_h,d_x),\\[4pt] \mathbf{U}_i,\mathbf{U}_f,\mathbf{U}_g,\mathbf{U}_o &\in \mathbb{R}^{d_h\times d_h}, &&\text{recurrent-weight matrices, shape }(d_h,d_h),\\[4pt] \vec{b}_i,\vec{b}_f,\vec{b}_g,\vec{b}_o &\in \mathbb{R}^{d_h}, &&\text{bias vectors, shape }(d_h,),\\[4pt] \vec{p}_i,\vec{p}_f,\vec{p}_o &\in \mathbb{R}^{d_h}, &&\text{optional diagonal peephole-weight vectors, shape }(d_h,). \end{aligned}
StepEquationRole
Hidden-state update
ht=ottanh(ct)\vec{h}_t = \vec{o}_t\odot\tanh\left(\vec{c}_t\right)
Produces the next hidden-state vector (and for stacked LSTMs, the input vector for the next layer).
Output gate
ot=σ(Woxt+Uoht1+poctoptional peephole+bo)\vec{o}_t = \sigma\left( \textcolor{blue}{\mathbf{W}_o\vec{x}_t} + \textcolor{purple}{\mathbf{U}_o\vec{h}_{t-1}} + \underbrace{ \textcolor{orange}{\vec{p}_o\odot\vec{c}_t} }_{\text{optional peephole}} + \textcolor{gray}{\vec{b}_o} \right)
Controls how much of the updated cell state is exposed.
Cell-state update
ct=ftct1retained previous memory+itgtnewly written memory\vec{c}_t = \underbrace{ \textcolor{teal}{\vec{f}_t\odot\vec{c}_{t-1}} }_{\text{retained previous memory}} + \underbrace{ \textcolor{red}{\vec{i}_t\odot\vec{g}_t} }_{\text{newly written memory}}
Combines retained memory with newly written memory.
Forget gate
ft=σ(Wfxt+Ufht1+pfct1optional peephole+bf)\vec{f}_t = \sigma\left( \textcolor{blue}{\mathbf{W}_f\vec{x}_t} + \textcolor{purple}{\mathbf{U}_f\vec{h}_{t-1}} + \underbrace{ \textcolor{orange}{\vec{p}_f\odot\vec{c}_{t-1}} }_{\text{optional peephole}} + \textcolor{gray}{\vec{b}_f} \right)
Controls how much previous memory is retained.
Input gate
it=σ(Wixt+Uiht1+pict1optional peephole+bi)\vec{i}_t = \sigma\left( \textcolor{blue}{\mathbf{W}_i\vec{x}_t} + \textcolor{purple}{\mathbf{U}_i\vec{h}_{t-1}} + \underbrace{ \textcolor{orange}{\vec{p}_i\odot\vec{c}_{t-1}} }_{\text{optional peephole}} + \textcolor{gray}{\vec{b}_i} \right)
Controls how much candidate memory is written.
Candidate cell state
gt=tanh(Wgxt+Ught1+bg)\vec{g}_t = \tanh\left( \textcolor{blue}{\mathbf{W}_g\vec{x}_t} + \textcolor{purple}{\mathbf{U}_g\vec{h}_{t-1}} + \textcolor{gray}{\vec{b}_g} \right)
Proposes new memory content.

Notes:

  • The input and forget gates inspect ct1\vec{c}_{t-1} because they are computed before the cell-state update. The output gate inspects ct\vec{c}_t, so it must be computed after the new cell state.
  • The optional peephole connections allow the gates to use the information stored in the cell-state in their decisions, rather than only relying on the current input (xt\vec{x}_t) and what was previously exposed from the cell-state (ht1\vec{h}_{t-1}).
  • For stacked LSTMs, let N>0\ell\in\mathbb{N}_{>0} be the scalar layer index. At the same time step tt, the input vector to layer +1\ell+1 is the hidden-state vector produced by layer \ell: xt(+1)=ht()\vec{x}_t^{(\ell+1)}=\vec{h}_t^{(\ell)}.

Activation functions

GELU

The GELU function is formally defined as GELU:RR\operatorname{GELU}: \mathbb{R} \to \mathbb{R},

GELU(x)=xΦ(x)=12x(1+erf(x2))\operatorname{GELU}(x) = x \Phi(x) = \frac{1}{2}x\left(1 + \operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)\right)

where Φ\Phi is the standard normal cumulative distribution function, and erf\operatorname{erf} is the error function. The ERF function is defined as erf:R(1,1)\operatorname{erf}: \mathbb{R} \to (-1, 1),

erf(z)=2π0zet2dt\operatorname{erf}(z) = \frac{2}{\sqrt{\pi}}\int_0^z e^{-t^2}\,dt

The figure above shows the commonly used tanh approximation,

GELU(x)x2(1+tanh(2π(x+0.044715x3)))\operatorname{GELU}(x) \approx \frac{x}{2}\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right)\right)

Activation function tradeoffs

ActivationProsConsCommon use cases
ReLU (rectified linear unit)Cheap, sparse outputs, reduces vanishing gradients for positive inputs.Can create dead units when activations stay negative; output is not centered.Default hidden activation for CNNs and MLPs.
Leaky ReLU (leaky rectified linear unit)Keeps a small negative slope, reducing dead ReLU behavior.Negative slope is a fixed hyperparameter; less common in modern transformer blocks.CNNs, GAN discriminators, and networks where dead ReLUs appear.
PReLU (parametric rectified linear unit)Learns the negative slope from data, making Leaky ReLU adaptive.Adds parameters and can overfit in small models or datasets.Computer vision models, especially deeper CNNs.
SigmoidMaps values to (0,1)(0, 1), making it useful as a probability gate.Saturates for large positive or negative inputs, causing small gradients; not zero-centered.Binary output heads and gates in recurrent networks.
tanh (hyperbolic tangent)Zero-centered output in (1,1)(-1, 1); stronger gradients near zero than sigmoid.Saturates for large magnitudes, causing vanishing gradients.RNN state updates, bounded hidden states, and small MLPs.
ELU (exponential linear unit)Smooth negative branch and negative outputs help center activations.Uses exponentials, so it is more expensive than ReLU; includes a fixed scale parameter.CNNs and MLPs where centered activations improve optimization.
SELU (scaled exponential linear unit)Designed to preserve mean and variance in self-normalizing networks.Requires LeCun normal initialization and AlphaDropout-like regularization assumptions.Self-normalizing feedforward networks.
GELU (Gaussian error linear unit)Smooth probabilistic gating; strong empirical performance in transformer MLP blocks.More expensive than ReLU and less interpretable; often implemented with an approximation.Transformers, large language models, and modern vision architectures.
SiLU (sigmoid linear unit)Smooth, non-monotonic, and often competitive with GELU.More expensive than ReLU; can be slightly harder to reason about than piecewise activations.Efficient CNNs, diffusion models, and transformer variants.
MishSmooth, non-monotonic, and can improve gradient flow in some CNNs.More expensive than SiLU and GELU; gains are architecture-dependent.Experimental CNNs and MLPs where activation choice is tuned.
SoftplusSmooth differentiable approximation to ReLU; always positive.Does not create exact sparsity and can saturate for large negative values.Positive-valued parameters, smooth ReLU replacement, and probabilistic models.