A dive into the theory behind GLMs. This post covers the GLM model, canonical and non-canonical link functions, optimization of the log-likelihood, and inference.

This post is meant to be a reasonably self-contained exposition of GLM theory. I explore various topics, including canonical and non-canonical link functions, optimization of the log likelihood, and inference.

This post originally was written in 2020. Motivated by our recent work on robust association testing via resampling score statistics, I added a new section about GLM score tests.

Let \(\{P_\theta\}, \theta \in \Theta \subset \mathbb{R}\) be a family of distributions. Recall that \(\{P_\theta\}\) belongs to the 1-parameter exponential family if its density can be written as \[f(y|\theta) = e^{ \theta T(y) - \psi(\theta)}h(y),\] where \(\theta\) is called the canonical parameter. In this post we will assume the data \(\{y_1, \dots, y_n\}\) come from a slightly restricted version of the exponential family. Specifically, we will assume the \(y_i\)s have density \[f(y_i | \theta_i) = e^{\theta_i y_i - \psi(\theta_i)}h(y_i),\] i.e., we will assume the sufficient statistic \(T(y_i)\) is the identity. This slightly narrower class encompasses the most important distributions for statistical modeling, including the binomial, exponential, Poisson, Bernoulli, negative binomial (assuming fixed number of failures), and normal (assuming unit variance) distributions. Note that the function \(h\) is not a function of the unknown parameter \(\theta_i\) and thus will show up as a constant in the log-likelihood. The only function that contains information (relevant to inference) about the distribution is the cumulant generating function \(\psi\). Thus, we can identify exponential families (with identity sufficient statistic) with their corresponding canonical-form cumulant generating function.

We observe independent (but not necessarily identically-distributed) tuples \(\{(y_1, x_1), \dots, (y_n, x_n)\},\) where \(y_i \in \mathbb{R}\) and \(x_i \in \mathbb{R}^p\). Assume the \(y_i\)s are drawn from a one-parameter exponential family with identity sufficient statistic and possibly different canonical parameters: \[ y_i \sim f(y_i|\theta_i) = e^{ \theta_i y_i - \psi(\theta_i)}h(y).\] Assume the \(x_i\)s are fixed and known. Recall that in exponential families, there exists a bijective map between the natural parameter and the mean. Thus, the observations \(\{y_1, \dots, y_n\}\) have means \(\{ \mu_1, \dots, \mu_n \}.\)

We must link the responses \(y_i\)
to the covariates \(x_i\). Assume there
exists a strictly increasing and differentiable function \(g: \mathbb{R} \to \mathbb{R}\) and an
unknown vector \(\beta \in
\mathbb{R}^p\) such that \[ g(\mu_i) =
\langle x_i, \beta \rangle \] for all \(i \in \{ 1, \dots, n \}\). The function
\(g\) is called the **link
function**, and the function \(g^{-1}\) is called the **inverse link
function**. We sometimes use the symbol \(\eta_i\) to denote the linear component of
the model, i.e. \(\eta_i = \langle x_i, \beta
\rangle\). With these various pieces in place, we can frame our
statistical problem as follows: given data \(\{(y_i, x_i)\}_{i=1}^n\), a link function
\(g\), and an exponential family
density \(f\), estimate the unknown
parameter \(\beta\). We will estimate
\(\beta\) and obtain associated
standard errors through MLE.

For a given exponential family, there exists a special link function
called the **canonical link function** that imbues the GLM
with very nice mathematical properties. Use of the canonical link is
entirely optional; however, when given the option, people generally
choose to use the canonical link given its nice properties.

The canonical link function is the link function that results from setting the linear component of the model \(\langle x_i, \beta_i \rangle\) equal to the natural parameter (\(\theta_i\)). In math, the canonical link function is that function \(g_c\) that satisfies \[g_c(\mu_i) = \langle x_i, \beta \rangle = \theta_i.\] We can express the canonical link function in terms of known quantities. Recall that for an exponential family in canonical form (with identity sufficient statistic), we have \(\psi'(\theta_i) = \mathbb{E}[Y_i] = \mu_i.\) Therefore, by the definition of \(g_c\) above, we obtain \(g_c = [\psi']^{-1}.\) That is, the canonical link function is equal to the inverse of the derivative of \(\psi\).

The main advantage of using the canonical link function is that the canonical link renders the canonical parameter of the joint distribution \(y = (y_1, \dots, y_n)^T\) equal to \(\beta\).

**Theorem**: *Suppose we model the mean* \(\mu_i\) of \(y_i\) using the canonical link function,
i.e. \(g_c(\mu_i) = \langle x_i,\beta
\rangle\). Then the joint density of \(y = (y_1, \dots, y_n)^T\) is a p-parameter
exponential family with canonical parameter \(\beta\).

**Proof sketch**: Define \(\theta = X \beta\), where \(\theta = [\theta_1, \dots, \theta_n]^T \in
\mathbb{R}^n\) and \(X\) is the
\(n \times p\) design matrix of
observations. Let \([S_1(y), \dots,
S_p(y)]^T\) denote the \(p\)-dimensional product of the
matrix-vector multiplication \(X^T y\).
Observe that \[ \sum_{i=1}^n \theta_i y_i =
\sum_{i=1}^n x_i^T \beta y_i = \beta^T X^T y = \langle \beta, [S_1(y),
\dots, S_p(y)]^T \rangle.\] Next, define \(\phi: \mathbb{R}^p \to \mathbb{R}\) by
\[\phi(\beta) = \sum_{i=1}^n \psi(x_i^T
\beta) = \sum_{i=1}^n \psi(\theta_i).\] Finally, define \[H(y) = \prod_{i=1}^n h(y_i).\] We can
express the joint density \(f_c\) of
\(y = (y_1, \dots, y_n)^T\) as \[f_c(y|\beta) = e^{ \sum_{i=1}^n \theta_i y_i -
\psi(\theta_i)}\prod_{i=1}^n h(y_i) = \left( e^{\langle \beta, [S_1(y),
\dots, S_p(y)]^T \rangle - \phi(\beta)}\right) H(y).\] \(\square\)

We see that \(f_c\) is a \(p\)-dimensional exponential family density in canonical form. We have that

- \(\beta\) is the \(p\)-dimensional canonical parameter
- \([S_1(y), \dots, S_p(y)]^T\) is the p-dimensional sufficient statistic
- \(\phi\) is the cumulant generating function
- \(H\) is the carrier density.

The density \(f_c(y|\beta)\) inherits all the nice properties of a \(p\)-dimensional exponential family density in canonical form, including convexity of the log-likelihood and equality of the observed and Fisher information matrices:

**Convexity**. The log-likelihood for \(\beta\) is a concave function defined over a convex set. Typically, the log-likelihood for \(\beta\) is strictly concave (although this depends on \(\phi\)). Thus, the MLE for \(\beta\) exists, (typically) is unique, and is easy to compute.**Equality of observed and Fisher information matrices**. The observed and Fisher information matrices for \(\beta\) coincide. This fact simplifies some aspects of GLM fitting and inference, as we will see.

We can use a link function that is non-canonical. Consider the same
setup as before (i.e., the problem setup as presented in the *GLM
model* section.) Given an exponential family distribution, a link
function, and set of data, our goal is to estimate the unknown parameter
\(\beta\) of the GLM through MLE.

We begin by introducing some notation. Let \(h : \mathbb{R} \to \mathbb{R}\) be the function that maps the linear predictor into the canonical parameter, i.e. \[h( \langle x_i, \beta \rangle ) = \theta_i.\] The function \(h\) is simply the identity function when we use the canonical link. In general, we can express \(h\) in terms of \(g\) and \(\psi\): \[ h( \langle x_i, \beta \rangle) = \theta_i = [\psi']^{-1}(\mu_i) = [\psi']^{-1}(g^{-1}(\langle x_i, \beta \rangle)).\] Thus, \(h = [\psi']^{-1} \circ g^{-1}.\)

At this point we have defined a lot of functions. There are three functions in particular that are relevant to our analysis: the cumulant generating function \(\psi\), the link function \(g\), and the above-defined function \(h\). We review the definitions and properties of these functions in the list below.

- Function \(\psi\)

- Definition: the cumulant generating function of the exponential family in canonical form.
- Properties: \(\psi'(\theta_i) = \mu_i\) and \(\psi''(\theta_i) = \sigma_i^2\), where \(\sigma_i^2 = \mathbb{V}(y_i)\).

- Function \(g\)

- Definition: the function that maps the mean to the linear component, i.e. \(g(\mu_i) = \langle x_i, \beta \rangle\).
- Properties: The canonical link function \(g_c\) satisfies \(g_c = [\psi']^{-1}\).

- Function \(h\)

- Definition: the function that maps the linear component to the canonical parameter, i.e. \(h(\langle x_i, \beta \rangle) = \theta_i\).
- Properties: \(h\) can be expressed in terms of \(\psi\) and \(g\) as \(h = [\psi']^{-1} \circ g^{-1}\). When \(g\) is the canonical link, \(h\) is the identity.

With these definitions in hand, we can express the log-likelihood of the model (up to a constant) as follows: \[\mathcal{L}(\beta;y,X) = \sum_{i=1}^n \theta_i y_i - \psi(\theta_i) = \sum_{i=1}^n y_i \cdot h(\langle x_i, \beta \rangle) - \psi( h(\langle x_i, \beta \rangle)).\] For optimization and inference purposes, we need to compute the gradient and Hessian of the log-likelihood. Recall that the gradient of the log-likelihood is the score statistic, the Hessian of the log-likelihood is the negative observed information matrix, and the expected Hessian of the log-likelihood is the negative Fisher information matrix. We need to define some matrices and vectors to express these quantities compactly. Yes, this is painful, but it is necessary. Define the matrices

\[ \begin{cases} \Delta = \textrm{diag}\left\{ h'( \langle x_i, \beta \rangle ) \right\}_{i=1}^n \\ \Delta' = \textrm{diag}\left\{ h''( \langle x_i, \beta \rangle) \right\}_{i=1}^n \\ V = \textrm{diag}\left\{ \psi''(\theta_i) \right\}_{i=1}^n \\ H = \textrm{diag}\left\{ y_i - \mu_i \right\}_{i=1}^n \\ W = \Delta V \Delta. \end{cases} \] Also, define the vector \(s = [ y_1 - \mu_1, \dots, y_n - \mu_n]^T.\) We can show through calculus (use the chain rule!) that \[ \nabla \mathcal{L}(\beta) = X^T \Delta s\] and \[ \nabla^2 \mathcal{L}(\beta) = - X^T (\Delta V \Delta - \Delta'H ) X.\]

Suppose we use the canonical link function. Then \(h\) is the identity, implying \(h'\) is identically equal to \(1\) and \(h''\) is identically equal to \(0\). Thus, \(\Delta = I\) and \(\Delta' = 0\), yielding a simpler expression for the observed information matrix: \[\nabla^2 \mathcal{L}(\beta) = -X^TWX.\]

We can compute the Fisher information matrix \(I(\beta)\) by taking the expectation of the observed information matrix: \[ I(\beta) = - \mathbb{E} \left[ \nabla^2 \mathcal{L}(\beta) \right] = X^T ( \Delta V \Delta - \Delta' \mathbb{E}[H])X = X^T ( \Delta V \Delta)X = X^TWX.\]

Note that the observed and Fisher matrices coincide if we use the canonical link function (as predicted by exponential family theory).

We make a brief digression to discuss the log-likelihood of weighted GLMs. Let \(T_1, \dots, T_n > 0\) be given scalar weights. We can generalize the log-likelihood of the GLM slightly by multiplying the \(i\)th term by \(T_i\): \[ \mathcal{L}(\beta) = \sum_{i=1}^n T_i \left[ y_i \cdot h(\langle x_i, \beta \rangle) - \psi( h(\langle x_i, \beta \rangle)) \right].\] We might do this if we consider some observations to be more “important” than others. Let \(T = \textrm{diag} \{ T_i \}_{i=1}^n\) be the diagonal matrix of weights. It is easy to show that the gradient \(\nabla \mathcal{L}\) and Hessian \(\nabla^2 \mathcal{L}\) of the weighted log-likelihood are \(\nabla \mathcal{L}(\beta) = X^T T \Delta s\) and \(\nabla^2 \mathcal{L}(\beta) = - X^T T (\Delta V \Delta - \Delta'H ) X\), respectively.

We can optimize the log likelihood through one of three related methods: Newton-Raphson, Fisher scoring, or iteratively reweighted least squares. We discuss these methods seriatim.

**Newton-Raphson**: Newton-Raphson is a general
optimization algorithm. Let \(f: \mathbb{R}^p
\to \mathbb{R}\) be a twice continuously differentiable, concave
function. The following iterative algorithm converges to the global
maximum of \(f\): \[ x^{(k+1)} \leftarrow x^{(k)} - [\nabla^2
f(x^{(k)})]^{-1} [ \nabla f(x^{(k)})].\] To use this algorithm to
optimize \(\mathcal{L},\) substitute
the negative observed information matrix for \(\nabla^2 f(x^{(k)})\) and the score
statistic for \(\nabla f(x^{(k)})\):
\[ \beta^{(k+1)} \leftarrow \beta^{(k)} +
\left[ X^T (\Delta V \Delta - \Delta'H ) X \right]^{-1} X^T \Delta s
|_{\beta = \beta^{(k)}}.\]

**Fisher scoring**: The Fisher information matrix is the
expected observed information matrix. It turns out that we can replace
the observed information matrix with the Fisher information matrix in
the Newton-Raphson algorithm and retain global convergence. Making this
substitution, we obtain \[ \beta^{(k+1)}
\leftarrow \beta^{(k)} + [ X^T W X]^{-1} [ X^T \Delta s]|_{ \beta =
\beta^{(k)} }.\] This modified algorithm is known as the Fisher
scoring algorithm.

**Iteratively reweighted least squares**. We can rewrite
the Fisher scoring algorithm in a clever way to derive a fast
implementation of the algorithm. Define \[\tilde{y} = [g'(\mu_1)y_1, \dots,
g'(\mu_n)y_n]^T\] and \[\tilde{\mu} = [g'(\mu_1) \mu_1, \dots,
g'(\mu_n)\mu_n]^T.\] We begin with a lemma.

**Lemma**: \(\Delta s =
W(\tilde{y} - \tilde{\mu})\).

**Proof**: Recall that \(g^{-1}(t) = \psi'(h(t)).\) By the
inverse derivative theorem, \[
\frac{1}{g'(g^{-1}(t))} = \psi''(h(t))h'(t).\]
Plugging in \(\langle X_i, \beta
\rangle\) for \(t\), we find
\[ \begin{multline}
\frac{1}{g'(g^{-1}(\langle X_i, \beta \rangle))} =
\psi''(h(\langle X_i, \beta \rangle)) h'(\langle X_i, \beta
\rangle) \iff \psi''(\theta_i) = \frac{1}{ g'(\mu_i) h'(
\langle X_i, \beta \rangle)} .\end{multline}\] Next, note that
\(W = \Delta V \Delta = \textrm{diag}\left\{
h'( \langle X_i, \beta \rangle )^2 \psi''(\theta_i)
\right\}_{i=1}^n.\) Plugging in the expression we derived for
\(\psi''(\theta_i)\) above, we
obtain \[ W = \textrm{diag}\left\{
\frac{h'(\langle X_i, \beta \rangle)}{ g'(\mu_i) }
\right\}_{i=1}^n.\] Finally, it is clear that \[\begin{multline}
\Delta s = \textrm{diag}\{ h'(\langle X_i, \beta \rangle)\}_{i=1}^n
([ y_1 - \mu_1, \dots, y_n - \mu_n]) \\ = \textrm{diag}\left\{
\frac{h'(\langle X_i, \beta \rangle)}{ g'(\mu_i) }
\right\}_{i=1}^n ( g'(\mu_1)[y_1 - \mu_1], \dots, g'(\mu_n)[y_n
- \mu_n]) = W(\tilde{y} - \tilde{\mu}).
\end{multline}\] \(\square\)

Returning to the optimization problem, we can rewrite Fisher’s scoring algorithm as \[ \beta^{(k+1)} \leftarrow \beta^{(k)} + (X^TWX)^{-1} X^T \Delta s\\ = (X^TWX)^{-1} X^TW(\tilde{y} - \tilde{\mu}) \\ = (X^TWX)^{-1} X^TW(\tilde{y} - \tilde{\mu} + X\beta^{(k)}).\]

Recall the weighted least squares problem. Given an objective function \(f: \mathbb{R}^p \to \mathbb{R}\) defined by \[ f(\beta) = (y - X \beta) W (y - X\beta)\] for an \(n \times p\) design matrix \(X\), a diagonal matrix of weights \(W\), and a response vector \(y\), the global minimizer is \[ \hat{\beta} = (X^TWX)^{-1}X^TWy.\] We can use a weighted least squares solver to compute \(\beta^{(k+1)}\): set the design matrix equal to \(X\), the diagonal matrix of weights equal to \(W\), and the response vector equal to \(\tilde{y} - \tilde{\mu} + X\beta^{(k)}\). This procedure is called iteratively reweighted least squares (IRLS).

R uses IRLS to implement the *glm* function. R initializes the
parameter \(\beta^{(0)}\) to a random
value. R then repeatedly solves weighted least squares problems until
the sequence \(\{ \beta^{(0)}, \beta^{(1)},
\dots \}\) converges.

Sometimes, the score vector is expressed using slightly different notation. Define the function \(V\) by \(V(\mu_i) := \mathbb{V}(y_i),\) i.e., \(V\) takes as an argument the mean \(\mu_i\) and outputs the variance of \(\mathbb{V}(y_i)\) of \(y_i\). Given that \(\mu_i = \psi'(\theta_i)\), we can use implicit differentiation to express \(V(\mu_i)\) as follows:

\[V(\mu_i) = \frac{d\mu_i}{d\theta_i} = \psi''(\theta_i) = \mathbb{V}(y_i),\] i.e., \(V(\mu_i)\) is the differential of \(\mu_i\) with respect to \(\theta_i\). Next, recalling that \(g(\mu_i) = \eta_i\) (where \(\eta_i\)) is the linear component of the model, we consider the following, additional implicit derivative: \[ \frac{d \eta_i}{d \mu_i} = g'(\mu_i).\] Using this alternate notation, we can express the diagonal entries of the weight matrix \(W\) as follows: \[W_i = \frac{1}{V(\mu_i)(d\eta_i/d\mu_i)^2}.\] This expression for \(W_i\) coincides with our previous expression for \(W_i\), i.e., \[W_i = h'(\eta_i)^2 \psi''(\theta_i).\] This holds for the following reason: recall from the previous lemma that \[\psi''(\theta_i) = \frac{1}{g'(\mu_i) h'(\eta_i)}.\] Thus,

\[\psi''(\theta_i) = \frac{1}{g'(\mu_i) h'(\eta_i)} \iff \psi''(\theta_i)^2 = \frac{1}{g'(\mu_i)^2 h'(\eta_i)^2} \] \[ \iff \frac{1}{\psi''(\theta_i) g'(\mu_i)^2} = h'(\eta_i)^2 \psi''(\theta_i) \iff \frac{1}{ V(\mu_i) (d \eta_i/d \mu_i)^2} = h'(\eta_i)^2\psi''(\theta_i).\] Define the matrix \(M\) by \(\textrm{diag} \left\{ d\eta_i/d\mu_i \right\}_{i=1}^n.\) Then we can express the score vector as \[ \nabla \mathcal{L}(\beta) = X^T WM (y - \mu),\] because \[W_i M_i = \frac{1}{V(\mu_i) (d\eta_i/d\mu_i)} = h'(\eta_i) = \Delta_i,\] again by the lemma. Under this formulation the Fisher information matrix \(I(\beta) = X^T W X\) remains unchanged.

Using this alternate notation, the \(r\)th Fisher scoring iteration can be written as \[\hat{\beta}^{(r+1)} \leftarrow \hat{\beta}^{(r)} + (X^T W X)^{-1}X^T W M (y - \hat{\mu}),\] where all terms on the right hand side have been evaluated at \(\beta = \hat{\beta}^{(r)}\). Define the \(r\)th “working response vector” \(z^{(r)}\) as \[z^{(r)} = \hat{\eta}^{(r)} + M(y - \hat{\mu}^{(r)}) = X \hat{\beta}^{(r)} + M(y - \hat{\mu}^{(r)}).\] We can write the \(r\)th scoring iteration in terms of \(z^{(r)}\) as follows:

\[ \hat{\beta}^{(r)} + (X^TWX)^{-1}X^TWM(y - \hat{\mu}^{(r)}) = (X^TWX)^{-1}(X^TWX) \hat{\beta}^{(r)} + (X^TWX)^{-1}X^TW M(y - \hat{\mu}^{(r)})\] \[= (X^TWX)^{-1} X^TW ( X \hat{\beta}^{(r)} + M(y - \hat{\mu}^{(r)})) = (X^TWX)^{-1} X^TWz.\]

The \(r\)th iteration of the procedure therefore reduces to the least squares problem \[ \hat{\beta}^{(r+1)} \leftarrow (X^TWX)^{-1}X^TWz^{(r)}.\] We proceed until we converge upon the MLE \(\hat{\beta}\), recomputing the weight matrix \(W\) and the working response vector \(z\) at each step using our updated estimates for \(\beta\).

We can obtain Wald standard errors and \(p\)-values for the estimated parameter
\(\hat{\beta}\) using MLE asymptotics.
Recall that \(I(\beta)\) is the Fisher
information matrix of the model evaluated at \(\beta\). If the number of samples \(n\) is large, we have the approximation
\[ \hat{\beta} \sim N_p(\beta,
[I(\hat{\beta})]^{-1}).\] We can compute \(I(\hat{\beta})\) according to the formula
\(I(\hat{\beta}) = X^T
W(\hat{\beta})X,\) where \(W(\hat{\beta})\) is the weight matrix
evaluated at \(\hat{\beta}\). Note that
R returns the matrix \(W(\hat{\beta})\)
as part of the *glm* output.

We can use a score test to test whether a new predictor should be added to an existing (i.e., already-fitted) GLM. Let \(z\), \(W\), and \(\hat{\eta}\) denote the final (converged) values of the working response vector, weight matrix, and linear component vector, respectively. Let the “working residuals” \(e = [e_1, \dots, e_n]\) be defined by \[e_i = z_i - \hat{\eta}_i,\] i.e., \(e_i\) is the \(i\)th residual of the weighted least squares regression of \(z\) onto the columns of \(X\) using weights \(W\). Let \(X_2 \in \mathbb{R}^n\) be a new column of predictors. Let \(E_2\) be the vector of residuals that we obtain after regressing \(X_2\) on the columns of \(X\) with weights \(W\): \[E_2 := X_2 - X(X^T W X)^{-1} X^T W X_2.\] The z-score test statistic is then

\[Z = \frac{E_2^T W e}{\sqrt{E_2^T W
E_2}}.\] Under the null hypothesis \(Z
\sim N(0,1)\) for large \(n\).
Note that \(e\) is only a function of
the fitted GLM (and thus remains unchanged if we decide to test new
vectors \(X_2\)). \(E_2\), by contrast, is a function of both
the fitted GLM *and* the new vector of predictors.

In R we can obtain the working residuals vector \(e\) from a fitted GLM object
`fit`

via `fit$residuals`

; likewise, we can obtain
the working weights \(W\) via
`fit$weights`

. We can use the function
`glm.scoretest()`

from the package **statmod**
to run a score test on a fitted GLM.

This ends our three-part mini series on exponential families, information matrices, and GLMs. We saw that GLMs, though old, are an elegant, general, and powerful method for modeling and inference. In later posts we will see how GLMs can be used to model genomic and genetic data, with an eye toward single-cell and bulk-tissue RNA-seq.

*9gag*

- Lecture Notes provided by Professor Bradley Efron.
- Lecture Notes provided by Professor Philippe Rigollet.
- Ibrahim, Joseph G. “Incomplete data in generalized linear models.” Journal of the American Statistical Association 85.411 (1990): 765-769.