A dive into the theory behind GLMs. This post covers the GLM model, canonical and non-canonical link functions, optimization of the log-likelihood, and inference.
This post is meant to be a reasonably self-contained exposition of GLM theory. I explore various topics, including canonical and non-canonical link functions, optimization of the log likelihood, and inference.
This post originally was written in 2020. Motivated by our recent work on robust association testing via resampling score statistics, I added a new section about GLM score tests.
Let {Pθ},θ∈Θ⊂R be a family of distributions. Recall that {Pθ} belongs to the 1-parameter exponential family if its density can be written as f(y|θ)=eθT(y)−ψ(θ)h(y), where θ is called the canonical parameter. In this post we will assume the data {y1,…,yn} come from a slightly restricted version of the exponential family. Specifically, we will assume the yis have density f(yi|θi)=eθiyi−ψ(θi)h(yi), i.e., we will assume the sufficient statistic T(yi) is the identity. This slightly narrower class encompasses the most important distributions for statistical modeling, including the binomial, exponential, Poisson, Bernoulli, negative binomial (assuming fixed number of failures), and normal (assuming unit variance) distributions. Note that the function h is not a function of the unknown parameter θi and thus will show up as a constant in the log-likelihood. The only function that contains information (relevant to inference) about the distribution is the cumulant generating function ψ. Thus, we can identify exponential families (with identity sufficient statistic) with their corresponding canonical-form cumulant generating function.
We observe independent (but not necessarily identically-distributed) tuples {(y1,x1),…,(yn,xn)}, where yi∈R and xi∈Rp. Assume the yis are drawn from a one-parameter exponential family with identity sufficient statistic and possibly different canonical parameters: yi∼f(yi|θi)=eθiyi−ψ(θi)h(y). Assume the xis are fixed and known. Recall that in exponential families, there exists a bijective map between the natural parameter and the mean. Thus, the observations {y1,…,yn} have means {μ1,…,μn}.
We must link the responses yi to the covariates xi. Assume there exists a strictly increasing and differentiable function g:R→R and an unknown vector β∈Rp such that g(μi)=⟨xi,β⟩ for all i∈{1,…,n}. The function g is called the link function, and the function g−1 is called the inverse link function. We sometimes use the symbol ηi to denote the linear component of the model, i.e. ηi=⟨xi,β⟩. With these various pieces in place, we can frame our statistical problem as follows: given data {(yi,xi)}ni=1, a link function g, and an exponential family density f, estimate the unknown parameter β. We will estimate β and obtain associated standard errors through MLE.
For a given exponential family, there exists a special link function called the canonical link function that imbues the GLM with very nice mathematical properties. Use of the canonical link is entirely optional; however, when given the option, people generally choose to use the canonical link given its nice properties.
The canonical link function is the link function that results from setting the linear component of the model ⟨xi,βi⟩ equal to the natural parameter (θi). In math, the canonical link function is that function gc that satisfies gc(μi)=⟨xi,β⟩=θi. We can express the canonical link function in terms of known quantities. Recall that for an exponential family in canonical form (with identity sufficient statistic), we have ψ′(θi)=E[Yi]=μi. Therefore, by the definition of gc above, we obtain gc=[ψ′]−1. That is, the canonical link function is equal to the inverse of the derivative of ψ.
The main advantage of using the canonical link function is that the canonical link renders the canonical parameter of the joint distribution y=(y1,…,yn)T equal to β.
Theorem: Suppose we model the mean μi of yi using the canonical link function, i.e. gc(μi)=⟨xi,β⟩. Then the joint density of y=(y1,…,yn)T is a p-parameter exponential family with canonical parameter β.
Proof sketch: Define θ=Xβ, where θ=[θ1,…,θn]T∈Rn and X is the n×p design matrix of observations. Let [S1(y),…,Sp(y)]T denote the p-dimensional product of the matrix-vector multiplication XTy. Observe that n∑i=1θiyi=n∑i=1xTiβyi=βTXTy=⟨β,[S1(y),…,Sp(y)]T⟩. Next, define ϕ:Rp→R by ϕ(β)=n∑i=1ψ(xTiβ)=n∑i=1ψ(θi). Finally, define H(y)=n∏i=1h(yi). We can express the joint density fc of y=(y1,…,yn)T as fc(y|β)=e∑ni=1θiyi−ψ(θi)n∏i=1h(yi)=(e⟨β,[S1(y),…,Sp(y)]T⟩−ϕ(β))H(y). ◻
We see that fc is a p-dimensional exponential family density in canonical form. We have that
The density fc(y|β) inherits all the nice properties of a p-dimensional exponential family density in canonical form, including convexity of the log-likelihood and equality of the observed and Fisher information matrices:
Convexity. The log-likelihood for β is a concave function defined over a convex set. Typically, the log-likelihood for β is strictly concave (although this depends on ϕ). Thus, the MLE for β exists, (typically) is unique, and is easy to compute.
Equality of observed and Fisher information matrices. The observed and Fisher information matrices for β coincide. This fact simplifies some aspects of GLM fitting and inference, as we will see.
We can use a link function that is non-canonical. Consider the same setup as before (i.e., the problem setup as presented in the GLM model section.) Given an exponential family distribution, a link function, and set of data, our goal is to estimate the unknown parameter β of the GLM through MLE.
We begin by introducing some notation. Let h:R→R be the function that maps the linear predictor into the canonical parameter, i.e. h(⟨xi,β⟩)=θi. The function h is simply the identity function when we use the canonical link. In general, we can express h in terms of g and ψ: h(⟨xi,β⟩)=θi=[ψ′]−1(μi)=[ψ′]−1(g−1(⟨xi,β⟩)). Thus, h=[ψ′]−1∘g−1.
At this point we have defined a lot of functions. There are three functions in particular that are relevant to our analysis: the cumulant generating function ψ, the link function g, and the above-defined function h. We review the definitions and properties of these functions in the list below.
With these definitions in hand, we can express the log-likelihood of the model (up to a constant) as follows: \mathcal{L}(\beta;y,X) = \sum_{i=1}^n \theta_i y_i - \psi(\theta_i) = \sum_{i=1}^n y_i \cdot h(\langle x_i, \beta \rangle) - \psi( h(\langle x_i, \beta \rangle)). For optimization and inference purposes, we need to compute the gradient and Hessian of the log-likelihood. Recall that the gradient of the log-likelihood is the score statistic, the Hessian of the log-likelihood is the negative observed information matrix, and the expected Hessian of the log-likelihood is the negative Fisher information matrix. We need to define some matrices and vectors to express these quantities compactly. Yes, this is painful, but it is necessary. Define the matrices
\begin{cases} \Delta = \textrm{diag}\left\{ h'( \langle x_i, \beta \rangle ) \right\}_{i=1}^n \\ \Delta' = \textrm{diag}\left\{ h''( \langle x_i, \beta \rangle) \right\}_{i=1}^n \\ V = \textrm{diag}\left\{ \psi''(\theta_i) \right\}_{i=1}^n \\ H = \textrm{diag}\left\{ y_i - \mu_i \right\}_{i=1}^n \\ W = \Delta V \Delta. \end{cases} Also, define the vector s = [ y_1 - \mu_1, \dots, y_n - \mu_n]^T. We can show through calculus (use the chain rule!) that \nabla \mathcal{L}(\beta) = X^T \Delta s and \nabla^2 \mathcal{L}(\beta) = - X^T (\Delta V \Delta - \Delta'H ) X.
Suppose we use the canonical link function. Then h is the identity, implying h' is identically equal to 1 and h'' is identically equal to 0. Thus, \Delta = I and \Delta' = 0, yielding a simpler expression for the observed information matrix: \nabla^2 \mathcal{L}(\beta) = -X^TWX.
We can compute the Fisher information matrix I(\beta) by taking the expectation of the observed information matrix: I(\beta) = - \mathbb{E} \left[ \nabla^2 \mathcal{L}(\beta) \right] = X^T ( \Delta V \Delta - \Delta' \mathbb{E}[H])X = X^T ( \Delta V \Delta)X = X^TWX.
Note that the observed and Fisher matrices coincide if we use the canonical link function (as predicted by exponential family theory).
We make a brief digression to discuss the log-likelihood of weighted GLMs. Let T_1, \dots, T_n > 0 be given scalar weights. We can generalize the log-likelihood of the GLM slightly by multiplying the ith term by T_i: \mathcal{L}(\beta) = \sum_{i=1}^n T_i \left[ y_i \cdot h(\langle x_i, \beta \rangle) - \psi( h(\langle x_i, \beta \rangle)) \right]. We might do this if we consider some observations to be more “important” than others. Let T = \textrm{diag} \{ T_i \}_{i=1}^n be the diagonal matrix of weights. It is easy to show that the gradient \nabla \mathcal{L} and Hessian \nabla^2 \mathcal{L} of the weighted log-likelihood are \nabla \mathcal{L}(\beta) = X^T T \Delta s and \nabla^2 \mathcal{L}(\beta) = - X^T T (\Delta V \Delta - \Delta'H ) X, respectively.
We can optimize the log likelihood through one of three related methods: Newton-Raphson, Fisher scoring, or iteratively reweighted least squares. We discuss these methods seriatim.
Newton-Raphson: Newton-Raphson is a general optimization algorithm. Let f: \mathbb{R}^p \to \mathbb{R} be a twice continuously differentiable, concave function. The following iterative algorithm converges to the global maximum of f: x^{(k+1)} \leftarrow x^{(k)} - [\nabla^2 f(x^{(k)})]^{-1} [ \nabla f(x^{(k)})]. To use this algorithm to optimize \mathcal{L}, substitute the negative observed information matrix for \nabla^2 f(x^{(k)}) and the score statistic for \nabla f(x^{(k)}): \beta^{(k+1)} \leftarrow \beta^{(k)} + \left[ X^T (\Delta V \Delta - \Delta'H ) X \right]^{-1} X^T \Delta s |_{\beta = \beta^{(k)}}.
Fisher scoring: The Fisher information matrix is the expected observed information matrix. It turns out that we can replace the observed information matrix with the Fisher information matrix in the Newton-Raphson algorithm and retain global convergence. Making this substitution, we obtain \beta^{(k+1)} \leftarrow \beta^{(k)} + [ X^T W X]^{-1} [ X^T \Delta s]|_{ \beta = \beta^{(k)} }. This modified algorithm is known as the Fisher scoring algorithm.
Iteratively reweighted least squares. We can rewrite the Fisher scoring algorithm in a clever way to derive a fast implementation of the algorithm. Define \tilde{y} = [g'(\mu_1)y_1, \dots, g'(\mu_n)y_n]^T and \tilde{\mu} = [g'(\mu_1) \mu_1, \dots, g'(\mu_n)\mu_n]^T. We begin with a lemma.
Lemma: \Delta s = W(\tilde{y} - \tilde{\mu}).
Proof: Recall that g^{-1}(t) = \psi'(h(t)). By the inverse derivative theorem, \frac{1}{g'(g^{-1}(t))} = \psi''(h(t))h'(t). Plugging in \langle X_i, \beta \rangle for t, we find \begin{multline} \frac{1}{g'(g^{-1}(\langle X_i, \beta \rangle))} = \psi''(h(\langle X_i, \beta \rangle)) h'(\langle X_i, \beta \rangle) \iff \psi''(\theta_i) = \frac{1}{ g'(\mu_i) h'( \langle X_i, \beta \rangle)} .\end{multline} Next, note that W = \Delta V \Delta = \textrm{diag}\left\{ h'( \langle X_i, \beta \rangle )^2 \psi''(\theta_i) \right\}_{i=1}^n. Plugging in the expression we derived for \psi''(\theta_i) above, we obtain W = \textrm{diag}\left\{ \frac{h'(\langle X_i, \beta \rangle)}{ g'(\mu_i) } \right\}_{i=1}^n. Finally, it is clear that \begin{multline} \Delta s = \textrm{diag}\{ h'(\langle X_i, \beta \rangle)\}_{i=1}^n ([ y_1 - \mu_1, \dots, y_n - \mu_n]) \\ = \textrm{diag}\left\{ \frac{h'(\langle X_i, \beta \rangle)}{ g'(\mu_i) } \right\}_{i=1}^n ( g'(\mu_1)[y_1 - \mu_1], \dots, g'(\mu_n)[y_n - \mu_n]) = W(\tilde{y} - \tilde{\mu}). \end{multline} \square
Returning to the optimization problem, we can rewrite Fisher’s scoring algorithm as \beta^{(k+1)} \leftarrow \beta^{(k)} + (X^TWX)^{-1} X^T \Delta s\\ = (X^TWX)^{-1} X^TW(\tilde{y} - \tilde{\mu}) \\ = (X^TWX)^{-1} X^TW(\tilde{y} - \tilde{\mu} + X\beta^{(k)}).
Recall the weighted least squares problem. Given an objective function f: \mathbb{R}^p \to \mathbb{R} defined by f(\beta) = (y - X \beta) W (y - X\beta) for an n \times p design matrix X, a diagonal matrix of weights W, and a response vector y, the global minimizer is \hat{\beta} = (X^TWX)^{-1}X^TWy. We can use a weighted least squares solver to compute \beta^{(k+1)}: set the design matrix equal to X, the diagonal matrix of weights equal to W, and the response vector equal to \tilde{y} - \tilde{\mu} + X\beta^{(k)}. This procedure is called iteratively reweighted least squares (IRLS).
R uses IRLS to implement the glm function. R initializes the parameter \beta^{(0)} to a random value. R then repeatedly solves weighted least squares problems until the sequence \{ \beta^{(0)}, \beta^{(1)}, \dots \} converges.
Sometimes, the score vector is expressed using slightly different notation. Define the function V by V(\mu_i) := \mathbb{V}(y_i), i.e., V takes as an argument the mean \mu_i and outputs the variance of \mathbb{V}(y_i) of y_i. Given that \mu_i = \psi'(\theta_i), we can use implicit differentiation to express V(\mu_i) as follows:
V(\mu_i) = \frac{d\mu_i}{d\theta_i} = \psi''(\theta_i) = \mathbb{V}(y_i), i.e., V(\mu_i) is the differential of \mu_i with respect to \theta_i. Next, recalling that g(\mu_i) = \eta_i (where \eta_i) is the linear component of the model, we consider the following, additional implicit derivative: \frac{d \eta_i}{d \mu_i} = g'(\mu_i). Using this alternate notation, we can express the diagonal entries of the weight matrix W as follows: W_i = \frac{1}{V(\mu_i)(d\eta_i/d\mu_i)^2}. This expression for W_i coincides with our previous expression for W_i, i.e., W_i = h'(\eta_i)^2 \psi''(\theta_i). This holds for the following reason: recall from the previous lemma that \psi''(\theta_i) = \frac{1}{g'(\mu_i) h'(\eta_i)}. Thus,
\psi''(\theta_i) = \frac{1}{g'(\mu_i) h'(\eta_i)} \iff \psi''(\theta_i)^2 = \frac{1}{g'(\mu_i)^2 h'(\eta_i)^2} \iff \frac{1}{\psi''(\theta_i) g'(\mu_i)^2} = h'(\eta_i)^2 \psi''(\theta_i) \iff \frac{1}{ V(\mu_i) (d \eta_i/d \mu_i)^2} = h'(\eta_i)^2\psi''(\theta_i). Define the matrix M by \textrm{diag} \left\{ d\eta_i/d\mu_i \right\}_{i=1}^n. Then we can express the score vector as \nabla \mathcal{L}(\beta) = X^T WM (y - \mu), because W_i M_i = \frac{1}{V(\mu_i) (d\eta_i/d\mu_i)} = h'(\eta_i) = \Delta_i, again by the lemma. Under this formulation the Fisher information matrix I(\beta) = X^T W X remains unchanged.
Using this alternate notation, the rth Fisher scoring iteration can be written as \hat{\beta}^{(r+1)} \leftarrow \hat{\beta}^{(r)} + (X^T W X)^{-1}X^T W M (y - \hat{\mu}), where all terms on the right hand side have been evaluated at \beta = \hat{\beta}^{(r)}. Define the rth “working response vector” z^{(r)} as z^{(r)} = \hat{\eta}^{(r)} + M(y - \hat{\mu}^{(r)}) = X \hat{\beta}^{(r)} + M(y - \hat{\mu}^{(r)}). We can write the rth scoring iteration in terms of z^{(r)} as follows:
\hat{\beta}^{(r)} + (X^TWX)^{-1}X^TWM(y - \hat{\mu}^{(r)}) = (X^TWX)^{-1}(X^TWX) \hat{\beta}^{(r)} + (X^TWX)^{-1}X^TW M(y - \hat{\mu}^{(r)}) = (X^TWX)^{-1} X^TW ( X \hat{\beta}^{(r)} + M(y - \hat{\mu}^{(r)})) = (X^TWX)^{-1} X^TWz.
The rth iteration of the procedure therefore reduces to the least squares problem \hat{\beta}^{(r+1)} \leftarrow (X^TWX)^{-1}X^TWz^{(r)}. We proceed until we converge upon the MLE \hat{\beta}, recomputing the weight matrix W and the working response vector z at each step using our updated estimates for \beta.
We can obtain Wald standard errors and p-values for the estimated parameter \hat{\beta} using MLE asymptotics. Recall that I(\beta) is the Fisher information matrix of the model evaluated at \beta. If the number of samples n is large, we have the approximation \hat{\beta} \sim N_p(\beta, [I(\hat{\beta})]^{-1}). We can compute I(\hat{\beta}) according to the formula I(\hat{\beta}) = X^T W(\hat{\beta})X, where W(\hat{\beta}) is the weight matrix evaluated at \hat{\beta}. Note that R returns the matrix W(\hat{\beta}) as part of the glm output.
We can use a score test to test whether a new predictor should be added to an existing (i.e., already-fitted) GLM. Let z, W, and \hat{\eta} denote the final (converged) values of the working response vector, weight matrix, and linear component vector, respectively. Let the “working residuals” e = [e_1, \dots, e_n] be defined by e_i = z_i - \hat{\eta}_i, i.e., e_i is the ith residual of the weighted least squares regression of z onto the columns of X using weights W. Let X_2 \in \mathbb{R}^n be a new column of predictors. Let E_2 be the vector of residuals that we obtain after regressing X_2 on the columns of X with weights W: E_2 := X_2 - X(X^T W X)^{-1} X^T W X_2. The z-score test statistic is then
Z = \frac{E_2^T W e}{\sqrt{E_2^T W E_2}}. Under the null hypothesis Z \sim N(0,1) for large n. Note that e is only a function of the fitted GLM (and thus remains unchanged if we decide to test new vectors X_2). E_2, by contrast, is a function of both the fitted GLM and the new vector of predictors.
In R we can obtain the working residuals vector e from a fitted GLM object
fit
via fit$residuals
; likewise, we can obtain
the working weights W via
fit$weights
. We can use the function
glm.scoretest()
from the package statmod
to run a score test on a fitted GLM.
This ends our three-part mini series on exponential families, information matrices, and GLMs. We saw that GLMs, though old, are an elegant, general, and powerful method for modeling and inference. In later posts we will see how GLMs can be used to model genomic and genetic data, with an eye toward single-cell and bulk-tissue RNA-seq.
9gag