Part III of this book describes the statistical and algorithmic methods that underlie sceptre
. Users who wish to apply sceptre
to analyze their data — but do not need to know about the specifics of the methods contained within the software — safely can skip this part.
Several statistical tasks are involved in the analysis of single-cell CRISPR screen data, including assigning gRNAs to cells, testing for association between gRNAs and the expression of responses (such as genes), and integrating information across gRNAs that target the same site. This chapter provides a high-level overview of the statistical methods that sceptre
employs to carry out these analysis tasks.
Notation
We begin by introducing some notation that we use throughout the chapter. Let denote the set of nonnegative integers. For a given single-cell CRISPR screen dataset, let denote the number of responses, the number of gRNAs, and the number of cells in the dataset. Next, let be the matrix of response UMI counts, where responses are in the rows and cells in the columns. A given entry of indicates the number of UMIs from response sequenced in cell . Similarly, let be the matrix of gRNA UMI counts; a given entry of indicates the number of UMIs from gRNA sequenced in cell . Finally, let be the cell-specific covariates (e.g., response_n_umis
, response_n_nonzero
, response_p_mito
, grna_n_umis
, grna_n_nonzero
, batch
, etc.), where we have included a “1” in each to serve as an intercept term. (We assume that the count-based covariates, such as grna_n_nonzero
and grna_n_umis
, have been log-transformed.) Finally, let denote the vectors assembled into a matrix. Throughout, we use an uppercase letter (e.g., ) to refer to a random variable and a lowercase letter (e.g., ) to refer to the realized value of the random variable.
The first statistical task involved in single-cell CRISPR screen analysis is to assign gRNAs to cells. The gRNA assignment task involves leveraging the gRNA count matrix (and possibly the covariate matrix ) to impute the (unobserved) binary matrix of gRNA presences and absences, where a given entry of is defined to be (resp., ) if gRNA is present (resp., absent) in cell . sceptre
implements three gRNA assignment strategies: the thresholding method, maximum method, and the mixture method. All three methods account for background contamination, the phenomenon by which gRNA reads sometimes map to cells that do not contain the corresponding gRNA. (See Assign gRNAs for a real-data example of background contamination.) We describe each method in greater depth here.
Thresholding method
The thresholding method assigns a gRNA to a cell if the UMI count of the gRNA in the cell exceeds some integer threshold . Formally, the thresholding method sets to if and to if . The default value for is 5 (as proposed in Gasperini et al. (2019) and further validated by Barry, Roeder, and Katsevich (2024)). An important special case involves setting to , which implements the “naive” (but occasionally useful) strategy of assigning any gRNA expressed in a given cell to that cell (and thus ignoring background contamination).
Maximum method
The maximum method assigns the gRNA that exhibits the greatest UMI count in a given cell to that cell. Formally, consider a given cell . Let be the index such that (If multiple indices satisfy this criterion, select among these arbitrarily.) We assign gRNA to cell , i.e. we set to and to for . In carrying out the maximum assignment step, sceptre
also flags cells that likely contain multiple gRNAs. Let be a user-specified threshold (default value ). Suppose that the gRNA assigned to cell constitutes fewer than of the UMIs in that cell. In other words, suppose that
Then cell is flagged as containing multiple gRNAs. Similarly, scepre
flags cells that likely contain zero gRNAs. Suppose that the sum of the gRNA UMI counts in a given cell is less than (default value 5):
Then cell is flagged as containing zero gRNAs. Cells containing zero gRNAs or multiple gRNAs are removed as part of the cellwise QC step. (Note that if multiple gRNAs are tied for most highly expressed in a given cell, and if , then that cell is flagged as containing multiple gRNAs, per the above rule.)
Mixture method
The mixture method assigns gRNAs to cells using a latent variable generalized linear model (GLM). Consider a given gRNA . Let be the UMI count of gRNA in cell , and let be the (unobserved) variable indicating whether gRNA is present () or absent () in cell . (We suppress the subscript for notational compactness.) We model the gRNA UMI counts using a latent variable Poisson GLM:
Here, is the mean expression level of gRNA in cell (given the covariates); and are the (unknown) regression coefficients; and is the (unknown) probability that gRNA is present in a given cell. We fit the model Equation 10.1 using an EM algorithm, producing estimates , and for the model coefficients , , and . Using these estimates, we can compute the probability that a given cell contains gRNA . (The probabilities are sometimes called “posterior probabilities.”) Finally, we threshold the posterior probabilities at some threshold (default value 0.8) to assign gRNA to the cells, i.e. we set if and if . An advantage of the Poisson GLM framework is that it enables us to account for cell-specific covariates that might affect the expression level of the gRNA, such as the number of gRNA UMIs sequenced in a given cell or batch. Formulating the model Equation 10.1 is fairly straightforward; fitting the model to data in a fast and numerically stable way is more challenging. sceptre
leverages a novel variant of the EM algorithm for this purpose, which we describe below.
Approximating the model Equation 10.1 with a simpler model. We begin by approximating the latent variable model Equation 10.1 with a simpler latent variable model that is easier to estimate. First, we obtain accurate estimates for the parameter by exploiting the fact that the gRNA is present in only a small fraction (typically ) of cells. Let denote the probability mass function of the Poisson distribution with mean evaluated at , i.e.,
Conditioning on the covariates (i.e., treating and as fixed), we can express the log-likelihood of the GLM in Equation 10.1 as follows:
In words, the gRNA indicator is equal to zero in the large majority of cells, and so we can approximate the GLM in Equation 10.1 with the GLM that results from excluding from the model:
We can estimate the GLM Equation 10.2 using standard maximum likelihood estimation, as all the variables in this GLM are observed. Doing so yields an estimate for . Assuming that (which holds for large and small ), we can write
where we have set to , i.e. the th fitted value (on the scale of the cannonical parameter) of the GLM Equation 10.2. Finally, we propose a simplified model for expression of the gRNA:
Here, the s are offset terms in the GLM. Notice that the GLM in Equation 10.3 does not include covariates or an intercept term; rather, all information about the covariates and the baseline expression level of the gRNA has been “encoded” into the offsets . In summary we (i) estimated using the GLM Equation 10.2 and then (ii) replaced in Equation 10.1 with its estimate, yielding the model Equation 10.3. The model Equation 10.3 is a good approximation to Equation 10.1 in the sense that the maximum likelihood estimates for and in Equation 10.3 are close to the corresponding estimates for these parameters in Equation 10.1 when is large and is small (as is the case on most single-cell CRISPR screen datasets). We turn our attention to estimating the model Equation 10.3, treating the offset terms as known and fixed.
EM algorithm. We derive an EM algorithm to estimate the model Equation 10.3. Let denote the unknown model parameters. We begin by writing down the complete-data likelihood of this model, which is the likelihood that would result if had been observed.
We obtain the complete-data log-likelihood by taking the log of :
We derive the E and M steps for the EM algorithm in this model.
E step: The E step entails computing the membership probability of each cell (i.e., the probability that each cell contains the gRNA given the current parameter estimates and the observed gRNA counts). Let be the parameter estimate for at the th iteration of the algorithm. The th membership probability at the th iteration of the algorithm is defined as Let be the mean gRNA UMI count in the th cell at the th iteration of the algorithm that results from setting to , i.e.,
Applying Bayes rule, we can express the membership probability as
where we define
Plugging in the Poisson probability mass function for , we can express is follows.
This expression Equation 10.4 for is numerically stable and fast to evaluate. In summary the M step consists of computing for all by computing (using Equation 10.4) and then evaluating
M step: The M-step involves maximizing the so-called “Q function,” which is the function that results from taking the expectation of the complete-data log-likelihood with respect to the s while conditioning on the s and the current parameter estimates . Formally, the Q function is defined as . We can express the Q function as
The goal of the M step to identify the parameter that maximizes Equation 10.5. The first two terms of Equation 10.5 are a function of , and the last two terms are a function of . Thus, we can optimize these sets of terms separately.
To find the maximizer in , we differentiate the first two terms of Equation 10.5 with respect to , yielding
Setting Equation 10.6 to zero and solving for produces the maximizer :
We turn our attention to the last two terms of Equation 10.5. The final term is not a function of and thus can be ignored. We rewrite the penultimate term as follows:
The derivative of Equation 10.8 with respect to is
Finally, setting Equation 10.9 to zero and solving for yields the maximizer
In summary the M step entails computing the updated parameter estimates and using the formulas Equation 10.7 and Equation 10.10, respectively.
Convergence: The incomplete-data likelihood (which we obtain by integrating the complete-data likelihood with respect to the s) is
The incomplete-data log-likelihood is
Note that the incomplete-data log-likelihood — in contrast to the complete-data log-likelihood — is computable (albeit hard to optimize directly). We iterate between E and M steps until converges. We declare that the sequence of estimates has converged when
for some small (in the code, ). We set the final parameter estimates and to and . Additionally, we use the final membership probabilities to assign the gRNA to cells.
As is standard, we run the EM algorithm several times over random starting estimates for and to improve chances of converging to the global maximizer of the incomplete-data likelihood. We use the run whose estimates yield the greatest incomplete-data log-likelihood to assign gRNAs to cells.
Definition of “treatment group” and “control group”
To test for association between a given gRNA (or gRNA target) and a response, sceptre
divides the cells into two groups: the “treatment group” and the “control group.” sceptre
tests for differential expression of the response across these two groups of cells, yielding a p-value for the test of association between the gRNA and response. Defining the cells that constitute the treatment group and control group is somewhat subtle and depends on the analysis parameters of gRNA integration strategy (i.e., union, singleton, or Bonferroni), control group (i.e., complement set or NT cells), and analysis type (i.e., discovery analysis or calibration check). (See Set analysis parameters for a discussion of these parameters.) This section aims to carefully define “treatment group” and “control group,” in particular as a function of the aforementioned analysis parameters.
We first introduce some notation. Let denote the entire set of cells. Let denote the matrix of imputed gRNA presences and absences (as determined in the gRNA assignment step). Index the gRNAs by . For , let denote the set of cells that gRNA has infected, i.e. Let denote the set of non-targeting gRNAs, and let denote the set of targeting gRNAs. (Typically, consists of both discovery gRNAs and positive control gRNAs.) Note that in low MOI, after applying quality control, each cell contains a single gRNA, implying that the sets are disjoint, i.e.
This is not the case in high-MOI; in that setting the intersection of and for is nonempty (in general). We proceed by taking cases on gRNA integration strategy, first considering the union integration strategy and then considering the singleton and Bonferroni integration strategies.
Union gRNA integration strategy. We first consider the “union” gRNA integration strategy. The union integration strategy involves combining multiple gRNAs into a single “combined” gRNA and then testing this combined gRNA against responses as if it were a singleton gRNA. Suppose that we are running a discovery analysis, and let be a given set of gRNAs that targets the same site. The table below defines the treatment group, the complement set control group, and the NT cells control group relative to .
Union gRNA integration strategy, discovery analysis.
Treatment group |
Complement set control group |
NT cells control group |
|
|
|
Next, suppose that we are running a calibration check analysis (again using the union gRNA integration strategy). Let be a set of non-targeting gRNAs that has been randomly combined into a “negative control” gRNA target. (By default, , i.e., the discovery target and negative control target contain the same number of gRNAs.) The treatment group, complement set control group, and NT cells control group are defined relative to as follows.
Union gRNA integration strategy, calibration check.
Treatment group |
Complement set control group |
NT cells control group |
|
|
|
Singleton and Bonferroni gRNA integration strategies. Next, we consider the “singleton” and “Bonferroni” gRNA integration strategies. Both strategies involve testing singleton gRNAs for association against responses. Suppose that we are running a discovery analysis, and let be a given targeting gRNA. The table below defines the treatment group, complement set control group, and NT cells control group relative to .
Singleton/Bonferroni gRNA integration strategy, discovery analysis.
Treatment group |
Complement set control group |
NT cells control group |
|
|
|
Finally, suppose that we are running a calibration check using the singleton or Bonferroni gRNA integration strategy. Let be a non-targeting gRNA. The treatment group, complement set control group, and NT cells control group are defined relative to as follows.
Singleton/Bonferroni gRNA integration strategy, calibration check.
Treatment group |
Complement set control group |
NT cells control group |
|
|
|
Note that the Bonferroni gRNA integration strategy combines p-values across gRNAs that target the same site into a single “target-wide” p-value via a Bonferroni correction. Note also that the complement set is available as a control group in both low- and high-MOI settings, while the NT cells are available as a control group only in the low-MOI setting.
Differential expression testing
We next explicate the differential expression testing paradigm of sceptre
. sceptre
is based on the combination of a negative binomial model and a resampling framework for differential expression, improving its robustness relative to a standard negative binomial model. We begin by refining the notation introduced above. Consider a given target-response pair. Suppose that there are cells across the treatment and control groups. (We are abusing notation slightly here; in a low-MOI analysis in which we use the NT cells as our control group, the number of cells in a given pair is not equal to the number of cells in the whole dataset.) Let be the vector of raw gene (or protein, etc.) UMI counts. Next, let be a binary vector indicating whether a given cell is in the treatment group () or the control group (). Finally, let be the cell-specific covariates, where we have included a “1” in each to serve as an intercept term. Finally, let denote the vectors assembled into a matrix, and let denote the matrix that results from concatenating and . Testing for association between a gRNA target and response involves three steps: first, “resampling” the treatment vector times to form “null” treatment vectors ; second, evaluating a negative binomial test statistic on the original treatment vector and null treatment vectors ; and third, comparing the original test statistic to the null test statistics to compute a p-value. We describe each of these steps below.
Resampling the treatment vector
The first step is to “resample” the treatment vector to form “null” treatment vectors . The “null” treatment vectors are constructed such that they are not associated with the response vector , enabling a natural comparison to the original treatment vector. sceptre
provides two methods for resampling the original treatment vector: permutations and conditional resampling. One can set the resampling_mechanism
argument of set_analysis_parameters()
to "permutations"
and "crt"
to select between these options.
The permutation strategy simply involves permuting the vector times. More precisely, among the possible permutations of , we sample permutations randomly (with replacement and uniformly over the set of all permutations), yielding .
The conditional resampling strategy involves resampling conditional on the values of the cell-specific covariates . It involves several steps. First, we model as a function of using a logistic regression GLM:
Here, is an unknown regression coefficient, and is the logit function
We produce an estimate for by regressing onto via a logistic GLM. Next, we estimate the conditional mean of given by evaluating the fitted mean of the GLM , where is the logistic function: Finally, for each , we sample the null treatment vector from the distribution yielding null treatment vectors . The permutation and conditional resampling strategies pose distinct advantages and disadvantages; see Section 2.7 for a discussion.
Computing the test statistics
The second step is to evaluate a test statistic on the original treatment vector and the resampled treatment vectors , enabling us to compare the original treatment vector to the resampled treatment vectors (via their test statistics). The test statistic that we use for this purpose is based on a negative binomial GLM. We propose the following working model for the response UMI count as a function of the cell-specific covariates : Here, is the (unknown) negative binomial size parameter, and is an (unknown) regression coefficient. The probability mass function of the negative binomial distribution with mean and size parameter evaluated at is as follows:
This is the parameterization of the negative binomial distribution in which the variance of given mean is The model (Equation 10.11) corresponds to the null hypothesis of no association between the response and the treatment .
We regress onto via a negative binomial GLM, yielding estimates and for and , respectively. Next, we test the original treatment vector and resampled treatment vectors for inclusion in the fitted GLM by computing a GLM score test on each of these vectors. The GLM score test statistic for the original treatment vector is given by
where and are a matrix and vector, respectively, that depend on the fitted means , response expression vector , and estimated size parameter :
We compute the score test statistic corresponding to the th resampled treatment vector for by replacing with in (Equation 10.12).
Note that the score test statistic is asymptotically equivalent to a Wald statistic. In other words, we instead could compute the test statistic regressing onto the matrix and testing the hypothesis that the coefficient corresponding to is equal to zero via a Wald test. This latter approach is asymptotically equivalent to the score statistic approach outlined above but is much less computationally efficient.
Computing the p-value
The third and final step is to compute a p-value. The standard negative binomial GLM p-value is obtained by evaluating the tail probability of a standard Gaussian distribution at . For example, the left-tailed negative binomial p-value is given by where is the standard Gaussian distribution. We instead propose to compute the p-value by comparing the original test statistic to the null test statistics ; this latter approach is more robust. The “exact” left-, right-, and two-tailed permutation test p-values (which also are valid for a conditional randomization test) are given below.
sceptre
returns the exact permutation test p-value when resampling_approximation
is set to "no_approximation"
in set_analysis_parameters()
.
Unfortunately, computing an exact permutation test p-value can be slow, especially when the number of hypotheses to be tested is large. Thus, sceptre
provides the option of approximating the null distribution of the test statistics via a skew-normal distribution. A skew-normal distribution is a bell-shaped distribution that generalizes a Gaussian distribution by allowing for skew (i.e., asymmetry) about the mode of the distribution. A skew-normal distribution has three paramters: (the mean), (the standard deviation), and (the skew; setting recovers a Gaussian distribution with parameters and ). One can activate the skew-normal functionality of sceptre
by setting resampling_approximation
to "skew_normal"
in set_analysis_parameters()
. In this case sceptre
fits a skew-normal distribution to the null test statistics via a method of moments estimator, yielding estimates and for , , and , respectively. Next, sceptre
computes the p-value by evaluating the tail probability of the fitted skew-normal distribution at the original test statistic . In particular, letting denote the probability density function of the skew-normal distribution with parameters , and , the left-, right-, and two-tailed p-values are given as follows.
Estimating the log-fold change
sceptre
employs a simple and fast strategy for to estimate the log-fold change. Recall that, before computing the test statistics , we fit the negative binomial GLM (Equation 10.11), yielding parameter estimates and for and , respectively. To estimate the log-fold change, we posit the simplified model where is the th fitted value of the negative binomial GLM (Equation 10.11). (This simplification is analogous to the simplification employed in the context of fitting the gRNA mixture model; Section 10.1.3.) Estimating the log fold change reduces to estimating . We treat the terms as i.i.d. draws from some distribution ; this approximation is reasoanble for large . The log-likelihood of the model (Equation 10.13) is
where is a constant in the unknown parameter . Observe that, among the set of observations , we can ignore the observations for which , as these observations do not contribute to the log likelihood. Relabel the observations for which as , where is the number of observations for which . Using this notation, the log-likelihood of (Equation 10.13) can be expressed as
Differentiating the above in and setting to zero yields the MLE equation:
or
The unknown parameter appears on both sides of (Equation 10.14). Thus, we cannot solve for the MLE analytically. However, we can derive an approximate solution for the MLE. Notice that both numerator and denominator of (Equation 10.14) are the mean of i.i.d. random variables. The expectation of the numerator is , and that of the denominator is Thus, by the law of large numbers and the continuous mapping theorem, the MLE for converges in probability to However, the following quantity converges in probability to the same limit:
Thus, we can approximate the MLE with the right-hand side of (Equation 10.15):
This is the simple estimator that sceptre
uses for the log fold change. An advantage to this estimator is that it is quite fast to compute using the fitted GLM (Equation 10.11). Interestingly, this estimator circumvents estimation of the size parameter .
Accelerations
sceptre
leverages several statistical and algorithmic accelerations to considerably improve the speed of the methods described above, including a sparsity-exploiting algorithm for computing GLM score tests and a strategy for recycling computation across permutation tests. We plan to describe these accelerations in a subsequent chapter.
Barry, Timothy, Kathryn Roeder, and Eugene Katsevich. 2024. “Exponential family measurement error models for single-cell CRISPR screens.” Biostatistics, kxae010.
Gasperini, Molly, Andrew J. Hill, José L. McFaline-Figueroa, Beth Martin, Seungsoo Kim, Melissa D. Zhang, Dana Jackson, et al. 2019. “A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens.” Cell 176 (1): 377–90.