10  Overview of methods

Tip

Part III of this book describes the statistical and algorithmic methods that underlie sceptre. Users who wish to apply sceptre to analyze their data — but do not need to know about the specifics of the methods contained within the software — safely can skip this part.

Several statistical tasks are involved in the analysis of single-cell CRISPR screen data, including assigning gRNAs to cells, testing for association between gRNAs and the expression of responses (such as genes), and integrating information across gRNAs that target the same site. This chapter provides a high-level overview of the statistical methods that sceptre employs to carry out these analysis tasks.

10.1 Notation

We begin by introducing some notation that we use throughout the chapter. Let N={0,1,2,} denote the set of nonnegative integers. For a given single-cell CRISPR screen dataset, let pN denote the number of responses, rN the number of gRNAs, and nN the number of cells in the dataset. Next, let YNp×n be the matrix of response UMI counts, where responses are in the rows and cells in the columns. A given entry Yij of Y indicates the number of UMIs from response i sequenced in cell j. Similarly, let GNr×n be the matrix of gRNA UMI counts; a given entry Gij of G indicates the number of UMIs from gRNA i sequenced in cell j. Finally, let Z1,,ZnRd be the cell-specific covariates (e.g., response_n_umis, response_n_nonzero, response_p_mito, grna_n_umis, grna_n_nonzero, batch, etc.), where we have included a “1” in each Zj to serve as an intercept term. (We assume that the count-based covariates, such as grna_n_nonzero and grna_n_umis, have been log-transformed.) Finally, let ZRn×d denote the vectors Z1,,Zn assembled into a matrix. Throughout, we use an uppercase letter (e.g., Gij) to refer to a random variable and a lowercase letter (e.g., gij) to refer to the realized value of the random variable.

The first statistical task involved in single-cell CRISPR screen analysis is to assign gRNAs to cells. The gRNA assignment task involves leveraging the gRNA count matrix G (and possibly the covariate matrix Z) to impute the (unobserved) binary matrix X{0,1}r×n of gRNA presences and absences, where a given entry Xij of X is defined to be 1 (resp., 0) if gRNA i is present (resp., absent) in cell j. sceptre implements three gRNA assignment strategies: the thresholding method, maximum method, and the mixture method. All three methods account for background contamination, the phenomenon by which gRNA reads sometimes map to cells that do not contain the corresponding gRNA. (See Assign gRNAs for a real-data example of background contamination.) We describe each method in greater depth here.

10.1.1 Thresholding method

The thresholding method assigns a gRNA to a cell if the UMI count of the gRNA in the cell exceeds some integer threshold cN. Formally, the thresholding method sets Xij to 1 if Gijc and to 0 if Gij<c. The default value for c is 5 (as proposed in Gasperini et al. () and further validated by Barry, Roeder, and Katsevich ()). An important special case involves setting c to 1, which implements the “naive” (but occasionally useful) strategy of assigning any gRNA expressed in a given cell to that cell (and thus ignoring background contamination).

10.1.2 Maximum method

The maximum method assigns the gRNA that exhibits the greatest UMI count in a given cell to that cell. Formally, consider a given cell j. Let i{1,,r} be the index such that Gij=maxi{1,,r}Gij. (If multiple indices satisfy this criterion, select i among these arbitrarily.) We assign gRNA i to cell j, i.e. we set Xij to 1 and Xij to 0 for ii. In carrying out the maximum assignment step, sceptre also flags cells that likely contain multiple gRNAs. Let u[0,1] be a user-specified threshold (default value 0.8). Suppose that the gRNA assigned to cell j constitutes fewer than u of the UMIs in that cell. In other words, suppose that

Giji=1rGij<u.

Then cell j is flagged as containing multiple gRNAs. Similarly, scepre flags cells that likely contain zero gRNAs. Suppose that the sum of the gRNA UMI counts in a given cell j is less than tN (default value 5):

i=1rGij<t.

Then cell j is flagged as containing zero gRNAs. Cells containing zero gRNAs or multiple gRNAs are removed as part of the cellwise QC step. (Note that if multiple gRNAs are tied for most highly expressed in a given cell, and if u>0.5, then that cell is flagged as containing multiple gRNAs, per the above rule.)

10.1.3 Mixture method

The mixture method assigns gRNAs to cells using a latent variable generalized linear model (GLM). Consider a given gRNA i. Let Gj be the UMI count of gRNA i in cell j, and let Xj be the (unobserved) variable indicating whether gRNA i is present (Xj=1) or absent (Xj=0) in cell j. (We suppress the i subscript for notational compactness.) We model the gRNA UMI counts using a latent variable Poisson GLM:

(10.1){Gj|μjPois(μj)log(μj|Xj,Zj)=γXj+βTZjXjBernoulli(π).

Here, μj is the mean expression level of gRNA i in cell j (given the covariates); γR and βRd are the (unknown) regression coefficients; and π[0,1] is the (unknown) probability that gRNA i is present in a given cell. We fit the model using an EM algorithm, producing estimates β^, γ^ and π^ for the model coefficients β, γ, and π. Using these estimates, we can compute the probability Tij:=P(Xij=1) that a given cell j contains gRNA i. (The probabilities Ti1,,Tin are sometimes called “posterior probabilities.”) Finally, we threshold the posterior probabilities at some threshold u (default value 0.8) to assign gRNA i to the cells, i.e. we set Xij=1 if Tij>u and Xij=0 if Tiju. An advantage of the Poisson GLM framework is that it enables us to account for cell-specific covariates that might affect the expression level of the gRNA, such as the number of gRNA UMIs sequenced in a given cell or batch. Formulating the model is fairly straightforward; fitting the model to data in a fast and numerically stable way is more challenging. sceptre leverages a novel variant of the EM algorithm for this purpose, which we describe below.

Approximating the model with a simpler model. We begin by approximating the latent variable model with a simpler latent variable model that is easier to estimate. First, we obtain accurate estimates for the parameter β by exploiting the fact that the gRNA is present in only a small fraction (typically <2%) of cells. Let f(gj;μj) denote the probability mass function of the Poisson distribution with mean μj evaluated at gj, i.e.,

f(gj;μj)=μjgjeμjgj!.Conditioning on the covariates (i.e., treating Xj=xj and Zj=zj as fixed), we can express the log-likelihood of the GLM in as follows:
L(β,γ)=j=1nlog[f(gj;μj)]=j=1nlog[f(gj;exp(γxj+βTzj))]=j:xj=1log[f(gj;exp(γ+βTzj))]few terms+j:xj=0log[f(gj;exp(βTzj))]many termsj=1nlog[f(gj;exp(βTzj))].

In words, the gRNA indicator xj is equal to zero in the large majority of cells, and so we can approximate the GLM in with the GLM that results from excluding xj from the model:

(10.2){Gj|μjPois(μj)log(μj|zj)=βTZj.

We can estimate the GLM using standard maximum likelihood estimation, as all the variables in this GLM are observed. Doing so yields an estimate β^ for β. Assuming that β^β (which holds for large n and small π), we can write

γXj+βZjγXj+β^Zj=γXj+oj,where we have set oj to β^TZj, i.e. the jth fitted value (on the scale of the cannonical parameter) of the GLM . Finally, we propose a simplified model for expression of the gRNA:

(10.3){Gj|μjPois(μj)log(μj|Xj)=γXj+ojXjBernoulli(π).

Here, the ojs are offset terms in the GLM. Notice that the GLM in does not include covariates or an intercept term; rather, all information about the covariates and the baseline expression level of the gRNA has been “encoded” into the offsets o1,,on. In summary we (i) estimated β using the GLM and then (ii) replaced β in with its estimate, yielding the model . The model is a good approximation to in the sense that the maximum likelihood estimates for γ and π in are close to the corresponding estimates for these parameters in when n is large and π is small (as is the case on most single-cell CRISPR screen datasets). We turn our attention to estimating the model , treating the offset terms o1,,on as known and fixed.

EM algorithm. We derive an EM algorithm to estimate the model . Let θ=(γ,π) denote the unknown model parameters. We begin by writing down the complete-data likelihood l of this model, which is the likelihood that would result if X1,,Xn had been observed.

l(θ)=j=1nP(Gj=gj,Xj=xj)=j=1nP(Gj=gj|Xj=xj)P(Xj=xj)=j=1nf(gj;exp(γxj+oj))[πxi(1π)1xi].

We obtain the complete-data log-likelihood L by taking the log of l:

L(θ)=log(l(θ))=j=1nlog[f(gj;exp(γxj+oj)]+j=1nxjlog(π)+(1xj)(1π).

We derive the E and M steps for the EM algorithm in this model.

E step: The E step entails computing the membership probability of each cell (i.e., the probability that each cell contains the gRNA given the current parameter estimates and the observed gRNA counts). Let θ(t)=(γ(t),π(t)) be the parameter estimate for θ at the tth iteration of the algorithm. The jth membership probability at the tth iteration of the algorithm Tj(t) is defined as Tj(t)=P(Xj=1|Gj=gj,θ(t)). Let [μj(k)](t) be the mean gRNA UMI count in the jth cell at the tth iteration of the algorithm that results from setting xj to k{0,1}, i.e.,

[μj(k)](t)=exp(γ(t)k+oj).

Applying Bayes rule, we can express the membership probability Tj(t) as

Tj(t)=P(Xj=1|Gj=gj,θ(t))=P(Gj=gj|Xj=1,θ(t))P(Xj=1|θ(t))k=01P(Gj=gj|Xj=k,θ(t))P(Xj=k|θ(t))=(P(Gj=gj|Xj=0,θ(t))P(Xj=0)P(Gj=gj|Xj=1,θ(t))P(Xj=1)+1)1=(f(gj;[μj(0)](t))(1π(t)))f(gj;[μj(1)](t))π(t))+1)1=(exp(qj(t))+1)1, where we define qj(t):=log(f(gj;[μj(0)](t))(1π(t))f(gj;[μj(1)](t))π(t)).

Plugging in the Poisson probability mass function for f, we can express qj(t) is follows.

(10.4)qj(t):=log(1π(t))log(π(t))+gj(log([μj(0)](t))log([μj(1)](t)))+[μj(1)](t)[μj(0)](t).

This expression for qj(t) is numerically stable and fast to evaluate. In summary the M step consists of computing Tj(t) for all j{1,,n} by computing qj(t) (using ) and then evaluating Tj(t)=exp(qj(t)+1)1.

M step: The M-step involves maximizing the so-called “Q function,” which is the function that results from taking the expectation of the complete-data log-likelihood with respect to the Xjs while conditioning on the Gjs and the current parameter estimates θ(t). Formally, the Q function Q(θ|θ(t)) is defined as Q(θ|θ(t))=EX1,,Xn[L(θ)|G1=g1,,Gn=gnθ(t)]. We can express the Q function as

(10.5)Q(θ|θ(t))=j=1nTj(t)log(π)+j=1n(1Tj(t))log(1π)+j=1nTj(t)log[f(gj;exp[γ+oj])]+j=1n(1Tj(t))log[f(gj;exp(oj)].

The goal of the M step to identify the parameter θ=(π,γ) that maximizes . The first two terms of are a function of π, and the last two terms are a function of γ. Thus, we can optimize these sets of terms separately.

To find the maximizer in π, we differentiate the first two terms of with respect to π, yielding

(10.6)j=1nTj(t)πj=1n(1Tj(t))1π.

Setting to zero and solving for π produces the maximizer π(t+1):

(10.7)π(t+1)=(1/n)j=1nTj(t).

We turn our attention to the last two terms of . The final term is not a function of γ and thus can be ignored. We rewrite the penultimate term as follows:

(10.8)j=1nTj(t)log[f(gj;exp[γ+oj])]=j=1nTj(t)log[exp(γ+oj)gjexp(exp[γ+oj])gj!]=j=1nTj(t)[gj(γ+oj)exp(γ+oj)log(gj!)].

The derivative of with respect to γ is

(10.9)j=1nTj(t)gjTj(t)exp(γ+oj)=j=1nTj(t)gjexp(γ)j=1nTj(t)exp(oj).

Finally, setting to zero and solving for γ yields the maximizer

(10.10)γ(t+1)=log(j=1nTj(t)yjj=1nTj(t)eoj).

In summary the M step entails computing the updated parameter estimates π(t+1) and γ(t+1) using the formulas and , respectively.

Convergence: The incomplete-data likelihood (which we obtain by integrating the complete-data likelihood with respect to the Xjs) is

lincomplete(θ)=j=1nf(gj;γ+oj)π+f(gj;oj)(1π).

The incomplete-data log-likelihood is

Lincomplete(θ)=log(lincomplete(θ))=j=1nlog[f(gj;γ+oj)π+f(gj;oj)(1π)].

Note that the incomplete-data log-likelihood — in contrast to the complete-data log-likelihood — is computable (albeit hard to optimize directly). We iterate between E and M steps until Lincomplete converges. We declare that the sequence of estimates θ(0),θ(1),θ(2), has converged when

|Lincomplete(θ(t+1))Lincomplete(θ(t))|min{|Lincomplete(θ(t+1))|,|Lincomplete(θ(t))|}<ϵ

for some small ϵ>0 (in the code, ϵ=0.5104). We set the final parameter estimates π^ and γ^ to π(t+1) and γ(t+1). Additionally, we use the final membership probabilities T1(t+1),,Tn(t+1) to assign the gRNA to cells.

As is standard, we run the EM algorithm several times over random starting estimates for π and γ to improve chances of converging to the global maximizer of the incomplete-data likelihood. We use the run whose estimates yield the greatest incomplete-data log-likelihood to assign gRNAs to cells.

10.2 Definition of “treatment group” and “control group”

To test for association between a given gRNA (or gRNA target) and a response, sceptre divides the cells into two groups: the “treatment group” and the “control group.” sceptre tests for differential expression of the response across these two groups of cells, yielding a p-value for the test of association between the gRNA and response. Defining the cells that constitute the treatment group and control group is somewhat subtle and depends on the analysis parameters of gRNA integration strategy (i.e., union, singleton, or Bonferroni), control group (i.e., complement set or NT cells), and analysis type (i.e., discovery analysis or calibration check). (See Set analysis parameters for a discussion of these parameters.) This section aims to carefully define “treatment group” and “control group,” in particular as a function of the aforementioned analysis parameters.

We first introduce some notation. Let S={1,,n} denote the entire set of cells. Let X{0,1}r×n denote the matrix of imputed gRNA presences and absences (as determined in the gRNA assignment step). Index the gRNAs by {1,,r}. For i{1,,r}, let Gi denote the set of cells that gRNA i has infected, i.e. Gi:={j:Xij=1}. Let N{1,,r} denote the set of non-targeting gRNAs, and let T={1,,r}N denote the set of targeting gRNAs. (Typically, T consists of both discovery gRNAs and positive control gRNAs.) Note that in low MOI, after applying quality control, each cell contains a single gRNA, implying that the sets G1,,Gr are disjoint, i.e.

i=1rGi=.

This is not the case in high-MOI; in that setting the intersection of Gi1 and Gi2 for i1i2 is nonempty (in general). We proceed by taking cases on gRNA integration strategy, first considering the union integration strategy and then considering the singleton and Bonferroni integration strategies.

Union gRNA integration strategy. We first consider the “union” gRNA integration strategy. The union integration strategy involves combining multiple gRNAs into a single “combined” gRNA and then testing this combined gRNA against responses as if it were a singleton gRNA. Suppose that we are running a discovery analysis, and let IT be a given set of gRNAs that targets the same site. The table below defines the treatment group, the complement set control group, and the NT cells control group relative to I.

Union gRNA integration strategy, discovery analysis.
Treatment group Complement set control group NT cells control group
iIGi SiIGi iNGi

Next, suppose that we are running a calibration check analysis (again using the union gRNA integration strategy). Let LN be a set of non-targeting gRNAs that has been randomly combined into a “negative control” gRNA target. (By default, |L|=|I|, i.e., the discovery target and negative control target contain the same number of gRNAs.) The treatment group, complement set control group, and NT cells control group are defined relative to L as follows.

Union gRNA integration strategy, calibration check.
Treatment group Complement set control group NT cells control group
lLGl SlLGl l(NL)Gl

Singleton and Bonferroni gRNA integration strategies. Next, we consider the “singleton” and “Bonferroni” gRNA integration strategies. Both strategies involve testing singleton gRNAs for association against responses. Suppose that we are running a discovery analysis, and let iT be a given targeting gRNA. The table below defines the treatment group, complement set control group, and NT cells control group relative to i.

Singleton/Bonferroni gRNA integration strategy, discovery analysis.
Treatment group Complement set control group NT cells control group
Gi SGi iNGi

Finally, suppose that we are running a calibration check using the singleton or Bonferroni gRNA integration strategy. Let lN be a non-targeting gRNA. The treatment group, complement set control group, and NT cells control group are defined relative to l as follows.

Singleton/Bonferroni gRNA integration strategy, calibration check.
Treatment group Complement set control group NT cells control group
Gl SGl i(N{l})Gi

Note that the Bonferroni gRNA integration strategy combines p-values across gRNAs that target the same site into a single “target-wide” p-value via a Bonferroni correction. Note also that the complement set is available as a control group in both low- and high-MOI settings, while the NT cells are available as a control group only in the low-MOI setting.

10.3 Differential expression testing

We next explicate the differential expression testing paradigm of sceptre. sceptre is based on the combination of a negative binomial model and a resampling framework for differential expression, improving its robustness relative to a standard negative binomial model. We begin by refining the notation introduced above. Consider a given target-response pair. Suppose that there are n cells across the treatment and control groups. (We are abusing notation slightly here; in a low-MOI analysis in which we use the NT cells as our control group, the number of cells in a given pair is not equal to the number of cells in the whole dataset.) Let Y=[Y1,,Yn]T be the vector of raw gene (or protein, etc.) UMI counts. Next, let X=[X1,,Xn]T be a binary vector indicating whether a given cell j is in the treatment group (Xj=1) or the control group (Xj=0). Finally, let Z1,,ZnRd be the cell-specific covariates, where we have included a “1” in each Zj to serve as an intercept term. Finally, let ZRn×d denote the vectors Z1,,Zn assembled into a matrix, and let [X,Z]Rn×(d+1) denote the matrix that results from concatenating X and Z. Testing for association between a gRNA target and response involves three steps: first, “resampling” the treatment vector B times to form “null” treatment vectors X~1,X~B; second, evaluating a negative binomial test statistic on the original treatment vector X and null treatment vectors X~1,,X~B; and third, comparing the original test statistic to the null test statistics to compute a p-value. We describe each of these steps below.

10.3.1 Resampling the treatment vector

The first step is to “resample” the treatment vector X to form B “null” treatment vectors X~1,,X~B. The “null” treatment vectors are constructed such that they are not associated with the response vector Y, enabling a natural comparison to the original treatment vector. sceptre provides two methods for resampling the original treatment vector: permutations and conditional resampling. One can set the resampling_mechanism argument of set_analysis_parameters() to "permutations" and "crt" to select between these options.

The permutation strategy simply involves permuting the X vector B times. More precisely, among the n! possible permutations of X, we sample B permutations randomly (with replacement and uniformly over the set of all permutations), yielding X~1,,X~B.

The conditional resampling strategy involves resampling X~1,,X~B conditional on the values of the cell-specific covariates Z1,,Zn. It involves several steps. First, we model Xj as a function of Zj using a logistic regression GLM:

XjBern(μi);logit(μi)=δTZj.

Here, δTRd is an unknown regression coefficient, and logit is the logit function logit(μ)=log(μ1μ).

We produce an estimate δ^ for δ by regressing X onto Z via a logistic GLM. Next, we estimate the conditional mean μ^j of Xj given Zj by evaluating the fitted mean of the GLM μ^j=σ(δ^TZj), where σ is the logistic function: σ(η)=11+eη. Finally, for each k{1,,B}, we sample the null treatment vector X~k from the distribution [Bern(μ^1),Bern(μ^2),,Bern(μ^n)]T, yielding null treatment vectors X~1,,X~B. The permutation and conditional resampling strategies pose distinct advantages and disadvantages; see for a discussion.

10.3.2 Computing the test statistics

The second step is to evaluate a test statistic on the original treatment vector X and the resampled treatment vectors X~1,,X~B, enabling us to compare the original treatment vector to the resampled treatment vectors (via their test statistics). The test statistic that we use for this purpose is based on a negative binomial GLM. We propose the following working model for the response UMI count Yj as a function of the cell-specific covariates Zj: (10.11)YjNBθ(μj);log(μj)=τTZj. Here, θ is the (unknown) negative binomial size parameter, and τTRd is an (unknown) regression coefficient. The probability mass function fθ(yj;μj) of the negative binomial distribution with mean μj and size parameter θ evaluated at yj is as follows: fθ(yj;μj)=(yj+θ1yj)(μjμj+θ)yj(θμj+θ)θ.

This is the parameterization of the negative binomial distribution in which the variance V(Yi|μi) of Yi given mean μi is V(Yi|μi)=μi+μi2/θ. The model () corresponds to the null hypothesis of no association between the response Y and the treatment X.

We regress Y onto Z via a negative binomial GLM, yielding estimates τ^T and θ^ for τ and θ, respectively. Next, we test the original treatment vector X and resampled treatment vectors X~1,,X~B for inclusion in the fitted GLM by computing a GLM score test on each of these vectors. The GLM score test statistic Torig for the original treatment vector X is given by (10.12)Torig=XTWM(Yμ^)XTWXXTWZ(ZTWZ)1ZTWX,

where W and M(Yμ^) are a matrix and vector, respectively, that depend on the fitted means μ^, response expression vector Y, and estimated size parameter θ^:

μ^=[exp(τ^Tz1),,exp(τ^Tzn)]T;W=diag{μ^11+μ^1/θ^,,μ^n1+μ^n/θ^};M(Yμ^)=[Y1μ^11,,Ynμ^n1]T.

We compute the score test statistic Tk corresponding to the kth resampled treatment vector for k{1,,B} by replacing X with X~k in ().

Note that the score test statistic is asymptotically equivalent to a Wald statistic. In other words, we instead could compute the test statistic regressing Y onto the matrix [X,Z] and testing the hypothesis that the coefficient corresponding to X is equal to zero via a Wald test. This latter approach is asymptotically equivalent to the score statistic approach outlined above but is much less computationally efficient.

10.3.3 Computing the p-value

The third and final step is to compute a p-value. The standard negative binomial GLM p-value pNB is obtained by evaluating the tail probability of a standard Gaussian distribution at Torig. For example, the left-tailed negative binomial p-value is given by pNB=F(Torig), where F is the standard Gaussian distribution. We instead propose to compute the p-value by comparing the original test statistic Torig to the null test statistics T1,,TB; this latter approach is more robust. The “exact” left-, right-, and two-tailed permutation test p-values (which also are valid for a conditional randomization test) are given below.

pleft=1B+1(1+i=1BI(TorigTi)),pright=1B+1(1+i=1BI(TorigTi)),pboth=2min{pleft,pright}. sceptre returns the exact permutation test p-value when resampling_approximation is set to "no_approximation" in set_analysis_parameters().

Unfortunately, computing an exact permutation test p-value can be slow, especially when the number of hypotheses to be tested is large. Thus, sceptre provides the option of approximating the null distribution of the test statistics via a skew-normal distribution. A skew-normal distribution is a bell-shaped distribution that generalizes a Gaussian distribution by allowing for skew (i.e., asymmetry) about the mode of the distribution. A skew-normal distribution has three paramters: μ (the mean), σ (the standard deviation), and α (the skew; setting α=0 recovers a Gaussian distribution with parameters μ and σ). One can activate the skew-normal functionality of sceptre by setting resampling_approximation to "skew_normal" in set_analysis_parameters(). In this case sceptre fits a skew-normal distribution to the null test statistics T1,,TB via a method of moments estimator, yielding estimates μ^,σ^ and α^ for μ, σ, and α, respectively. Next, sceptre computes the p-value by evaluating the tail probability of the fitted skew-normal distribution at the original test statistic zorig. In particular, letting fμ^,σ^,α^ denote the probability density function of the skew-normal distribution with parameters μ^,σ^, and α^, the left-, right-, and two-tailed p-values are given as follows.

pleft=Torigfμ^,σ^,α^(x)dx,pright=Torigfμ^,σ^,α^(x)dx,pboth=2min{pright,pleft}.

10.4 Estimating the log-fold change

sceptre employs a simple and fast strategy for to estimate the log-fold change. Recall that, before computing the test statistics Torig,T1,,TB, we fit the negative binomial GLM (), yielding parameter estimates τ^ and θ^ for τ and θ, respectively. To estimate the log-fold change, we posit the simplified model (10.13)YjNBθ(μj);log(μj)=ξXj+oj, where oj=τ^TZj is the jth fitted value of the negative binomial GLM (). (This simplification is analogous to the simplification employed in the context of fitting the gRNA mixture model; .) Estimating the log fold change reduces to estimating ξ. We treat the terms o1,,on as i.i.d. draws from some distribution P; this approximation is reasoanble for large n. The log-likelihood of the model () is

L(ξ)=j:xj=0nlog[fθ(yj;exp[oj])]+j:xj=1nlog[fθ(yj;exp[ξ+oj])]=C+j:xj=1nlog[fθ(yj;exp[ξ+oj])],

where C is a constant in the unknown parameter ξ. Observe that, among the set of observations {(Y1,X1),,(Yn,Xn)}, we can ignore the observations for which Xi=0, as these observations do not contribute to the log likelihood. Relabel the observations for which Xi=1 as {(Y1,X1),,(Ys,Xs)}, where s=i=1nXi is the number of observations for which Xi=1. Using this notation, the log-likelihood of () can be expressed as

L(ξ)=ξj=1syjj=1syjlog(θ+exp[ξ+oj])θj=1slog(θ+exp[ξ+oj])=ξj=1syjj=1s(yj+θ)log(θ+exp[ξ+oj]).

Differentiating the above in ξ and setting to zero yields the MLE equation: L(ξ)=exp(ξ)j=1s(yj+θ)exp(oj)θ+exp(ξ+oj)=j=1syj,

or

(10.14)ξ=log{(1/n)j=1syj(1/n)j=1s(yj+θ)exp(oj)/[θ+exp(ξ+oj)]}.

The unknown parameter ξ appears on both sides of (). Thus, we cannot solve for the MLE analytically. However, we can derive an approximate solution for the MLE. Notice that both numerator and denominator of () are the mean of i.i.d. random variables. The expectation of the numerator is E[Yj], and that of the denominator is E[exp(oj)]. Thus, by the law of large numbers and the continuous mapping theorem, the MLE ξ^ for ξ converges in probability to ξ^Plog(E[Yj]E[exp(oj)]). However, the following quantity converges in probability to the same limit: (10.15)log((1/n)j=1sYj(1/n)j=1sexp[oj])Plog(E[Yj]E[exp(oj)]).

Thus, we can approximate the MLE ξ^ with the right-hand side of ():

ξ^log(j=1sYjj=1sexp[oj]).

This is the simple estimator that sceptre uses for the log fold change. An advantage to this estimator is that it is quite fast to compute using the fitted GLM (). Interestingly, this estimator circumvents estimation of the size parameter θ.

10.5 Integrating information across gRNAs

sceptre provides three strategies for integrating information across gRNAs that target the same site: "singleton", "union", "bonferroni". (One can select between these strategies by specifying the grna_integration_strategy argument in the set_analysis_parameters() function.) The “singleton” and “union” strategies are straightforward; see and for more information. The Bonferroni strategy is slightly more complicated. Consider a given target-response pair. Suppose that k gRNAs gRNA1gRNAk target this target. Moreover, suppose that, upon testing for association between each of these gRNAs and the response, we obtain gRNA-wise p-values p1,,pk. The Bonferroni p-value is defined as follows:

p=kmin{p1,,pk}. The Bonferroni p-value is valid in the following sense: if p1,,pk are valid (i.e., uniformly distributed) p-values, then p too is a valid (i.e., uniformly distributed) p-value. (The gRNA-wise p-values p1,,pk can be arbitrarily dependent.) Below, we state this result in a slightly more precise way and prove its correctness. “Superuniformity” is more general than uniformity and is sufficient for p-value validity. This proof is standard; see, e.g., the Bonferroni correction Wikipedia page.

Theorem: Suppose that p1,,pk are superuniform, i.e. P(piα)α for all i{1,,k} and all α[0,1]. Then p=kmin{p1,,pk} also is superuniform.

Proof:

P(kmin{p1,,pk}α)=P(min{p1,,pk}α/k)=P(p1α/k or  or pkα/k)i=1kP(piα/k)i=1kα/k=α.

The penultimate inequality follows from the union bound, while the final inequality follows from the superuniformity of pi.

10.6 Accelerations

sceptre leverages several statistical and algorithmic accelerations to considerably improve the speed of the methods described above, including a sparsity-exploiting algorithm for computing GLM score tests and a strategy for recycling computation across permutation tests. We plan to describe these accelerations in a subsequent chapter.


  1. We fit the negative binomial GLM by fitting a Poisson GLM to the data, taking as the negative binomial GLM coefficients the Poisson GLM coefficients, and then estimating the negative binomial size parameter using the residuals of the fitted Poisson GLM. This procedure is faster than standard procedures for fitting a negative binomial GLM. A more detailed description of this procedure will appear in a subsequent chapter.↩︎