Abstract

Motivation

One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous datasets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback–Leibler information divergence and the Yang–Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 measures for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R2 index that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature measurements.

Results

We evaluate the performance of our measures using extensive simulation studies and publicly available datasets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods.

Availability and implementation

R code for the proposed methods is available at github.com/lburns27/Feature-Selection.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

There have been significant developments in high-throughput genomic and related technologies in the past two decades. Examples include microarray technology to measure mRNA and microRNA expression, methylation arrays to quantify DNA methylation, SNP arrays to measure allele-specific expression and DNA copy number variation, next-generation sequencing technologies such as RNA-Seq, ChIP-Seq, etc. for the measurement of digital gene expression as well as mass spectroscopy and nuclear magnetic resonance spectroscopy for the measurement of protein and metabolite expression. With the wealth of data available from large-scale genomic studies, researchers can now attempt to understand and estimate the effects of specific genomic features on various diseases and characteristics associated with those diseases. A genomic feature may represent a gene that codes for a protein or a non-gene-centric element such as a microRNA, CpG site or a particular genomic region of interest. Our use of the terms feature measurement, feature value and feature is broad in scope and includes gene or protein expression, DNA methylation and copy number variation. One specific area of interest is in studying the relationship between genomic features and time to death or recurrence of some disease, often referred to as ‘survival time’. Let Y and C denote, respectively, the survival and censoring times, and let δ=I(YC) be the indicator of whether the event has been observed. Because of censoring, it is not possible to observe all true survival times, so we let T=min(Y,C) be the observed survival time. In this study, we will deal with p features measured for each of n subjects, where pn. We let Z denote the n × p matrix of features and z represent the p-vector of features for a subject.

These high-dimensional datasets offer some explicit challenges when applying standard statistical methods. One of the most commonly used models in survival analysis is the Cox proportional hazards (PH) regression model (Cox, 1972) which postulates that the risk (or hazard) of death of an individual, given their feature measurement, is simply proportional to their baseline risk in the absence of the feature. It is a multiplicative hazards model that implies constant hazard ratio (HR). In a high-throughput genomic study, for instance, the PH assumption would imply a constant effect of features on survival over the entire period of follow-up in a study. However, this assumption cannot be verified for each feature and it is also unlikely that PH will actually hold for each feature. Moreover, there is empirical evidence indicating that some features may not have a multiplicative effect on the hazard and that non-proportional hazards (NPH) can occur when feature effect increases or decreases over time leading to converging or diverging hazards (CoH or DH); in some studies, features with DH are found more often than features with CoH (Bhattacharjee et al., 2001; Dunkler et al., 2010; Rouam et al., 2011; Xu et al., 2005). It is, therefore, unreasonable to expect the many thousands of features to exhibit PH. In a review of survival analyses published in cancer journals, it was revealed that only 5% of all studies using the Cox PH model actually checked the PH assumption (Altman et al., 1995). Applying the PH model to data that do not support the underlying assumptions may result in inaccurate and sometimes erroneous conclusions. For instance, it could lead to under- or over-estimation of effects for a considerable number of features. Consequently, some features are falsely declared as important in predicting survival and other relevant features are completely missed. Furthermore, if some features exhibit NPH then their HRs, estimated by ignoring their time-dependence, are not comparable to those of features with PH or of features exhibiting different patterns of NPH. Although NPH typically arises from time-dependent effects of features on survival, it could also result from model mis-specification such as from omitting relevant clinical covariates like age at diagnosis and stage of disease. Feature selection methods involving univariate analyses are particularly prone to this problem. Testing the time-varying effect of features, in addition to significance testing, is important in the presence of such covariates and for small sample sizes given that in many genomic studies, typically np (Anderson and Fleming, 1995; Li et al., 1996; Struthers and Kalbfleisch, 1986). It is also important to be aware of the potential effects of dichotomization of features which may have a bearing on the results and interpretation. The goal of this study is to discuss alternative models and methods that can be used to link large-scale genomic data to a survival outcome, with the ultimate goal of feature selection under different types of hazards that may be present.

This article is organized as follows. In Section 2, we survey a variety of well-known survival models, many of which are designed to handle NPH, and provide an overview of marginal screening methods that currently exist in the literature. In Section 3, we use real-life genomic data to motivate the need for these models. Through this analysis, we identify specific models that fit a large proportion of genomic features and allow for NPH. Then in Section 4, we propose several feature selection methods that do not rely on the PH assumption and, instead, are developed from models capable of handling NPH. These methods are evaluated using extensive simulations and publicly available large-scale datasets in cancer genomics in Sections 5 and 6. Finally, Section 7 provides a summary and some concluding remarks.

2 A brief survey of survival models and existing methods

2.1 The PH model and its generalization

The Cox PH model is a semi-parametric regression model proposed by Cox (1972). The hazard rate, λ(t|z), is defined as the instantaneous risk of an event at time t for covariate vector z, with Λ(t|z) representing the cumulative hazard function. The model is given by
(1)
where t >0, λo(t) and Λo(t) are the baseline hazard and cumulative hazards functions, and β is a vector of regression coefficients. Estimation for the coefficient β can be done by maximizing the log partial likelihood l(β)=i=1nδi{ziβlog[jR(ti)exp(zjβ)]}, where ti is the survival time for subject i, δi is the censoring indicator, and R(ti) is the risk set, the individuals who have yet to experience the event at time ti. The HR corresponding to two different feature vectors z and z*, given by λ(t|z)λ(t|z*)=exp[β(zz*)], depends only on the features and not on time. This fundamental assumption is unlikely to hold for all p features in the genomic setting. A semi-parametric generalization of the PH model which allows crossing hazards (CH) is described in Devarajan and Ebrahimi (2011). Its cumulative hazard and survival functions are,
(2)
respectively. This model has a more flexible, general form, but retains the Cox PH model as a special case when γ=0. Since the partial likelihood approach cannot be applied, Devarajan and Ebrahimi (2011) utilize a flexible parametric approach via a cubic B-spline approximation for the baseline hazard to estimate the model parameters. Rouam et al. (2011) considered the special case obtained by setting β=0 and proposed a pseudo-R2 measure for genomic feature selection. In this article, we refer to this special case as the CH model.

2.2 The proportional odds model and its generalization

The proportional odds (PO) model is an alternative to the PH model, and does not require the assumption of PH. It allows some forms of NPH and, instead, assumes that the effect of covariates will proportionately increase or decrease the odds of dying or recurrence at time t (Bennett, 1983). The PO model is given by
(3)
where So(t) and ψ0(t)=1So(t)So(t) are the baseline survival and odds functions, respectively, at time t. The multiplier exp(βz) quantifies the amount of proportionate increase or decrease in the odds associated with covariate z. A semi-parametric generalization of the PO model can be specified as
(4)
where γ=0 results in the PO model. This model allows for both CH and crossing odds (CO). Later, we consider the special case where γ=β and refer to it as the CO model.

2.3 Yang–Prentice model

Yang and Prentice (2005) proposed the Yang–Prentice (YP) model as a generalization of both the PH and PO models. Its survival function is given by
(5)

Note that when γ=β, it becomes the PH model, and when γ=0, it becomes the PO model. Thus, it is a versatile and useful model that encompasses both the PH and PO models, as well as a host of other models, and allows for time-varying hazards.

2.4 Existing methods for feature selection

Few specific methods are currently available in the literature for the purpose of feature selection when the PH assumption is violated. Dunkler et al. (2010) proposed concordance regression (CON), using a generalized concordance probability as a measure of the effect size, to select genes in microarray studies that are related to survival irrespective of the type of hazard. The basic concordance probability is c=P(T1<T0), where T1 is a randomly chosen survival time from group 1 and T0 is a randomly chosen survival time from group 0, and is independent of the PH assumption. This is generalized to continuous data which has the form c=exp(γ)1+exp(γ), where γ are the log odds that the survival time decreases if the gene expression is doubled. Genes are ranked based on the absolute effect size estimate c^+=0.5+|c^0.5|. Using simulated and microarray data, Dunkler et al. (2010) showed that when some genes showed a time-dependent effect on survival, CON produced the least biased and most stable estimates compared to methods based on Cox regression. Rouam et al. (2010, 2011) developed pseudo-R2 measures for genomic feature selection based on the PH and CH models which rely on the partial likelihood of the respective model.

3 Motivating examples

We motivate our problem using five datasets from large-scale studies in cancer utilizing a variety of high-throughput genomic technologies. Datasets 1 and 2 consist of DNA methylation and microRNA expression profiles, respectively, measured on glioblastoma samples while dataset 3 consists of digital gene expression profiles obtained using RNA sequencing from subjects with head and neck squamous cell carcinoma (HNSCC) [The Cancer Genome Atlas (TCGA), http://cancergenome.nih.gov]. Datasets 4 and 5 consist of gene expression data obtained using Affymetrix and HG 1.ST microarrays, respectively, from patients with ovarian and oral cancer (Tothill et al., 2008; Saintigny et al., 2011). Raw data were pre-processed using standard methods for each dataset as described in the Supplementary Section 8.1, and relevant characteristics of these datasets are displayed in Tables 1 and 2. We describe a comprehensive analysis of these datasets using the PH, PO and YP models to test for the goodness-of-fit (GOF) of each model using the methods outlined in Grambsch and Therneau (1994), Martinussen and Scheike (2006) and Yang and Zhao (2012), respectively (see Supplementary Section 8.3, for details on this analysis and GOF tests). All analyses were performed at the genomic feature level, the purpose of which is to identify features that exhibit some form of NPH and to demonstrate the need for alternatives to the PH model. Wherever possible, clinical covariates such as age at diagnosis and stage of cancer were adjusted for in the analysis (datasets 3–5, Table 1). The currently available method for the YP model is only capable of handling a single dichotomized covariate (Yang and Prentice, 2005), and, hence, we utilized dichotomized genomic features and did not adjust for age and stage in analyses reported in Table 2. Thus, the PH, PO and YP model fits in Table 2 utilize only dichotomized features. The q-value approach was employed to estimate the false discovery rate (FDR) and to evaluate the effect of testing multiple hypotheses (Storey and Tibshirani, 2003). In summary, GOF results based on quantitative features for the PH and PO models (Table 1) and dichotomized features for the PH, PO and YP models (Table 2) are presented. All resulted reported are based on p-value 0.05 and cases that satisfy q-value 0.25 at this threshold are indicated in the footnotes. It should be noted that different GOF tests cannot be directly compared with each other but their intended purpose is as stated above.

Table 1

Summary of model fits: quantitative features.

Dataset12345
p945245419 34124 73912 776
n28041622127686
% Censored2619623259
A: lack of PH fit21713075238811810
(%)(2)(29)a(4)(16)a(14)a
B: good PO fit870838517 34922 42411 544
(%)(92)a(85)a(90)a(91)a(90)a
A B170852882513947
A B′47454641368863
Table 2

Summary of model fits: dichotomized features

Dataset12345
A: lack of PH fit9418293433661294
(%)(10)a(18)a(5)(14)a(10)
B: good PO fit888042617 74322 62811 755
(%)(94)(94)a(92)(91)(92)a
C: good YP fit903844118 61422 79611 609
(%)(96)(97)(96)(92)(91)
A B854724712325598
A C8968087930511045
A B′ ( C)87 (82)10 (9)463 (428)1041 (926)696 (537)

The following notations apply to Tables 1 and 2: p = Number of features; n = Number of observations; all results based on p0.05; a: q0.25 (at p =0.05); A and B′ denote, respectively, subsets of features for which PH and PO does not fit; B and C denote, respectively, subsets of features for which PO and YP fits.

The results shown in Tables 1 and 2 motivate the development of methods based on the PO and YP models because of their ability to fit not only a large number of genomic features in general, but specifically features for which the PH assumption is violated. These models provide flexible alternatives for handling various forms of NPH for quantitative as well as dichotomized features. Further details are provided in Supplementary Section 8.3.

4 Proposed methods for feature selection and ranking

In Section 3, we demonstrated the need for alternative methods to the PH model that can handle various types of NPH. We also showed in Tables 1 and 2 that there are many features for which the PH model does not fit but the PO or YP model does; and for some features, we observed that both the PH and PO models do not fit thereby suggesting the need for more complex models such as the YP or CO. In this section, we construct several marginal screening approaches based on the PO, CO and YP models. First, we adopt an information-theoretic approach and develop a test for genomic feature effect under the YP model using Kullback–Leibler (KL) information divergence. This approach includes tests as well as R2 measures for its two important special cases—the PO and PH models. Following this, we propose a unified framework to compute pseudo-R2 measures for a wide range of survival models that allow different types of NPH. It includes the PH, CH, PO and CO models and generalizes prior work (Rouam et al., 2010, 2011). The utility of these methods is demonstrated using simulated and real-life genomic datasets and by comparison to existing methods.

4.1 Methods based on information divergence

4.1.1 Test for genomic feature effect under the YP model

Within the framework of the YP model, we would like to test the null hypothesis H0:Λ(t|z)=Λ0(t) against the alternative H1:Λ(t|z)=0tλ(t|z)dt, where λ(t|z)=dlog(S(t|z))dt from equation (5). These hypotheses can be rewritten as H0:β=0andγ=0vs.H1:β0orγ0, i.e., we are interested in testing whether a particular genomic feature has an effect on survival time according to the YP model. To this end, we utilize KL information divergence and construct a test statistic. Let f0(t) and f(t|z) denote the densities of T under H0 and H1, respectively, and let F0 and F be the corresponding distribution functions. Then KL information divergence is defined as
(6)
and is the directed divergence that measures the discrepancy between F0 and F. Equation (6) quantifies the divergence between the null and alternative hypotheses and can be viewed as a weighted log-likelihood ratio, i.e., I(F0:F)=Ef0(log{f0(t)f(t|z)}). Feature-specific KL information divergence for the YP model has the form
(7)
which we denote by IYP(β,γ;z). Let zij represent the measurement for individual i and feature j, where i=1,,n and j=1,,p. We estimate this measure by replacing β and γ with β^ and γ^, the pseudo maximum likelihood estimates obtained using the approach in Yang and Prentice (2005), and by summing over the n individuals in a study. Thus, we obtain the following estimator of IYPj for feature j,
(8)

From Theorem 1 (see Supplementary Section 9.1 for details on IYP), it can be seen that I^YP is a maximum likelihood estimator and is asymptotically normal with mean IYP. Despite the complexity of the YP model, I^YP has the computational advantage of not requiring an estimate of the baseline. In addition, it combines feature effects quantified by the two model parameters β and γ into a single measure and provides a simpler interpretation of feature effect. However, a practical limitation of the currently available estimation method for this model is that it requires dichotomized features (Yang and Prentice, 2005). Therefore, we propose I^YP as a measure for ranking feature effect. In our applications, we compute I^YP using β^ and γ^ obtained by fitting the YP model to features dichotomized by the median.

4.1.2 Tests for genomic feature effect under the PO and PH models

As outlined earlier, the YP model contains the PO and PH models as special cases. In equation (5), setting γ=β results in the PH model and setting γ = 0 results in the PO model. In each model there is a single parameter β and we are interested in evaluating feature effect under the respective model by testing the null hypothesis H0:β=0 against the alternative H1:β0. The test statistic for feature j under the PO model is obtained by setting γ = 0 in equation (7) and using equation (8) which simplifies to
(9)
where β^ is the modified partial likelihood estimator (Martinussen and Scheike, 2006). From Theorem 2 (see Supplementary Section 9.2 for details on IPO), it can be seen that I^PO is a maximum likelihood estimator and is asymptotically normal with mean IPO. The variance of I^PO can be estimated using the delta method and Taylor series expansion as
(10)
where σ is the variance under H0. Using equation (10), we can reject H0 if
(11)

The test statistic for feature j under the PH model, I^PH, is obtained in the limit as γβ in equation (7) resulting in a test similar to that outlined above for the PO model. However, it would potentially over- or under-estimate feature effects when the PH assumption is violated. In that regard, the proposed methods based on YP and PO models benefit from the simplicity of this approach while addressing the issue of NPH. Furthermore, our prior studies of information divergence measures for the PH and generalized CH (equation (2)) models indicate that the proposed methods will have good small sample performance in terms of power and consistency (Devarajan and Ebrahimi, 2002, 2009).

4.1.3 Feature ranking and selection

Since I^ quantifies the effect of a genomic feature according to the particular model of interest, it can be directly used to serve as a measure for feature ranking. Both I^PO and I^YP can be calculated for each feature in a dataset where a higher I^ indicates a larger effect on survival based on the particular model chosen. Similarly, the test statistic in equation (11) based on the PO model can be used to compute a p-value for each feature and features can be selected by controlling the FDR, at a pre-determined level such as 5% (Benjamini and Hochberg, 1995). A similar approach can be used for feature selection and ranking using I^PH under the PH model. It is straightforward to account for potential confounders (such as age and stage of disease) by utilizing the model parameter estimate from the adjusted model in the computation of I^PO and I^PH and their respective test statistics. Multiple GOF tests at the feature level to determine which measure to use would have implications on the Type I Error and would miss features with small to moderate effects; however, since the PO accommodates forms of NPH while also allowing PH under certain conditions, a practical strategy would be to use I^PO and χ^PO2. An added benefit is that these methods can handle quantitative or dichotomized features as well as account for confounders.

4.2 Measures of explained randomness

We provide a brief overview of existing measures of explained randomness (ER), explained variation (EV) and predictive accuracy (PA) for censored survival data. While most of the measures require the PH assumption to hold, there are a few measures that are relevant to other survival models or do not assume a specific working model. A comprehensive review on this topic is provided in Choodari-Oskooei et al. (2012a,b). Several R2 measures of ER for the PH model have been developed based on the likelihood ratio (LR), logL(β^)L(0), where L(0) and L(β^) denote the partial likelihood for the PH model under the null and alternative hypotheses, respectively, and β^ is the maximum partial likelihood estimator of β. For example, Allison’s index (Allison, 1995) has the form RLR,A2=1exp(2N[logL(β^)L(0)]) where N is the number of subjects. O’Quigley et al. (2005) proposed a modified version of RLR,A2 where N is replaced by k, the number of failures, which is an approximation of the measure in Xu and O’Quigley (1999). It is less sensitive to censoring which is beneficial in our application due to the observed high fraction of censored observations and is given by RLR2=1exp(2k[logL(β^)L(0)]). Another modified version of RLR,A2 was proposed by Nagelkerke (1991) and has the form RLR,N2=RLR,A2Rmax2 where Rmax2=1exp(2NlogL(0)). Kent and O’Quigley (1988), O’Quigley and Flandre (1994) and Royston (2006) defined measures of EV for the PH model while Royston and Sauerbrei (2004) proposed similar measures for PH, PO and lognormal models. Schemper and Henderson (2000) developed a measure of PA, RSH2, based on absolute deviation of prediction under the PH assumption. Graf et al. (1999) proposed a model independent measure of PA, RG2, based on the mean squared error of prediction by weighted averages of time-dependent residuals, and Gerds and Schumacher (2006) extended it using regression models for the censoring distribution. RG2 can be used for both PH and PO models.

4.2.1 R2 measures based on information divergence

We utilize tests for genomic feature effect proposed in Section 4.1.2 to develop R2 measures, that quantify ER, for the PO and PH models. These indices take values on the [0,1] scale and are easy to interpret. From equation (7), feature-specific KL information divergence for the PO model can be obtained by setting γ = 0 and expressed as
(12)
where z represents the feature. Similarly, feature-specific KL information divergence for the PH model can be written as
(13)
which is obtained in the limit as γβ in equation (7). In Sections 4.1.1 and 4.1.2, we derived tests for feature effect in the YP, PO and PH models using the respective I^ by summing over n individuals in a dataset. This approach accounts for the feature measurement of each individual in the study. Here, we propose an alternative, but more robust, approach by integrating over the covariate distribution; in this case, the distribution of feature measurements. A normalizing transformation stabilizes the variance and can be applied to a variety of large-scale genomic data for this purpose. Examples include the logarithmic transformation for mRNA and miRNA gene expression, the log(x+1) transformation for digital gene expression and the logit transformation for DNA methylation (measured using beta values) while copy number variation is expressed as log-ratios. These transformations are described in Smyth (2004), Irizarry et al. (2003), Du et al. (2010), Ritchie et al. (2015) and Wineinger et al. (2008). In addition, if we standardize (the measurements for) each feature to have zero mean and unit standard deviation, then Zzijiid Normal(0, 1), where i=1,,n and j=1,,p. We define I˜ as the expectation of I(F0:F) with respect to the marginal distribution of Z,
(14)
where ϕZ(z) is the standard normal density. I˜ is computed for the PO and PH models using I(F0:F|z) in equations (12) and (13), respectively. An R2-type measure is then defined as RI˜2=1exp(2I˜^) (Soofi et al., 1995). For the PO model, I˜PO is calculated using Taylor series expansion as
(15)
and, hence, RI˜PO2=1exp(2I˜^PO)=1exp(13β^2), where β^ is the modified partial likelihood estimator of β in the PO model obtained using the standardized feature (Martinussen and Scheike, 2006). For the PH model, we calculate I˜ directly using IPH as
(16)

Hence, RI˜PH2=1exp(2I˜^PH), where β^ is the partial likelihood estimator of β in the PH model obtained using the standardized feature (Cox, 1972). Details regarding the derivation of I˜PO and I˜PH are provided in Supplementary Section 9.3. Both RI˜PO2 and RI˜PH2 can be easily seen to fall in the [0,1] range and can be used for feature ranking and selection where genomic features with larger R2 values can be interpreted as exhibiting larger effects on survival under the respective models. These measures have an information-theoretic foundation and are easy to compute; from equations (15) and (16), we observe that both measures are simple functions of the respective model parameter β which contains the required information if features can be normalized to follow the standard normal model.

4.2.2 R2 measures based on the LR

We propose three different R2 measures of ER for the PO model based on LR, logL(β^)L(0), where L(0) and L(β^) denote the modified partial likelihood for the PO model under the null and alternative hypotheses, respectively, and the parameter β is estimated using the approach outlined in Martinussen and Scheike (2006) (see Supplementary Section 9.4 for the modified partial likelihood). These measures parallel corresponding R2 measures for the PH model—RLR,A2,RLR2 and RLR,N2 described in Section 4.2—and can be interpreted as the proportion of ER by the PO model.

4.3 A generalized pseudo-R2 index

We develop pseudo-R2 measures for the PO and CO models, denoted by RPO2 and RCO2, as well as a generalized pseudo-R2 index that embeds such measures for the PH, CH, PO and CO models. The proposed approach utilizes the partial likelihood of the respective models and does not require an estimate of the parameter β in these models. It generalizes the work of Rouam et al. (2010, 2011) where measures for the PH and CH models, which we denote by RPH2 and RCH2, respectively, were proposed. These pseudo-R2 measures can be interpreted in terms of the difference in the values of a genomic feature between subjects experiencing and not experiencing the event of interest, and can be used as tools for feature ranking and selection under a variety of scenarios involving NPH. An obvious disadvantage of RPH2 is that it is based on the PH model. In the CH model, the HR between two individuals with feature values z and z* cross over time (Rouam et al., 2011). Thus, while RCH2 does address the inherent problem with RPH2, it forces CH. On the other hand, the PO model is more generally applicable to the variety of hazard structures observed in genomic data as seen in Tables 1 and 2. The CO model generalizes the PO model in the same manner as the CH model generalizes the PH model and, thus, has a more versatile form. For our purposes, we will use the special case γ=β in equation (4). Hence, the measures RPO2 and RCO2 based on these models offer significant advantages and flexibility that is not afforded by currently available measures.

The generalized pseudo-R2 index can then be expressed as R2=1k(i=1nW^i)2i=1nW^i2, where Wi^=δiw^(ti)(zij=1nYj(ti)zjj=1nYj(ti))j=1nδjw^(tj)Yi(tj)r=1nYr(tj)(zir=1nYr(tj)zrr=1nYr(tj)), zi is the value of a given feature for subject i, k is the number of uncensored failure times and Y¯=j=1nYj is the number of subjects at risk at time ti. This index can be seen as the robust score statistic divided by the number of distinct uncensored failure times, a quantity that falls between 0 and 1 and can be interpreted in terms of the percentage of separability between subjects experiencing and not experiencing the event in relation to the value of a genomic feature. Thus, R2 indices corresponding to the PH, CH, PO and CO models based on model-specific choice of the weight, w(t), determined by the respective partial likelihood can be obtained. The estimated weight, w^(t), for each special case is shown in Supplementary Table 6 and Supplementary Section 9.4, and can be interpreted as the derivative of the log HR for the corresponding model with respect to the predictor η=βz evaluated at η= 0. In Supplementary Table 6, Λ0(t) is estimated by the left-continuous version of Nelson’s estimator and S0(t) is estimated using the Kaplan–Meier estimator. We note that RPH2 and RCH2 are the measures described in Rouam et al. (2010, 2011) while RPO2 and RCO2 are our newly proposed measures that allow for various types of NPH as well as PH. In order to avoid numerical issues, we applied empirical corrections in the computation of RCH2 (which we refer to as RCH˜2 henceforth) and RCO2 (see Supplementary Section 9.4 for details).

5 Application to simulated data

In this section, we evaluate our newly proposed ranking methods, I^PO,I^YP,RI˜PO2,RI˜PH2,RPO2,RCO2,RCH˜2 and RLR2 using simulated datasets under a variety of scenarios and compare their performance to existing methods, RPH2,RCH2 and c^+ (Dunkler et al., 2010; Rouam et al., 2010, 2011) and measures of PA, RG2 and RSH2, described in Section 4.2. The goal of this study is to assess the performance of these methods in selecting features that are truly associated with survival in the high-dimensional setting.

5.1 Simulation schemes

We considered two different simulation schemes to generate artificial survival and genomic datasets based on the approach outlined in Dunkler et al. (2010). In order to account for various types of hazards, survival times Yi,i=1,,n, were generated from each of five different models specified as follows: standard log-normal LN (μ  =  0, σ  =  1); log-logistic LL1 (α1=2,λ1=2,λ2=4) and LL2 (α1=3,α2=4,λ1=1,λ2=2); and Weibull W1 (α1=1,λ1=12) and W2 (α1=3,α2=2,λ1=1,λ2=12), where LL1 and W1 refer to the respective models where the shape parameters are the same but the scale parameters differ, and LL2 and W2 refer to the respective models where both the shape and scale parameters differ. In the LN model, μ and σ are the location and scale parameters, respectively. We use a more informed approach that is broader in scope compared to that of Dunkler et al. (2010), who only considered W1 in their simulations. Here, LN, LL2 and W2 cases are of particular interest because of their ability to simulate CH. To simulate censoring, we drew random samples with uniform follow-up times C from U(0,τ) and defined the observed survival time as T=min(Y,C) with censoring indicator δ=I(T=Y). We chose τ to get censoring proportions of 0, 33 and 67%. For each model, we simulated censored survival times and genomic data for N =200 subjects and p =5000 mock features that are linked to survival time based on the logarithm of the HR (HR(t)), βg(t)=β0log(HR(t)). Genomic data were generated from the standard normal model which covers a variety of features seen in large-scale genomic studies. Following Klein and Moeschberger (2003), log(HR(t)) was calculated based on the respective model of interest. For LN, we used βg(t)=β0(t21) to simulate CH similar to what was done in Dunkler et al. (2010). Then, β0 was chosen so that only the first 400 features were assumed to have an effect on survival time, with 200 having a large effect and 200 having a small effect. In Scheme 1, we adopt a univariate approach where each feature is separately linked to survival, and in Scheme 2, we adopt a multivariate approach that incorporates correlations between features. Under each scenario, 200 datasets were generated and the ranking methods were assessed using mean AUC (area under the receiver operating characteristic curve) and Youden index (Youden, 1950). More details can be found in Dunkler et al. (2010) and Supplementary Section 10.

5.2 Simulation results

Overall, under different data generating models and censoring proportions, the proposed methods outperformed and, in some cases, performed as well as existing methods. In most cases, we noticed some form of improvement. Selected results under different NPH model assumptions are highlighted in Table 3 while detailed simulation results under various scenarios outlined above are provided in Supplementary Sections 10.3, 10.4 and Tables 7–11. Since RLR,A2,RLR2, and RLR,N2 resulted in the same feature rankings, we only report results for RLR2. Our PO model-based methods (all R2 measures and I^PO) performed strongly overall and I^YP performed particularly well for lower censoring (0% and 33%). RCO2 outperformed RCH2 in many cases where NPH was present, across censoring levels and simulations schemes; but it is important to note that our modified version, RCH˜2, performed better than RCH2 in most cases and similarly to it in other cases. Overall, depending on the simulation scheme and type of NPH present, we can identify the benefits of utilizing each of our measures. Most importantly, the proposed feature selection methods are more flexible and generalize existing methods.

Table 3

Comparison of methods (AUC): simulation schemes 1 and 2.

CensoringScheme 1, 33%
Scheme 1, 67%
Scheme 2, 33%
Scheme 2, 67%
MeasureLNLL2W2LNLL2W2LNLL2W2LNLL2W2
I  PO0.890.771.000.930.810.950.870.620.900.890.800.86
I  YP0.960.751.000.830.580.930.840.820.860.690.680.68
c^+0.820.711.000.880.710.980.800.500.900.840.530.87
RI˜PO20.880.781.000.930.810.950.860.640.900.880.800.89
RPO20.890.770.990.930.810.950.870.630.890.890.810.84
RCO20.820.800.870.680.740.890.590.870.880.560.710.87
RCH˜20.970.870.820.870.840.710.880.890.740.810.870.55
RPH20.770.611.000.920.780.970.750.540.890.890.710.87

6 Application to genomic data

In this section, we compare the performance of the proposed methods using several datasets representing a broad spectrum of high-throughput genomic technologies. In addition to datasets 1–5 described in Section 3, we utilized the following datasets. Dataset 6 consists of microarray gene expression profiles measured on the same set of glioblastoma samples used in datasets 1 and 2, while datasets 7 and 8 consist of DNA methylation and copy number variation profiles, respectively, from subjects with HNSCC (see Supplementary Section 8.1). Since we do not have prior knowledge on the number of genomic features significantly associated with survival in real data, this approach will differ from that of the simulations. Here, we will rank features based on each method and compare them using a pre-selected number of top features across methods. Whenever possible, parameter estimates required for computing certain I and R2 measures were obtained from respective models adjusted for potential confounders such as age and stage of disease. These measures include I^PO,RLR2,RI˜PO2,RI˜PH2 and the absolute effect size estimate, c^+, from CON. In addition to quantitative features, dichotomized features were utilized for I^PO and c^+ and no adjustment for confounders was done for analyses involving dichotomized data in order to enable direct comparison of results to I^YP which accommodates only a single dichotomized covariate as explained earlier.

We begin by focusing attention on the application of our proposed I measures—I^PO and I^YP—and c^+, selecting the top 500 features based on each measure. We observe that there are few features commonly selected by all three measures, as evidenced in the Venn diagrams presented in Supplementary Figures 1–3. These results are captured in Table 5 which summarizes the fraction of features that are identified in common by different measures (for quantitative features). This is not surprising given that these measures are based on different model assumptions. More details can be found in Supplementary Material. Using the test statistic for I^PO in equation (11), we computed a p-value for each feature and selected features by controlling the FDR at 5% (Benjamini and Hochberg, 1995); for comparison, we applied the test from CON (Section 2.4) using the same threshold. It should be noted that this approach can also be used to rank features and to compare different measures. Table 4 summarizes the results of this analysis for quantitative and dichotomized features for various datasets. In almost all cases, χ^PO2 selects a much larger number of features compared to CON which does not select many features at all in the first place. This limitation is recognized in Dunkler et al. (2010), and it renders CON as a tool for feature ranking only (based on c^+) rather than feature selection, unlike χ^PO2 which can be used for ranking as well as selecting features based on a pre-specified p-value or FDR threshold. Even though some overlap is observed in features selected by I^PO and c^+ when used purely as ranking measures (as seen in Table 5), significant improvement is observed in the performance of the proposed statistical test based on I^PO,χ^PO2, as evident from Table 4. In addition, χ^PO2 enables better enrichment of selected feature sets as demonstrated in Supplementary Section 11, where we also outline an exploratory approach to evaluate and visualize the combined effect of selected features on survival. Overall, we recommend the use of I^PO,χ^PO2 or I^YP because of their inherent versatility.

Table 4

Number of features selected by χ^PO2 and CON

Data typeQuantitative
Dichotomous
Datasetχ^PO2CONχ^PO2CON
11097993180
2330170
31580719632
447841760570
575814734099323
6310000
757990388521
84640640
Table 5

Fraction (%) of common features selected by different measures.

DatasetsI^YP and I^POI^YP and c^+I^PO and c^+I^YP, I^PO and c^+RPO2 and RCH˜2RCO2 and RCH˜2RPO2 and RCO2RPO2, RCO2, and RCH˜2
11114627745386
2343688324042128
3711485141860
456423189110
510233972738<1
616157514113750
71<162<119910
8616534215770

Next, we examine the differences between genomic features selected by various R2 measures, once again selecting the top 500 features in each case. Supplementary Figure 4 shows the overlaps between PO-based R2 measures—RPO2,RLR2 and RI˜PO2—for the glioblastoma datasets (1, 2 and 6). Since dataset 2 contained a small number of features (454), the top 50 selected features were compared. The results are, however, consistent across these datasets where we observe a large fraction of common features selected by different measures. In all three datasets, a large fraction of features (>80%) are common to all three measures as well as between any pair of measures. Similar observations can be made for the HNSCC (3, 7 and 8), ovarian (4) and oral (5) datasets shown in Supplementary Figure 5.

Venn diagrams corresponding to RPO2,RCO2 and RCH˜2 are shown in Supplementary Figures 6 and 7, and the size of each intersecting feature set is shown in Table 5. We observe that there are only minor overlaps between different measures across all datasets, with the largest overlaps occurring between RPO2 and RCH˜2 for datasets 5 and 8 and between RCO2 and RCH˜2 for datasets 1, 6 and 8. There are also no features common to all three measures in datasets 3, 4, 6, 7 and 8, and only a very small number of common features in the remaining datasets. Overall, these results are not surprising since each of these methods is based on a different model, so we would expect them to select different features; however, the results also indicate that PO and CO model-based measures can identify various forms of NPH. The differences observed in selected features across the three measures indicate the presence of various types of NPH in a given dataset. Thus, the appropriate measure can be chosen based on the goal of feature selection and type of hazards present or expected. We emphasize that our proposed measures provide a more versatile and general framework that allows for inclusion of various types of hazards.

7 Discussion and conclusion

In this article, we proposed unified methods for feature selection in large-scale genomic data in the presence of censored survival outcomes. We illustrated the utility of these methods using real-life studies in cancer genomics; in particular, we demonstrated their ability to handle the challenges due to various forms of NPH. The proposed methods are based on models that relax the PH assumption and are able to identify genomic features with a time-varying effect with increased specificity and sensitivity.

Our methods are flexible and generalize existing methods for feature selection by casting them within unifying frameworks. First, we proposed a general framework to test for the effect of a genomic feature under the YP model using KL information divergence by developing the measure I^YP which quantifies this effect. I^YP includes measures for the PO and PH models—I^PO and I^PH, respectively—as special cases. An advantage of these measures is that they do not require an estimate of the baseline hazard and, instead, are simple functions of respective model parameter estimates. Using these measures, we developed a statistical test (χ^PO2) for genomic feature effect in the PO model where the test statistic or p-value could be used for feature selection; in addition, we developed R2 measures of ER for the PO and PH models that only rely on the respective model parameter estimate or the LR. Finally, we proposed a generalized pseudo-R2 index that embeds measures for the PO, CO, PH and CH models as special cases. These measures do not require an estimate of the model parameter and can be easily interpreted as the percentage of separability between subjects in the event and non-event groups according to feature values. Since RPO2,RCO2 and RCH˜2 select fairly independent subsets of features, each set can be used for further exploration and study. PO model-based measures demonstrated their ability to handle PH and various forms of NPH while RCH˜2 only performed well in detecting certain types of CH; moreover, RCO2 offers a useful alternative and is better suited for handling particular forms of NPH. Thus, each measure provides useful information specific to a particular model of interest while PO model-based R2 measures provide overall flexibility relative to other measures.

All proposed R2 measures can be applied to quantitative (continuous, count or ordered categorical) features and I measures are applicable to quantitative or dichotomized features; however, I^YP is currently usable only on dichotomized features. Use of (appropriately variance stabilized and normalized) quantitative features aids in robust estimation of time-varying effects and interpretability while dichotomized features facilitate visualization of results. However, it is important to be aware of the potential effect of dichotomization—using the median split or any other arbitrary cut-off—on the PH assumption. This is evidenced by the results in Tables 1 and 2 for datasets 1 and 2 where dichotomization has a negative and positive effect, respectively, on GOF. The proposed I measures are useful for feature ranking in general and, in particular, when the distribution of feature measurements or a normalizing transformation for it is unknown. Furthermore, I^PO,RI˜PO2,RI˜PH2 and RLR2 can accommodate potential confounders such as age and stage of disease directly or indirectly in their computation (as illustrated in our examples) as well as a group of pre-selected features depending on the application of interest. While I^YP offers an approach for feature ranking using a flexible survival model for dichotomized features, χ^PO2 provides a method for feature ranking as well as selection for both quantitative and dichotomized features. Typically, χ^PO2 selects more features at the same significance threshold and, thus, provides a more lenient approach for feature discovery relative to standard methods and CON as evidenced in Table 4. Moreover, our results demonstrate that it includes a significant fraction of features identified by CON. This is a desirable property of any feature selection method and it enables appropriate correction for multiple testing through use of FDR based not only on the Benjamini–Hochberg approach which assumes independent hypotheses but also on more stringent methods that account for dependence amongst hypotheses. Such flexibility is not possible with currently available methods for feature selection and ranking especially within a broad framework that includes the YP, PO and PH models.

Our extensive simulation studies demonstrated that there were a variety of scenarios where our proposed measures outperformed currently available methods. An important consideration is that when marginal screening methods are utilized for the purpose of feature ranking and selection, the issue of multiple testing becomes less important in comparison to adjusting for potential confounders when considering different models and measures. The proposed methods demonstrated their ability to correctly select genomic features associated with survival in the presence of different types of time-varying effects in real genomic data, after adjusting for potential confounders and for multiple testing, as well as in simulated data. As genomic technologies continue to advance and as more clinical, demographic and genomic data are generated and stored in repositories such as TCGA, Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) etc., feature selection methods will become increasingly important as we attempt to identify genomic features with a prognostic impact on patient survival. Although we focused on genomic feature selection in this article, it should be noted that the proposed methods are directly applicable to a broad array of high-throughput ‘omics’ studies such as those involving genome-wide association, proteomics, metabolomics, transcriptomics and radiomics. In particular, radiomics is a rapidly developing area where there has been considerable interest in correlating intra-tumor heterogeneity based on textural features to survival endpoints.

Acknowledgements

The authors would like to thank the reviewers for thoughtful comments that helped improve the presentation in this article.

Funding

This work was partially supported by the National Institutes of Health/National Cancer Institute [P30 CA006927 to K.D.]; the National Science Foundation [625061]; and US Army grant [W911NF-16-2-0189] (Temple University’s HPC resources).

Conflict of Interest: none declared.

References

Allison
 
P.D.
(
1995
).
Survival Analysis Using SAS: A Practical Guide
.
SAS Publishing, Cary, North Carolina
.

Altman
 
D.G.
 et al. (
1995
)
Review of survival analysis published in cancer journals
.
Br. J. Cancer
,
72
,
511
518
.

Anderson
 
G.A.
,
Fleming
T.R.
(
1995
)
Model misspecification in proportional hazards regression
.
Biometrika
,
82
,
527
541
.

Benjamini
 
Y.
,
Hochberg
Y.
(
1995
)
Controlling the false discovery rate: a practical and powerful approach to multiple testing
.
J. R. Stat. Soc
.,
57
,
289
300
.

Bennett
 
S.
(
1983
)
Analysis of survival data by the proportional odds model
.
Stat. Med
.,
2
,
273
277
.

Bhattacharjee
 
A.
 et al. (
2001
)
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
.
Proc. Natl. Acad. Sci. USA
,
98
,
13790
13795
.

Choodari-Oskooei
 
B.
 et al. (
2012a
)
A simulation study of predictive ability measures in a survival model I: explained variation measures
.
Stat. Med
.,
31
,
2627
2643
.

Choodari-Oskooei
 
B.
 et al. (
2012b
)
A simulation study of predictive ability measures in a survival model II: explained randomness and predictive accuracy
.
Stat. Med
.,
31
,
2644
2659
.

Cox
 
D.R.
(
1972
)
Regression models and life-tables
.
J. R. Stat. Soc
.,
34
,
187
220
.

Devarajan
 
K.
,
Ebrahimi
N.
(
2002
). Goodness-of-fit testing for the Cox proportional hazards model. In: Huber-Carol, C., et al. (eds)
Goodness-of-Fit Tests and Model Validity
. Statistics for Industry and Technology.
Birkhauser
,
Boston
, Massachusetts, pp.
237
254
.

Devarajan
 
K.
,
Ebrahimi
N.
(
2009
)
Testing for covariate effect in the cox proportional hazards regression model
.
Commun. Stat
.,
38
,
2333
2347
.

Devarajan
 
K.
,
Ebrahimi
N.
(
2011
)
A semi-parametric generalization of the Cox proportional hazards regression model: inference and applications
.
Comput. Stat. Data Anal
.,
55
,
667
676
.

Du
 
P.
 et al. (
2010
)
Comparison of beta-value and M-value methods for quantifying methylation levels by microarray analysis
.
BMC Bioinformatics
,
11
,
587
.

Dunkler
 
D.
 et al. (
2010
)
Gene selection in microarray survival studies under possibly non-proportional hazards
.
Bioinformatics
,
26
,
784
790
.

Gerds
 
T.A.
,
Schumacher
M.
(
2006
)
Consistent estimation of the expected Brier score in general survival models with right-censored event times
.
Biometric. J
.,
48
,
1029
1040
.

Graf
 
E.
 et al. (
1999
)
Assessment and comparison of prognostic classification schemes for survival data
.
Stat. Med
.,
18
,
2529
2545
.

Grambsch
 
P.
,
Therneau
T.
(
1994
)
Proportional hazards tests and diagnostics based on weighted residuals
.
Biometrika
,
81
,
515
526
.

Irizarry
 
R.A.
 et al. (
2003
)
Exploration, normalization, and summaries of high density oligonucleotide array probe level data
.
Biostatistics
,
4
,
249
264
.

Kent
 
J.T.
,
O'Quigley
J.
(
1988
)
Measures of dependence for censored survival data
.
Biometrika
,
75
,
525
534
.

Klein
 
J.P.
,
Moeschberger
M.L.
(
2003
).
Survival Analysis: Techniques for Censored and Truncated Data
.
Springer
,
New York
.

Li
 
Y.-H.
 et al. (
1996
)
Effects of model misspecification in estimating covariate effects in survival analysis for small sample sizes
.
Comput. Stat. Data Anal
.,
22
,
177
192
.

Martinussen
 
T.
,
Scheike
T.H.
(
2006
). Dynamic Regression Models for Survival Data. Statistics for Biology and Health, Springer, New York.

Nagelkerke
 
N.
(
1991
)
A note on a general definition of the coefficient of determination
.
Biometrika
,
78
,
691
692
.

O’Quigley
 
J.
,
Flandre
P.
(
1994
)
Predictive capability of proportional hazards regression
.
Proc. Natl. Acad. Sci. USA
,
91
,
2310
2314
.

O’Quigley
 
J.
 et al. (
2005
)
Explained randomness in proportional hazards models
.
Stat. Med
.,
24
,
479
489
.

Ritchie
 
M.E.
 et al. (
2015
)
limma powers differential expression analyses for RNA-sequencing and microarray studies
.
Nucleic Acids Res
.,
43
,
e47
.

Rouam
 
S.
 et al. (
2010
)
Identifying common prognostic factors in genomic cancer studies: a novel index for censored outcomes
.
BMC Bioinformatics
,
11
.

Rouam
 
S.
 et al. (
2011
)
A pseudo-R2 measure for selecting genomic markers with crossing hazards functions
.
BMC Med. Res. Methodol
.,
11
,
28
.

Royston
 
P.
(
2006
)
Explained variation for survival models
.
Stata J
.,
6
,
83
14
.

Royston
 
P.
,
Sauerbrei
W.
(
2004
)
A new measure of prognostic separation in survival data
.
Stat. Med
.,
23
,
723
748
.

Saintigny
 
P.
 et al. (
2011
)
Gene expression profiling predicts the development of oral cancer
.
Cancer Prev. Res. (Phila)
,
4
,
218
229
.

Schemper
 
M.
,
Henderson
R.
(
2000
)
Predictive accuracy and explained variation in Cox regression
.
Biometrics
,
56
,
249
255
.

Smyth
 
G.K.
(
2004
)
Linear models and empirical Bayes methods for assessing differential expression in microarray experiments
.
Stat. Appl. Genet. Mol. Biol
.,
3
,
1
25
.

Soofi
 
E.S.
 et al. (
1995
)
Information distinguishability with application to analysis of failure data
.
J. Am. Stat. Assoc
.,
90
,
657
668
.

Storey
 
J.D.
,
Tibshirani
R.J.
(
2003
)
Statistical significance for genomewide studies
.
Proc. Natl. Acad. Sci. USA
,
100
,
9440
9445
.

Struthers
 
C.A.
,
Kalbfleisch
J.D.
(
1986
)
Misspecified proportional hazard models
.
Biometrika
,
73
,
363
369
.

Tothill
 
R.W.
 et al. (
2008
)
Novel molecular subtypes of serous and endometroid ovarian cancer linked to clinical outcome
.
Clin. Cancer Res
.,
14
,
5198
5208
.

Wineinger
 
N.E.
 et al. (
2008
)
Statistical issues in the analysis of DNA copy number variations
.
Int. J. Comput. Biol. Drug Des
.,
1
,
368
395
.

Xu
 
R.
,
O’Quigley
J.
(
1999
)
A measure of dependence for proportional hazards models
.
J. Nonparametric Stat
.,
12
,
83
107
.

Xu
 
J.
 et al. (
2005
)
Survival analysis of microarray expression data by transformation models
.
Comput. Biol. Chem
.,
29
,
91
94
.

Yang
 
S.
,
Prentice
R.
(
2005
)
Semiparametric analysis of short-term and long-term hazard ratios with two-sample survival data
.
Biometrika
,
92
,
1
17
.

Yang
 
S.
,
Zhao
Y.
(
2012
)
Checking the short-term and long-term hazard ratio model for survival data
.
Scand. J. Stat
.,
39
,
554
567
.

Youden
 
W.J.
(
1950
)
Index for rating diagnostic tests
.
Cancer
,
3
,
32
35
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Alfonso Valencia
Alfonso Valencia
Associate Editor
Search for other works by this author on:

Supplementary data