-
PDF
- Split View
-
Views
-
Cite
Cite
Lauren Spirko-Burns, Karthik Devarajan, Unified methods for feature selection in large-scale genomic studies with censored survival outcomes, Bioinformatics, Volume 36, Issue 11, June 2020, Pages 3409–3417, https://doi.org/10.1093/bioinformatics/btaa161
Close - Share Icon Share
Abstract
One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous datasets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback–Leibler information divergence and the Yang–Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 measures for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R2 index that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature measurements.
We evaluate the performance of our measures using extensive simulation studies and publicly available datasets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods.
R code for the proposed methods is available at github.com/lburns27/Feature-Selection.
Supplementary data are available at Bioinformatics online.
1 Introduction
There have been significant developments in high-throughput genomic and related technologies in the past two decades. Examples include microarray technology to measure mRNA and microRNA expression, methylation arrays to quantify DNA methylation, SNP arrays to measure allele-specific expression and DNA copy number variation, next-generation sequencing technologies such as RNA-Seq, ChIP-Seq, etc. for the measurement of digital gene expression as well as mass spectroscopy and nuclear magnetic resonance spectroscopy for the measurement of protein and metabolite expression. With the wealth of data available from large-scale genomic studies, researchers can now attempt to understand and estimate the effects of specific genomic features on various diseases and characteristics associated with those diseases. A genomic feature may represent a gene that codes for a protein or a non-gene-centric element such as a microRNA, CpG site or a particular genomic region of interest. Our use of the terms feature measurement, feature value and feature is broad in scope and includes gene or protein expression, DNA methylation and copy number variation. One specific area of interest is in studying the relationship between genomic features and time to death or recurrence of some disease, often referred to as ‘survival time’. Let Y and C denote, respectively, the survival and censoring times, and let be the indicator of whether the event has been observed. Because of censoring, it is not possible to observe all true survival times, so we let be the observed survival time. In this study, we will deal with p features measured for each of n subjects, where . We let Z denote the n × p matrix of features and z represent the p-vector of features for a subject.
These high-dimensional datasets offer some explicit challenges when applying standard statistical methods. One of the most commonly used models in survival analysis is the Cox proportional hazards (PH) regression model (Cox, 1972) which postulates that the risk (or hazard) of death of an individual, given their feature measurement, is simply proportional to their baseline risk in the absence of the feature. It is a multiplicative hazards model that implies constant hazard ratio (HR). In a high-throughput genomic study, for instance, the PH assumption would imply a constant effect of features on survival over the entire period of follow-up in a study. However, this assumption cannot be verified for each feature and it is also unlikely that PH will actually hold for each feature. Moreover, there is empirical evidence indicating that some features may not have a multiplicative effect on the hazard and that non-proportional hazards (NPH) can occur when feature effect increases or decreases over time leading to converging or diverging hazards (CoH or DH); in some studies, features with DH are found more often than features with CoH (Bhattacharjee et al., 2001; Dunkler et al., 2010; Rouam et al., 2011; Xu et al., 2005). It is, therefore, unreasonable to expect the many thousands of features to exhibit PH. In a review of survival analyses published in cancer journals, it was revealed that only 5% of all studies using the Cox PH model actually checked the PH assumption (Altman et al., 1995). Applying the PH model to data that do not support the underlying assumptions may result in inaccurate and sometimes erroneous conclusions. For instance, it could lead to under- or over-estimation of effects for a considerable number of features. Consequently, some features are falsely declared as important in predicting survival and other relevant features are completely missed. Furthermore, if some features exhibit NPH then their HRs, estimated by ignoring their time-dependence, are not comparable to those of features with PH or of features exhibiting different patterns of NPH. Although NPH typically arises from time-dependent effects of features on survival, it could also result from model mis-specification such as from omitting relevant clinical covariates like age at diagnosis and stage of disease. Feature selection methods involving univariate analyses are particularly prone to this problem. Testing the time-varying effect of features, in addition to significance testing, is important in the presence of such covariates and for small sample sizes given that in many genomic studies, typically (Anderson and Fleming, 1995; Li et al., 1996; Struthers and Kalbfleisch, 1986). It is also important to be aware of the potential effects of dichotomization of features which may have a bearing on the results and interpretation. The goal of this study is to discuss alternative models and methods that can be used to link large-scale genomic data to a survival outcome, with the ultimate goal of feature selection under different types of hazards that may be present.
This article is organized as follows. In Section 2, we survey a variety of well-known survival models, many of which are designed to handle NPH, and provide an overview of marginal screening methods that currently exist in the literature. In Section 3, we use real-life genomic data to motivate the need for these models. Through this analysis, we identify specific models that fit a large proportion of genomic features and allow for NPH. Then in Section 4, we propose several feature selection methods that do not rely on the PH assumption and, instead, are developed from models capable of handling NPH. These methods are evaluated using extensive simulations and publicly available large-scale datasets in cancer genomics in Sections 5 and 6. Finally, Section 7 provides a summary and some concluding remarks.
2 A brief survey of survival models and existing methods
2.1 The PH model and its generalization
2.2 The proportional odds model and its generalization
2.3 Yang–Prentice model
Note that when , it becomes the PH model, and when , it becomes the PO model. Thus, it is a versatile and useful model that encompasses both the PH and PO models, as well as a host of other models, and allows for time-varying hazards.
2.4 Existing methods for feature selection
Few specific methods are currently available in the literature for the purpose of feature selection when the PH assumption is violated. Dunkler et al. (2010) proposed concordance regression (CON), using a generalized concordance probability as a measure of the effect size, to select genes in microarray studies that are related to survival irrespective of the type of hazard. The basic concordance probability is , where T1 is a randomly chosen survival time from group 1 and T0 is a randomly chosen survival time from group 0, and is independent of the PH assumption. This is generalized to continuous data which has the form , where γ are the log odds that the survival time decreases if the gene expression is doubled. Genes are ranked based on the absolute effect size estimate . Using simulated and microarray data, Dunkler et al. (2010) showed that when some genes showed a time-dependent effect on survival, CON produced the least biased and most stable estimates compared to methods based on Cox regression. Rouam et al. (2010, 2011) developed pseudo-R2 measures for genomic feature selection based on the PH and CH models which rely on the partial likelihood of the respective model.
3 Motivating examples
We motivate our problem using five datasets from large-scale studies in cancer utilizing a variety of high-throughput genomic technologies. Datasets 1 and 2 consist of DNA methylation and microRNA expression profiles, respectively, measured on glioblastoma samples while dataset 3 consists of digital gene expression profiles obtained using RNA sequencing from subjects with head and neck squamous cell carcinoma (HNSCC) [The Cancer Genome Atlas (TCGA), http://cancergenome.nih.gov]. Datasets 4 and 5 consist of gene expression data obtained using Affymetrix and HG 1.ST microarrays, respectively, from patients with ovarian and oral cancer (Tothill et al., 2008; Saintigny et al., 2011). Raw data were pre-processed using standard methods for each dataset as described in the Supplementary Section 8.1, and relevant characteristics of these datasets are displayed in Tables 1 and 2. We describe a comprehensive analysis of these datasets using the PH, PO and YP models to test for the goodness-of-fit (GOF) of each model using the methods outlined in Grambsch and Therneau (1994), Martinussen and Scheike (2006) and Yang and Zhao (2012), respectively (see Supplementary Section 8.3, for details on this analysis and GOF tests). All analyses were performed at the genomic feature level, the purpose of which is to identify features that exhibit some form of NPH and to demonstrate the need for alternatives to the PH model. Wherever possible, clinical covariates such as age at diagnosis and stage of cancer were adjusted for in the analysis (datasets 3–5, Table 1). The currently available method for the YP model is only capable of handling a single dichotomized covariate (Yang and Prentice, 2005), and, hence, we utilized dichotomized genomic features and did not adjust for age and stage in analyses reported in Table 2. Thus, the PH, PO and YP model fits in Table 2 utilize only dichotomized features. The q-value approach was employed to estimate the false discovery rate (FDR) and to evaluate the effect of testing multiple hypotheses (Storey and Tibshirani, 2003). In summary, GOF results based on quantitative features for the PH and PO models (Table 1) and dichotomized features for the PH, PO and YP models (Table 2) are presented. All resulted reported are based on p-value and cases that satisfy q-value at this threshold are indicated in the footnotes. It should be noted that different GOF tests cannot be directly compared with each other but their intended purpose is as stated above.
| Dataset . | 1 . | 2 . | 3 . | 4 . | 5 . |
|---|---|---|---|---|---|
| p | 9452 | 454 | 19 341 | 24 739 | 12 776 |
| n | 280 | 416 | 221 | 276 | 86 |
| % Censored | 26 | 19 | 62 | 32 | 59 |
| A: lack of PH fit | 217 | 130 | 752 | 3881 | 1810 |
| (%) | (2) | (29)a | (4) | (16)a | (14)a |
| B: good PO fit | 8708 | 385 | 17 349 | 22 424 | 11 544 |
| (%) | (92)a | (85)a | (90)a | (91)a | (90)a |
| A B | 170 | 85 | 288 | 2513 | 947 |
| A B′ | 47 | 45 | 464 | 1368 | 863 |
| Dataset | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| p | 9452 | 454 | 19 341 | 24 739 | 12 776 |
| n | 280 | 416 | 221 | 276 | 86 |
| % Censored | 26 | 19 | 62 | 32 | 59 |
| A: lack of PH fit | 217 | 130 | 752 | 3881 | 1810 |
| (%) | (2) | (29)a | (4) | (16)a | (14)a |
| B: good PO fit | 8708 | 385 | 17 349 | 22 424 | 11 544 |
| (%) | (92)a | (85)a | (90)a | (91)a | (90)a |
| A | 170 | 85 | 288 | 2513 | 947 |
| A | 47 | 45 | 464 | 1368 | 863 |
| Dataset . | 1 . | 2 . | 3 . | 4 . | 5 . |
|---|---|---|---|---|---|
| A: lack of PH fit | 941 | 82 | 934 | 3366 | 1294 |
| (%) | (10)a | (18)a | (5) | (14)a | (10) |
| B: good PO fit | 8880 | 426 | 17 743 | 22 628 | 11 755 |
| (%) | (94) | (94)a | (92) | (91) | (92)a |
| C: good YP fit | 9038 | 441 | 18 614 | 22 796 | 11 609 |
| (%) | (96) | (97) | (96) | (92) | (91) |
| A B | 854 | 72 | 471 | 2325 | 598 |
| A C | 896 | 80 | 879 | 3051 | 1045 |
| A B′ ( C) | 87 (82) | 10 (9) | 463 (428) | 1041 (926) | 696 (537) |
| Dataset | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| A: lack of PH fit | 941 | 82 | 934 | 3366 | 1294 |
| (%) | (10)a | (18)a | (5) | (14)a | (10) |
| B: good PO fit | 8880 | 426 | 17 743 | 22 628 | 11 755 |
| (%) | (94) | (94)a | (92) | (91) | (92)a |
| C: good YP fit | 9038 | 441 | 18 614 | 22 796 | 11 609 |
| (%) | (96) | (97) | (96) | (92) | (91) |
| A | 854 | 72 | 471 | 2325 | 598 |
| A | 896 | 80 | 879 | 3051 | 1045 |
| A | 87 (82) | 10 (9) | 463 (428) | 1041 (926) | 696 (537) |
The following notations apply to Tables 1 and 2: p = Number of features; n = Number of observations; all results based on ; a: (at p = 0.05); A and B′ denote, respectively, subsets of features for which PH and PO does not fit; B and C denote, respectively, subsets of features for which PO and YP fits.
The results shown in Tables 1 and 2 motivate the development of methods based on the PO and YP models because of their ability to fit not only a large number of genomic features in general, but specifically features for which the PH assumption is violated. These models provide flexible alternatives for handling various forms of NPH for quantitative as well as dichotomized features. Further details are provided in Supplementary Section 8.3.
4 Proposed methods for feature selection and ranking
In Section 3, we demonstrated the need for alternative methods to the PH model that can handle various types of NPH. We also showed in Tables 1 and 2 that there are many features for which the PH model does not fit but the PO or YP model does; and for some features, we observed that both the PH and PO models do not fit thereby suggesting the need for more complex models such as the YP or CO. In this section, we construct several marginal screening approaches based on the PO, CO and YP models. First, we adopt an information-theoretic approach and develop a test for genomic feature effect under the YP model using Kullback–Leibler (KL) information divergence. This approach includes tests as well as R2 measures for its two important special cases—the PO and PH models. Following this, we propose a unified framework to compute pseudo-R2 measures for a wide range of survival models that allow different types of NPH. It includes the PH, CH, PO and CO models and generalizes prior work (Rouam et al., 2010, 2011). The utility of these methods is demonstrated using simulated and real-life genomic datasets and by comparison to existing methods.
4.1 Methods based on information divergence
4.1.1 Test for genomic feature effect under the YP model
From Theorem 1 (see Supplementary Section 9.1 for details on IYP), it can be seen that is a maximum likelihood estimator and is asymptotically normal with mean . Despite the complexity of the YP model, has the computational advantage of not requiring an estimate of the baseline. In addition, it combines feature effects quantified by the two model parameters β and γ into a single measure and provides a simpler interpretation of feature effect. However, a practical limitation of the currently available estimation method for this model is that it requires dichotomized features (Yang and Prentice, 2005). Therefore, we propose as a measure for ranking feature effect. In our applications, we compute using and obtained by fitting the YP model to features dichotomized by the median.
4.1.2 Tests for genomic feature effect under the PO and PH models
The test statistic for feature j under the PH model, , is obtained in the limit as in equation (7) resulting in a test similar to that outlined above for the PO model. However, it would potentially over- or under-estimate feature effects when the PH assumption is violated. In that regard, the proposed methods based on YP and PO models benefit from the simplicity of this approach while addressing the issue of NPH. Furthermore, our prior studies of information divergence measures for the PH and generalized CH (equation (2)) models indicate that the proposed methods will have good small sample performance in terms of power and consistency (Devarajan and Ebrahimi, 2002, 2009).
4.1.3 Feature ranking and selection
Since quantifies the effect of a genomic feature according to the particular model of interest, it can be directly used to serve as a measure for feature ranking. Both and can be calculated for each feature in a dataset where a higher indicates a larger effect on survival based on the particular model chosen. Similarly, the test statistic in equation (11) based on the PO model can be used to compute a p-value for each feature and features can be selected by controlling the FDR, at a pre-determined level such as 5% (Benjamini and Hochberg, 1995). A similar approach can be used for feature selection and ranking using under the PH model. It is straightforward to account for potential confounders (such as age and stage of disease) by utilizing the model parameter estimate from the adjusted model in the computation of and and their respective test statistics. Multiple GOF tests at the feature level to determine which measure to use would have implications on the Type I Error and would miss features with small to moderate effects; however, since the PO accommodates forms of NPH while also allowing PH under certain conditions, a practical strategy would be to use and . An added benefit is that these methods can handle quantitative or dichotomized features as well as account for confounders.
4.2 Measures of explained randomness
We provide a brief overview of existing measures of explained randomness (ER), explained variation (EV) and predictive accuracy (PA) for censored survival data. While most of the measures require the PH assumption to hold, there are a few measures that are relevant to other survival models or do not assume a specific working model. A comprehensive review on this topic is provided in Choodari-Oskooei et al. (2012a,b). Several R2 measures of ER for the PH model have been developed based on the likelihood ratio (LR), , where L(0) and denote the partial likelihood for the PH model under the null and alternative hypotheses, respectively, and is the maximum partial likelihood estimator of β. For example, Allison’s index (Allison, 1995) has the form where N is the number of subjects. O’Quigley et al. (2005) proposed a modified version of where N is replaced by k, the number of failures, which is an approximation of the measure in Xu and O’Quigley (1999). It is less sensitive to censoring which is beneficial in our application due to the observed high fraction of censored observations and is given by . Another modified version of was proposed by Nagelkerke (1991) and has the form where . Kent and O’Quigley (1988), O’Quigley and Flandre (1994) and Royston (2006) defined measures of EV for the PH model while Royston and Sauerbrei (2004) proposed similar measures for PH, PO and lognormal models. Schemper and Henderson (2000) developed a measure of PA, , based on absolute deviation of prediction under the PH assumption. Graf et al. (1999) proposed a model independent measure of PA, , based on the mean squared error of prediction by weighted averages of time-dependent residuals, and Gerds and Schumacher (2006) extended it using regression models for the censoring distribution. can be used for both PH and PO models.
4.2.1 R2 measures based on information divergence
Hence, , where is the partial likelihood estimator of β in the PH model obtained using the standardized feature (Cox, 1972). Details regarding the derivation of and are provided in Supplementary Section 9.3. Both and can be easily seen to fall in the range and can be used for feature ranking and selection where genomic features with larger R2 values can be interpreted as exhibiting larger effects on survival under the respective models. These measures have an information-theoretic foundation and are easy to compute; from equations (15) and (16), we observe that both measures are simple functions of the respective model parameter β which contains the required information if features can be normalized to follow the standard normal model.
4.2.2 R2 measures based on the LR
We propose three different R2 measures of ER for the PO model based on LR, , where L(0) and denote the modified partial likelihood for the PO model under the null and alternative hypotheses, respectively, and the parameter β is estimated using the approach outlined in Martinussen and Scheike (2006) (see Supplementary Section 9.4 for the modified partial likelihood). These measures parallel corresponding R2 measures for the PH model— and described in Section 4.2—and can be interpreted as the proportion of ER by the PO model.
4.3 A generalized pseudo-R2 index
We develop pseudo-R2 measures for the PO and CO models, denoted by and , as well as a generalized pseudo-R2 index that embeds such measures for the PH, CH, PO and CO models. The proposed approach utilizes the partial likelihood of the respective models and does not require an estimate of the parameter β in these models. It generalizes the work of Rouam et al. (2010, 2011) where measures for the PH and CH models, which we denote by and , respectively, were proposed. These pseudo-R2 measures can be interpreted in terms of the difference in the values of a genomic feature between subjects experiencing and not experiencing the event of interest, and can be used as tools for feature ranking and selection under a variety of scenarios involving NPH. An obvious disadvantage of is that it is based on the PH model. In the CH model, the HR between two individuals with feature values z and cross over time (Rouam et al., 2011). Thus, while does address the inherent problem with , it forces CH. On the other hand, the PO model is more generally applicable to the variety of hazard structures observed in genomic data as seen in Tables 1 and 2. The CO model generalizes the PO model in the same manner as the CH model generalizes the PH model and, thus, has a more versatile form. For our purposes, we will use the special case in equation (4). Hence, the measures and based on these models offer significant advantages and flexibility that is not afforded by currently available measures.
The generalized pseudo-R2 index can then be expressed as , where , zi is the value of a given feature for subject i, k is the number of uncensored failure times and is the number of subjects at risk at time ti. This index can be seen as the robust score statistic divided by the number of distinct uncensored failure times, a quantity that falls between 0 and 1 and can be interpreted in terms of the percentage of separability between subjects experiencing and not experiencing the event in relation to the value of a genomic feature. Thus, R2 indices corresponding to the PH, CH, PO and CO models based on model-specific choice of the weight, w(t), determined by the respective partial likelihood can be obtained. The estimated weight, , for each special case is shown in Supplementary Table 6 and Supplementary Section 9.4, and can be interpreted as the derivative of the log HR for the corresponding model with respect to the predictor η=βz evaluated at η = 0. In Supplementary Table 6, is estimated by the left-continuous version of Nelson’s estimator and is estimated using the Kaplan–Meier estimator. We note that and are the measures described in Rouam et al. (2010, 2011) while and are our newly proposed measures that allow for various types of NPH as well as PH. In order to avoid numerical issues, we applied empirical corrections in the computation of (which we refer to as henceforth) and (see Supplementary Section 9.4 for details).
5 Application to simulated data
In this section, we evaluate our newly proposed ranking methods, and using simulated datasets under a variety of scenarios and compare their performance to existing methods, and (Dunkler et al., 2010; Rouam et al., 2010, 2011) and measures of PA, and , described in Section 4.2. The goal of this study is to assess the performance of these methods in selecting features that are truly associated with survival in the high-dimensional setting.
5.1 Simulation schemes
We considered two different simulation schemes to generate artificial survival and genomic datasets based on the approach outlined in Dunkler et al. (2010). In order to account for various types of hazards, survival times , were generated from each of five different models specified as follows: standard log-normal LN (μ = 0, σ = 1); log-logistic LL1 () and LL2 (); and Weibull W1 () and W2 (), where LL1 and W1 refer to the respective models where the shape parameters are the same but the scale parameters differ, and LL2 and W2 refer to the respective models where both the shape and scale parameters differ. In the LN model, μ and σ are the location and scale parameters, respectively. We use a more informed approach that is broader in scope compared to that of Dunkler et al. (2010), who only considered W1 in their simulations. Here, LN, LL2 and W2 cases are of particular interest because of their ability to simulate CH. To simulate censoring, we drew random samples with uniform follow-up times C from and defined the observed survival time as with censoring indicator . We chose τ to get censoring proportions of 0, 33 and 67%. For each model, we simulated censored survival times and genomic data for N = 200 subjects and p = 5000 mock features that are linked to survival time based on the logarithm of the HR , . Genomic data were generated from the standard normal model which covers a variety of features seen in large-scale genomic studies. Following Klein and Moeschberger (2003), was calculated based on the respective model of interest. For LN, we used to simulate CH similar to what was done in Dunkler et al. (2010). Then, β0 was chosen so that only the first 400 features were assumed to have an effect on survival time, with 200 having a large effect and 200 having a small effect. In Scheme 1, we adopt a univariate approach where each feature is separately linked to survival, and in Scheme 2, we adopt a multivariate approach that incorporates correlations between features. Under each scenario, 200 datasets were generated and the ranking methods were assessed using mean AUC (area under the receiver operating characteristic curve) and Youden index (Youden, 1950). More details can be found in Dunkler et al. (2010) and Supplementary Section 10.
5.2 Simulation results
Overall, under different data generating models and censoring proportions, the proposed methods outperformed and, in some cases, performed as well as existing methods. In most cases, we noticed some form of improvement. Selected results under different NPH model assumptions are highlighted in Table 3 while detailed simulation results under various scenarios outlined above are provided in Supplementary Sections 10.3, 10.4 and Tables 7–11. Since , and resulted in the same feature rankings, we only report results for . Our PO model-based methods (all R2 measures and ) performed strongly overall and performed particularly well for lower censoring (0% and 33%). outperformed in many cases where NPH was present, across censoring levels and simulations schemes; but it is important to note that our modified version, , performed better than in most cases and similarly to it in other cases. Overall, depending on the simulation scheme and type of NPH present, we can identify the benefits of utilizing each of our measures. Most importantly, the proposed feature selection methods are more flexible and generalize existing methods.
| Censoring . | Scheme 1, 33% . | Scheme 1, 67% . | Scheme 2, 33% . | Scheme 2, 67% . | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Measure . | LN . | LL2 . | W2 . | LN . | LL2 . | W2 . | LN . | LL2 . | W2 . | LN . | LL2 . | W2 . |
| I PO | 0.89 | 0.77 | 1.00 | 0.93 | 0.81 | 0.95 | 0.87 | 0.62 | 0.90 | 0.89 | 0.80 | 0.86 |
| I YP | 0.96 | 0.75 | 1.00 | 0.83 | 0.58 | 0.93 | 0.84 | 0.82 | 0.86 | 0.69 | 0.68 | 0.68 |
| 0.82 | 0.71 | 1.00 | 0.88 | 0.71 | 0.98 | 0.80 | 0.50 | 0.90 | 0.84 | 0.53 | 0.87 | |
| 0.88 | 0.78 | 1.00 | 0.93 | 0.81 | 0.95 | 0.86 | 0.64 | 0.90 | 0.88 | 0.80 | 0.89 | |
| 0.89 | 0.77 | 0.99 | 0.93 | 0.81 | 0.95 | 0.87 | 0.63 | 0.89 | 0.89 | 0.81 | 0.84 | |
| 0.82 | 0.80 | 0.87 | 0.68 | 0.74 | 0.89 | 0.59 | 0.87 | 0.88 | 0.56 | 0.71 | 0.87 | |
| 0.97 | 0.87 | 0.82 | 0.87 | 0.84 | 0.71 | 0.88 | 0.89 | 0.74 | 0.81 | 0.87 | 0.55 | |
| 0.77 | 0.61 | 1.00 | 0.92 | 0.78 | 0.97 | 0.75 | 0.54 | 0.89 | 0.89 | 0.71 | 0.87 | |
| Censoring | Scheme 1, 33% | Scheme 1, 67% | Scheme 2, 33% | Scheme 2, 67% | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Measure | LN | LL2 | W2 | LN | LL2 | W2 | LN | LL2 | W2 | LN | LL2 | W2 |
| I PO | 0.89 | 0.77 | 1.00 | 0.93 | 0.81 | 0.95 | 0.87 | 0.62 | 0.90 | 0.89 | 0.80 | 0.86 |
| I YP | 0.96 | 0.75 | 1.00 | 0.83 | 0.58 | 0.93 | 0.84 | 0.82 | 0.86 | 0.69 | 0.68 | 0.68 |
| 0.82 | 0.71 | 1.00 | 0.88 | 0.71 | 0.98 | 0.80 | 0.50 | 0.90 | 0.84 | 0.53 | 0.87 | |
| 0.88 | 0.78 | 1.00 | 0.93 | 0.81 | 0.95 | 0.86 | 0.64 | 0.90 | 0.88 | 0.80 | 0.89 | |
| 0.89 | 0.77 | 0.99 | 0.93 | 0.81 | 0.95 | 0.87 | 0.63 | 0.89 | 0.89 | 0.81 | 0.84 | |
| 0.82 | 0.80 | 0.87 | 0.68 | 0.74 | 0.89 | 0.59 | 0.87 | 0.88 | 0.56 | 0.71 | 0.87 | |
| 0.97 | 0.87 | 0.82 | 0.87 | 0.84 | 0.71 | 0.88 | 0.89 | 0.74 | 0.81 | 0.87 | 0.55 | |
| 0.77 | 0.61 | 1.00 | 0.92 | 0.78 | 0.97 | 0.75 | 0.54 | 0.89 | 0.89 | 0.71 | 0.87 | |
6 Application to genomic data
In this section, we compare the performance of the proposed methods using several datasets representing a broad spectrum of high-throughput genomic technologies. In addition to datasets 1–5 described in Section 3, we utilized the following datasets. Dataset 6 consists of microarray gene expression profiles measured on the same set of glioblastoma samples used in datasets 1 and 2, while datasets 7 and 8 consist of DNA methylation and copy number variation profiles, respectively, from subjects with HNSCC (see Supplementary Section 8.1). Since we do not have prior knowledge on the number of genomic features significantly associated with survival in real data, this approach will differ from that of the simulations. Here, we will rank features based on each method and compare them using a pre-selected number of top features across methods. Whenever possible, parameter estimates required for computing certain I and R2 measures were obtained from respective models adjusted for potential confounders such as age and stage of disease. These measures include and the absolute effect size estimate, , from CON. In addition to quantitative features, dichotomized features were utilized for and and no adjustment for confounders was done for analyses involving dichotomized data in order to enable direct comparison of results to which accommodates only a single dichotomized covariate as explained earlier.
We begin by focusing attention on the application of our proposed I measures— and —and , selecting the top 500 features based on each measure. We observe that there are few features commonly selected by all three measures, as evidenced in the Venn diagrams presented in Supplementary Figures 1–3. These results are captured in Table 5 which summarizes the fraction of features that are identified in common by different measures (for quantitative features). This is not surprising given that these measures are based on different model assumptions. More details can be found in Supplementary Material. Using the test statistic for in equation (11), we computed a p-value for each feature and selected features by controlling the FDR at 5% (Benjamini and Hochberg, 1995); for comparison, we applied the test from CON (Section 2.4) using the same threshold. It should be noted that this approach can also be used to rank features and to compare different measures. Table 4 summarizes the results of this analysis for quantitative and dichotomized features for various datasets. In almost all cases, selects a much larger number of features compared to CON which does not select many features at all in the first place. This limitation is recognized in Dunkler et al. (2010), and it renders CON as a tool for feature ranking only (based on ) rather than feature selection, unlike which can be used for ranking as well as selecting features based on a pre-specified p-value or FDR threshold. Even though some overlap is observed in features selected by and when used purely as ranking measures (as seen in Table 5), significant improvement is observed in the performance of the proposed statistical test based on , as evident from Table 4. In addition, enables better enrichment of selected feature sets as demonstrated in Supplementary Section 11, where we also outline an exploratory approach to evaluate and visualize the combined effect of selected features on survival. Overall, we recommend the use of or because of their inherent versatility.
| Data type . | Quantitative . | Dichotomous . | ||
|---|---|---|---|---|
| Dataset . | . | CON . | . | CON . |
| 1 | 1097 | 99 | 318 | 0 |
| 2 | 33 | 0 | 17 | 0 |
| 3 | 1580 | 7 | 1963 | 2 |
| 4 | 4784 | 17 | 6057 | 0 |
| 5 | 758 | 1473 | 4099 | 323 |
| 6 | 310 | 0 | 0 | 0 |
| 7 | 5799 | 0 | 3885 | 21 |
| 8 | 464 | 0 | 64 | 0 |
| Data type | Quantitative | Dichotomous | ||
|---|---|---|---|---|
| Dataset | CON | CON | ||
| 1 | 1097 | 99 | 318 | 0 |
| 2 | 33 | 0 | 17 | 0 |
| 3 | 1580 | 7 | 1963 | 2 |
| 4 | 4784 | 17 | 6057 | 0 |
| 5 | 758 | 1473 | 4099 | 323 |
| 6 | 310 | 0 | 0 | 0 |
| 7 | 5799 | 0 | 3885 | 21 |
| 8 | 464 | 0 | 64 | 0 |
| Datasets . | and . | and . | and . | , and . | and . | and . | and . | , , and . |
|---|---|---|---|---|---|---|---|---|
| 1 | 11 | 14 | 62 | 7 | 7 | 45 | 38 | 6 |
| 2 | 34 | 36 | 88 | 32 | 40 | 42 | 12 | 8 |
| 3 | 7 | 11 | 48 | 5 | 14 | 18 | 6 | 0 |
| 4 | 5 | 6 | 42 | 3 | 18 | 9 | 11 | 0 |
| 5 | 10 | 23 | 39 | 7 | 27 | 3 | 8 | <1 |
| 6 | 16 | 15 | 75 | 14 | 11 | 37 | 5 | 0 |
| 7 | 1 | <1 | 62 | <1 | 19 | 9 | 1 | 0 |
| 8 | 6 | 16 | 53 | 4 | 21 | 57 | 7 | 0 |
| Datasets | ||||||||
|---|---|---|---|---|---|---|---|---|
| 1 | 11 | 14 | 62 | 7 | 7 | 45 | 38 | 6 |
| 2 | 34 | 36 | 88 | 32 | 40 | 42 | 12 | 8 |
| 3 | 7 | 11 | 48 | 5 | 14 | 18 | 6 | 0 |
| 4 | 5 | 6 | 42 | 3 | 18 | 9 | 11 | 0 |
| 5 | 10 | 23 | 39 | 7 | 27 | 3 | 8 | <1 |
| 6 | 16 | 15 | 75 | 14 | 11 | 37 | 5 | 0 |
| 7 | 1 | <1 | 62 | <1 | 19 | 9 | 1 | 0 |
| 8 | 6 | 16 | 53 | 4 | 21 | 57 | 7 | 0 |
Next, we examine the differences between genomic features selected by various R2 measures, once again selecting the top 500 features in each case. Supplementary Figure 4 shows the overlaps between PO-based R2 measures— and —for the glioblastoma datasets (1, 2 and 6). Since dataset 2 contained a small number of features (454), the top 50 selected features were compared. The results are, however, consistent across these datasets where we observe a large fraction of common features selected by different measures. In all three datasets, a large fraction of features () are common to all three measures as well as between any pair of measures. Similar observations can be made for the HNSCC (3, 7 and 8), ovarian (4) and oral (5) datasets shown in Supplementary Figure 5.
Venn diagrams corresponding to and are shown in Supplementary Figures 6 and 7, and the size of each intersecting feature set is shown in Table 5. We observe that there are only minor overlaps between different measures across all datasets, with the largest overlaps occurring between and for datasets 5 and 8 and between and for datasets 1, 6 and 8. There are also no features common to all three measures in datasets 3, 4, 6, 7 and 8, and only a very small number of common features in the remaining datasets. Overall, these results are not surprising since each of these methods is based on a different model, so we would expect them to select different features; however, the results also indicate that PO and CO model-based measures can identify various forms of NPH. The differences observed in selected features across the three measures indicate the presence of various types of NPH in a given dataset. Thus, the appropriate measure can be chosen based on the goal of feature selection and type of hazards present or expected. We emphasize that our proposed measures provide a more versatile and general framework that allows for inclusion of various types of hazards.
7 Discussion and conclusion
In this article, we proposed unified methods for feature selection in large-scale genomic data in the presence of censored survival outcomes. We illustrated the utility of these methods using real-life studies in cancer genomics; in particular, we demonstrated their ability to handle the challenges due to various forms of NPH. The proposed methods are based on models that relax the PH assumption and are able to identify genomic features with a time-varying effect with increased specificity and sensitivity.
Our methods are flexible and generalize existing methods for feature selection by casting them within unifying frameworks. First, we proposed a general framework to test for the effect of a genomic feature under the YP model using KL information divergence by developing the measure which quantifies this effect. includes measures for the PO and PH models— and , respectively—as special cases. An advantage of these measures is that they do not require an estimate of the baseline hazard and, instead, are simple functions of respective model parameter estimates. Using these measures, we developed a statistical test () for genomic feature effect in the PO model where the test statistic or p-value could be used for feature selection; in addition, we developed R2 measures of ER for the PO and PH models that only rely on the respective model parameter estimate or the LR. Finally, we proposed a generalized pseudo-R2 index that embeds measures for the PO, CO, PH and CH models as special cases. These measures do not require an estimate of the model parameter and can be easily interpreted as the percentage of separability between subjects in the event and non-event groups according to feature values. Since and select fairly independent subsets of features, each set can be used for further exploration and study. PO model-based measures demonstrated their ability to handle PH and various forms of NPH while only performed well in detecting certain types of CH; moreover, offers a useful alternative and is better suited for handling particular forms of NPH. Thus, each measure provides useful information specific to a particular model of interest while PO model-based R2 measures provide overall flexibility relative to other measures.
All proposed R2 measures can be applied to quantitative (continuous, count or ordered categorical) features and I measures are applicable to quantitative or dichotomized features; however, is currently usable only on dichotomized features. Use of (appropriately variance stabilized and normalized) quantitative features aids in robust estimation of time-varying effects and interpretability while dichotomized features facilitate visualization of results. However, it is important to be aware of the potential effect of dichotomization—using the median split or any other arbitrary cut-off—on the PH assumption. This is evidenced by the results in Tables 1 and 2 for datasets 1 and 2 where dichotomization has a negative and positive effect, respectively, on GOF. The proposed I measures are useful for feature ranking in general and, in particular, when the distribution of feature measurements or a normalizing transformation for it is unknown. Furthermore, and can accommodate potential confounders such as age and stage of disease directly or indirectly in their computation (as illustrated in our examples) as well as a group of pre-selected features depending on the application of interest. While offers an approach for feature ranking using a flexible survival model for dichotomized features, provides a method for feature ranking as well as selection for both quantitative and dichotomized features. Typically, selects more features at the same significance threshold and, thus, provides a more lenient approach for feature discovery relative to standard methods and CON as evidenced in Table 4. Moreover, our results demonstrate that it includes a significant fraction of features identified by CON. This is a desirable property of any feature selection method and it enables appropriate correction for multiple testing through use of FDR based not only on the Benjamini–Hochberg approach which assumes independent hypotheses but also on more stringent methods that account for dependence amongst hypotheses. Such flexibility is not possible with currently available methods for feature selection and ranking especially within a broad framework that includes the YP, PO and PH models.
Our extensive simulation studies demonstrated that there were a variety of scenarios where our proposed measures outperformed currently available methods. An important consideration is that when marginal screening methods are utilized for the purpose of feature ranking and selection, the issue of multiple testing becomes less important in comparison to adjusting for potential confounders when considering different models and measures. The proposed methods demonstrated their ability to correctly select genomic features associated with survival in the presence of different types of time-varying effects in real genomic data, after adjusting for potential confounders and for multiple testing, as well as in simulated data. As genomic technologies continue to advance and as more clinical, demographic and genomic data are generated and stored in repositories such as TCGA, Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) etc., feature selection methods will become increasingly important as we attempt to identify genomic features with a prognostic impact on patient survival. Although we focused on genomic feature selection in this article, it should be noted that the proposed methods are directly applicable to a broad array of high-throughput ‘omics’ studies such as those involving genome-wide association, proteomics, metabolomics, transcriptomics and radiomics. In particular, radiomics is a rapidly developing area where there has been considerable interest in correlating intra-tumor heterogeneity based on textural features to survival endpoints.
Acknowledgements
The authors would like to thank the reviewers for thoughtful comments that helped improve the presentation in this article.
Funding
This work was partially supported by the National Institutes of Health/National Cancer Institute [P30 CA006927 to K.D.]; the National Science Foundation [625061]; and US Army grant [W911NF-16-2-0189] (Temple University’s HPC resources).
Conflict of Interest: none declared.
References