The ovarian cancer dataset is based on gene expression data of oncological patients and is constituted by 517 observations over 12042 covariates. This data was obtained from The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/).
The dataset is publicly available (https://gdc-portal.nci.nih.gov/) and it was normalized and aggregated by the TCGA consortium allowing for the analysis to be reproducible with the original dataset. We only cleaned corrupted case records from the data, where correlated attributes had invalid information.
The clinical data was cleaned using “Days to last follow-up” and “Days to death” attributes to detect inconsistencies between them. Only the cases where the number of days matched were included in the analysis. The same process was performed for the attributes “Days to death” and “Vital status”, where some cases had as status “deceased”, but a missing “Days to death”.
For the analysis, 18 genes were considered, based on those selected in a previous study Zhang (2013).
Lets read the dataset. Do not forget to change the filepath for the correct directory of your computer.
The covariate matrix has to be defined. The data is stored in a txt file.
data.survival<- read.table("./tcga18.txt",dec = ",", header = TRUE)
head(data.survival)
## LPL IGF1 EDNRA MFAP5 LOX INHBA THBS2 ADIPOQ
## 1 5.610701 7.511564 5.251405 4.348473 3.753541 4.789777 8.197449 4.852636
## 2 3.927762 6.238349 5.916283 6.667097 4.025553 4.925367 8.380464 3.204983
## 3 6.667391 4.218046 4.300340 3.395904 3.872636 3.835449 5.995941 3.164932
## 4 3.737681 6.878165 3.823494 2.997130 3.349100 3.585837 4.163529 3.209809
## 5 4.649516 7.648314 6.883327 7.845051 4.645337 5.202604 9.693533 3.193201
## 6 3.693657 6.110123 4.855396 6.772189 4.604459 6.536206 10.045702 3.260278
## NPY CCL11 VCAN DCN TIMP3 CRYAB CXCL12
## 1 3.468236 4.035410 6.582996 7.641697 6.138341 5.342155 6.256299
## 2 3.401826 4.431276 7.088650 10.527365 6.665464 7.184504 7.493833
## 3 3.538999 4.054605 3.874755 7.404490 4.243858 4.089568 4.210808
## 4 3.499236 3.793138 3.569516 3.744800 3.656709 6.956911 6.939087
## 5 3.251680 5.470451 8.267006 9.763219 6.290778 10.038258 5.326372
## 6 3.445472 5.797378 8.239962 9.495431 7.657357 6.589208 6.701060
## SPARC CNN1 FBN1
## 1 9.882551 5.000291 5.514547
## 2 10.482700 6.432753 7.220278
## 3 7.042135 4.435957 4.330573
## 4 7.010820 4.459158 3.850500
## 5 11.077949 9.074373 7.972447
## 6 10.894212 6.020242 6.567972
And the matrix for the time and status is given by:
data.time.status<-read.table("./timestatus.txt",dec = ",", header = TRUE)
head(data.time.status)
## time status
## 1 1336 1
## 2 1247 1
## 3 55 1
## 4 1495 0
## 5 61 1
## 6 1418 0
The Cox’s proportional hazard was performed in all variables.
Let’s start the analysis for the Cox regression model.
library(survival)
Fit the Cox proportional hazards to the dataset.
fit.cox.all<-coxph(Surv(data.time.status$time,data.time.status$status) ~ .,data=data.survival)
fit.cox.all
## Call:
## coxph(formula = Surv(data.time.status$time, data.time.status$status) ~
## ., data = data.survival)
##
## coef exp(coef) se(coef) z p
## LPL 0.1263 1.1346 0.0751 1.68 0.0924
## IGF1 0.0210 1.0212 0.0600 0.35 0.7266
## EDNRA 0.0224 1.0227 0.1227 0.18 0.8549
## MFAP5 0.0165 1.0166 0.0482 0.34 0.7327
## LOX 0.1918 1.2114 0.1251 1.53 0.1254
## INHBA -0.1432 0.8666 0.1786 -0.80 0.4227
## THBS2 0.0639 1.0660 0.0902 0.71 0.4787
## ADIPOQ -0.1256 0.8819 0.0910 -1.38 0.1676
## NPY 0.0552 1.0567 0.0496 1.11 0.2655
## CCL11 -0.1296 0.8785 0.0960 -1.35 0.1771
## VCAN 0.0578 1.0595 0.1009 0.57 0.5664
## DCN 0.0729 1.0757 0.0892 0.82 0.4133
## TIMP3 0.0719 1.0745 0.0835 0.86 0.3891
## CRYAB 0.1092 1.1154 0.0424 2.58 0.0100
## CXCL12 0.0204 1.0206 0.0818 0.25 0.8030
## SPARC -0.3811 0.6831 0.1402 -2.72 0.0066
## CNN1 0.0863 1.0901 0.1141 0.76 0.4493
## FBN1 0.1135 1.1202 0.1690 0.67 0.5018
##
## Likelihood ratio test=26.8 on 18 df, p=0.0837
## n= 517, number of events= 284
Next, see if the hypothesis of proportional hazards is not violated. By the results obtained, the proportional hazards hypothesis is not violated.
pph<-cox.zph(fit.cox.all, transform="km", global=TRUE)
pph
## rho chisq p
## LPL 0.0584 1.09e+00 0.2970
## IGF1 -0.0378 4.75e-01 0.4909
## EDNRA -0.0256 3.00e-01 0.5839
## MFAP5 0.0674 1.52e+00 0.2184
## LOX 0.0179 1.09e-01 0.7418
## INHBA 0.0118 4.31e-02 0.8355
## THBS2 -0.0379 4.56e-01 0.4997
## ADIPOQ -0.1246 4.87e+00 0.0273
## NPY -0.0310 3.28e-01 0.5669
## CCL11 0.0665 1.36e+00 0.2440
## VCAN 0.0240 1.81e-01 0.6705
## DCN -0.0100 3.12e-02 0.8598
## TIMP3 -0.0202 1.27e-01 0.7216
## CRYAB -0.0544 9.81e-01 0.3219
## CXCL12 -0.0014 6.83e-04 0.9791
## SPARC 0.0577 9.08e-01 0.3406
## CNN1 -0.0850 2.82e+00 0.0928
## FBN1 -0.0218 1.91e-01 0.6618
## GLOBAL NA 1.92e+01 0.3795
In order to see if the results obtained before are consistent, a robust version of the Cox regression model is going to be performed. First call the following package.
library(coxrobust)
To see the results obtained by fitting the robust Cox using exponential weights with a 5% level of trimming, do the following:
fit.coxrobust<-coxr(Surv(data.time.status$time,data.time.status$status) ~ .,data=data.survival,trunc=0.95, f.weight="exponential",singular.ok=T,model=F)
fit.coxrobust
##
## Call:
## coxr(formula = Surv(data.time.status$time, data.time.status$status) ~ ., data = data.survival, trunc = 0.95, f.weight = "exponential", singular.ok = T, model = F)
##
## Partial likelihood estimator
## coef exp(coef) se(coef) p
## LPL 0.1263 1.135 0.0751 0.09257
## IGF1 0.0207 1.021 0.0601 0.73040
## EDNRA 0.0223 1.023 0.1227 0.85570
## MFAP5 0.0166 1.017 0.0482 0.73068
## LOX 0.1922 1.212 0.1251 0.12450
## INHBA -0.1431 0.867 0.1786 0.42301
## THBS2 0.0640 1.066 0.0902 0.47786
## ADIPOQ -0.1254 0.882 0.0910 0.16841
## NPY 0.0552 1.057 0.0496 0.26528
## CCL11 -0.1296 0.878 0.0960 0.17703
## VCAN 0.0576 1.059 0.1009 0.56776
## DCN 0.0728 1.076 0.0892 0.41396
## TIMP3 0.0719 1.074 0.0835 0.38923
## CRYAB 0.1092 1.115 0.0424 0.00999
## CXCL12 0.0204 1.021 0.0818 0.80330
## SPARC -0.3818 0.683 0.1403 0.00649
## CNN1 0.0866 1.091 0.1141 0.44763
## FBN1 0.1141 1.121 0.1690 0.49972
##
## Wald test=27 on 18 df, p=0.079
##
## Robust estimator
## coef exp(coef) se(coef) p
## LPL 0.10107 1.106 0.0856 0.2375
## IGF1 0.03407 1.035 0.0705 0.6289
## EDNRA 0.06185 1.064 0.2119 0.7704
## MFAP5 0.00888 1.009 0.0622 0.8865
## LOX 0.16876 1.184 0.1499 0.2604
## INHBA -0.15555 0.856 0.1895 0.4118
## THBS2 0.08631 1.090 0.1072 0.4205
## ADIPOQ -0.07269 0.930 0.1047 0.4875
## NPY 0.06251 1.065 0.0710 0.3785
## CCL11 -0.15784 0.854 0.1212 0.1927
## VCAN 0.02857 1.029 0.1419 0.8404
## DCN 0.07907 1.082 0.0993 0.4257
## TIMP3 0.07745 1.081 0.0906 0.3925
## CRYAB 0.11793 1.125 0.0544 0.0302
## CXCL12 0.01292 1.013 0.0962 0.8932
## SPARC -0.39783 0.672 0.2020 0.0489
## CNN1 0.13126 1.140 0.1395 0.3468
## FBN1 0.11223 1.119 0.2234 0.6154
##
## Extended Wald test=27.8 on 18 df, p=0.0647
Another proposal of the robust Cox is given by Heritier (2009), and is going to be performed next.
filepath="../functions/"
source(paste(filepath,"Chapter7_functions.r",sep=""))
The fitted model using exponential weights with 5% level of trimming is given bellow.
fit.coxrobust.H=rcoxph(data.time.status$time,data.time.status$status,data.survival,wt.type="exponential",quant=.95)
fit.coxrobust.H$coefficients
## estimate SE z p.value
## LPL 0.101118182 0.07169238 1.4104454 0.15840
## IGF1 0.034040440 0.06700056 0.5080620 0.61140
## EDNRA 0.062118582 0.14823520 0.4190542 0.67517
## MFAP5 0.008912579 0.05164610 0.1725702 0.86298
## LOX 0.168964994 0.12809588 1.3190510 0.18715
## INHBA -0.155638383 0.18406256 -0.8455733 0.39779
## THBS2 0.086216158 0.09076783 0.9498537 0.34218
## ADIPOQ -0.072837762 0.10007700 -0.7278172 0.46672
## NPY 0.062470225 0.05534324 1.1287778 0.25899
## CCL11 -0.157642473 0.10130146 -1.5561718 0.11966
## VCAN 0.028569797 0.09559983 0.2988478 0.76505
## DCN 0.079087093 0.09756531 0.8106067 0.41759
## TIMP3 0.077490703 0.08806427 0.8799335 0.37889
## CRYAB 0.117990400 0.04369162 2.7005269 0.00692
## CXCL12 0.012983156 0.08792611 0.1476598 0.88261
## SPARC -0.397470586 0.13323024 -2.9833361 0.00285
## CNN1 0.131279965 0.13408693 0.9790660 0.32754
## FBN1 0.111613834 0.18058542 0.6180667 0.53653
The only genes statistically significant were: CRYAB and SPARC, for Cox’s and Cox’s robust.
The next figure shows that observations 113 and 219 are identified as influential observations in the sense that the weight given for each one is the lowest among the others.
In the next section it is going to be evaluated if the observations identified before are or not influential.
In order to identify which of the observations might be outliers some methods for outlier detection for survival analysis are going to be performed.
The first method used for outlier detection is the residuals: deviance.
The results regarding the martingale residuals are shown bellow. Observations 415 and 346 presented the highest absolut value.
res.dev.2<- resid(fit.cox.all,type="deviance")
head(sort(abs(res.dev.2),decreasing = TRUE))
## 415 346 202 311 219 310
## 3.158922 3.156986 2.999861 2.749230 2.669330 2.664042
The outlier detection method for the concordance c-index is based on the definition of an outlier as “an observation that when absent from the data, will likely decrease the prediction error of the fitted model”. The concordance c-index is used as a test statistics that is sensitive to the presence of outliers, i.e., the larger the number of outliers, the lower the performance of the model adjusted. The concordance c-index measures how well predicted values are concordant with the rank ordered response variables. Two algorithms based on this are going to be performed next. First let’s call the package to perform this outlier detection method.
library("BCSOD")
The input of this algorithm is:
The dual bootstrap hypothesis testing (DBHT) is an improvement of the BHT described before. The BHT removes one observation from the dataset, and then evaluates the impact of each removal on the concordance c-index. Notice that the model has less observations than the original dataset, and therefore the concordance c-index has the tendency to increase, which may hamper the evaluation of the statistical significance and may increase the number of false positives.
The input is the same as before. Notice that B.N corresponds to the number of observations
Here are the top observations with lowest p-values.
outliers_dbht1000[1:10,]
## pvalues
## [1,] 346 4.223057e-20
## [2,] 74 1.628611e-16
## [3,] 42 1.535752e-15
## [4,] 201 8.492635e-10
## [5,] 219 4.414806e-09
## [6,] 113 7.301784e-09
## [7,] 508 2.953401e-08
## [8,] 277 2.739335e-07
## [9,] 466 3.007432e-07
## [10,] 115 5.547855e-07
dbht<-as.data.frame(outliers_dbht1000)# convert to a dataframe (more easy to handle)
names(dbht)[1]="id" # change the name of the 1st) column
dbht.order<-dbht[order(dbht$id),]
dbht.order[1:10,]
## id pvalues
## 345 1 7.981046e-01
## 301 2 6.801899e-01
## 136 3 1.550638e-01
## 211 4 3.970933e-01
## 509 5 9.999235e-01
## 504 6 9.993949e-01
## 363 7 8.307585e-01
## 17 8 2.927809e-05
## 323 9 7.307515e-01
## 63 10 2.272372e-02
Censored quantile regression was introduced by Portnoy (2003) as an alternative to the Cox’s regression model for survival data with censored observations. Due to the fact that the Cox’s regression model assumes proportional hazards, the censored quantile regression can provide more flexibility. The alternative method proposed by Portnoy (2003) is a generalization of the Kaplan-Meier one sample estimator.
Here two algorithm are considered. the results an command lines to obtain the outliers based on this theory are presented next. First call the package OutlierDC, and make sure that the covariate matrix is in the form of a data frame.
library(OutlierDC)
## Loading required package: quantreg
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
##
## Attaching package: 'quantreg'
## The following object is masked from 'package:survival':
##
## untangle.specials
## Loading required package: Formula
##
## Package OutlierDC (0.3-0) loaded.
data.survival<-as.data.frame(data.survival)
The score algorithm arises from the outlier detection algorithm for the Cox’s regression model for censored data proposed by Nardi (1999) but now based on quantile regression.
odcS_tcga<-odc(formula=Surv((data.time.status$time),data.time.status$status)~
LPL+IGF1+EDNRA+MFAP5+LOX+INHBA+THBS2+ADIPOQ+ NPY+ CCL11+VCAN+ DCN+ TIMP3+ CRYAB+ CXCL12+ SPARC+ CNN1+ FBN1, data = data.survival,method="score",rq.model="Wang",h=0.05)
## Please wait...
## Done.
The top outliers are:
head(sort(odcS_tcga@score,decreasing = TRUE))
## 113 221 455 26 452 297
## 10.239740 8.456885 5.719224 5.289057 4.443739 4.214216
Put in form of a vector the results obtained before
odS.vec<-as.vector(odcS_tcga@score)
Define a matrix with the id and the values for the residual algorithm.
n<-dim(data.survival)[1]#dimension of the data
id<-as.vector(seq(1:n))# vector with the id
odS.matrix<-as.data.frame(cbind(id,odS.vec))
In order to obtain a global result concerning each one of the techniques performed in the previous section, a rank product test is going to be used. First a rank matrix is going to be obtained based on the ranks of each outliers detection method.
res.dev.vec<-as.vector(abs(res.dev.2))
rank_deviance<-rank(-res.dev.vec,ties.method = "first")
rank_dbht<-rank(dbht.order$pvalues,ties.method = "first")
rank_score<-rank(-(odS.matrix$odS.vec),ties.method ="first")
Now let us obtain based on the ranks above the rank product matrix. First let’s consider a rank matrix, with all the ranks obtained by each method.
rankMat<-cbind(rank_deviance,rank_dbht,rank_score)
In second obtain the rank product, which is obtained based on the input of the rank matrix.
RankProduct = function(rankMat)
{
return(apply(rankMat, 1, prod))
}
rankproduct<-RankProduct(rankMat)
In order to obtain the p-values the algorithm proposed by Heskes (2014) was used. Notice that the p-values are based on the geometric mean of upper and lower bounds, defined recursively.
filepath="../functions/"
source(paste(filepath,"Heskes_pvalues.R",sep=""))
The input to obtain the p-values is the following:
rho=rankproduct
n<-dim(data.survival)[1]
k<-dim(rankMat)[2]
The p-values are obtained by the following, where Delta option is the geometric mean.
pvalues<-as.vector(rankprodbounds(rho,n,k,Delta ='geometric'))
In order to obtain the observations with the lowest p-values we can do the lines:
pvalues.matrix<-cbind(pvalues)
which(pvalues.matrix<0.01)## pvalues < to 1%
## [1] 32 42 74 113 115 201 202 211 219 221 279 346 455 508
which(pvalues.matrix<0.05) ## pvalues < to 5%
## [1] 10 28 32 42 55 74 113 114 115 128 155 159 179 197 201 202 211
## [18] 219 221 277 279 311 323 346 348 372 389 405 407 415 422 452 455 466
## [35] 500 506 508 516
Another key issue when performing Rank product tests is related with the multiple testing problem. In fact, since many observations are tested, type-I errors (false positives) will increase.
The FDR, which is the expected proportion of false positives among all tests that are significant, sorts in an ascendant order the p-values and divides them by their percentile rank. The measure used to determine the FDR is the q-value. For the p-value: 0.05 implies that 5% of all tests will result in false positives, instead, for the q-value: 0.05 implies that 5% of significant tests will result in false positives. In this sense the q-value is able to control the number of false discoveries in those tests. For this reason it has the ability of finding truly significant results.
The q-value package must be installed.
library(qvalue)
Here the only input that is need is the p-values obtained before for each observation.
qobj<-qvalue(pvalues)
qvalues<-qobj$qvalues
which(qvalues<0.01)
## integer(0)
which(qvalues<0.05)
## [1] 113 219 221 346 455
In order to resume all the information, let us get an id vector and a data frame that combines the ranks the pvalues and the qvalues.
id<-as.vector(seq(1:n))# vector with the id
tcga.outliers<-as.data.frame(cbind(id,rank_deviance,rank_dbht,rank_score,pvalues,qvalues))
Next we can obtain the top 15 outlier observations, with the qvalues sorted in increasing order.
sort.tcga.outliers <- tcga.outliers[order(qvalues) , ]
sort.tcga.outliers[1:15,]
## id rank_deviance rank_dbht rank_score pvalues qvalues
## 113 113 79 6 1 8.714172e-05 0.01505800
## 219 219 5 5 19 8.737720e-05 0.01505800
## 221 221 14 12 2 5.595444e-05 0.01505800
## 346 346 2 1 500 2.123937e-04 0.02745189
## 455 455 11 61 3 4.688042e-04 0.04847436
## 202 202 3 11 486 4.092992e-03 0.35267951
## 32 32 45 14 43 6.802920e-03 0.36002620
## 42 42 29 3 453 9.685479e-03 0.36002620
## 74 74 22 2 510 5.678407e-03 0.36002620
## 115 115 47 10 70 8.178812e-03 0.36002620
## 201 201 17 4 485 8.197544e-03 0.36002620
## 211 211 18 315 7 9.749259e-03 0.36002620
## 279 279 164 15 15 9.108626e-03 0.36002620
## 508 508 57 7 49 4.968524e-03 0.36002620
## 372 372 23 330 8 1.439240e-02 0.46505439
Notice the observations the with the lowest qvalues lead us to believe that they are influential observations and should taken for further study.
Let us look for the original dataset with 18 covariates.
summary(data.survival)
## LPL IGF1 EDNRA MFAP5
## Min. :3.236 Min. :3.344 Min. :3.461 Min. : 2.858
## 1st Qu.:3.926 1st Qu.:4.573 1st Qu.:3.935 1st Qu.: 3.794
## Median :4.323 Median :5.388 Median :4.413 Median : 4.880
## Mean :4.668 Mean :5.590 Mean :4.662 Mean : 5.225
## 3rd Qu.:5.108 3rd Qu.:6.428 3rd Qu.:5.207 3rd Qu.: 6.489
## Max. :9.381 Max. :9.123 Max. :8.788 Max. :10.200
## LOX INHBA THBS2 ADIPOQ
## Min. :3.103 Min. :3.224 Min. : 3.535 Min. :2.833
## 1st Qu.:3.450 1st Qu.:3.637 1st Qu.: 5.052 1st Qu.:3.130
## Median :3.779 Median :3.949 Median : 6.593 Median :3.221
## Mean :3.972 Mean :4.287 Mean : 6.783 Mean :3.501
## 3rd Qu.:4.313 3rd Qu.:4.797 3rd Qu.: 8.514 3rd Qu.:3.352
## Max. :8.436 Max. :7.782 Max. :10.702 Max. :9.766
## NPY CCL11 VCAN DCN
## Min. : 2.989 Min. :3.250 Min. : 3.406 Min. : 3.529
## 1st Qu.: 3.437 1st Qu.:3.697 1st Qu.: 4.613 1st Qu.: 7.153
## Median : 3.609 Median :4.047 Median : 5.846 Median : 8.211
## Mean : 4.047 Mean :4.420 Mean : 6.050 Mean : 8.173
## 3rd Qu.: 3.994 3rd Qu.:4.839 3rd Qu.: 7.455 3rd Qu.: 9.458
## Max. :10.625 Max. :8.724 Max. :10.154 Max. :11.542
## TIMP3 CRYAB CXCL12 SPARC
## Min. :3.495 Min. : 3.160 Min. :3.349 Min. : 6.018
## 1st Qu.:4.638 1st Qu.: 6.086 1st Qu.:4.390 1st Qu.: 8.274
## Median :5.592 Median : 7.106 Median :5.256 Median : 9.104
## Mean :5.763 Mean : 7.137 Mean :5.485 Mean : 9.106
## 3rd Qu.:6.710 3rd Qu.: 8.279 3rd Qu.:6.364 3rd Qu.: 9.950
## Max. :9.676 Max. :11.356 Max. :9.406 Max. :11.698
## CNN1 FBN1
## Min. :3.715 Min. :3.604
## 1st Qu.:4.485 1st Qu.:4.587
## Median :4.747 Median :5.302
## Mean :4.932 Mean :5.550
## 3rd Qu.:5.241 3rd Qu.:6.224
## Max. :9.074 Max. :9.741
Regarding the time and status,
summary(data.time.status)
## time status
## Min. : 8 Min. :0.0000
## 1st Qu.: 376 1st Qu.:0.0000
## Median : 923 Median :1.0000
## Mean :1043 Mean :0.5493
## 3rd Qu.:1483 3rd Qu.:1.0000
## Max. :5481 Max. :1.0000
Now let us see the values of the covariates regarding the observations that seem to have some influential.
data.survival[c(113, 219, 221, 346, 455, 202, 32, 42, 74, 115),]
## LPL IGF1 EDNRA MFAP5 LOX INHBA THBS2
## 113 3.730568 7.384083 4.595852 7.600399 3.778588 5.199009 8.488414
## 219 4.278560 4.817638 5.811480 4.905208 3.356856 3.663791 5.162079
## 221 5.298230 7.500011 4.423761 4.854608 3.931041 4.374053 7.728778
## 346 4.172526 4.717430 3.853260 3.190443 3.350616 3.524537 4.484999
## 455 4.323476 6.277478 4.410887 5.600682 3.863177 4.226022 8.377245
## 202 6.367273 4.404099 3.770298 3.299618 3.726482 3.746802 5.417982
## 32 4.046278 4.250899 3.985438 3.329091 3.663556 3.606930 6.112304
## 42 4.024678 4.240507 4.235880 3.094953 3.386156 3.693511 6.245130
## 74 3.603795 3.642994 3.748988 9.415976 3.837903 3.386590 3.666802
## 115 6.787121 7.322588 6.449485 6.252702 4.061837 5.664429 8.799108
## ADIPOQ NPY CCL11 VCAN DCN TIMP3 CRYAB
## 113 3.310786 3.536249 4.288686 6.920805 7.967912 6.828998 9.597481
## 219 3.268250 7.738465 3.536991 5.014663 9.210073 6.504730 9.267920
## 221 3.249411 3.910235 4.687864 6.101423 7.986346 6.485240 7.311449
## 346 3.219537 3.686063 3.620383 4.794432 7.328997 4.605971 5.616922
## 455 3.254565 3.249004 3.588039 6.844709 8.371540 6.169093 7.423490
## 202 3.159120 3.325088 3.693970 4.613538 5.897462 4.562150 3.769068
## 32 3.375889 3.896632 3.735272 3.781419 5.718897 3.722733 9.046409
## 42 3.165178 3.877778 4.760985 5.599226 9.181945 5.116395 6.021069
## 74 3.421356 3.527337 3.427662 3.960211 5.308635 4.293556 4.046809
## 115 6.985946 3.572123 4.706295 8.018179 9.535735 6.762000 9.454499
## CXCL12 SPARC CNN1 FBN1
## 113 6.568068 9.314609 6.669734 5.871749
## 219 4.824111 9.307861 4.432514 5.683262
## 221 5.575577 9.289752 6.349214 5.611158
## 346 3.704212 8.624038 4.066246 4.520293
## 455 5.432554 9.675083 5.237223 6.012563
## 202 4.129214 7.526688 4.497028 4.453816
## 32 3.879755 6.924337 4.567773 4.161153
## 42 4.441712 9.012392 4.009992 5.035908
## 74 3.619439 7.511828 4.256769 4.193958
## 115 6.175115 10.238335 5.359044 7.017225
data.time.status[c(113, 219, 221, 346, 455, 202, 32, 42, 74, 115),]
## time status
## 113 3260 1
## 219 3525 0
## 221 2788 0
## 346 9 1
## 455 3532 0
## 202 9 1
## 32 1966 0
## 42 304 1
## 74 204 1
## 115 2259 0
Based on the values of the time, we can see that in general all the observations had higher values. Notice that observation 452 has the maximum value with time 5481.
kc4<-kmeans(data.survival,4)
kc4
## K-means clustering with 4 clusters of sizes 127, 124, 111, 155
##
## Cluster means:
## LPL IGF1 EDNRA MFAP5 LOX INHBA THBS2 ADIPOQ
## 1 4.507494 5.079685 4.049926 3.834330 3.628492 3.667702 4.975273 3.195383
## 2 4.378266 5.095369 4.072482 5.371933 3.594146 3.670304 5.277970 3.222208
## 3 5.335314 6.552470 5.835925 7.046514 4.713080 5.572208 9.416523 4.346545
## 4 4.552070 5.714899 4.795556 4.944086 4.023904 4.368217 7.582466 3.367945
## NPY CCL11 VCAN DCN TIMP3 CRYAB CXCL12
## 1 3.851242 3.818282 4.532819 6.611667 4.718139 5.723988 4.611607
## 2 4.241725 3.859527 4.722062 7.328610 5.121022 8.077714 4.641390
## 3 4.176950 5.517743 8.430778 10.133088 7.565960 7.777593 7.099515
## 4 3.958158 4.575160 6.651720 8.723868 5.842716 7.083832 5.719584
## SPARC CNN1 FBN1
## 1 8.072844 4.631832 4.569274
## 2 8.435018 4.756706 4.747612
## 3 10.617322 5.453678 7.268519
## 4 9.405838 4.943186 5.765576
##
## Clustering vector:
## [1] 4 3 1 1 3 3 4 4 4 1 4 1 2 2 4 1 2 2 1 1 1 1 1 2 1 3 2 2 2 1 2 2 1 2 1
## [36] 2 1 2 1 2 4 1 4 4 1 1 2 4 4 2 2 2 1 1 2 2 1 1 2 1 2 2 4 4 2 2 2 4 4 2
## [71] 3 1 1 1 1 1 2 1 4 2 2 1 4 4 4 4 4 1 2 4 4 4 2 1 2 1 3 2 2 4 1 2 2 1 1
## [106] 2 2 4 1 4 3 1 4 4 3 4 1 4 3 2 4 3 4 4 3 4 4 2 2 2 3 4 3 2 1 1 3 2 2 2
## [141] 4 4 4 3 1 3 2 4 3 4 4 3 4 3 1 3 3 4 1 2 1 3 4 2 4 4 3 4 3 3 3 1 4 3 2
## [176] 3 4 4 4 4 1 1 2 1 3 4 1 1 1 4 2 1 4 1 4 4 1 2 2 2 1 1 4 1 2 4 4 3 3 4
## [211] 4 4 1 4 1 4 1 4 2 3 4 4 4 1 4 3 3 2 4 4 1 2 4 1 2 4 3 1 3 2 4 3 2 3 2
## [246] 4 2 3 3 4 3 1 1 1 1 1 1 1 1 3 3 1 1 4 1 2 2 1 1 1 2 1 3 3 3 1 1 2 2 1
## [281] 2 4 4 1 1 4 4 3 3 1 2 4 4 3 4 2 1 2 4 2 2 3 4 3 3 2 2 4 4 4 4 2 3 3 4
## [316] 3 3 4 3 4 1 3 2 4 2 1 1 4 4 1 3 2 4 4 1 4 4 3 4 3 4 3 2 3 2 1 2 3 2 3
## [351] 1 1 4 1 3 2 2 4 4 2 3 4 2 3 3 3 4 4 4 4 3 3 4 3 1 1 3 4 3 2 1 1 3 1 3
## [386] 4 2 1 4 3 3 3 4 3 2 1 2 3 3 2 1 4 1 4 3 2 4 4 4 1 3 2 3 4 3 2 4 3 3 2
## [421] 3 4 3 3 3 4 3 1 3 4 3 3 4 3 1 2 4 2 3 4 2 3 1 4 4 2 3 1 2 1 2 4 1 2 4
## [456] 4 3 4 2 3 4 4 4 4 2 2 3 1 2 3 2 4 4 1 4 1 2 2 1 2 2 1 1 4 1 4 2 4 4 2
## [491] 3 2 3 4 3 3 4 3 4 1 2 2 1 1 1 1 4 2 4 1 3 4 2 2 2 1 4
##
## Within cluster sum of squares by cluster:
## [1] 1543.203 1681.734 1688.129 2159.681
## (between_SS / total_SS = 52.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
mydata <- data.frame(data.survival, kc4$cluster)
mydata[c(113, 219, 221, 346, 455, 202, 32, 42, 74, 115),]
## LPL IGF1 EDNRA MFAP5 LOX INHBA THBS2
## 113 3.730568 7.384083 4.595852 7.600399 3.778588 5.199009 8.488414
## 219 4.278560 4.817638 5.811480 4.905208 3.356856 3.663791 5.162079
## 221 5.298230 7.500011 4.423761 4.854608 3.931041 4.374053 7.728778
## 346 4.172526 4.717430 3.853260 3.190443 3.350616 3.524537 4.484999
## 455 4.323476 6.277478 4.410887 5.600682 3.863177 4.226022 8.377245
## 202 6.367273 4.404099 3.770298 3.299618 3.726482 3.746802 5.417982
## 32 4.046278 4.250899 3.985438 3.329091 3.663556 3.606930 6.112304
## 42 4.024678 4.240507 4.235880 3.094953 3.386156 3.693511 6.245130
## 74 3.603795 3.642994 3.748988 9.415976 3.837903 3.386590 3.666802
## 115 6.787121 7.322588 6.449485 6.252702 4.061837 5.664429 8.799108
## ADIPOQ NPY CCL11 VCAN DCN TIMP3 CRYAB
## 113 3.310786 3.536249 4.288686 6.920805 7.967912 6.828998 9.597481
## 219 3.268250 7.738465 3.536991 5.014663 9.210073 6.504730 9.267920
## 221 3.249411 3.910235 4.687864 6.101423 7.986346 6.485240 7.311449
## 346 3.219537 3.686063 3.620383 4.794432 7.328997 4.605971 5.616922
## 455 3.254565 3.249004 3.588039 6.844709 8.371540 6.169093 7.423490
## 202 3.159120 3.325088 3.693970 4.613538 5.897462 4.562150 3.769068
## 32 3.375889 3.896632 3.735272 3.781419 5.718897 3.722733 9.046409
## 42 3.165178 3.877778 4.760985 5.599226 9.181945 5.116395 6.021069
## 74 3.421356 3.527337 3.427662 3.960211 5.308635 4.293556 4.046809
## 115 6.985946 3.572123 4.706295 8.018179 9.535735 6.762000 9.454499
## CXCL12 SPARC CNN1 FBN1 kc4.cluster
## 113 6.568068 9.314609 6.669734 5.871749 4
## 219 4.824111 9.307861 4.432514 5.683262 2
## 221 5.575577 9.289752 6.349214 5.611158 4
## 346 3.704212 8.624038 4.066246 4.520293 1
## 455 5.432554 9.675083 5.237223 6.012563 4
## 202 4.129214 7.526688 4.497028 4.453816 1
## 32 3.879755 6.924337 4.567773 4.161153 2
## 42 4.441712 9.012392 4.009992 5.035908 1
## 74 3.619439 7.511828 4.256769 4.193958 1
## 115 6.175115 10.238335 5.359044 7.017225 3
The observations with a q-value smaller than 5%
mydata[c(113, 219, 221, 346, 455),]
## LPL IGF1 EDNRA MFAP5 LOX INHBA THBS2
## 113 3.730568 7.384083 4.595852 7.600399 3.778588 5.199009 8.488414
## 219 4.278560 4.817638 5.811480 4.905208 3.356856 3.663791 5.162079
## 221 5.298230 7.500011 4.423761 4.854608 3.931041 4.374053 7.728778
## 346 4.172526 4.717430 3.853260 3.190443 3.350616 3.524537 4.484999
## 455 4.323476 6.277478 4.410887 5.600682 3.863177 4.226022 8.377245
## ADIPOQ NPY CCL11 VCAN DCN TIMP3 CRYAB
## 113 3.310786 3.536249 4.288686 6.920805 7.967912 6.828998 9.597481
## 219 3.268250 7.738465 3.536991 5.014663 9.210073 6.504730 9.267920
## 221 3.249411 3.910235 4.687864 6.101423 7.986346 6.485240 7.311449
## 346 3.219537 3.686063 3.620383 4.794432 7.328997 4.605971 5.616922
## 455 3.254565 3.249004 3.588039 6.844709 8.371540 6.169093 7.423490
## CXCL12 SPARC CNN1 FBN1 kc4.cluster
## 113 6.568068 9.314609 6.669734 5.871749 4
## 219 4.824111 9.307861 4.432514 5.683262 2
## 221 5.575577 9.289752 6.349214 5.611158 4
## 346 3.704212 8.624038 4.066246 4.520293 1
## 455 5.432554 9.675083 5.237223 6.012563 4
kc5<-kmeans(data.survival,5)
kc5
## K-means clustering with 5 clusters of sizes 124, 101, 98, 97, 97
##
## Cluster means:
## LPL IGF1 EDNRA MFAP5 LOX INHBA THBS2 ADIPOQ
## 1 4.618342 5.807890 4.910235 5.152469 4.120732 4.509613 7.922721 3.436084
## 2 5.359920 6.600361 5.895569 7.143613 4.738249 5.638378 9.464933 4.395812
## 3 4.415086 5.332284 4.181659 5.282843 3.635519 3.781646 5.902508 3.214585
## 4 4.258753 4.729927 3.974906 4.587060 3.568523 3.593207 4.609537 3.208088
## 5 4.673645 5.380388 4.234404 3.902078 3.725651 3.800970 5.596883 3.232988
## NPY CCL11 VCAN DCN TIMP3 CRYAB CXCL12
## 1 4.028975 4.678772 6.943050 8.933593 6.071771 7.051747 5.854258
## 2 4.202736 5.569152 8.480410 10.180507 7.652817 7.849420 7.163588
## 3 4.095147 3.979528 5.173194 7.977305 5.360167 8.426339 4.861124
## 4 3.866376 3.767748 4.203868 6.078190 4.468760 7.118908 4.313165
## 5 4.039233 3.989742 5.111638 7.402521 5.103638 5.220272 5.067739
## SPARC CNN1 FBN1
## 1 9.630790 4.975456 5.955204
## 2 10.661374 5.494766 7.364244
## 3 8.860077 4.806302 5.049814
## 4 7.674751 4.653158 4.314266
## 5 8.493464 4.694065 4.885280
##
## Clustering vector:
## [1] 1 2 5 4 2 2 3 1 1 4 5 4 3 4 5 5 3 4 5 5 4 5 5 3 5 2 3 4 4 4 4 4 4 3 5
## [36] 3 5 4 5 4 1 5 1 3 5 5 4 5 1 5 3 3 4 5 4 4 5 5 3 5 5 3 1 1 3 3 4 1 1 4
## [71] 2 4 5 4 5 4 3 4 1 3 4 4 1 1 3 3 1 4 3 1 1 1 4 4 4 5 2 4 4 1 4 3 4 4 4
## [106] 4 5 3 4 3 2 5 1 5 2 3 5 1 2 4 1 2 1 1 2 1 5 4 3 3 2 1 2 4 4 5 2 5 4 4
## [141] 1 1 1 1 4 2 3 1 2 1 1 2 1 1 5 2 2 5 4 3 5 2 1 4 1 1 2 1 2 2 2 4 1 2 3
## [176] 1 1 5 1 1 4 5 4 5 2 3 5 5 5 1 3 5 1 4 1 1 4 3 3 4 4 5 1 5 3 1 1 2 2 1
## [211] 3 3 5 1 4 3 5 3 3 2 1 1 1 4 1 2 2 3 1 1 4 3 1 4 4 1 2 4 2 3 1 2 3 2 3
## [246] 1 3 2 2 3 2 5 4 4 5 5 4 5 4 2 2 4 5 1 5 3 3 4 5 5 3 4 1 2 2 4 5 3 3 5
## [281] 3 1 3 5 4 1 3 2 2 5 3 1 5 1 1 5 5 4 1 3 3 1 1 1 2 3 4 1 1 1 1 3 2 2 5
## [316] 2 2 5 2 1 5 2 3 3 3 4 5 1 1 4 2 3 1 1 5 5 1 1 1 2 3 2 3 2 3 5 4 2 5 2
## [351] 5 5 5 4 2 4 3 5 1 4 2 1 4 2 2 2 1 1 1 5 2 2 1 2 4 4 2 1 2 3 5 5 2 4 2
## [386] 1 3 5 3 2 2 2 1 2 3 5 4 2 2 3 5 1 5 3 2 3 1 5 1 5 2 3 1 1 2 3 5 2 2 3
## [421] 2 1 2 2 2 1 2 5 2 1 2 2 1 2 4 3 1 4 2 5 3 2 5 1 1 4 2 5 3 5 4 1 5 3 1
## [456] 3 2 1 3 2 1 1 1 1 3 3 2 4 3 2 4 1 5 5 1 4 3 3 4 3 3 4 5 5 5 3 4 1 3 3
## [491] 2 3 1 1 2 2 1 2 1 5 3 3 4 4 4 4 1 3 1 5 2 1 3 4 3 5 1
##
## Within cluster sum of squares by cluster:
## [1] 1645.332 1545.320 1178.679 1161.628 1179.167
## (between_SS / total_SS = 55.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
mydata2 <- data.frame(data.survival, kc5$cluster)
mydata2[c(113, 219, 221, 346, 455, 202, 32, 42, 74, 115),]
## LPL IGF1 EDNRA MFAP5 LOX INHBA THBS2
## 113 3.730568 7.384083 4.595852 7.600399 3.778588 5.199009 8.488414
## 219 4.278560 4.817638 5.811480 4.905208 3.356856 3.663791 5.162079
## 221 5.298230 7.500011 4.423761 4.854608 3.931041 4.374053 7.728778
## 346 4.172526 4.717430 3.853260 3.190443 3.350616 3.524537 4.484999
## 455 4.323476 6.277478 4.410887 5.600682 3.863177 4.226022 8.377245
## 202 6.367273 4.404099 3.770298 3.299618 3.726482 3.746802 5.417982
## 32 4.046278 4.250899 3.985438 3.329091 3.663556 3.606930 6.112304
## 42 4.024678 4.240507 4.235880 3.094953 3.386156 3.693511 6.245130
## 74 3.603795 3.642994 3.748988 9.415976 3.837903 3.386590 3.666802
## 115 6.787121 7.322588 6.449485 6.252702 4.061837 5.664429 8.799108
## ADIPOQ NPY CCL11 VCAN DCN TIMP3 CRYAB
## 113 3.310786 3.536249 4.288686 6.920805 7.967912 6.828998 9.597481
## 219 3.268250 7.738465 3.536991 5.014663 9.210073 6.504730 9.267920
## 221 3.249411 3.910235 4.687864 6.101423 7.986346 6.485240 7.311449
## 346 3.219537 3.686063 3.620383 4.794432 7.328997 4.605971 5.616922
## 455 3.254565 3.249004 3.588039 6.844709 8.371540 6.169093 7.423490
## 202 3.159120 3.325088 3.693970 4.613538 5.897462 4.562150 3.769068
## 32 3.375889 3.896632 3.735272 3.781419 5.718897 3.722733 9.046409
## 42 3.165178 3.877778 4.760985 5.599226 9.181945 5.116395 6.021069
## 74 3.421356 3.527337 3.427662 3.960211 5.308635 4.293556 4.046809
## 115 6.985946 3.572123 4.706295 8.018179 9.535735 6.762000 9.454499
## CXCL12 SPARC CNN1 FBN1 kc5.cluster
## 113 6.568068 9.314609 6.669734 5.871749 1
## 219 4.824111 9.307861 4.432514 5.683262 3
## 221 5.575577 9.289752 6.349214 5.611158 1
## 346 3.704212 8.624038 4.066246 4.520293 5
## 455 5.432554 9.675083 5.237223 6.012563 1
## 202 4.129214 7.526688 4.497028 4.453816 5
## 32 3.879755 6.924337 4.567773 4.161153 4
## 42 4.441712 9.012392 4.009992 5.035908 5
## 74 3.619439 7.511828 4.256769 4.193958 4
## 115 6.175115 10.238335 5.359044 7.017225 2