1 Description of the dataset

The ovarian cancer dataset is based on gene expression data of oncological patients and is constituted by 517 observations over 12042 covariates. This data was obtained from The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/).

The dataset is publicly available (https://gdc-portal.nci.nih.gov/) and it was normalized and aggregated by the TCGA consortium allowing for the analysis to be reproducible with the original dataset. We only cleaned corrupted case records from the data, where correlated attributes had invalid information.

The clinical data was cleaned using “Days to last follow-up” and “Days to death” attributes to detect inconsistencies between them. Only the cases where the number of days matched were included in the analysis. The same process was performed for the attributes “Days to death” and “Vital status”, where some cases had as status “deceased”, but a missing “Days to death”.

For the analysis, 18 genes were considered, based on those selected in a previous study Zhang (2013).

Lets read the dataset. Do not forget to change the filepath for the correct directory of your computer.

The covariate matrix has to be defined. The data is stored in a txt file.

data.survival<- read.table("./tcga18.txt",dec = ",", header = TRUE)

head(data.survival)

##        LPL     IGF1    EDNRA    MFAP5      LOX    INHBA     THBS2   ADIPOQ
## 1 5.610701 7.511564 5.251405 4.348473 3.753541 4.789777  8.197449 4.852636
## 2 3.927762 6.238349 5.916283 6.667097 4.025553 4.925367  8.380464 3.204983
## 3 6.667391 4.218046 4.300340 3.395904 3.872636 3.835449  5.995941 3.164932
## 4 3.737681 6.878165 3.823494 2.997130 3.349100 3.585837  4.163529 3.209809
## 5 4.649516 7.648314 6.883327 7.845051 4.645337 5.202604  9.693533 3.193201
## 6 3.693657 6.110123 4.855396 6.772189 4.604459 6.536206 10.045702 3.260278
##        NPY    CCL11     VCAN       DCN    TIMP3     CRYAB   CXCL12
## 1 3.468236 4.035410 6.582996  7.641697 6.138341  5.342155 6.256299
## 2 3.401826 4.431276 7.088650 10.527365 6.665464  7.184504 7.493833
## 3 3.538999 4.054605 3.874755  7.404490 4.243858  4.089568 4.210808
## 4 3.499236 3.793138 3.569516  3.744800 3.656709  6.956911 6.939087
## 5 3.251680 5.470451 8.267006  9.763219 6.290778 10.038258 5.326372
## 6 3.445472 5.797378 8.239962  9.495431 7.657357  6.589208 6.701060
##       SPARC     CNN1     FBN1
## 1  9.882551 5.000291 5.514547
## 2 10.482700 6.432753 7.220278
## 3  7.042135 4.435957 4.330573
## 4  7.010820 4.459158 3.850500
## 5 11.077949 9.074373 7.972447
## 6 10.894212 6.020242 6.567972

And the matrix for the time and status is given by:

data.time.status<-read.table("./timestatus.txt",dec = ",", header = TRUE)
head(data.time.status)

##   time status
## 1 1336      1
## 2 1247      1
## 3   55      1
## 4 1495      0
## 5   61      1
## 6 1418      0

2 Survival analysis

The Cox’s proportional hazard was performed in all variables.

2.1 Cox Regression model

Let’s start the analysis for the Cox regression model.

library(survival)

Fit the Cox proportional hazards to the dataset.

fit.cox.all<-coxph(Surv(data.time.status$time,data.time.status$status) ~ .,data=data.survival)
fit.cox.all

## Call:
## coxph(formula = Surv(data.time.status$time, data.time.status$status) ~ 
##     ., data = data.survival)
## 
##           coef exp(coef) se(coef)     z      p
## LPL     0.1263    1.1346   0.0751  1.68 0.0924
## IGF1    0.0210    1.0212   0.0600  0.35 0.7266
## EDNRA   0.0224    1.0227   0.1227  0.18 0.8549
## MFAP5   0.0165    1.0166   0.0482  0.34 0.7327
## LOX     0.1918    1.2114   0.1251  1.53 0.1254
## INHBA  -0.1432    0.8666   0.1786 -0.80 0.4227
## THBS2   0.0639    1.0660   0.0902  0.71 0.4787
## ADIPOQ -0.1256    0.8819   0.0910 -1.38 0.1676
## NPY     0.0552    1.0567   0.0496  1.11 0.2655
## CCL11  -0.1296    0.8785   0.0960 -1.35 0.1771
## VCAN    0.0578    1.0595   0.1009  0.57 0.5664
## DCN     0.0729    1.0757   0.0892  0.82 0.4133
## TIMP3   0.0719    1.0745   0.0835  0.86 0.3891
## CRYAB   0.1092    1.1154   0.0424  2.58 0.0100
## CXCL12  0.0204    1.0206   0.0818  0.25 0.8030
## SPARC  -0.3811    0.6831   0.1402 -2.72 0.0066
## CNN1    0.0863    1.0901   0.1141  0.76 0.4493
## FBN1    0.1135    1.1202   0.1690  0.67 0.5018
## 
## Likelihood ratio test=26.8  on 18 df, p=0.0837
## n= 517, number of events= 284

Next, see if the hypothesis of proportional hazards is not violated. By the results obtained, the proportional hazards hypothesis is not violated.

pph<-cox.zph(fit.cox.all, transform="km", global=TRUE)
pph

##            rho    chisq      p
## LPL     0.0584 1.09e+00 0.2970
## IGF1   -0.0378 4.75e-01 0.4909
## EDNRA  -0.0256 3.00e-01 0.5839
## MFAP5   0.0674 1.52e+00 0.2184
## LOX     0.0179 1.09e-01 0.7418
## INHBA   0.0118 4.31e-02 0.8355
## THBS2  -0.0379 4.56e-01 0.4997
## ADIPOQ -0.1246 4.87e+00 0.0273
## NPY    -0.0310 3.28e-01 0.5669
## CCL11   0.0665 1.36e+00 0.2440
## VCAN    0.0240 1.81e-01 0.6705
## DCN    -0.0100 3.12e-02 0.8598
## TIMP3  -0.0202 1.27e-01 0.7216
## CRYAB  -0.0544 9.81e-01 0.3219
## CXCL12 -0.0014 6.83e-04 0.9791
## SPARC   0.0577 9.08e-01 0.3406
## CNN1   -0.0850 2.82e+00 0.0928
## FBN1   -0.0218 1.91e-01 0.6618
## GLOBAL      NA 1.92e+01 0.3795

2.2 Cox Robust

In order to see if the results obtained before are consistent, a robust version of the Cox regression model is going to be performed. First call the following package.

library(coxrobust)

To see the results obtained by fitting the robust Cox using exponential weights with a 5% level of trimming, do the following:

fit.coxrobust<-coxr(Surv(data.time.status$time,data.time.status$status) ~ .,data=data.survival,trunc=0.95, f.weight="exponential",singular.ok=T,model=F)
fit.coxrobust

## 
## Call:
## coxr(formula = Surv(data.time.status$time, data.time.status$status) ~     ., data = data.survival, trunc = 0.95, f.weight = "exponential",     singular.ok = T, model = F)
## 
## Partial likelihood estimator
##           coef exp(coef) se(coef)       p
## LPL     0.1263     1.135   0.0751 0.09257
## IGF1    0.0207     1.021   0.0601 0.73040
## EDNRA   0.0223     1.023   0.1227 0.85570
## MFAP5   0.0166     1.017   0.0482 0.73068
## LOX     0.1922     1.212   0.1251 0.12450
## INHBA  -0.1431     0.867   0.1786 0.42301
## THBS2   0.0640     1.066   0.0902 0.47786
## ADIPOQ -0.1254     0.882   0.0910 0.16841
## NPY     0.0552     1.057   0.0496 0.26528
## CCL11  -0.1296     0.878   0.0960 0.17703
## VCAN    0.0576     1.059   0.1009 0.56776
## DCN     0.0728     1.076   0.0892 0.41396
## TIMP3   0.0719     1.074   0.0835 0.38923
## CRYAB   0.1092     1.115   0.0424 0.00999
## CXCL12  0.0204     1.021   0.0818 0.80330
## SPARC  -0.3818     0.683   0.1403 0.00649
## CNN1    0.0866     1.091   0.1141 0.44763
## FBN1    0.1141     1.121   0.1690 0.49972
## 
## Wald test=27 on 18 df, p=0.079
## 
## Robust estimator
##            coef exp(coef) se(coef)      p
## LPL     0.10107     1.106   0.0856 0.2375
## IGF1    0.03407     1.035   0.0705 0.6289
## EDNRA   0.06185     1.064   0.2119 0.7704
## MFAP5   0.00888     1.009   0.0622 0.8865
## LOX     0.16876     1.184   0.1499 0.2604
## INHBA  -0.15555     0.856   0.1895 0.4118
## THBS2   0.08631     1.090   0.1072 0.4205
## ADIPOQ -0.07269     0.930   0.1047 0.4875
## NPY     0.06251     1.065   0.0710 0.3785
## CCL11  -0.15784     0.854   0.1212 0.1927
## VCAN    0.02857     1.029   0.1419 0.8404
## DCN     0.07907     1.082   0.0993 0.4257
## TIMP3   0.07745     1.081   0.0906 0.3925
## CRYAB   0.11793     1.125   0.0544 0.0302
## CXCL12  0.01292     1.013   0.0962 0.8932
## SPARC  -0.39783     0.672   0.2020 0.0489
## CNN1    0.13126     1.140   0.1395 0.3468
## FBN1    0.11223     1.119   0.2234 0.6154
## 
## Extended Wald test=27.8 on 18 df, p=0.0647

2.3 Cox Robust based on Heritier

Another proposal of the robust Cox is given by Heritier (2009), and is going to be performed next.

filepath="../functions/"
source(paste(filepath,"Chapter7_functions.r",sep=""))

The fitted model using exponential weights with 5% level of trimming is given bellow.

fit.coxrobust.H=rcoxph(data.time.status$time,data.time.status$status,data.survival,wt.type="exponential",quant=.95)
fit.coxrobust.H$coefficients

##            estimate         SE          z p.value
## LPL     0.101118182 0.07169238  1.4104454 0.15840
## IGF1    0.034040440 0.06700056  0.5080620 0.61140
## EDNRA   0.062118582 0.14823520  0.4190542 0.67517
## MFAP5   0.008912579 0.05164610  0.1725702 0.86298
## LOX     0.168964994 0.12809588  1.3190510 0.18715
## INHBA  -0.155638383 0.18406256 -0.8455733 0.39779
## THBS2   0.086216158 0.09076783  0.9498537 0.34218
## ADIPOQ -0.072837762 0.10007700 -0.7278172 0.46672
## NPY     0.062470225 0.05534324  1.1287778 0.25899
## CCL11  -0.157642473 0.10130146 -1.5561718 0.11966
## VCAN    0.028569797 0.09559983  0.2988478 0.76505
## DCN     0.079087093 0.09756531  0.8106067 0.41759
## TIMP3   0.077490703 0.08806427  0.8799335 0.37889
## CRYAB   0.117990400 0.04369162  2.7005269 0.00692
## CXCL12  0.012983156 0.08792611  0.1476598 0.88261
## SPARC  -0.397470586 0.13323024 -2.9833361 0.00285
## CNN1    0.131279965 0.13408693  0.9790660 0.32754
## FBN1    0.111613834 0.18058542  0.6180667 0.53653

The only genes statistically significant were: CRYAB and SPARC, for Cox’s and Cox’s robust.

The next figure shows that observations 113 and 219 are identified as influential observations in the sense that the weight given for each one is the lowest among the others.

In the next section it is going to be evaluated if the observations identified before are or not influential.

3 Outlier detection methods for survival analisys

In order to identify which of the observations might be outliers some methods for outlier detection for survival analysis are going to be performed.

3.1 Residuals

The first method used for outlier detection is the residuals: deviance.

3.1.1 Deviance

The results regarding the martingale residuals are shown bellow. Observations 415 and 346 presented the highest absolut value.

res.dev.2<- resid(fit.cox.all,type="deviance")

head(sort(abs(res.dev.2),decreasing = TRUE))

##      415      346      202      311      219      310 
## 3.158922 3.156986 2.999861 2.749230 2.669330 2.664042

3.2 Concordance c-index

The outlier detection method for the concordance c-index is based on the definition of an outlier as “an observation that when absent from the data, will likely decrease the prediction error of the fitted model”. The concordance c-index is used as a test statistics that is sensitive to the presence of outliers, i.e., the larger the number of outliers, the lower the performance of the model adjusted. The concordance c-index measures how well predicted values are concordant with the rank ordered response variables. Two algorithms based on this are going to be performed next. First let’s call the package to perform this outlier detection method.

library("BCSOD")

The input of this algorithm is:

the time vector;
the status vector;
the covariate matrix.

3.2.1 DBHT

The dual bootstrap hypothesis testing (DBHT) is an improvement of the BHT described before. The BHT removes one observation from the dataset, and then evaluates the impact of each removal on the concordance c-index. Notice that the model has less observations than the original dataset, and therefore the concordance c-index has the tendency to increase, which may hamper the evaluation of the statistical significance and may increase the number of false positives.

The input is the same as before. Notice that B.N corresponds to the number of observations

Here are the top observations with lowest p-values.

outliers_dbht1000[1:10,]

##                pvalues
##  [1,] 346 4.223057e-20
##  [2,]  74 1.628611e-16
##  [3,]  42 1.535752e-15
##  [4,] 201 8.492635e-10
##  [5,] 219 4.414806e-09
##  [6,] 113 7.301784e-09
##  [7,] 508 2.953401e-08
##  [8,] 277 2.739335e-07
##  [9,] 466 3.007432e-07
## [10,] 115 5.547855e-07

dbht<-as.data.frame(outliers_dbht1000)# convert to a dataframe (more easy to handle)
names(dbht)[1]="id" # change the name of the 1st) column
dbht.order<-dbht[order(dbht$id),]
dbht.order[1:10,]

##     id      pvalues
## 345  1 7.981046e-01
## 301  2 6.801899e-01
## 136  3 1.550638e-01
## 211  4 3.970933e-01
## 509  5 9.999235e-01
## 504  6 9.993949e-01
## 363  7 8.307585e-01
## 17   8 2.927809e-05
## 323  9 7.307515e-01
## 63  10 2.272372e-02

3.3 Censored quantile regression

Censored quantile regression was introduced by Portnoy (2003) as an alternative to the Cox’s regression model for survival data with censored observations. Due to the fact that the Cox’s regression model assumes proportional hazards, the censored quantile regression can provide more flexibility. The alternative method proposed by Portnoy (2003) is a generalization of the Kaplan-Meier one sample estimator.

Here two algorithm are considered. the results an command lines to obtain the outliers based on this theory are presented next. First call the package OutlierDC, and make sure that the covariate matrix is in the form of a data frame.

library(OutlierDC)

## Loading required package: quantreg

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

## 
## Attaching package: 'quantreg'

## The following object is masked from 'package:survival':
## 
##     untangle.specials

## Loading required package: Formula

##

## Package OutlierDC (0.3-0) loaded.

data.survival<-as.data.frame(data.survival)

3.3.1 Score

The score algorithm arises from the outlier detection algorithm for the Cox’s regression model for censored data proposed by Nardi (1999) but now based on quantile regression.

odcS_tcga<-odc(formula=Surv((data.time.status$time),data.time.status$status)~ 
          LPL+IGF1+EDNRA+MFAP5+LOX+INHBA+THBS2+ADIPOQ+ NPY+ CCL11+VCAN+ DCN+ TIMP3+ CRYAB+ CXCL12+ SPARC+ CNN1+ FBN1, data = data.survival,method="score",rq.model="Wang",h=0.05)

## Please wait... 
## Done.

The top outliers are:

head(sort(odcS_tcga@score,decreasing = TRUE))

##       113       221       455        26       452       297 
## 10.239740  8.456885  5.719224  5.289057  4.443739  4.214216

Put in form of a vector the results obtained before

odS.vec<-as.vector(odcS_tcga@score)

Define a matrix with the id and the values for the residual algorithm.

n<-dim(data.survival)[1]#dimension of the data
id<-as.vector(seq(1:n))# vector with the id
odS.matrix<-as.data.frame(cbind(id,odS.vec))

4 Rank Product test

In order to obtain a global result concerning each one of the techniques performed in the previous section, a rank product test is going to be used. First a rank matrix is going to be obtained based on the ranks of each outliers detection method.

res.dev.vec<-as.vector(abs(res.dev.2))
rank_deviance<-rank(-res.dev.vec,ties.method = "first")

rank_dbht<-rank(dbht.order$pvalues,ties.method = "first")

rank_score<-rank(-(odS.matrix$odS.vec),ties.method ="first")

Now let us obtain based on the ranks above the rank product matrix. First let’s consider a rank matrix, with all the ranks obtained by each method.

rankMat<-cbind(rank_deviance,rank_dbht,rank_score)

In second obtain the rank product, which is obtained based on the input of the rank matrix.

RankProduct = function(rankMat)
{
  return(apply(rankMat, 1, prod))
}

rankproduct<-RankProduct(rankMat)

4.1 p-values

In order to obtain the p-values the algorithm proposed by Heskes (2014) was used. Notice that the p-values are based on the geometric mean of upper and lower bounds, defined recursively.

filepath="../functions/"
source(paste(filepath,"Heskes_pvalues.R",sep=""))

The input to obtain the p-values is the following:

rho: the rank product matrix;
the number of observations, n;
the number of methods used, k.

rho=rankproduct
n<-dim(data.survival)[1]
k<-dim(rankMat)[2]

The p-values are obtained by the following, where Delta option is the geometric mean.

pvalues<-as.vector(rankprodbounds(rho,n,k,Delta ='geometric'))

In order to obtain the observations with the lowest p-values we can do the lines:

pvalues.matrix<-cbind(pvalues)
which(pvalues.matrix<0.01)## pvalues < to 1%

##  [1]  32  42  74 113 115 201 202 211 219 221 279 346 455 508

which(pvalues.matrix<0.05) ## pvalues < to 5%

##  [1]  10  28  32  42  55  74 113 114 115 128 155 159 179 197 201 202 211
## [18] 219 221 277 279 311 323 346 348 372 389 405 407 415 422 452 455 466
## [35] 500 506 508 516

4.2 q-values

Another key issue when performing Rank product tests is related with the multiple testing problem. In fact, since many observations are tested, type-I errors (false positives) will increase.

The FDR, which is the expected proportion of false positives among all tests that are significant, sorts in an ascendant order the p-values and divides them by their percentile rank. The measure used to determine the FDR is the q-value. For the p-value: 0.05 implies that 5% of all tests will result in false positives, instead, for the q-value: 0.05 implies that 5% of significant tests will result in false positives. In this sense the q-value is able to control the number of false discoveries in those tests. For this reason it has the ability of finding truly significant results.

The q-value package must be installed.

library(qvalue)

Here the only input that is need is the p-values obtained before for each observation.

qobj<-qvalue(pvalues)

qvalues<-qobj$qvalues
which(qvalues<0.01)

## integer(0)

which(qvalues<0.05)

## [1] 113 219 221 346 455

In order to resume all the information, let us get an id vector and a data frame that combines the ranks the pvalues and the qvalues.

id<-as.vector(seq(1:n))# vector with the id
tcga.outliers<-as.data.frame(cbind(id,rank_deviance,rank_dbht,rank_score,pvalues,qvalues))

Next we can obtain the top 15 outlier observations, with the qvalues sorted in increasing order.

sort.tcga.outliers <- tcga.outliers[order(qvalues) , ]
sort.tcga.outliers[1:15,]

##      id rank_deviance rank_dbht rank_score      pvalues    qvalues
## 113 113            79         6          1 8.714172e-05 0.01505800
## 219 219             5         5         19 8.737720e-05 0.01505800
## 221 221            14        12          2 5.595444e-05 0.01505800
## 346 346             2         1        500 2.123937e-04 0.02745189
## 455 455            11        61          3 4.688042e-04 0.04847436
## 202 202             3        11        486 4.092992e-03 0.35267951
## 32   32            45        14         43 6.802920e-03 0.36002620
## 42   42            29         3        453 9.685479e-03 0.36002620
## 74   74            22         2        510 5.678407e-03 0.36002620
## 115 115            47        10         70 8.178812e-03 0.36002620
## 201 201            17         4        485 8.197544e-03 0.36002620
## 211 211            18       315          7 9.749259e-03 0.36002620
## 279 279           164        15         15 9.108626e-03 0.36002620
## 508 508            57         7         49 4.968524e-03 0.36002620
## 372 372            23       330          8 1.439240e-02 0.46505439

Notice the observations the with the lowest qvalues lead us to believe that they are influential observations and should taken for further study.

5 Analysis of the Results

Let us look for the original dataset with 18 covariates.

summary(data.survival)

##       LPL             IGF1           EDNRA           MFAP5       
##  Min.   :3.236   Min.   :3.344   Min.   :3.461   Min.   : 2.858  
##  1st Qu.:3.926   1st Qu.:4.573   1st Qu.:3.935   1st Qu.: 3.794  
##  Median :4.323   Median :5.388   Median :4.413   Median : 4.880  
##  Mean   :4.668   Mean   :5.590   Mean   :4.662   Mean   : 5.225  
##  3rd Qu.:5.108   3rd Qu.:6.428   3rd Qu.:5.207   3rd Qu.: 6.489  
##  Max.   :9.381   Max.   :9.123   Max.   :8.788   Max.   :10.200  
##       LOX            INHBA           THBS2            ADIPOQ     
##  Min.   :3.103   Min.   :3.224   Min.   : 3.535   Min.   :2.833  
##  1st Qu.:3.450   1st Qu.:3.637   1st Qu.: 5.052   1st Qu.:3.130  
##  Median :3.779   Median :3.949   Median : 6.593   Median :3.221  
##  Mean   :3.972   Mean   :4.287   Mean   : 6.783   Mean   :3.501  
##  3rd Qu.:4.313   3rd Qu.:4.797   3rd Qu.: 8.514   3rd Qu.:3.352  
##  Max.   :8.436   Max.   :7.782   Max.   :10.702   Max.   :9.766  
##       NPY             CCL11            VCAN             DCN        
##  Min.   : 2.989   Min.   :3.250   Min.   : 3.406   Min.   : 3.529  
##  1st Qu.: 3.437   1st Qu.:3.697   1st Qu.: 4.613   1st Qu.: 7.153  
##  Median : 3.609   Median :4.047   Median : 5.846   Median : 8.211  
##  Mean   : 4.047   Mean   :4.420   Mean   : 6.050   Mean   : 8.173  
##  3rd Qu.: 3.994   3rd Qu.:4.839   3rd Qu.: 7.455   3rd Qu.: 9.458  
##  Max.   :10.625   Max.   :8.724   Max.   :10.154   Max.   :11.542  
##      TIMP3           CRYAB            CXCL12          SPARC       
##  Min.   :3.495   Min.   : 3.160   Min.   :3.349   Min.   : 6.018  
##  1st Qu.:4.638   1st Qu.: 6.086   1st Qu.:4.390   1st Qu.: 8.274  
##  Median :5.592   Median : 7.106   Median :5.256   Median : 9.104  
##  Mean   :5.763   Mean   : 7.137   Mean   :5.485   Mean   : 9.106  
##  3rd Qu.:6.710   3rd Qu.: 8.279   3rd Qu.:6.364   3rd Qu.: 9.950  
##  Max.   :9.676   Max.   :11.356   Max.   :9.406   Max.   :11.698  
##       CNN1            FBN1      
##  Min.   :3.715   Min.   :3.604  
##  1st Qu.:4.485   1st Qu.:4.587  
##  Median :4.747   Median :5.302  
##  Mean   :4.932   Mean   :5.550  
##  3rd Qu.:5.241   3rd Qu.:6.224  
##  Max.   :9.074   Max.   :9.741

Regarding the time and status,

summary(data.time.status)

##       time          status      
##  Min.   :   8   Min.   :0.0000  
##  1st Qu.: 376   1st Qu.:0.0000  
##  Median : 923   Median :1.0000  
##  Mean   :1043   Mean   :0.5493  
##  3rd Qu.:1483   3rd Qu.:1.0000  
##  Max.   :5481   Max.   :1.0000

Now let us see the values of the covariates regarding the observations that seem to have some influential.

data.survival[c(113,    219,    221,    346,    455,    202,    32, 42, 74, 115),]

##          LPL     IGF1    EDNRA    MFAP5      LOX    INHBA    THBS2
## 113 3.730568 7.384083 4.595852 7.600399 3.778588 5.199009 8.488414
## 219 4.278560 4.817638 5.811480 4.905208 3.356856 3.663791 5.162079
## 221 5.298230 7.500011 4.423761 4.854608 3.931041 4.374053 7.728778
## 346 4.172526 4.717430 3.853260 3.190443 3.350616 3.524537 4.484999
## 455 4.323476 6.277478 4.410887 5.600682 3.863177 4.226022 8.377245
## 202 6.367273 4.404099 3.770298 3.299618 3.726482 3.746802 5.417982
## 32  4.046278 4.250899 3.985438 3.329091 3.663556 3.606930 6.112304
## 42  4.024678 4.240507 4.235880 3.094953 3.386156 3.693511 6.245130
## 74  3.603795 3.642994 3.748988 9.415976 3.837903 3.386590 3.666802
## 115 6.787121 7.322588 6.449485 6.252702 4.061837 5.664429 8.799108
##       ADIPOQ      NPY    CCL11     VCAN      DCN    TIMP3    CRYAB
## 113 3.310786 3.536249 4.288686 6.920805 7.967912 6.828998 9.597481
## 219 3.268250 7.738465 3.536991 5.014663 9.210073 6.504730 9.267920
## 221 3.249411 3.910235 4.687864 6.101423 7.986346 6.485240 7.311449
## 346 3.219537 3.686063 3.620383 4.794432 7.328997 4.605971 5.616922
## 455 3.254565 3.249004 3.588039 6.844709 8.371540 6.169093 7.423490
## 202 3.159120 3.325088 3.693970 4.613538 5.897462 4.562150 3.769068
## 32  3.375889 3.896632 3.735272 3.781419 5.718897 3.722733 9.046409
## 42  3.165178 3.877778 4.760985 5.599226 9.181945 5.116395 6.021069
## 74  3.421356 3.527337 3.427662 3.960211 5.308635 4.293556 4.046809
## 115 6.985946 3.572123 4.706295 8.018179 9.535735 6.762000 9.454499
##       CXCL12     SPARC     CNN1     FBN1
## 113 6.568068  9.314609 6.669734 5.871749
## 219 4.824111  9.307861 4.432514 5.683262
## 221 5.575577  9.289752 6.349214 5.611158
## 346 3.704212  8.624038 4.066246 4.520293
## 455 5.432554  9.675083 5.237223 6.012563
## 202 4.129214  7.526688 4.497028 4.453816
## 32  3.879755  6.924337 4.567773 4.161153
## 42  4.441712  9.012392 4.009992 5.035908
## 74  3.619439  7.511828 4.256769 4.193958
## 115 6.175115 10.238335 5.359044 7.017225

data.time.status[c(113, 219,    221,    346,    455,    202,    32, 42, 74, 115),]

##     time status
## 113 3260      1
## 219 3525      0
## 221 2788      0
## 346    9      1
## 455 3532      0
## 202    9      1
## 32  1966      0
## 42   304      1
## 74   204      1
## 115 2259      0

Based on the values of the time, we can see that in general all the observations had higher values. Notice that observation 452 has the maximum value with time 5481.

6 Cluster analysis

7 Non-hierarchical Clustering - Kmeans

kc4<-kmeans(data.survival,4)
kc4

## K-means clustering with 4 clusters of sizes 127, 124, 111, 155
## 
## Cluster means:
##        LPL     IGF1    EDNRA    MFAP5      LOX    INHBA    THBS2   ADIPOQ
## 1 4.507494 5.079685 4.049926 3.834330 3.628492 3.667702 4.975273 3.195383
## 2 4.378266 5.095369 4.072482 5.371933 3.594146 3.670304 5.277970 3.222208
## 3 5.335314 6.552470 5.835925 7.046514 4.713080 5.572208 9.416523 4.346545
## 4 4.552070 5.714899 4.795556 4.944086 4.023904 4.368217 7.582466 3.367945
##        NPY    CCL11     VCAN       DCN    TIMP3    CRYAB   CXCL12
## 1 3.851242 3.818282 4.532819  6.611667 4.718139 5.723988 4.611607
## 2 4.241725 3.859527 4.722062  7.328610 5.121022 8.077714 4.641390
## 3 4.176950 5.517743 8.430778 10.133088 7.565960 7.777593 7.099515
## 4 3.958158 4.575160 6.651720  8.723868 5.842716 7.083832 5.719584
##       SPARC     CNN1     FBN1
## 1  8.072844 4.631832 4.569274
## 2  8.435018 4.756706 4.747612
## 3 10.617322 5.453678 7.268519
## 4  9.405838 4.943186 5.765576
## 
## Clustering vector:
##   [1] 4 3 1 1 3 3 4 4 4 1 4 1 2 2 4 1 2 2 1 1 1 1 1 2 1 3 2 2 2 1 2 2 1 2 1
##  [36] 2 1 2 1 2 4 1 4 4 1 1 2 4 4 2 2 2 1 1 2 2 1 1 2 1 2 2 4 4 2 2 2 4 4 2
##  [71] 3 1 1 1 1 1 2 1 4 2 2 1 4 4 4 4 4 1 2 4 4 4 2 1 2 1 3 2 2 4 1 2 2 1 1
## [106] 2 2 4 1 4 3 1 4 4 3 4 1 4 3 2 4 3 4 4 3 4 4 2 2 2 3 4 3 2 1 1 3 2 2 2
## [141] 4 4 4 3 1 3 2 4 3 4 4 3 4 3 1 3 3 4 1 2 1 3 4 2 4 4 3 4 3 3 3 1 4 3 2
## [176] 3 4 4 4 4 1 1 2 1 3 4 1 1 1 4 2 1 4 1 4 4 1 2 2 2 1 1 4 1 2 4 4 3 3 4
## [211] 4 4 1 4 1 4 1 4 2 3 4 4 4 1 4 3 3 2 4 4 1 2 4 1 2 4 3 1 3 2 4 3 2 3 2
## [246] 4 2 3 3 4 3 1 1 1 1 1 1 1 1 3 3 1 1 4 1 2 2 1 1 1 2 1 3 3 3 1 1 2 2 1
## [281] 2 4 4 1 1 4 4 3 3 1 2 4 4 3 4 2 1 2 4 2 2 3 4 3 3 2 2 4 4 4 4 2 3 3 4
## [316] 3 3 4 3 4 1 3 2 4 2 1 1 4 4 1 3 2 4 4 1 4 4 3 4 3 4 3 2 3 2 1 2 3 2 3
## [351] 1 1 4 1 3 2 2 4 4 2 3 4 2 3 3 3 4 4 4 4 3 3 4 3 1 1 3 4 3 2 1 1 3 1 3
## [386] 4 2 1 4 3 3 3 4 3 2 1 2 3 3 2 1 4 1 4 3 2 4 4 4 1 3 2 3 4 3 2 4 3 3 2
## [421] 3 4 3 3 3 4 3 1 3 4 3 3 4 3 1 2 4 2 3 4 2 3 1 4 4 2 3 1 2 1 2 4 1 2 4
## [456] 4 3 4 2 3 4 4 4 4 2 2 3 1 2 3 2 4 4 1 4 1 2 2 1 2 2 1 1 4 1 4 2 4 4 2
## [491] 3 2 3 4 3 3 4 3 4 1 2 2 1 1 1 1 4 2 4 1 3 4 2 2 2 1 4
## 
## Within cluster sum of squares by cluster:
## [1] 1543.203 1681.734 1688.129 2159.681
##  (between_SS / total_SS =  52.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

mydata <- data.frame(data.survival, kc4$cluster)
mydata[c(113,   219,    221,    346,    455,    202,    32, 42, 74, 115),]

##          LPL     IGF1    EDNRA    MFAP5      LOX    INHBA    THBS2
## 113 3.730568 7.384083 4.595852 7.600399 3.778588 5.199009 8.488414
## 219 4.278560 4.817638 5.811480 4.905208 3.356856 3.663791 5.162079
## 221 5.298230 7.500011 4.423761 4.854608 3.931041 4.374053 7.728778
## 346 4.172526 4.717430 3.853260 3.190443 3.350616 3.524537 4.484999
## 455 4.323476 6.277478 4.410887 5.600682 3.863177 4.226022 8.377245
## 202 6.367273 4.404099 3.770298 3.299618 3.726482 3.746802 5.417982
## 32  4.046278 4.250899 3.985438 3.329091 3.663556 3.606930 6.112304
## 42  4.024678 4.240507 4.235880 3.094953 3.386156 3.693511 6.245130
## 74  3.603795 3.642994 3.748988 9.415976 3.837903 3.386590 3.666802
## 115 6.787121 7.322588 6.449485 6.252702 4.061837 5.664429 8.799108
##       ADIPOQ      NPY    CCL11     VCAN      DCN    TIMP3    CRYAB
## 113 3.310786 3.536249 4.288686 6.920805 7.967912 6.828998 9.597481
## 219 3.268250 7.738465 3.536991 5.014663 9.210073 6.504730 9.267920
## 221 3.249411 3.910235 4.687864 6.101423 7.986346 6.485240 7.311449
## 346 3.219537 3.686063 3.620383 4.794432 7.328997 4.605971 5.616922
## 455 3.254565 3.249004 3.588039 6.844709 8.371540 6.169093 7.423490
## 202 3.159120 3.325088 3.693970 4.613538 5.897462 4.562150 3.769068
## 32  3.375889 3.896632 3.735272 3.781419 5.718897 3.722733 9.046409
## 42  3.165178 3.877778 4.760985 5.599226 9.181945 5.116395 6.021069
## 74  3.421356 3.527337 3.427662 3.960211 5.308635 4.293556 4.046809
## 115 6.985946 3.572123 4.706295 8.018179 9.535735 6.762000 9.454499
##       CXCL12     SPARC     CNN1     FBN1 kc4.cluster
## 113 6.568068  9.314609 6.669734 5.871749           4
## 219 4.824111  9.307861 4.432514 5.683262           2
## 221 5.575577  9.289752 6.349214 5.611158           4
## 346 3.704212  8.624038 4.066246 4.520293           1
## 455 5.432554  9.675083 5.237223 6.012563           4
## 202 4.129214  7.526688 4.497028 4.453816           1
## 32  3.879755  6.924337 4.567773 4.161153           2
## 42  4.441712  9.012392 4.009992 5.035908           1
## 74  3.619439  7.511828 4.256769 4.193958           1
## 115 6.175115 10.238335 5.359044 7.017225           3

The observations with a q-value smaller than 5%

mydata[c(113,   219,    221,    346,    455),]

##          LPL     IGF1    EDNRA    MFAP5      LOX    INHBA    THBS2
## 113 3.730568 7.384083 4.595852 7.600399 3.778588 5.199009 8.488414
## 219 4.278560 4.817638 5.811480 4.905208 3.356856 3.663791 5.162079
## 221 5.298230 7.500011 4.423761 4.854608 3.931041 4.374053 7.728778
## 346 4.172526 4.717430 3.853260 3.190443 3.350616 3.524537 4.484999
## 455 4.323476 6.277478 4.410887 5.600682 3.863177 4.226022 8.377245
##       ADIPOQ      NPY    CCL11     VCAN      DCN    TIMP3    CRYAB
## 113 3.310786 3.536249 4.288686 6.920805 7.967912 6.828998 9.597481
## 219 3.268250 7.738465 3.536991 5.014663 9.210073 6.504730 9.267920
## 221 3.249411 3.910235 4.687864 6.101423 7.986346 6.485240 7.311449
## 346 3.219537 3.686063 3.620383 4.794432 7.328997 4.605971 5.616922
## 455 3.254565 3.249004 3.588039 6.844709 8.371540 6.169093 7.423490
##       CXCL12    SPARC     CNN1     FBN1 kc4.cluster
## 113 6.568068 9.314609 6.669734 5.871749           4
## 219 4.824111 9.307861 4.432514 5.683262           2
## 221 5.575577 9.289752 6.349214 5.611158           4
## 346 3.704212 8.624038 4.066246 4.520293           1
## 455 5.432554 9.675083 5.237223 6.012563           4

kc5<-kmeans(data.survival,5)
kc5

## K-means clustering with 5 clusters of sizes 124, 101, 98, 97, 97
## 
## Cluster means:
##        LPL     IGF1    EDNRA    MFAP5      LOX    INHBA    THBS2   ADIPOQ
## 1 4.618342 5.807890 4.910235 5.152469 4.120732 4.509613 7.922721 3.436084
## 2 5.359920 6.600361 5.895569 7.143613 4.738249 5.638378 9.464933 4.395812
## 3 4.415086 5.332284 4.181659 5.282843 3.635519 3.781646 5.902508 3.214585
## 4 4.258753 4.729927 3.974906 4.587060 3.568523 3.593207 4.609537 3.208088
## 5 4.673645 5.380388 4.234404 3.902078 3.725651 3.800970 5.596883 3.232988
##        NPY    CCL11     VCAN       DCN    TIMP3    CRYAB   CXCL12
## 1 4.028975 4.678772 6.943050  8.933593 6.071771 7.051747 5.854258
## 2 4.202736 5.569152 8.480410 10.180507 7.652817 7.849420 7.163588
## 3 4.095147 3.979528 5.173194  7.977305 5.360167 8.426339 4.861124
## 4 3.866376 3.767748 4.203868  6.078190 4.468760 7.118908 4.313165
## 5 4.039233 3.989742 5.111638  7.402521 5.103638 5.220272 5.067739
##       SPARC     CNN1     FBN1
## 1  9.630790 4.975456 5.955204
## 2 10.661374 5.494766 7.364244
## 3  8.860077 4.806302 5.049814
## 4  7.674751 4.653158 4.314266
## 5  8.493464 4.694065 4.885280
## 
## Clustering vector:
##   [1] 1 2 5 4 2 2 3 1 1 4 5 4 3 4 5 5 3 4 5 5 4 5 5 3 5 2 3 4 4 4 4 4 4 3 5
##  [36] 3 5 4 5 4 1 5 1 3 5 5 4 5 1 5 3 3 4 5 4 4 5 5 3 5 5 3 1 1 3 3 4 1 1 4
##  [71] 2 4 5 4 5 4 3 4 1 3 4 4 1 1 3 3 1 4 3 1 1 1 4 4 4 5 2 4 4 1 4 3 4 4 4
## [106] 4 5 3 4 3 2 5 1 5 2 3 5 1 2 4 1 2 1 1 2 1 5 4 3 3 2 1 2 4 4 5 2 5 4 4
## [141] 1 1 1 1 4 2 3 1 2 1 1 2 1 1 5 2 2 5 4 3 5 2 1 4 1 1 2 1 2 2 2 4 1 2 3
## [176] 1 1 5 1 1 4 5 4 5 2 3 5 5 5 1 3 5 1 4 1 1 4 3 3 4 4 5 1 5 3 1 1 2 2 1
## [211] 3 3 5 1 4 3 5 3 3 2 1 1 1 4 1 2 2 3 1 1 4 3 1 4 4 1 2 4 2 3 1 2 3 2 3
## [246] 1 3 2 2 3 2 5 4 4 5 5 4 5 4 2 2 4 5 1 5 3 3 4 5 5 3 4 1 2 2 4 5 3 3 5
## [281] 3 1 3 5 4 1 3 2 2 5 3 1 5 1 1 5 5 4 1 3 3 1 1 1 2 3 4 1 1 1 1 3 2 2 5
## [316] 2 2 5 2 1 5 2 3 3 3 4 5 1 1 4 2 3 1 1 5 5 1 1 1 2 3 2 3 2 3 5 4 2 5 2
## [351] 5 5 5 4 2 4 3 5 1 4 2 1 4 2 2 2 1 1 1 5 2 2 1 2 4 4 2 1 2 3 5 5 2 4 2
## [386] 1 3 5 3 2 2 2 1 2 3 5 4 2 2 3 5 1 5 3 2 3 1 5 1 5 2 3 1 1 2 3 5 2 2 3
## [421] 2 1 2 2 2 1 2 5 2 1 2 2 1 2 4 3 1 4 2 5 3 2 5 1 1 4 2 5 3 5 4 1 5 3 1
## [456] 3 2 1 3 2 1 1 1 1 3 3 2 4 3 2 4 1 5 5 1 4 3 3 4 3 3 4 5 5 5 3 4 1 3 3
## [491] 2 3 1 1 2 2 1 2 1 5 3 3 4 4 4 4 1 3 1 5 2 1 3 4 3 5 1
## 
## Within cluster sum of squares by cluster:
## [1] 1645.332 1545.320 1178.679 1161.628 1179.167
##  (between_SS / total_SS =  55.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

mydata2 <- data.frame(data.survival, kc5$cluster)
mydata2[c(113,  219,    221,    346,    455,    202,    32, 42, 74, 115),]

##          LPL     IGF1    EDNRA    MFAP5      LOX    INHBA    THBS2
## 113 3.730568 7.384083 4.595852 7.600399 3.778588 5.199009 8.488414
## 219 4.278560 4.817638 5.811480 4.905208 3.356856 3.663791 5.162079
## 221 5.298230 7.500011 4.423761 4.854608 3.931041 4.374053 7.728778
## 346 4.172526 4.717430 3.853260 3.190443 3.350616 3.524537 4.484999
## 455 4.323476 6.277478 4.410887 5.600682 3.863177 4.226022 8.377245
## 202 6.367273 4.404099 3.770298 3.299618 3.726482 3.746802 5.417982
## 32  4.046278 4.250899 3.985438 3.329091 3.663556 3.606930 6.112304
## 42  4.024678 4.240507 4.235880 3.094953 3.386156 3.693511 6.245130
## 74  3.603795 3.642994 3.748988 9.415976 3.837903 3.386590 3.666802
## 115 6.787121 7.322588 6.449485 6.252702 4.061837 5.664429 8.799108
##       ADIPOQ      NPY    CCL11     VCAN      DCN    TIMP3    CRYAB
## 113 3.310786 3.536249 4.288686 6.920805 7.967912 6.828998 9.597481
## 219 3.268250 7.738465 3.536991 5.014663 9.210073 6.504730 9.267920
## 221 3.249411 3.910235 4.687864 6.101423 7.986346 6.485240 7.311449
## 346 3.219537 3.686063 3.620383 4.794432 7.328997 4.605971 5.616922
## 455 3.254565 3.249004 3.588039 6.844709 8.371540 6.169093 7.423490
## 202 3.159120 3.325088 3.693970 4.613538 5.897462 4.562150 3.769068
## 32  3.375889 3.896632 3.735272 3.781419 5.718897 3.722733 9.046409
## 42  3.165178 3.877778 4.760985 5.599226 9.181945 5.116395 6.021069
## 74  3.421356 3.527337 3.427662 3.960211 5.308635 4.293556 4.046809
## 115 6.985946 3.572123 4.706295 8.018179 9.535735 6.762000 9.454499
##       CXCL12     SPARC     CNN1     FBN1 kc5.cluster
## 113 6.568068  9.314609 6.669734 5.871749           1
## 219 4.824111  9.307861 4.432514 5.683262           3
## 221 5.575577  9.289752 6.349214 5.611158           1
## 346 3.704212  8.624038 4.066246 4.520293           5
## 455 5.432554  9.675083 5.237223 6.012563           1
## 202 4.129214  7.526688 4.497028 4.453816           5
## 32  3.879755  6.924337 4.567773 4.161153           4
## 42  4.441712  9.012392 4.009992 5.035908           5
## 74  3.619439  7.511828 4.256769 4.193958           4
## 115 6.175115 10.238335 5.359044 7.017225           2

Outlier detection in survival analysis

IDMEC Team, 2017

Dataset-ovarian