1
votes

I am learning some R and the use of it on the data analysis so sorry for my probably very dumb questions...

I am analyzing a data frame with a lot of data related to protein presence on a group of cell lines from AMD Anderson database. So I have in a datatable as rows the cell lines and on the cols the protein with thir data ("AMDDatabase"). I need to cross these data using a correlation but when I am onto it it gives me an error "Evaluation error: not enough finite observations"

Example of "AMD_database", rows cell lines and cols would be protein names and data

actividad_protein_long <- gather(data = AMD_database, key = protein, value = level, -(1:5))

correlation_table <- na.omit(actividad_protein_long) %>%
  group_by(protein) %>%
  summarise(r = cor.test(rel_IC50_uM, level, method = "Kendall")$estimate,
        p_value = cor.test(rel_IC50_uM, level, method = "Kendall")$p.value)

I understand that the problem is related to the number of data on the cols, this number vary a lot and I can see some of them are under the threshold of three data point per protein so the analysis cannot be completed.

How can I filter the data previously so I can remove all the data where the observations are under the three required to perform the analysis?

I tried

filteredData <- AMD_database[which(,colSums(!is.na(AMD_database)))>3]
filteredData <- AMD_database[which(AMD_database[,colSums(AMD_database)]>3)]

But it doesn't end well. How can I make all of the columns in AMD_database contain enough non-NA values before use correlation? is there any workaround better than doing this? Is correct to bypass this warning like that or I am incurring in a big mistake to avoid it?

And already check: cor.test ,"not enough finite observations" How to ignore cor.test:“not enough finite observations” and continue, when using tidyverse and ggplot2 (ggpmisc) R cor.test : "not enough finite observations"

But I cannot use "Purrl" package because this is intended to be shared on a server that cannot have that package available and I am not sure I wnat to bypass the error like the other post tells...

Many thanks in advance :)

1

1 Answers

1
votes

If i understood the problem correctly, the problem is that each column has a different rows of missing variable. Here is my crude solution, for every two columns it checks the amount of not missing rows they share, and calculate the cor.test if

## Creating data 

dat <- replicate(100, rnorm(20))

# Adding NAs 

dat[sample(2000, 700)] <- NA


cor.wrap <- function(dat) {
  p <- ncol(dat)
  cor.mat <- matrix(NA, ncol = p, nrow = p)
  for (i in 1:(p - 1)) {
    for (j in i:p) {
      num.ind <- !is.na(dat[ ,i]) & !is.na(dat[ ,j])
      cor.mat[i,j] <- ifelse(sum(num.ind) > 3, 
                             cor.test(dat[num.ind, i], dat[num.ind, j])$statistic,
                             NA)

    }
  }
  return(dat)
}

cor.wrap(dat)