0
votes

I have several data frame that have a list of gene names without a header. Each files roughly looks like this:

SCA-6_Chr1v1_00001
SCA-6_Chr1v1_00002
SCA-6_Chr1v1_00003
SCA-6_Chr1v1_00004
SCA-6_Chr1v1_00005
SCA-6_Chr1v1_00006
SCA-6_Chr1v1_00009
SCA-6_Chr1v1_00010
SCA-6_Chr1v1_00014
SCA-6_Chr1v1_00015
SCA-6_Chr1v1_00017

Each of these data frames is written to a separate .txt file and I have uploaded them all into one list like so:

temp = list.files(pattern = "*.txt")
myfiles = lapply(temp, FUN=read.table, header=FALSE)

With the myfiles list I want to determine all of the values unique to each file and return them in a list (I assume I can do this with a lapply function). I have tried running the following code but it is not dropping the shared values:

unique.genes = lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]], unlist(myfiles[-n])))

Any help would be greatly appreciated.

1
Could you provide a couple of data frames so we can work on them? you should also explain what exactly you want the output to be. Is it a list of unique values in a field common to all the DFs? - Pablo Herreros Cantis
With myfiles = lapply(temp, FUN = scan, what = character()) the lapply/setdiff loop will work and it's much faster. - Rui Barradas

1 Answers

0
votes

Here is an approach. First, provide reproducible data:

set.seed(42)
myfiles <- replicate(2, sample(LETTERS, 25, replace=TRUE), simplify=FALSE)
myfiles
# [[1]]
#  [1] "Q" "E" "A" "Y" "J" "D" "R" "Z" "Q" "O" "X" "G" "D" "Y" "E" "N" "T" "Z" "R" "O" "C" "I" "Y" "D" "E"
# 
# [[2]]
#  [1] "M" "E" "T" "B" "H" "C" "Z" "A" "J" "X" "K" "O" "V" "Z" "H" "D" "D" "V" "R" "M" "E" "D" "B" "X" "R"

Now find the unique values:

result <- lapply(myfiles, unique)
result
# [[1]]
#  [1] "Q" "E" "A" "Y" "J" "D" "R" "Z" "O" "X" "G" "N" "T" "C" "I"
# 
# [[2]]
#  [1] "M" "E" "T" "B" "H" "C" "Z" "A" "J" "X" "K" "O" "V" "D" "R"

Or this will sort them for easier comparison:

result2 <- lapply(myfiles, function(x) sort(unique(x)))