2
votes

This appears to be a simple question but is causing me a lot of headache (it's not homework, but a sticking point in real research)

I have a single list with 2266 levels. The list looks somewhat like this:

[1] ~/folder1/folder1/a.bin
[2] ~/folder1/folder1/b.bin
[3] ~/folder1/folder1/c.bin
[4] ~/folder1/folder2/a.bin
[5] ~/folder1/folder2/b.bin
[6] ~/folder1/folder2/c.bin

To explain: the list is filenames of binary files that I am reading in using the readBin function. I want to compare every row with every other row, so what I want are two columns which contain all unique combinations in them, derived from my single column.

(choose 2266,2) tells me that there are 2566245 combinations of our single column into two.

`expand.grid() appears to get me half way there. But there are four times as many combinations as I require: I get two rows each 5132490. This means that there are duplications: 1 + 2 and 2 + 1 are the same thing for my purpose.

expand.grid.df with unique=TRUE also doesn't seem to help.

My last idea was md5 hashing each of the 5 million rows and trying to detect duplicates that way.

I am looking for some way of making two lists which cover the 2566245 combinations of my list. Alternatively some way of removing all the duplicates. I guess I am not absolutely wedded to using R and have investigated awk or sed to do the same thing. No success yet though.

1
Did you try combn(all_files, 2)? - Arun
I tried combn too. I can try again... - Frank Zafka
combn(all_files, 2) doesn't work (not even close). - Frank Zafka
could you elaborate doesn't work? - Arun
how many combinations do you get on just the 6 files you've shown above? - Arun

1 Answers

2
votes

I think you are looking for combn looking like expand.grid, using @Arun data,

v <- c("~/folder1/folder1/a.bin", 
       "~/folder1/folder1/b.bin", 
       "~/folder1/folder1/c.bin", 
       "~/folder1/folder2/a.bin", 
       "~/folder1/folder2/b.bin", 
       "~/folder1/folder2/c.bin")
do.call(rbind,combn(v,2,simplify=F))

    [,1]                      [,2]                     
 [1,] "~/folder1/folder1/a.bin" "~/folder1/folder1/b.bin"
 [2,] "~/folder1/folder1/a.bin" "~/folder1/folder1/c.bin"
 [3,] "~/folder1/folder1/a.bin" "~/folder1/folder2/a.bin"
 [4,] "~/folder1/folder1/a.bin" "~/folder1/folder2/b.bin"
 [5,] "~/folder1/folder1/a.bin" "~/folder1/folder2/c.bin"
 [6,] "~/folder1/folder1/b.bin" "~/folder1/folder1/c.bin"
 [7,] "~/folder1/folder1/b.bin" "~/folder1/folder2/a.bin"
 [8,] "~/folder1/folder1/b.bin" "~/folder1/folder2/b.bin"
 [9,] "~/folder1/folder1/b.bin" "~/folder1/folder2/c.bin"
[10,] "~/folder1/folder1/c.bin" "~/folder1/folder2/a.bin"
[11,] "~/folder1/folder1/c.bin" "~/folder1/folder2/b.bin"
[12,] "~/folder1/folder1/c.bin" "~/folder1/folder2/c.bin"
[13,] "~/folder1/folder2/a.bin" "~/folder1/folder2/b.bin"
[14,] "~/folder1/folder2/a.bin" "~/folder1/folder2/c.bin"
[15,] "~/folder1/folder2/b.bin" "~/folder1/folder2/c.bin"

EDIT

I think that the path format over complicte the problem. If we use for example letters in place of file names, we get :

do.call(rbind,combn(letters[1:4],2,simplify=F))
     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "a"  "d" 
[4,] "b"  "c" 
[5,] "b"  "d" 
[6,] "c"  "d"  

So As you see there is no duplictated.