For instance, imagine that you are looking to count all identical residues in a 80 residue peptide, where a match occurs when the residue occurs at the same position in another peptide. But the catch is that the number of levels is probably not the same, as some letters [A - Z] which represent peptides will be present in one peptide but not in the next. For simplicity, imagine that we are looking for exactly identical residues (the letters match at these same positions) in all three peptides, and so the answer is a BOOLEAN TRUE or FALSE statement, where TRUE is if they all match and FALSE is if they do not match. Again the catch is that the number of factors are not the same so you can't test peptide_x == peptide_y.
Coding:
> peptide_x <- as.factor(sample(LETTERS[1:26], replace = TRUE, 80))
> peptide_y <- as.factor(sample(LETTERS[1:26], replace = TRUE, 80))
> peptide_z <- as.factor(sample(LETTERS[1:26], replace = TRUE, 80))
You can check which letters from the alphabet of 26 residues are missing in your peptide with the command:
> setdiff(LETTERS[1:26], peptide_x)
[1] "Y"
So we see that "Y" (Tyrosine) is missing. When you create the random peptide, you might be missing another letter or two, and you can do this for any of the peptides.
If I try to compare factors with equal levels, then that works:
> x <- c("M", "N", "A", "Q", "C")
> y <- c("N", "M", "A", "C", "Q")
> xy_frame <- data.frame(x,y)
> xy_frame
> x == y
[1] FALSE FALSE TRUE FALSE FALSE As you can see, the A's match up, so the third element "A" is the only truth.
Shockingly this test works:
> x <- c("A", "A", "B", "Q", "C")
> y <- c("A", "Q", "C", "D", "R")
> x == y
[1] TRUE FALSE FALSE FALSE FALSE
even though the number of factors is not the same. So I wonder if there is something wrong with my data type which is why I can't test this:
> peptides <- data.frame(peptide_x, peptide_y)
> peptides$peptide_x == peptides$peptide_y
Error in Ops.factor(peptides$peptide_x, peptides$peptide_y) : level sets of factors are different
So how can I fix my data type if that's the issue, or am I running the right test?
I just want to count TRUE - FALSE for non-identical factor levels.
Comment:
Is the %in% not working correctly because ...
head(peptide_x) [1] "C" "T" "X" "Z" "M" "A"
head(peptide_y) [1] "R" "G" "T" "U" "G" "U"
head(peptide_x %in% peptide_y) [1] TRUE TRUE TRUE TRUE TRUE TRUE
The first 6 letters of each peptide, for example, don't match up, but it says TRUE! How?